Datasets for Portuguese Legal Semantic Textual Similarity


  • Daniel da Silva Junior Institute of Computing - UFF
  • Paulo Roberto dos Santos Corval Law School - UFF
  • Daniel de Oliveira Institute of Computing - UFF
  • Aline Paes Institute of Computing - UFF



Legal Dataset, Semantic Textual Similarity, Data Annotation


The Brazilian judiciary faces a significant workload, leading to prolonged durations for legal proceedings. In response, the Brazilian National Council of Justice introduced the Resolution 469/2022, which provides formal guidelines for document and process digitalization, thereby creating the opportunity to implement automatic techniques in the legal field. These techniques aim to assist with various tasks, especially managing the large volume of texts involved in law procedures. Notably, Artificial Intelligence (AI) techniques open room to process and extract valuable information from textual data, which could significantly expedite the process. However, one of the challenges lies in the scarcity of datasets specific to the legal domain required for various AI techniques. Obtaining such datasets is difficult as they require some expertise for labeling. To address this challenge, this article presents four datasets from the legal domain: two include unlabelled documents and metadata, while the other two are labeled using a heuristic approach designed for use in textual semantic similarity tasks. Additionally, the article presents a small ground truth dataset generated from domain expert annotations to evaluate the effectiveness of the proposed heuristic labeling process. The analysis of the ground truth labels highlights that conducting semantic analysis of domain-specific texts can be challenging, even for domain experts. Nonetheless, the comparison between the ground truth and heuristic labels demonstrates the utility and effectiveness of the heuristic labeling approach.


Download data is not yet available.


How to Cite

da Silva Junior, D., dos Santos Corval, P. R., de Oliveira, D., & Paes, A. (2024). Datasets for Portuguese Legal Semantic Textual Similarity. Journal of Information and Data Management, 15(1), 206–215.



Dataset Showcase Workshop 2022 - Extended Papers