Datasets for Portuguese Legal Semantic Textual Similarity


  • Daniel da Silva Junior Institute of Computing - UFF
  • Paulo Roberto dos Santos Corval Law School - UFF
  • Daniel de Oliveira Institute of Computing - UFF
  • Aline Paes Institute of Computing - UFF



Legal Dataset, Semantic Textual Similarity, Data Annotation


The Brazilian judiciary faces a significant workload, leading to prolonged durations for legal proceedings. In response, the Brazilian National Council of Justice introduced the Resolution 469/2022, which provides formal guidelines for document and process digitalization, thereby creating the opportunity to implement automatic techniques in the legal field. These techniques aim to assist with various tasks, especially managing the large volume of texts involved in law procedures. Notably, Artificial Intelligence (AI) techniques open room to process and extract valuable information from textual data, which could significantly expedite the process. However, one of the challenges lies in the scarcity of datasets specific to the legal domain required for various AI techniques. Obtaining such datasets is difficult as they require some expertise for labeling. To address this challenge, this article presents four datasets from the legal domain: two include unlabelled documents and metadata, while the other two are labeled using a heuristic approach designed for use in textual semantic similarity tasks. Additionally, the article presents a small ground truth dataset generated from domain expert annotations to evaluate the effectiveness of the proposed heuristic labeling process. The analysis of the ground truth labels highlights that conducting semantic analysis of domain-specific texts can be challenging, even for domain experts. Nonetheless, the comparison between the ground truth and heuristic labels demonstrates the utility and effectiveness of the heuristic labeling approach.


Download data is not yet available.


Albuquerque, H., Costa, R., Silvestre, G., Souza, E. P., Félix, N., Vitório, D., and Carvalho, A. (2022). Ulyssesner-br: A corpus of brazilian legislative documents for named entity recognition. DOI: 10.1007/978-3-030-98305-51.

Altman, D. G. (1990). Practical statistics for medical research. CRC press.

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia, L. (2017). SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics. DOI: 10.18653/v1/S17-2001.

Dal Pont, T. R., Sabo, I. C., Hübner, J. F., and Rover, A. J. (2020). Impact of text specificity and size on word embeddings performance: An empirical evaluation in brazilian legal domain. In Brazilian Conference on Intelligent Systems, pages 521–535. Springer.

de Araujo, P. H. L., de Campos, T. E., de Oliveira, R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-br: A dataset for named entity recognition in brazilian legal text. In International Conference on Computational Processing of the Portuguese Language, pages 313–323. Springer.

de Oliveira, R. A. N. and Júnior, M. C. (2017). Assessing the impact of stemming algorithms applied to judicial jurisprudence - an experimental analysis. In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,, pages 99–105. INSTICC, SciTePress. DOI: 10.5220/0006317100990105.

de Oliveira, R. S. and Nascimento, E. G. S. (2022). Brazilian court documents clustered by similarity together using natural language processing approaches with transformers.

Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Thomson Brooks/Cole Publishing Co.

Fonseca, E., Santos, L., Criscuolo, M., and Aluisio, S. (2016). Assin: Avaliacao de similaridade semantica e inferencia textual. In Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal, pages 13–15.

Joshi, A., Kale, S., Chandel, S., and Pal, D. K. (2015). Likert scale: Explored and explained. British journal of applied science & technology, 7(4):396.

Joshi, A., Sharma, A., Tanikella, S. K., and Modi, A. (2023). U-CREAT: Unsupervised case retrieval using events extrAc-Tion. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13899–13915, Toronto, Canada. Association for Computational Linguistics. DOI: 10.18653/v1/2023.acl-long.777.

Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. Human communication research, 30(3):411–433.

Luz de Araujo, P. H., de Campos, T. E., Ataides Braz, F., and Correia da Silva, N. (2020). VICTOR: a dataset for Brazilian legal documents classification. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1449–1458, Marseille, France. European Language Resources Association.

Paul, S., Mandal, A., Goyal, P., and Ghosh, S. (2023). Pretrained language models for the legal domain: A case study on indian law. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL ’23, page 187–196, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3594536.3595165.

Rabelo, J., Goebel, R., Kim, M.-Y., Kano, Y., Yoshioka, M., and Satoh, K. (2022). Overview and discussion of the competition on legal information extraction/entailment (coliee) 2021. The Review of Socionetwork Strategies, 16(1):111–133.

Sansone, C. and Sperlí, G. (2022). Legal information retrieval systems: State-of-the-art and open issues. Information Systems, 106:101967. DOI:

Silva-Junior, D., de Oliveira, D., and Paes, A. (2022). Criação de conjuntos de dados textuais jurídicos em português a partir de processo de extração e heurística. In Anais do IV Dataset Showcase Workshop, pages 91–100, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/dsw.2022.226253.

Willian Sousa, A. and Fabro, M. (2019). Iudicium textum dataset uma base de textos jurídicos para nlp. In Dataset Show Case Proceedings of 34th Brazilian Symposium on Databases. SBC.




How to Cite

da Silva Junior, D., dos Santos Corval, P. R., de Oliveira, D., & Paes, A. (2024). Datasets for Portuguese Legal Semantic Textual Similarity. Journal of Information and Data Management, 15(1), 206–215.



Dataset Showcase Workshop 2022 - Extended Papers