Legal Document Segmentation and Labeling Through Named Entity Recognition Approaches


  • Gabriel M. C. Guimarães University of Brasilia
  • Felipe X. B. da Silva University of Brasilia
  • Lucas A. B. Macedo University of Brasilia
  • Victor H. F. Lisboa University of Brasilia
  • Ricardo M. Marcacini University of São Paulo
  • Andrei L. Queiroz University of Brasilia
  • Vinicius R. P. Borges University of Brasilia
  • Thiago P. Faleiros University of Brasilia
  • Luis P. F. Garcia University of Brasilia



Legal documents, Named Entity Recognition, Segmentation


The document segmentation task allows us to divide documents into smaller parts, known as segments, which can then be labelled within different categories. This problem can be divided in two steps: the extraction and the labeling of these segments. We tackle the problem of document segmentation and segment labeling focusing on official gazettes or legal documents. They have a structure that can benefit from token classification approaches, especially Named Entity Recognition (NER), since they are divided into labelled segments. In this study, we use word-based and sentence-based CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models to bring together text segmentation and token classification. To validate our experiments, we propose a new annotated data set named PersoSEG composed of 127 documents in Portuguese from the Official Gazette of the Federal District, published between 2001 and 2015, with a Krippendorff's alpha agreement coefficient of 0.984. As a result, we observed a better performance for word-based models, especially with the CRF architecture, that achieved an average F1-Score of 75.65% for 12 different categories of segments.


Download data is not yet available.


Antoine, J.-Y., Villaneau, J., and Lefeuvre, A. (2014). Weighted krippendorff’s alpha is a more reliable metrics for multi-coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 550–559. DOI: 10.3115/v1/E14-1058.

Arnold, S., Schneider, R., Cudré-Mauroux, P., Gers, F. A., and Löser, A. (2019). SECTOR: A neural model for coherent topic segmentation and classification. Transactions of the Association for Computational Linguistics, 7.

Aumiller, D., Almasian, S., Lackner, S., and Gertz, M. (2021). Structural text segmentation of legal documents. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, page 2–11. DOI: 10.1145/3462757.3466085.

Barrow, J., Jain, R., Morariu, V., Manjunatha, V., Oard, D., and Resnik, P. (2020). A joint model for document segmentation and segment labeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 313–322. DOI: 10.18653/v1/2020.acl-main.29.

Coquenet, D., Chatelain, C., and Paquet, T. (2023). Dan: a segmentation-free document attention network for handwritten document recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8227–8243.DOI: 10.1109/TPAMI.2023.3235826.

da Silva, F., Guimarães, G., Marcacini, R., Queiroz, A., Borges, V. R. P., Faleiros, T., and Garcia, L. (2022). Named entity recognition approaches applied to legal document segmentation. In Anais do X Symposium on Knowledge Discovery, Mining and Learning, pages 210–217.

Eisenstein, J. and Barzilay, R. (2008). Bayesian unsupervised topic segmentation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 334–343.

Eskenazi, S., Gomez-Krämer, P., and Ogier, J.-M. (2017). A comprehensive survey of mostly textual document segmentation algorithms since 2008. Pattern Recognition, 64:1–14.

Glavaš, G., Nanni, F., and Ponzetto, S. P. (2016). Unsupervised text segmentation using semantic relatedness graphs. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pages 125–130. DOI: 10.18653/v1/S16-2016.

Glavaš, G. and Somasundaran, S. (2020). Two-level transformer and auxiliary coherence modeling for improved text segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7797–7804. DOI: 10.1609/aaai.v34i05.6284.

Haim, A., Shaw, S., and Heffernan, N. (2023). How to open science: A principle and reproducibility review of the learning analytics and knowledge conference. In LAK23: 13th International Learning Analytics and Knowledge Conference, pages 156–164.

Hearst, M. A. (1997). Text tiling: Segmenting text into multi- paragraph subtopic passages. Computational Linguistics, 23(1):33–64.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9:1735–1780. DOI: 10.1162/neco.1997.9.8.1735.

Inan, H., Rungta, R., and Mehdad, Y. (2022). Structured summarization: Unified text segmentation and segment labeling as a generation task. CoRR, 2209.13759. DOI: 10.48550/arXiv.2209.13759.

Koshorek, O., Cohen, A., Mor, N., Rotman, M., and Berant, J. (2018). Text segmentation as a supervised learning task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 469–473. DOI: 10.18653/v1/N18-2075.

Kumar, M., Sharma, R. K., and Jindal, M. K. (2011). Segmentation of lines and words in handwritten gurmukhi script documents. In Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia, page 25–28. DOI: 10.1145/1963564.1963568.

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, page 282–289.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551.

Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. CoRR, 1603.01354. DOI: 10.48550/arXiv.1603.01354.

Pak, I. and Teh, P. L. (2017). Text segmentation techniques: A critical review. In Innovative Computing, Optimization and Its Applications, pages 167–181.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. DOI: 10.3115/v1/D14-1162.

Riedl, M. and Biemann, C. (2012). TopicTiling: A text segmentation algorithm based on LDA. In Proceedings of ACL 2012 Student Research Workshop, pages 37–42.

Shen, Y., Yun, H., Lipton, Z., Kronrod, Y., and Anandkumar, A. (2017). Deep active learning for named entity recognition. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 252–256. DOI: 10.18653/v1/W17-2630.

Vicente-Saez, R. and Martinez-Fuentes, C. (2018). Open science now: A systematic literature review for an integrated definition. Journal of business research, 88:428–436.

Wissler, L., Almashraee, M., Díaz, D. M., and Paschke, A. (2014). The gold standard in corpus annotation. In IEEE Germany Student Conference, pages 1–4.




How to Cite

M. C. Guimarães, G., X. B. da Silva, F., A. B. Macedo, L., H. F. Lisboa, V., M. Marcacini, R., L. Queiroz, A., R. P. Borges, V., P. Faleiros, T., & P. F. Garcia, L. (2024). Legal Document Segmentation and Labeling Through Named Entity Recognition Approaches. Journal of Information and Data Management, 15(1), 123–131.



Best Papers of KDMiLe 2022 - Extended Papers