Text Classification in Law Area: a Systematic Review
DOI:
https://doi.org/10.5753/jidm.2022.2547Keywords:
Law, Text ClassificationAbstract
This article is an extension of the KDMile 2021 accepeted submission. Automatic Text Classification represents a great improvement in law area workflow, mainly in the migration of physical to electronic lawsuits. A systematic review of studies on text classification in the legal context from January 2017 up to February 2021 was conducted. The search strategy identified 20 studies, that were analyzed and compared. The review investigates from research questions: what are the state-of-art language models (LM); LM applications on text classification in English and Brazilian Portuguese datasets from legal area; if there are available language models pre-trained on Brazilian Portuguese; and datasets from the Brazilian judicial context. It concludes that there are applications of automatic text classification in Brazil, although there is a gap on the use of language models when compared with English language dataset studies, also the importance of language model in domain pre-training to improve results, as well as there are two studies making available Brazilian Portuguese language models, and one introducing a dataset in Brazilian law area.
Downloads
References
Bertalan, V. G. F. and Ruiz, E. Predicting judicial outcomes in the brazilian legal system using textual features. In Digital Humanities and Natural Language Processing, 2020*.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics vol. 5, pp. 135–146, 2017.
Campos, T., Sousa, M., and Luz de Araujo, P. H. pp. 76–86. In , Inferring the Source of Official Texts: Can SVM Beat ULMFiT? pp. 76–86, 2020*.
Chalkidis, I., Androutsopoulos, I., and Aletras, N. Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp. 4317–4323, 2019*.
Chalkidis, I., Fergadiotis, E., Malakasiotis, P., and Androutsopoulos, I. Large-scale multi-label text classification on EU legislation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp. 6314–6322, 2019.
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., and Androutsopoulos, I. LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, pp. 2898–2904, 2020*.
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp. 2978–2988, 2019.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186, 2019*.
Hartmann, N. S., Fonseca, E. R., Shulby, C. D., Treviso, M. V., Rodrigues, J. S., and Aluísio, S. M. Portuguese word embeddings evaluating on word analogies and natural language tasks. In Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana. SBC, Porto Alegre, RS, Brasil, pp. 122–131, 2017*.
Howard, J. and Ruder, S. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp. 328–339, 2018*.
Ling, W., Dyer, C., Black, A. W., and Trancoso, I. Two/too simple adaptations of Word2Vec for syntax problems. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, pp. 1299–1304, 2015.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach, 2019.
Luz de Araujo, P. H., de Campos, T. E., Ataides Braz, F., and Correia da Silva, N. VICTOR: a dataset for Brazilian legal documents classification. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp. 1449–1458, 2020*.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, pp. 142–150, 2011.
McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, vol. 24. Academic Press, pp. 109–165, 1989.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun (Eds.), 2013.
Mota, C., Lima, A., Nascimento, A., Miranda, P., and de Mello, R. Classificação de páginas de petições iniciais utilizando redes neurais convolucionais multimodais. In Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional. SBC, Porto Alegre, RS, Brasil, pp. 318–329, 2020*.
Noguti, M. Y., Vellasques, E., and Oliveira, L. S. Legal document classification: An application to law area prediction of petitions to public prosecution service. In 2020 International Joint Conference on Neural Networks (IJCNN). pp. 1–8, 2020*.
Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543, 2014.
Ponti, M. A., Ribeiro, L. S. F., Nazare, T. S., Bui, T., and Collomosse, J. Everything you wanted to know about deep learning for computer vision but were afraid to ask. In 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T). pp. 17–41, 2017.
Raulino Dal Pont, T., Sabo, I., Hübner, J., and Rover, A. Impact of text specificity and size on word embeddings performance: An empirical evaluation in brazilian legal domain, 2020*.
Shaheen, Z., Wohlgenannt, G., and Filtz, E. Large scale legal text classification using transformer models. Computer Science ArXiv vol. abs/2010.12871, 2020*.
Silva, A. C. and Maia, L. C. G. The use of machine learning in the classification of electronic lawsuits: An application in the court of justice of minas gerais. In Intelligent Systems, R. Cerri and R. C. Prati (Eds.). Springer International Publishing, Cham, pp. 606–620, 2020*.
Silva, N., Braz, F., and de Campos, T. Document type classification for brazil’s supreme court using a convolutional neural network. pp. 7–11, 2018*.
Soh, J., Lim, H. K., and Chai, I. E. Legal area classification: A comparative study of text classifiers on Singapore Supreme Court judgments. In Proceedings of the Natural Legal Language Processing Workshop 2019. Association for Computational Linguistics, Minneapolis, Minnesota, pp. 67–77, 2019*.
Song, D., Vold, A., Madan, K., and Schilder, F. Multi-label legal document classification: A deep learning-based approach with label-attention and domain-specific pre-training. Information Systems, 2021*.
Souza, F., Nogueira, R., and Lotufo, R. Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems, R. Cerri and R. C. Prati (Eds.). Springer International Publishing, Cham, pp. 403–417, 2020*.
Sun, C., Qiu, X., Xu, Y., and Huang, X. How to fine-tune bert for text classification? In Chinese Computational Linguistics, M. Sun, X. Huang, H. Ji, Z. Liu, and Y. Liu (Eds.). Springer International Publishing, Cham, pp. 194–206, 2019*.
Tjong Kim Sang, E. F. and De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. pp. 142–147, 2003.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Vol. 30. Curran Associates, Inc., 2017.
Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, 2018.
Wang, Z., Wu, Y., Lei, P., and Peng, C. Named entity recognition method of brazilian legal text based on pre-training model. Journal of Physics: Conference Series vol. 1550, pp. 032149, 05, 2020*.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Vol. 32. Curran Associates, Inc., 2019*.
Zhang, X., Zhao, J., and LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. NIPS’15. MIT Press, Cambridge, MA, USA, pp. 649–657, 2015.