Interpreting Lawsuits Contexts through Probabilistic Topic Modeling

Authors

DOI:

https://doi.org/10.5753/reic.2025.5962

Keywords:

Topic Modeling, Jurimetry, LDA, LSA, pLSA, Natural Language Processing, Text Mining, Legal Analytics

Abstract

The increasing volume and complexity of digital legal records have underscored the need for scalable analytical tools capable of extracting meaningful insights from unstructured text. This study investigates the application of topic modeling techniques Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Probabilistic Latent Semantic Analysis (pLSA) to the classification and interpretation of legal documents, with a focus on lawsuits related to special education policies in Brazil. A corpus of 4,259 judicial cases was collected from the São Paulo Court of Justice, preprocessed, and analyzed to identify latent thematic structures. Model performance was evaluated using the coherence score and further validated through human interpretability assessments. The results indicate that pLSA performs well with fewer topics, capturing broader legal themes, while LDA excels at higher topic counts, effectively distinguishing nuanced legal issues such as access to education and contractual disputes. The findings highlight the potential of probabilistic topic modeling as a decision-support tool, reinforcing the role of artificial intelligence in improving legal transparency, accessibility, and analytical depth, without compromising the autonomy of legal interpretation.

Downloads

Download data is not yet available.

References

Blei, D. M. (2012). Probabilistic topic models. Commun. ACM, 55(4):77–84. DOI: 10.1145/2133806.2133826.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Available at: [link].

Chen, Y., Peng, Z., Kim, S.-H., and Choi, C. W. (2023). What we can do and cannot do with topic modeling: A systematic review. Communication Methods and Measures, 17(2):1–20. DOI: 10.1080/19312458.2023.2167965.

Devins, N., Levine, R., Liptak, A., and Bhatia, K. S. (2017). The law and big data. Cornell Journal of Law and Public Policy, 27(2):357–401. Available at: [link].

Garg, A. and Ma, M. (2025). Opportunities and challenges in legal ai. Technical report, Stanford Law School. White Paper, CodeX. Available at: [link].

Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenenbaum, J. B. (2004). Integrating topics and syntax. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS’04, page 537–544, Cambridge, MA, USA. MIT Press. Available at: [link].

Hofmann, T. (1999). Probabilistic latent semantic analysis. UAI’99, page 289–296, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Available at: [link].

Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21. DOI: 10.1108/eb026526.

Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. DOI: 10.1214/aoms/1177729694.

Ma, M., Sinha, A., Tandon, A., and Richards, J. (2024). Generative ai legal landscape 2024. Technical report, Stanford Law School. White Paper, CodeX. Available at: [link].

Mehrotra, R., Sanner, S., Buntine, W., and Xie, L. (2013). Improving lda topic models for microblogs via tweet pooling and automatic labeling. SIGIR ’13, page 889–892, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2484028.2484166.

Meng, X.-L. and van Dyk, D. (1997). The em algorithm—an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):511–567. Available at: [link].

Mimno, D., Wallach, H. M., Talley, E., Leenders, M., and McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’11, page 262–272, USA. Association for Computational Linguistics. Available at: [link].

Naskar, D., Mokaddem, S., Rebollo, M., and Onaindia, E. (2021). Sentiment analysis in social networks through topic modeling. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Available at: [link].

Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009). Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, EMNLP ’09, page 248–256, USA. Association for Computational Linguistics. Available at: [link].

Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM), pages 399–408. ACM. DOI: 10.1145/2684822.2685324.

Steyvers, M. and Griffiths, T. (2006). Probabilistic topic models. In Landauer, T., McNamara, D., Dennis, S., and Kintsch, W., editors, Latent Semantic Analysis: A Road to Meaning, pages 427–448. Erlbaum. Available at: [link].

Wang, Y. (2008). Distributed gibbs sampling of latent dirichlet allocation: The gritty details. Accessed: January 16, 2025. Available at: [link].

Ződi, Z. (2017). Law and legal science in the age of big data. Intersections. East European Journal of Society and Politics, 3(2). DOI: 10.17356/ieejsp.v3i2.324.

Downloads

Published

2025-06-13

How to Cite

Santos, Étore B. e, & Rodello, I. A. (2025). Interpreting Lawsuits Contexts through Probabilistic Topic Modeling. Electronic Journal of Undergraduate Research on Computing, 23(1), 81–90. https://doi.org/10.5753/reic.2025.5962

Issue

Section

Full Papers