Open LLMs Meet Causality in Portuguese: A Corpus-Based Fine-Tuning Approach
DOI:
https://doi.org/10.5753/jbcs.2025.5825Keywords:
Causal Reasoning, Open Source LLMs, Corpus-Based Fine-Tuning, Portuguese NLPAbstract
Causal reasoning is a key component in the development of more robust, fair, and explainable language models. However, the ability of open-source Large Language Models (LLMs) to perform causal reasoning, especially in languages other than English, remains an open challenge. In this paper, we introduce an expanded version of CaLQuest.PT, a corpus of 2,500 natural questions in Portuguese designed to support multi-level causal evaluation. This dataset enables three layers of classification: (1) causal vs. non-causal questions, (2) causal action types such as cause-seeking, effect-seeking, and recommendation-seeking, and (3) reasoning types based on Pearl’s Ladder of Causality—associational, interventional, and counterfactual. We also present an enhanced Few-Shot Learning prompting strategy and evaluate the performance of open-source models fine-tuned on this corpus. Our results show that, with targeted training and prompt design, smaller open-source LLMs can approach and even surpass the performance of larger models in several causal classification tasks. This study highlights the viability of corpus-based fine-tuning as a low-resource alternative for enhancing causal reasoning in open LLMs and advancing natural language understanding in Portuguese.
Downloads
References
Almeida, F. C. and Caminha, C. (2024). Evaluation of entry-level open-source large language models for information extraction from digitized documents. In Symposium on Knowledge Discovery, Mining and Learning (KDMiLe), pages 25-32. SBC. DOI: 10.5753/kdmile.2024.243859.
Bondarenko, A., Wolska, M., Heindorf, S., Blübaum, L., Ngonga Ngomo, A.-C., Stein, B., Braslavski, P., Hagen, M., and Potthast, M. (2022). CausalQA: A benchmark for causal question answering. In COLING:2022:1, pages 3296-3308, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. Available at:[link].
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. DOI: 10.48550/arXiv.2005.14165.
Ceraolo, R., Kharlapenko, D., Reymond, A., Mihalcea, R., Sachan, M., Schölkopf, B., and Jin, Z. (2024). Causalquest: Collecting natural causal questions for ai agents. DOI: 10.48550/arxiv.2405.20318.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37-46. DOI: 10.1177/001316446002000104.
Cui, Y., He, P., Tang, X., He, Q., Luo, C., Tang, J., and Xing, Y. (2024). A theoretical understanding of chain-of-thought: Coherent reasoning and error-aware demonstration. DOI: 10.48550/arXiv.2410.16540.
DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Zhang, H., Ding, H., Xin, H., Gao, H., Li, H., Qu, H., Cai, J. L., Liang, J., Guo, J., Ni, J., Li, J., Wang, J., Chen, J., Chen, J., Yuan, J., Qiu, J., Li, J., Song, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Xu, L., Xia, L., Zhao, L., Wang, L., Zhang, L., Li, M., Wang, M., Zhang, M., Zhang, M., Tang, M., Li, M., Tian, N., Huang, P., Wang, P., Zhang, P., Wang, Q., Zhu, Q., Chen, Q., Du, Q., Chen, R. J., Jin, R. L., Ge, R., Zhang, R., Pan, R., Wang, R., Xu, R., Zhang, R., Chen, R., Li, S. S., Lu, S., Zhou, S., Chen, S., Wu, S., Ye, S., Ye, S., Ma, S., Wang, S., Zhou, S., Yu, S., Zhou, S., Pan, S., Wang, T., Yun, T., Pei, T., Sun, T., Xiao, W. L., Zeng, W., Zhao, W., An, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Li, X. Q., Jin, X., Wang, X., Bi, X., Liu, X., Wang, X., Shen, X., Chen, X., Zhang, X., Chen, X., Nie, X., Sun, X., Wang, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yu, X., Song, X., Shan, X., Zhou, X., Yang, X., Li, X., Su, X., Lin, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhu, Y. X., Zhang, Y., Xu, Y., Xu, Y., Huang, Y., Li, Y., Zhao, Y., Sun, Y., Li, Y., Wang, Y., Yu, Y., Zheng, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Tang, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Wu, Y., Ou, Y., Zhu, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Zha, Y., Xiong, Y., Ma, Y., Yan, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Wu, Z. F., Ren, Z. Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Huang, Z., Zhang, Z., Xie, Z., Zhang, Z., Hao, Z., Gou, Z., Ma, Z., Yan, Z., Shao, Z., Xu, Z., Wu, Z., Zhang, Z., Li, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Gao, Z., and Pan, Z. (2025). Deepseek-v3 technical report. DOI: 10.48550/arxiv.2412.19437.
Du, L., Ding, X., Xiong, K., Liu, T., and Qin, B. (2022). e-CARE: a new dataset for exploring explainable causal reasoning. In ACL:2022:long, pages 432-446, Dublin, Ireland. acl. DOI: 10.18653/v1/2022.acl-long.33.
Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-Doughty, Z., Eisenstein, J., Grimmer, J., Reichart, R., Roberts, M. E., Stewart, B. M., Veitch, V., and Yang, D. (2022). Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138-1158. DOI: 10.1162/tacl_a_00511.
Gusev, I. and Tikhonov, A. (2022). HeadlineCause: A dataset of news headlines for detecting causalities. In LREC:2022:1, pages 6153-6161, Marseille, France. European Language Resources Association. DOI: 10.48550/arXiv.2108.12626.
Jin, Z., Chen, Y., Leeb, F., Gresele, L., Kamal, O., LYU, Z., Blin, K., Gonzalez Adauto, F., Kleiman-Weiner, M., Sachan, M., and Schölkopf, B. (2023). Cladder: Assessing causal reasoning in language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 31038-31065. Curran Associates, Inc.. DOI: 10.48550/arXiv.2312.04350.
Jin, Z., Liu, J., LYU, Z., Poff, S., Sachan, M., Mihalcea, R., Diab, M. T., and Schölkopf, B. (2024). Can large language models infer causation from correlation? In The Twelfth International Conference on Learning Representations. DOI: 10.48550/arXiv.2306.05836.
Kejriwal, M., Santos, H., Mulvehill, A. M., Shen, K., McGuinness, D. L., and Lieberman, H. (2024). Can ai have common sense? finding out will be key to achieving machine intelligence. Nature, 634:291-294. DOI: 10.1038/d41586-024-03262-z.
Kıcıman, E., Ness, R., Sharma, A., and Tan, C. (2024). Causal reasoning and large language models: Opening a new frontier for causality. DOI: 10.48550/arxiv.2305.00050.
Lasheras, U., Alves, E., and Pinheiro, V. (2025). Interventional and counterfactual causal reasoning for llm-based ai agents a dataset and evaluation in portuguese. Procesamiento del Lenguaje Natural, 74. Available at: [link].
Lasheras, U. A. and Pinheiro, V. (2025). CaLQuest.PT: Towards the collection and evaluation of natural causal ladder questions in Portuguese for AI agents. In LORESLM:2025:1, pages 325-343, Abu Dhabi, United Arab Emirates. acl. Available at: [link].
Liu, X., Xu, P., Wu, J., Yuan, J., Yang, Y., Zhou, Y., Liu, F., Guan, T., Wang, H., Yu, T., McAuley, J. J., Ai, W., and Huang, F. (2024). Large language models and causal inference in collaboration: A comprehensive survey. ArXiv, abs/2403.09606. DOI: 10.18653/v1/2025.findings-naacl.427.
McClure, J., Hilton, D. J., Cowan, J., Ishida, L., and Wilson, M. (2001). When people explain difficult actions, is the causal question how or why? Journal of Language and Social Psychology, 20(3):339-357. DOI: 10.1177/0261927X01020003004.
Meta (2024). Introducing llama 3.1: Our most capable models to date. Available at:[link].
Mostafazadeh, N., Kalyanpur, A., Moon, L., Buchanan, D., Berkowitz, L., Biran, O., and Chu-Carroll, J. (2020). GLUCOSE: GeneraLized and COntextualized story explanations. In EMNLP:2020:main, pages 4569-4586, Online. acl. DOI: 10.18653/v1/2020.emnlp-main.370.
OpenAI (2024). Hello gpt-4o. Available at:[link].
OpenAI and et al., J. A. (2024). Gpt-4 technical report. DOI: 10.48550/arxiv.2303.08774.
Pearl, J. and Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books, Inc., USA, 1st edition. DOI: 10.1090/noti1912.
Schank, R. C. (1995). The structure of episodes in memory. DOI: 10.1016/b978-0-12-108550-6.50014-8.
Tandon, N., Dalvi, B., Sakaguchi, K., Clark, P., and Bosselut, A. (2019). WIQA: A dataset for what if... reasoning over procedural text. In EMNLP:2019:1, pages 6076-6085, Hong Kong, China. acl. DOI: 10.18653/v1/D19-1629.
Wang, Z. (2024). CausalBench: A comprehensive benchmark for evaluating causal reasoning capabilities of large language models. In SIGHAN:2024:1, pages 143-151, Bangkok, Thailand. acl. Available at:[link].
Zhang, L., Xu, H., Yang, Y., Zhou, S., You, W., Arora, M., and Callison-Burch, C. (2023). Causal reasoning of entities and events in procedural texts. In FINDINGS:2023:eacl, pages 415-431, Dubrovnik, Croatia. acl. DOI: 10.18653/v1/2023.findings-eacl.31.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Uriel Lasheras, Elioenai Alves, Caio Ponte, Carlos Caminha, Vládia Pinheiro

This work is licensed under a Creative Commons Attribution 4.0 International License.

