Pt-HotpotQA: Evaluating Multi-Hop Question Answering on Original and Portuguese-translated Datasets Using LLMs
DOI:
https://doi.org/10.5753/jbcs.2025.5801Keywords:
Multi-hop QA, Portuguese NLP, Large Language Models, Dataset Translation, Cross-lingual EvaluationAbstract
Multi-hop Question Answering (MHQA) advances Natural Language Processing by pushing models to combine information from multiple sources in a series of reasoning steps. Despite substantial advancements in MHQA for English, resources for evaluating Large Language Models (LLMs) in Portuguese remain scarce. To address this gap, we introduce a publicly available Portuguese translation of the HotpotQA dataset, a well-established English MHQA benchmark. We systematically evaluate several variants of the Llama multilingual LLM across both the original and translated datasets, analyzing performance variations by language. Our findings demonstrate that multilingual models consistently perform better in English than in Portuguese, though this gap narrows with increased model size. Additionally, we show the impact of fine-tuning on improving MHQA performance in Portuguese. This study provides valuable insights into optimizing LLMs for multilingual contexts and contributes a relevant benchmark for Portuguese-language MHQA research.
Downloads
References
Baudiš, P. and Šedivý, J. (2015). Modeling of the question answering task in the yodaqa system. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 222-228. Springer International Publishing. DOI: 10.1007/978-3-319-24027-5_20.
Berant, J., Chou, A., Frostig, R., and Liang, P. (2013). Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533-1544. DOI: 10.18653/v1/d13-1160.
Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. (2020). Piqa: Reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 7432-7439. DOI: 10.1609/aaai.v34i05.6239.
Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., and Zettlemoyer, L. (2018). Quac: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174-2184. DOI: 10.18653/v1/d18-1241.
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. (2019). BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924-2936, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.48550/arXiv.1905.10044.
Clark, P., Cowhey, S., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. In arXiv preprint arXiv:1803.05457. DOI: https://doi.org/10.48550/arXiv.1803.05457.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, J., Jun, H., Kaiser, L., Miller, J., Plappert, M., Tworek, J., Hilton, J., Schulman, J., Salakhutdinov, R., and Amodei, D. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. DOI: 10.48550/arxiv.2110.14168.
Criscuolo, M., Fonseca, E. R., Aluisio, S. M., and Speranca-Criscuolo, A. C. (2017). MilkQA: A Dataset of Consumer Questions for the Task of Answer Selection . In 2017 Brazilian Conference on Intelligent Systems (BRACIS), pages 354-359, Los Alamitos, CA, USA. IEEE Computer Society. DOI: 10.1109/BRACIS.2017.12.
Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2368-2378. DOI: 10.48550/arxiv.1903.00161.
Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., and Auli, M. (2019). ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558-3567, Florence, Italy. Association for Computational Linguistics. DOI: 10.48550/arXiv.1907.09190.
Fang, Y., Deng, J., Zhang, F., and Wang, H. (2023). An intelligent question-answering model over educational knowledge graph for sustainable urban living. Sustainability, 15(2). DOI: 10.3390/su15021139.
Ferreira, P., Pais, F., Silva, C., Alves, A., and Oliveira, H. G. (2024). Multiwoz-pt um conjunto de diálogos orientados a tarefas em português. Linguamática, 16(2):75-90. DOI: 10.21814/lm.16.2.431.
Gao, P., Gao, F., Ni, J., Wang, Y., Wang, F., and Zhang, Q. (2024). Medical knowledge graph question answering for drug-drug interaction prediction based on multi-hop machine reading comprehension. CAAI Transactions on Intelligence Technology, 9(3):1217-1228. DOI: 10.1049/cit2.12332.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset. NeurIPS. DOI: 10.48550/arxiv.2103.03874.
Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601-1611. DOI: 10.48550/arXiv.1705.03551.
Khot, T., Sabharwal, A., and Clark, P. (2020). Qasc: A dataset for question answering via sentence composition. AAAI. DOI: 10.1609/aaai.v34i05.6319.
Kočiský, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., and Grefenstette, E. (2018). The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics. DOI: 10.1162/tacl_a_00023.
Kwiatkowski, T., Palomaki, J., Redfield, O., et al. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466. DOI: 10.1162/tacl_a_00276.
Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. (2017). RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785-794, Copenhagen, Denmark. Association for Computational Linguistics. DOI: 10.18653/v1/D17-1082.
Lee, K., Chang, M.-W., and Toutanova, K. (2019). Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086-6096. DOI: 10.18653/v1/p19-1612.
Martinez-Gil, J. (2023). A survey on legal question–answering systems. Computer Science Review, 48:100552. DOI: 10.1016/j.cosrev.2023.100552.
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381-2391. Association for Computational Linguistics. DOI: 10.48550/arXiv.1809.02789.
Mucciaccia, S. S., Meireles Paixão, T., Wall Mutz, F., Santos Badue, C., Ferreira de Souza, A., and Oliveira-Santos, T. (2025). Automatic multiple-choice question generation and evaluation systems based on LLM: A study case with university resolutions. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2246-2260, Abu Dhabi, UAE. Association for Computational Linguistics. Available online [link].
Osório, T. F., Leite, B., Lopes Cardoso, H., Gomes, L., Rodrigues, J., Santos, R., and Branco, A. (2024). PORTULAN ExtraGLUE datasets and models: Kick-starting a benchmark for the neural processing of Portuguese. In Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024, pages 24-34, Torino, Italia. ELRA and ICCL. DOI: 10.7202/1042710ar.
Paschoal, A. F. A., Pirozelli, P., Freire, V., Delgado, K. V., Peres, S. M., José, M. M., Nakasato, F., Oliveira, A. S., Brandão, A. A. F., Costa, A. H. R., and Cozman, F. G. (2021). Pirá: A bilingual portuguese-english dataset for question-answering about the ocean. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, page 4544–4553, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3459637.3482012.
Patel, A., Bhattamishra, S., and Goyal, N. (2021). Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080-2094, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.naacl-main.168.
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 784-789. Association for Computational Linguistics. DOI: 10.48550/arXiv.1806.03822.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383-2392. Association for Computational Linguistics. DOI: 10.18653/v1/D16-1264.
Reddy, S., Chen, D., and Manning, C. D. (2019). Coqa: A conversational question answering challenge. In Transactions of the Association for Computational Linguistics, volume 7, pages 249-266. DOI: 10.1162/tacl_a_00266.
Richardson, M., Burges, C., and Renshaw, E. (2013). Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193-203. DOI: 10.18653/v1/d13-1020.
Rodrigues, R. C. (2023). Lessons learned from the evaluation of portuguese language models. Master's thesis, University of Malta, Msida, Malta. Available online [link].
Sen, P., Aji, A. F., and Saffari, A. (2022). Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1604-1619, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. DOI: 10.48550/arXiv.2210.01613.
Talmor, A., Herzig, J., Lourie, N., and Berant, J. (2019). Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4149-4158. DOI: 10.48550/arxiv.1811.00937.
Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal, A. (2018). Fever: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 809-819. DOI: 10.48550/arXiv.1803.05355.
Trischler, A., Wang, T., Yuan, X., et al. (2017). Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191-200. DOI: 10.48550/arXiv.1611.09830.
Trivedi, H., Balasubramanian, N., Talmor, A., Gardner, M., and Tafjord, O. (2022). Musique: Multihop questions via single-hop question composition. In Proceedings of the 60th Annual Meeting of the , doi = 10.1162/tacl_a_00475.
Tsatsaronis, G., Balikas, G., Malakasiotis, P., et al. (2015). An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16(1):138. DOI: 10.1186/s12859-015-0564-6.
Welbl, J., Stenetorp, P., and Riedel, S. (2018). Constructing datasets for multi-hop reading comprehension across documents. In Transactions of the Association for Computational Linguistics. DOI: 10.1162/tacl_a_00021.
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369-2380, Brussels, Belgium. Association for Computational Linguistics. DOI: 10.18653/v1/D18-1259.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791-4800. DOI: 10.48550/arXiv.1905.07830.
Zhao, Y., Zhang, W., Chen, G., Kawaguchi, K., and Bing, L. (2024). How do large language models handle multilingualism? In Advances in Neural Information Processing Systems 37 (NeurIPS 2024). DOI: 10.48550/arxiv.2402.18815.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Sérgio S. Mucciaccia, Thiago M. Paixão, Filipe Mutz, Alberto F. De Souza, Claudine S. Badue, Thiago Oliveira-Santos

This work is licensed under a Creative Commons Attribution 4.0 International License.

