Evaluating Reranking Strategies for Portuguese Information Retrieval: Fine-Tuning, LLMs, and Sociocultural Aspects

Renato Okabayashi Miyaji; Pedro Luiz Pizzigatti Corrêa

doi:10.5753/jbcs.2026.5659

Authors

Renato Okabayashi Miyaji Escola Politécnica da Universidade de São Paulo https://orcid.org/0000-0002-7279-4546
Pedro Luiz Pizzigatti Corrêa Escola Politécnica da Universidade de São Paulo https://orcid.org/0000-0002-8743-4244

DOI:

https://doi.org/10.5753/jbcs.2026.5659

Keywords:

Information Retrieval, Reranking, Fine-Tuning, Large Language Models, Not-so-Large Language Models

Abstract

Reranking plays a crucial role in improving Information Retrieval (IR) performance, particularly in low-resource languages, such as Portuguese. In this study, we evaluate different reranking strategies for Portuguese IR, comparing multilingual and Portuguese-specific models, as well as not-so-large language models and large language models (LLMs). We assess the performance of BM25 combined with ptT5 fine-tuned on multilingual and Brazilian Portuguese datasets, alongside multilingual state-of-the-art rerankers (BGE m3) and LLM as rerankers RankGPT (GPT-4) and Sabiá 3, a Portuguese-specific LLM. Additionally, we introduce a novel dynamic In-Context Learning (DICL) prompting strategy to enhance LLM performance. Experiments conducted on the Quati and Pirá 2.0 datasets show that fine-tuning on native Brazilian Portuguese data significantly improves retrieval effectiveness by up to 5 p.p. in nDCG compared to using translated multilingual datasets. Two fine-tuning approaches were tested: a binary classification strategy with ‘true’ and ‘false’ tokens and a relevance score-based training, both outperforming models fine-tuned on translated multilingual data. RankGPT achieved the best overall results, yet Sabiá 3 demonstrated competitive performance, particularly on queries related to sociocultural aspects. The DICL strategy further improved the results of both LLMs, significantly boosting their MRR@10. These findings highlight the importance of language-specific training and suggest that not-so-large language models can be viable alternatives for reranking tasks in Portuguese IR.

Downloads

Download data is not yet available.

References

Abonizio, H., Almeida, T., Laitz, T., Junior, R., Bonás, G., Nogueira, R., and Pires, R. (2024). Sabiá-3 technical report. ArXiv.

Bonifacio, L., Campiotti, I., Lotufo, R., and Nogueira, R. (2021). mmarco: A multilingual version of ms marco passage ranking dataset. CoRR, 15413(1). DOI: CoRR abs/2108.13897.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20, Red Hook, NY, USA. Curran Associates Inc.

Bueno, M., Oliveira, E., Nogueira, R., Lotufo, R., and Pereira, J. (2024). Quati: A brazilian portuguese information retrieval dataset from native speakers. Proceedings of the XV Brazilian Symposium on Information Technology and Human Language (STIL), 1.

Carmo, D., Piau, M., Campiotti, I., Nogueira, R., and Lotufo, R. (2020). Ptt5: Pretraining and validating the t5 model on brazilian portuguese data. ArXiv, 1.

Caseli, H. and Nunes, M. (2023). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. Brasileiras - Processamento de Linguagem Natural.

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. (2024). M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. Findings of the Association for Computational Linguistics: ACL 2024.

Guo, F., Li, W., Zhuang, H., Luo, Y., Li, Y., Yan, L., Zhu, Q., and Zhang, Y. (2025). Mcranker: Generating diverse criteria on-the-fly to improve pointwise llm rankers. Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining.

Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. Annual Meeting of the Association for Computational Linguistics.

Jones, K., Walker, S., and Robertson, S. (2000). A probabilistic model of information retrieval: development and comparative experiments. Information processing & management, 36(6).

Laitz, T., Papakostas, K., Lotufo, R., and Nogueira, R. (2025). Inranker: Distilled rankers for zero-shot information retrieval. Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science, 15413(1). DOI: 10.1007/978-3-031-79032-4_10.

Lewis, L., Yang, Y., Rose, T., and Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5(1).

Ma, X., Zhang, X., Pradeep, R., and Lin, J. (2023). Zero-shot listwise document reranking with a large language model. CoRR.

Maritaca AI (2025). Maritaca ai documentation - models. [link]. Accessed on 15 January 2025.

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., and Deng, L. (2016). Ms marco: A human generated machine reading comprehension dataset. Annual Conference on Neural Information Processing Systems (NIPS), 1.

Nogueira, R., Jiang, Z., Pradeep, R., and Lin, J. (2020). Document ranking with a pretrained sequence-to-sequence model. Empirical Methods in Natural Language Processing.

Nogueira, R., Yang, W., Cho, K., and Lin, J. (2019). Multi-stage document ranking with bert. ArXiv.

Oliveira, L., Romeu, R., and Moreira, V. (2021). Regis: A test collection for geoscientific documents in portuguese. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.

OpenAI (2025). Openai documentation - pricing. [link]. Accessed on 15 January 2025.

Paschoal, A., Pirozelli, P., Freire, V., Delgado, K., Peres, S., José, M., Nakasato, F., Oliveira, A., Brandão, A., Costa, A., and Cozman, F. (2021). Pirá: A bilingual portuguese-english dataset for question-answering about the ocean. Proceedings of the 30th ACM International Conference on Information & Knowledge Management.

Pirozelli, P., José, M., Silveira, I., Nakasato, F., Peres, S., Brandão, A., Costa, A., and Cozman, F. (2024). Benchmarks for pirá 2.0, a reading comprehension dataset about the ocean, the brazilian coast, and climate change. Data Intelligence, 1(6):29-63.

Qin, Z., Jagerman, R., Hui, K., Zhuang, H., Wu, J., Yan, L., Shen, J., Liu, J., Liu, J., Metzler, D., Wang, X., and Bendersky, M. (2024). Large language models are effective text rankers with pairwise ranking prompting. Findings of the Association for Computational Linguistics: NAACL 2024.

Sachan, D., Lewis, M., Joshi, M., Aghajanyan, A., Yih, W., Pineau, J., and Zettlemoyer, L. (2022). Improving passage retrieval with zero-shot question generation. Empirical Methods in Natural Language Processing.

Sentence-Transformers (2021). all-minilm-l6-v2. [link]. Pretrained sentence embedding model. Fine-tuned on 1B+ sentence pairs.

Sun, W., Yan, L., Ma, X., Wang, S., Ren, P., Chen, Z., Yin, D., and Ren, Z. (2023). Is chatgpt good at search? investigating large language models as re-ranking agents. Empirical Methods in Natural Language Processing.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. (2021). mt5: A massively multilingual pre-trained text-to-text transformer. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1.

Zhu, Y., Yuan, H., Wang, S., Liu, S., Liu, W., Deng, C., Chen, H., Liu, Z., Dou, Z., and Wen, J. (2024). Large language models for information retrieval: A survey. ArXiv.