Enhancing Large Language Models for Underrepresented Varieties: Pretraining Strategies in the Galician-Portuguese Diasystem

Pablo Rodríguez; Pablo Gamallo; Daniel Santos; Susana Sotelo; Silvia Paniagua; José Ramom Pichel; Pedro Salgueiro; Vítor Nogueira; Paulo Quaresma; Marcos Garcia; Senén Barro

doi:10.5753/jbcs.2025.5766

Authors

Pablo Rodríguez University of Santiago de Compostela https://orcid.org/0009-0008-7200-4197
Pablo Gamallo University of Santiago de Compostela https://orcid.org/0000-0002-5819-2469
Daniel Santos University of Évora https://orcid.org/0000-0002-9906-0358
Susana Sotelo University of Santiago de Compostela https://orcid.org/0000-0002-0067-7957
Silvia Paniagua University of Santiago de Compostela https://orcid.org/0009-0000-5260-869X
José Ramom Pichel University of Santiago de Compostela https://orcid.org/0000-0001-5172-6803
Pedro Salgueiro University of Évora https://orcid.org/0000-0001-7614-2951
Vítor Nogueira University of Évora https://orcid.org/0000-0002-0793-0003
Paulo Quaresma University of Évora https://orcid.org/0000-0002-5086-059X
Marcos Garcia University of Santiago de Compostela https://orcid.org/0000-0002-6557-0210
Senén Barro University of Santiago de Compostela https://orcid.org/0000-0001-6035-540X

DOI:

https://doi.org/10.5753/jbcs.2025.5766

Keywords:

Large Language Models, Continual Pretraining, European Portuguese, Galician

Abstract

This study presents a systematic exploration of strategies for pretraining generative Large Language Models (LLMs) within the Galician-Portuguese diasystem, by focusing on two underrepresented varieties of this diasystem, namely European Portuguese and Galician. We investigate the impact of combining versus separating linguistic varieties during continued pretraining, the trade-offs between large-scale noisy data and smaller high-quality corpora, and the potential gains from incorporating instruction-based data during the training phase instead of in post-training (e.g., instruction tuning). Our findings show that the inclusion of language varieties in training enhances both task-solving performance and linguistic quality in text generation, especially when leveraging curated linguistic resources. By integrating technical experimentation with sociolinguistic insight, this work underscores the importance of equitable and context-aware LLM development in multilingual and minority-language settings.

Downloads

Download data is not yet available.

References

Bafna, N., Murray, K., and Yarowsky, D. (2024). Evaluating large language models along dimensions of language variation: A systematik invesdigatiom uv cross-lingual generalization. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18742-18762, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.emnlp-main.1044.

Bagheri Nezhad, S., Agrawal, A., and Pokharel, R. (2025). Beyond data quantity: Key factors driving performance in multilingual language models. In Hettiarachchi, H., Ranasinghe, T., Rayson, P., Mitkov, R., Gaber, M., Premasiri, D., Tan, F. A., and Uyangodage, L., editors, Proceedings of the First Workshop on Language Models for Low-Resource Languages, pages 225-239, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. DOI: 10.48550/arxiv.2412.12500.

Baucells, I., Aula-Blasco, J., de Dios-Flores, I., Paniagua Suárez, S., Perez, N., Salles, A., Sotelo Docio, S., Falcão, J., Saiz, J. J., Sepulveda Torres, R., Barnes, J., Gamallo, P., Gonzalez-Agirre, A., Rigau, G., and Villegas, M. (2025). IberoBench: A benchmark for LLM evaluation in Iberian languages. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10491-10519, Abu Dhabi, UAE. Association for Computational Linguistics. Available online [link].

Carvalho, R. (1979). Sobre a nosa lingua. Grial, 17(64):140-152. Available online [link].

Cintra, L. F. L. (1971). Nova proposta de classificação dos dialectos galego-portugueses. Centro de Estudos Filológicos. Available online [link].

Collazo, S. D. (2014). O estándar galego: reintegracionismo vs. autonomismo. Romanica Olomucensia, (1):1-13. Available online [link].

Dayán-Fernández, A. and O’Rourke, B. (2020). Galician-portuguese and the politics of language in contemporary galicia. Multilingualism and politics: Revisiting multilingual citizenship, pages 231-260. DOI: 10.1007/978-3-030-40701-8_10.

de Dios-Flores, I., Suárez, S. P., Pérez, C. C., Outeiriño, D. B., Garcia, M., and Gamallo, P. (2024). CorpusNÓS: A massive Galician corpus for training large language models. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics. Available online [link].

del Olmo, F. J. C. and da Cunha, K. M. R. (2017). Percursos geopolíticos e perfis sociolinguísticos: mapeando a história social do diassistema galego-português. In Gallæcia: Estudos de lingüística portuguesa e galega, pages 565-581. Universidad de Santiago de Compostela. Available online [link].

Duarte, I. (2024). Ibero-romance i: Portuguese and galician. In Oxford Research Encyclopedia of Linguistics. DOI: 10.1093/acrefore/9780199384655.013.717.

Durão, C. (2008). Síntese do reintegracionismo contemporâneo. Boletim da Academia Galega da Língua Portuguesa, 1:35-56. Avaiçabçe at:[link].

Fernández-Pichel, M., Prada-Corral, M., Losada, D. E., Pichel, J. C., and Gamallo, P. (2024). An unsupervised perplexity-based method for boilerplate removal. Natural Language Engineering, 30(1):132–149. DOI: 10.1017/S1351324923000049.

Gamallo, P., Rodríguez, P., de Dios-Flores, I., Sotelo, S., Paniagua, S., Bardanca, D., Pichel, J. R., and Garcia, M. (2024a). Open generative large language models for galician. Procesamiento del Lenguaje Natural, 73:259-270. DOI: 10.48550/arxiv.2406.13893.

Gamallo, P., Rodríguez, P., Santos, D., Sotelo, S., Miquelina, N., Paniagua, S., Schmidt, D., de Dios-Flores, I., Quaresma, P., Bardanca, D., et al. (2024b). A galician-portuguese generative model. In EPIA Conference on Artificial Intelligence, pages 292-304. Springer. DOI: 10.1007/978-3-031-73503-5_24.

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac'h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2024). A framework for few-shot language model evaluation. Zenodo. DOI: 10.5281/zenodo.5371628.

Gomes, D., Nogueira, A., Miranda, J., and Costa, M. (2009). Introducing the portuguese web archive initiative. In 8th International Web Archiving Workshop (IWAW). Springer. Available online [link].

Gonzalez-Agirre, A., Pàmies, M., Llop, J., Baucells, I., Dalt, S. D., Tamayo, D., Saiz, J. J., Espuña, F., Prats, J., Aula-Blasco, J., Mina, M., Rubio, A., Shvets, A., Sallés, A., Lacunza, I., Pikabea, I., Palomar, J., Falcão, J., Tormo, L., Vasquez-Reina, L., Marimon, M., Ruíz-Fernández, V., and Villegas, M. (2025). Salamandra technical report. DOI: 10.48550/arxiv.2502.08489.

Grieve, J., Bartl, S., Fuoli, M., Grafmiller, J., Huang, W., Jawerbaum, A., Murakami, A., Perlman, M., Roemling, D., and Winter, B. (2025). The sociolinguistic foundations of language modeling. Frontiers in Artificial Intelligence, 7. DOI: 10.3389/frai.2024.1472411.

Helm, P., Bella, G., Koch, G., and Giunchiglia, F. (2024). Diversity and language technology: how language modeling bias causes epistemic injustice. Ethics and Information Technology, 26(1):8. DOI: 10.1007/s10676-023-09742-6.

Herrero-Valeiro, M. J. (2003). The discourse of language in galiza. Estudios de Sociolingüística, 4(1):289-320. Available online [link].

Lopes, G. V. (2010). Galician-portuguese as a literary language in the middle ages. A Comparative History of Literatures in the Iberian Peninsula, 1:396-412. DOI: 10.1075/chlel.xxiv.20vid.

Lopes, R., Magalhaes, J., and Semedo, D. (2024). GlórIA: A generative and open large language model for Portuguese. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., and Oliveira, Hug, editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 441-453, Santiago de Compostela, Galicia/Spain. Association for Computational Linguistics. DOI: 10.48550/arXiv.2402.12969.

Monteagudo, H. and Santamarina, A. (1993). Galician and Castilian in contact: historical, social and linguistic aspects. na. Book.

Muhr, R. (2013). Codifying linguistic standards innon-dominant varieties of pluricentric languages-adopting dominant or native norms? In Exploring linguistic standards in non-dominant varieties of pluricentric languages, pages 11-44. Peter Lang. Available online [link]

Paz Felix, A. (2020). Rede institucional do reintegracionismo: estrutura, agentes, programas e estratégias (2008-2019). Available online [link].

Pichel, J. R., Gamallo, P., Alegria, I., and Neves, M. (2021). A methodology to measure the diachronic language distance between three languages based on perplexity. Journal of Quantitative Linguistics, 28(4):306-336. DOI: 10.1080/09296174.2020.1732177.

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. (2020). Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-16. IEEE. DOI: 10.1109/sc41405.2020.00024.

Ramallo, F. and Rei-Doval, G. (2015). The standardization of galician. Sociolinguistica, 29(1):61-82. DOI: 10.1515/soci-2015-0006.

Rodríguez, P., Suárez, S. P., Gamallo, P., and Docio, S. S. (2025). Continued pretraining and interpretability-based evaluation for low-resource languages: A Galician case study. In Findings of the Association for Computational Linguistics: ACL 2025, pages 4622-4637, Vienna, Austria. Association for Computational Linguistics. DOI: 10.18653/v1/2025.findings-acl.240.

Romero, H. M. (1999). Historia social da lingua galega: idioma, sociedade e cultura a través do tempo, volume 1. Editorial Galaxia. Book.

Santos, D. and Rocha, P. (2000). CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In das Graças Volpe Nunes, M., editor, Actas do V Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR 2000), pages 131-140, Atibaia, São Paulo, Brasil. 19 a 22 de Novembro. Available online [link].

Santos, R., Silva, J. R., Gomes, L., Rodrigues, J., and Branco, A. (2024). Advancing generative AI for Portuguese with open decoder gervásio PT*. In Melero, M., Sakti, S., and Soria, C., editors, Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 16-26, Torino, Italia. ELRA and ICCL. DOI: 10.48550/arXiv.2402.18766.

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., and Workshop, B. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv preprint, arXiv:2211.05100. DOI: 10.48550/arxiv.2211.05100.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models. DOI: 10.48550/arxiv.2302.13971.