Cross-Lingual Keyword Extraction for Pesticide Terminology in Brazilian Portuguese and English

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5815

Keywords:

Multilingual keyword extraction, word aligment, BERT embeddings, pesticides

Abstract

Agriculture plays a crucial role in Brazil's economy. As the country intensifies its activities in the sector, the use of pesticides also increases. Hence, the risks associated with pesticide-laden food consumption have become a concern for chemistry researchers. An issue affecting regulatory standardization of pesticides in Brazil is the difficulty in translating pesticide names, particularly from English. For example, the word malathion can be translated from English to Portuguese as malatiom or malatião, resulting in inconsistent labeling. This issue extends to the broader problem of translating highly technical terms between languages, in particular for low-resource languages. In this work, we investigate terminological variation in the chemistry of organophosphorus pesticides. Our goal is to study strategies for domain-specific multilingual keyword extraction. To that end, two corpora were built based on pesticide-related scientific documents in Brazilian Portuguese and English, which led to a total of 84 and 210 texts, respectively, representing the low- and high-resource languages in this study. We then assessed 6 methods for keyword extraction: Simple Maths, TF-IDF, YAKE, TextRank, MultipartiteRank, and KeyBERT. We relied on a multilingual contextual BERT embedding to retrieve corresponding pesticide names in the target language. Fine-tuning was also explored to improve the multilingual representation further. Moreover, we evaluated the use of large language models (LLMs) combined with the recent retrieval-augmented generation (RAG) framework. As a result, we found that the contextual approach, combined with fine-tuning, provided the best results, contributing to enhancing Pesticide Terminology Extraction in a multilingual scenario.

Downloads

Download data is not yet available.

References

Abakerli, R. B., Fay, E. F., Rembischevski, P., Vekic, A. M., Godoy, K., Maximiano, A. D. A., and Bonifácio, A. (2003). REGRAS PARA NOMENCLATURA DOS NOMES COMUNS DOS AGROTÓXICOS. Pesticidas: Revista de Ecotoxicologia e Meio Ambiente, 13(0). Number: 0. DOI: 10.5380/pes.v13i0.3162.

Abulaish, M., Fazil, M., and Zaki, M. J. (2022). Domain-specific keyword extraction using joint modeling of local and global contextual semantics. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(4):1-30. DOI: 10.1145/3494560.

Al-Rfou', R., Perozzi, B., and Skiena, S. (2013). Polyglot: Distributed word representations for multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183-192, Sofia, Bulgaria. Association for Computational Linguistics. Available at: [link].

Beltagy, I., Lo, K., and Cohan, A. (2019). Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. DOI: 10.18653/v1/d19-1371.

Bougouin, A., Boudin, F., and Daille, B. (2013). Topicrank: Graph-based topic ranking for keyphrase extraction. In International joint conference on natural language processing (IJCNLP), pages 543-551. Available at: [link].

Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., and Jatowt, A. (2020). Yake! keyword extraction from single documents using multiple local features. Information Sciences, 509:257-289. DOI: 10.1016/j.ins.2019.09.013.

Cao, S., Kitaev, N., and Klein, D. (2020). Multilingual alignment of contextual word representations. arXiv preprint arXiv:2002.03518. DOI: 10.48550/arxiv.2002.03518.

Carrión, S. and Casacuberta, F. (2022). Few-shot regularization to tackle catastrophic forgetting in multilingual machine translation. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 188-199, Orlando, USA. Association for Machine Translation in the Americas. Available at: [link].

Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., and Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898-2904, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.findings-emnlp.261.

Chambers, J. E. and Levi, P. E. (2013). Organophosphates chemistry: fate, and effects. Elsevier. Book.

Delfino, R. T., Ribeiro, T. S., and Figueroa-Villar, J. D. (2009). Organophosphorus compounds as chemical warfare agents: a review. Journal of the Brazilian Chemical Society, 20:407-428. DOI: 10.1590/s0103-50532009000300003.

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R. O., and Larson, J. (2024). From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130. DOI: 10.48550/arxiv.2404.16130.

Finatto, M. J. (1996). Unidade e variação na lí­ngua portuguesa: a variação em terminologia. Revista Internacional de Língua Portuguesa, 15. Book.

Finatto, M. J. B. and Kerschner, S. (1999). Dicionários especializados em tradução: cooperação entre o tradutor, o especialista e o terminólogo para a caracterização da terminologia e da linguagem da Química. Cadernos do I.L. (dez. 1999), 21/22:p. 273-282. Available at: [link].

Firoozeh, N., Nazarenko, A., Alizon, F., and Daille, B. (2020). Keyword extraction: Issues and methods. Natural Language Engineering, 26(3):259-291. DOI: 10.1017/s1351324919000457.

Geary, R. (1953). Systemic insecticides, development of organic phosphates as systemic insecticides. Journal of Agricultural and Food Chemistry, 1(14):880-882. DOI: 10.1021/jf60014a003.

Grootendorst, M. (2020). Keybert: Minimal keyword extraction with bert. Available at: [link].

Hämmerl, K., Libovický, J., and Fraser, A. (2024). Understanding cross-lingual alignment-a survey. arXiv preprint arXiv:2404.06228. DOI: 10.18653/v1/2024.findings-acl.649.

Hashemzadeh, B. and Abdolrazzagh-Nezhad, M. (2020). Improving keyword extraction in multilingual texts. International Journal of Electrical & Computer Engineering (2088-8708), 10(6). DOI: 10.11591/ijece.v10i6.pp5909-5916.

Hulth, A., Karlgren, J., Jonsson, A., Boström, H., and Asker, L. (2001). Automatic keyword extraction using domain knowledge. In Computational Linguistics and Intelligent Text Processing: Second International Conference, CICLing 2001 Mexico City, Mexico, February 18-24, 2001 Proceedings 2, pages 472-482. Springer. DOI: 10.1007/3-540-44686-9_47.

ISO (1981). ISO 1750:1981 Pesticides and other agrochemicals - Common names. [link].

ISO (2018). ISO 257:2018 Pesticides and other agrochemicals - Principles for the selection of common names. Available at: [link].

Kamrud, G., Wilson, W. W., and Bullock, D. W. (2022). Logistics competition between the u.s. and brazil for soybean shipments to china: An optimized monte carlo simulation approach. Journal of Commodity Markets, page 100290. DOI: 10.1016/j.jcomm.2022.100290.

Kilgarriff, A. (2009). Simple maths for keywords. In Mahlberg, M., González-Díaz, V., and Smith, C., editors, Proceedings of Corpus Linguistics Conference 2009, University of Liverpool, UK. Available at: [link].

Kloske, M. and Witkiewicz, Z. (2019). Novichoks-the a group of organophosphorus chemical warfare agents. Chemosphere, 221:672-682. DOI: 10.1016/j.chemosphere.2019.01.054.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2020). Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234-1240. DOI: 10.1093/bioinformatics/btz682.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459-9474. DOI: 10.48550/arxiv.2005.11401.

Lopes-Ferreira, M., Maleski, A. L. A., Balan-Lima, L., Bernardo, J. T. G., Hipolito, L. M., Seni-Silva, A. C., Batista-Filho, J., Falcao, M. A. P., and Lima, C. (2022). Impact of pesticides on human health in the last six years in brazil. International journal of environmental research and public health, 19(6):3198. DOI: 10.3390/ijerph19063198.

Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., and Chi, Y. (2017). Deep keyphrase generation. arXiv preprint arXiv:1704.06879. DOI: 10.18653/v1/p17-1054.

Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404-411. Available at: [link].

Moncau, G. (2023). Conheça Adilson Paschoal, criador do termo - agrotóxico e parceiro de Ana Primavesi. Brasil de Fato. Available at: [link] Section: Geral.

Och, F. J., Tillmann, C., and Ney, H. (1999). Improved alignment models for statistical machine translation. In 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Available at: [link].

Pinto, P. T. and Lima, M. d. F. (2018). A tradução na Área de química orgânica: da adaptação à tradução literal. Estudos Linguí­sticos (São Paulo. 1978), 47(2):573-585. DOI: 10.21165/el.v47i2.2050.

Pires, T., Schlinger, E., and Garrette, D. (2019). How multilingual is multilingual bert? arXiv preprint arXiv:1906.01502. DOI: 10.18653/v1/p19-1493.

Piskorski, J., Stefanovitch, N., Jacquet, G., and Podavini, A. (2021). Exploring linguistically-lightweight keyword extraction techniques for indexing news articles in a multilingual set-up. In Proceedings of the EACL Hackashop on news media content analysis and automated report generation, pages 35-44. Avaialble at:[link].

Ragnarsdottir, K. V. (2000). Environmental fate and toxicology of organophosphate pesticides. Journal of the Geological Society, 157(4):859-876. DOI: 10.1144/jgs.157.4.859.

Ruder, S., Vulić, I., and Sägaard, A. (2021). Hybrid approaches for low-resource word alignment. arXiv preprint arXiv:2105.04556. Available at: [link].

Sammet, J. and Krestel, R. (2023). Domain-specific keyword extraction using bert. In Proceedings of the 4th Conference on Language, Data and Knowledge, pages 659-665. Available at: [link].

Schuster, T., Ram, O., and Barzilay, R. (2019a). Fine-tuning pretrained language models for domain-specific tasks. arXiv preprint arXiv:1909.00164. Available at: [link].

Schuster, T., Ram, O., Barzilay, R., and Globerson, A. (2019b). Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. pages 1599-1613, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1162.

Sharma, P. and Li, Y. (2019). Self-supervised contextual keyword and keyphrase retrieval with self-labelling. DOI: 10.20944/preprints201908.0073.v1.

Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., and Furtado, V. (2023). Legalbert-pt: A pretrained language model for the brazilian portuguese legal domain. In Naldi, M. C. and Bianchi, R. A. C., editors, Intelligent Systems, pages 268-282, Cham. Springer Nature Switzerland. DOI: 10.1007/978-3-031-45392-2_18.

Soudani, H., Kanoulas, E., and Hasibi, F. (2024). Fine tuning vs. retrieval augmented generation for less popular knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 12-22. DOI: 10.1145/3673791.3698415.

Souza, J. V. (2023). A corpus-based organophosphorus pesticide bilingual glossary in Portuguese and English: a focus on denominative variation. Master's thesis, Unesp. Available at: [link].

Souza, J. V., Pinto, P. T., and Lima, M. M. d. F. (2022). Malationa, malation ou malatiom? a variação denominativa no processo de criação de um glossário bilíngue da Área de química de pesticidas. Acta Scientiarum. Language and Culture, 44(11):e55894-e55894. DOI: 10.4025/actascilangcult.v44i1.55894.

Verma, Y., Jangra, A., Saha, S., Jatowt, A., and Roy, D. (2022). Maked: Multi-lingual automatic keyword extraction dataset. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6170-6179. Available at: [link].

Wan, X. and Xiao, J. (2008). Single document keyphrase extraction using neighborhood knowledge. In AAAI, volume 8, pages 855-860. DOI: 10.5555/1620163.1620205.

Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. (1999). Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254-255. DOI: 10.48550/arxiv.cs/9902007.

Xia, P., Zhang, L., and Li, F. (2015). Learning similarity with cosine similarity ensemble. Information sciences, 307:39-52. DOI: 10.1016/j.ins.2015.02.024.

Zhu, X., Li, L., Zhang, C., and Li, F. (2021). Domain-adapted word embeddings for improved sentiment analysis. arXiv preprint arXiv:2103.06407. Available at: [link].

Downloads

Published

2025-10-09

How to Cite

de Souza, J. V., Amamou, H., Chen, R., Salari, E., Gubelmann, R., Niklaus, C., Serpa, T., Lima, M. M. de F., Pinto, P. T., Kshirsagar, S., Davoust, A., Handschuh, S., & Avila, A. R. (2025). Cross-Lingual Keyword Extraction for Pesticide Terminology in Brazilian Portuguese and English. Journal of the Brazilian Computer Society, 31(1), 973–990. https://doi.org/10.5753/jbcs.2025.5815

Issue

Section

Articles