Analyzing Discourses in Portuguese Word Embeddings: A Case of Gender Bias Outside the English-Speaking World

Authors

DOI:

https://doi.org/10.5753/jis.2025.5958

Keywords:

Natural Language Processing, Computational Linguistics, Algorithmic Sexism, Ethics in AI, Non-English NLP

Abstract

In this paper we meticulously examined a Word Embedding model in Portuguese, endeavoring to identify gender biases through diverse analytical perspectives, employing SC-WEAT and RIPA metrics that is widely used in the English realm. Our inquiry focused on three primary dimensions: (1) the frequency-based association of words with feminine and masculine terms; (2) the identification of disparities between grammatical classes pertaining to gender sets; and (3) the categorisation and grouping of feminine and masculine words, including their distinctive attributes. In regard to frequency groups, our investigation revealed a pervasive negative association of words with feminine terms in most subsets, indicative of a pronounced inclination of the model’s vocabulary towards the masculine references. Notably, among the 100 most frequent words, 89 exhibited a stronger association with masculine terms. In the scrutiny of grammatical classes, our analysis demonstrated a predominant association of adjectives with feminine references, underscoring the imperative for supplementary description when referring to women. Furthermore, a conspicuous prevalence of participle verbs associated with feminine terms was observed, a phenomenon distinct from their male counterparts and one that requires further expert attention to be properly explained. The categorisation process underscored the existence of gender bias, as exemplified by the association of words with masculine terms within the domains of sport, finance, and science, while words related to feelings, home furniture, and entertainment were associated with feminine terms. These findings assume significance in fostering a discourse on gender analysis within non-English models, such as Portuguese models, thereby encouraging the Brazilian community to actively investigate biases in NLP models.

Downloads

Download data is not yet available.

References

Assi, F. and Caseli, H. (2024). Biases in gpt-3.5 turbo model: a case study regarding gender and language. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 294–305, Porto Alegre, RS, Brasil. SBC. DOI: https://doi.org/10.5753/stil.2024.245358.

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proc. of ACM-FAccT, page 610–623, Canada. Association for Computing Machinery. DOI: https://doi.org/10.1145/3442188.3445922.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29. DOI: https://doi.org/10.5555/3157382.3157584.

Caliskan, A., Ajay, P. P., Charlesworth, T., Wolfe, R., and Banaji, M. R. (2022). Gender bias in word embeddings: A comprehensive analysis of frequency, syntax, and semantics. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’22, page 156–170, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/3514094.3534162.

Caliskan, A., Bryson, J. J., and Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186. DOI: https://doi.org/10.1126/science.aal4230.

Chaloner, K. and Maldonado, A. (2019). Measuring gender bias in word embeddings across domains and discovering new gender bias word categories. In Costa-jussà, M. R., Hardmeier, C., Radford, W., and Webster, K., editors, Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 25–32, Florence, Italy. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W19-3804.

Chen, X., Li, M., Yan, R., Gao, X., and Zhang, X. (2022). Unsupervised mitigating gender bias by character components: A case study of Chinese word embedding. In Proc. of GeBNLP, pages 121–128, Seattle, Washington. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2022.gebnlp-1.14.

Cunha, C. and Cintra, L. (2016). Nova Gramática do Português Contemporâneo. Lexicon, NY, USA, 7 edition.

de Lima, L. F. and de Araujo, R. (2023). A call for a research agenda on fair NLP for Portuguese. In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 187–192, Porto Alegre, RS, Brasil. SBC. DOI: https://doi.org/10.5753/stil.2023.233763.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, MN, USA. Association for Computational Linguistics. DOI: https://doi.org/10.18653/V1/N19-1423.

Ethayarajh, K., Duvenaud, D., and Hirst, G. (2019). Understanding undesirable word embedding associations. In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1696–1705, Florence, Italy. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19-1166.

Falcão, C. (2021). Lentes racistas: Rui costa está transformando a bahia em um laboratório de vigilância com reconhecimento facial. [link]. Last access: 07-09-2025 (in Portuguese).

Firmino, V., Lopes, J., and Reis, V. (2024). Identificando padrões de sexismo na música brasileira através do processamento de linguagem natural. In Anais do V Workshop sobre as Implicações da Computação na Sociedade, pages 59–69, Porto Alegre, RS, Brasil. SBC. DOI: https://doi.org/10.5753/wics.2024.2968.

Fonseca, E., G Rosa, J., and Aluísio, S. (2015). Evaluating word embeddings and a revised corpus for part-of-speech tagging in portuguese. Journal of Brazilian Computer Society, 21(2). DOI: https://doi.org/10.1186/s13173-014-0020-x.

Freitas, C. and Martins, F. (2023). Bela, recatada e do lar: o que a mineração de textos literários nos diz sobre a caracterização de personagens femininas e masculinas. Fórum Linguístico, 20(3). DOI: http://dx.doi.org/10.5007/1984-8412.2023.e86749.

Freitas, C., Rocha, P., and Bick, E. (2008). Floresta sintá(c)tica: Bigger, thicker and easier. In Teixeira, A., de Lima, V. L. S., de Oliveira, L. C., and Quaresma, P., editors, Computational Processing of the Portuguese Language, pages 216–219, Berlin, Heidelberg. Springer Berlin Heidelberg. DOI: https://doi.org/10.1007/978-3-540-85980-2_23.

Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644. DOI: https://doi.org/10.1073/pnas.1720347115.

Greenwald, A. G., McGhee, D. E., and Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The implicit association test. J Pers Soc Psychol, 74(6):1464–80. DOI: https://doi.org/10.1037/0022-3514.74.6.1464.

Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., and Aluísio, S. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Paetzold, G. H. and Pinheiro, V., editors, Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, pages 122–131, Uberlândia, Brazil. Sociedade Brasileira de Computação. [link].

IBGE (2024). Estatísticas de Gênero: Indicadores sociais das mulheres no Brasil. Available in: [link]. Last access: 07-09-2025.

Kurpicz-Briki, M. (2020). Cultural differences in bias? origin and gender bias in pre-trained german and french word embeddings. In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) 16th Conference on Natural Language Processing (KONVENS), volume 2624, Zurich, Switzerland. CEUR Workshop proceedings. DOI: https://doi.org/10.24451/arbor.11922.

LESFEM (2024). Monitor de Feminicídios no Brasil. Available in: [link]. Last access: 07-09-2025.

Noble, S. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. New York University Press, NY, USA. DOI: https://doi.org/10.2307/j.ctt1pwt9w5.

Omrani Sabbaghi, S. and Caliskan, A. (2022). Measuring gender bias in word embeddings of gendered languages requires disentangling grammatical gender signals. In Proc. of AIES, page 518–531, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/3514094.3534176.

Park, J. H., Shin, J., and Fung, P. (2018). Reducing gender bias in abusive language detection. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J., editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2799–2804, Brussels, Belgium. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D18-1302.

Raymakers, T. (2020). Gender bias in word embeddings of different languages. Bachelor’s thesis, Delft University of Technology. [link]. Last access: 07-13-2025.

Salles, I. and Pappa, G. (2021). Viés de gênero em biografias da wikipédia em português. In Anais do X Brazilian Workshop on Social Network Analysis and Mining, pages 211–216, Porto Alegre, RS, Brasil. SBC. DOI: https://doi.org/10.5753/brasnam.2021.16142.

Silva, T. (2022). Linha do tempo do racismo algorítmico. [link]. Last access: 07-09-2025 (in Portuguese).

Sogancioglu, G., Mijsters, F., van Uden, A., and Peperzak, J. (2022). Gender bias in (non)-contextual clinical word embeddings for stereotypical medical categories. https://doi.org/10.48550/arXiv.2208.01341.

Sun, T., Gaut, A., Tang, S., Huang, Y., ElSherief, M., Zhao, J., Mirza, D., Belding, E., Chang, K.-W., and Wang, W. Y. (2019). Mitigating gender bias in natural language processing: Literature review. In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1630–1640, Florence, Italy. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19-1159.

Suresh, H. and Guttag, J. (2021). A framework for understanding sources of harm throughout the machine learning life cycle. In Proc. of EAAMO, volume 17, pages 1–9, NY, USA. ACM Press. DOI: https://doi.org/10.1145/3465416.3483305.

Taso, F., Reis, V., and Martinez, F. (2023a). Algorithmic gender discrimination: Case study and analysis in the brazilian context. In Proc. of WICS, pages 13–25, João Pessoa, PB, Brazil. SBC. (in Portuguese). DOI: https://doi.org/10.5753/wics.2023.229980.

Taso, F., Reis, V., and Martinez, F. (2023b). Sexismo no Brasil: análise de um Word Embedding por meio de testes baseados em associação implícita. In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 53–62, Porto Alegre, RS, Brasil. SBC. DOI: https://doi.org/10.5753/stil.2023.233845.

Torres Berrú, Y., Batista, V., and Zhingre, L. (2023). A data mining approach to detecting bias and favoritism in public procurement. Intell Autom Soft Co, 36(3):3501–3516. DOI: http://dx.doi.org/10.32604/iasc.2023.035367.

Trainotti Rabonato, R., Milios, E., and Berton, L. (2025). Gender-neutral english to portuguese machine translator: Promoting inclusive language. In Paes, A. and Verri, F. A. N., editors, Intelligent Systems, pages 180–195, Cham. Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-79038-6_13.

(UNDP), U. N. D. P. (2023). Breaking down gender biases shifting social norms towards gender equality. [link]. Last access: 07-09-2025.

United Nations (2024). Sustainable Development Goals. Available in: [link]. Last access: 07-09-2025.

Wagner, J. and Zarrieß, S. (2022). Do gender neutral affixes naturally reduce gender bias in static word embeddings? In Proc. of KONVENS, pages 88–97, Potsdam, Germany. KONVENS 2022 Organizers. [link].

Wairagala, E. P., Mukiibi, J., Tusubira, J. F., Babirye, C., Nakatumba-Nabende, J., Katumba, A., and Ssenkungu, I. (2022). Gender bias evaluation in Luganda-English machine translation. In Proc. of AMTA, pages 274–286, Orlando, USA. AMTA. [link].

Werneck, A. (2019). Reconhecimento facial falha em segundo dia, e mulher inocente é confundida com criminosa já presa. [link]. Last access: 07-09-2025 (in Portuguese).

Yee, K., Tantipongpipat, U., and Mishra, S. (2021). Image cropping on twitter: Fairness metrics, their limitations, and the importance of representation, design, and agency. Proc. of HCI, 5:1–24. DOI: https://doi.org/10.1145/3479594.

Zajonc, R. (2001). Mere exposure: A gateway to the subliminal. Current Directions in Psychological Science, 10(6):224–228. DOI: https://doi.org/10.1111/1467-8721.00154.

Downloads

Published

2025-07-14

How to Cite

TASO, F. T. de S.; REIS, V. Q. dos; MARTINEZ, F. V. Analyzing Discourses in Portuguese Word Embeddings: A Case of Gender Bias Outside the English-Speaking World. Journal on Interactive Systems, Porto Alegre, RS, v. 16, n. 1, p. 532–543, 2025. DOI: 10.5753/jis.2025.5958. Disponível em: https://journals-sol.sbc.org.br/index.php/jis/article/view/5958. Acesso em: 5 dec. 2025.

Issue

Section

Regular Paper

Most read articles by the same author(s)