Rewriting Stories with LLMs: Gender Bias in Generated Portuguese-language Narratives

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5799

Keywords:

gender bias, large language models, narrative generation, Portuguese-language text, bias mitigation

Abstract

Gender bias in Large Language Models (LLMs) has been widely documented, yet its impact on Portuguese-language text generation remains underexplored. In this study, we investigate gender bias in storytelling by prompting instruction-tuned LLMs to generate narrative continuations from masked sentences extracted from 840 public domain literary works. We analyze the gender distribution of generated characters and apply word association tests to quantify bias in word embeddings trained on the generated texts. Our findings reveal that both Mistral-7B-Instruct and LLaMA 3.2-3B tend to perpetuate and, in some cases, amplify existing gender imbalances; male characters are overrepresented and associated with cognitive and professional domains; and female characters are underrepresented and linked to emotional and domestic roles. We also explore the effectiveness of prompt engineering as a bias mitigation strategy, finding that while it increases gender-neutral descriptions, it also introduces greater uncertainty in gender inference. Our results highlight the challenges of addressing bias in LLMs and emphasize the need for more robust evaluation and mitigation strategies for Portuguese-language LLMs.

Downloads

Download data is not yet available.

References

Alhussain, A. I. and Azmi, A. M. (2021). Automatic Story Generation: A Survey of Approaches. ACM Comput. Surv., 54(5):103:1-103:38. DOI: 10.1145/3453156.

Assi, F. M. and Caseli, H. d. M. (2024). Biases in GPT-3.5 Turbo model: a case study regarding gender and language. In Simp. Bras. de Techologia da Informação e da Linguagem Humana, STIL, pages 294-305. SBC. DOI: 10.5753/stil.2024.245358.

Bauer, L., Mehrabi, N., Goyal, P., Chang, K.-W., Galstyan, A., and Gupta, R. (2024). BELIEVE: Belief-enhanced instruction generation and augmentation for zero-shot bias mitigation. In Procs of Workshop on Trustworthy Natural Language Processing, TrustNLPf, pages 239-251. ACL. DOI: 10.18653/v1/2024.trustnlp-1.20.

Bender, E. M., Gebru, T., McMillan-Major, A., et al. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Procs ACM Conference on Fairness, Accountability, and Transparency, FAccT, pages 610-623. ACM. DOI: 10.1145/3442188.3445922.

Bojanowski, P., Grave, E., Joulin, A., et al. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5:135-146. DOI: 10.1162/tacl_a_00051.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., et al. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems, volume 29 of NIPS. Available online [link].

Brown, T. et al. (2020). Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33 of NIPS, pages 1877-1901. Available online [link].

Caliskan, A., Bryson, J. J., and Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183-186. DOI: 10.1126/science.aal4230.

Carvalho, F., Junior, F. P., Ogasawara, E., et al. (2024). Evaluation of the brazilian portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015). Language Resources and Evaluation, 58(1):203-222. DOI: 10.1007/s10579-023-09647-2.

Carvalho, F., Rodrigues, R., Santos, G., et al. (2019). Avaliação da versão em português do LIWC lexicon 2015 com análise de sentimentos em redes sociais. In Anais do VIII Brazilian Workshop on Social Network Analysis and Mining, BraSNAM, pages 24-34. SBC. DOI: 10.5753/brasnam.2019.6545.

Chakrabarty, T., Padmakumar, V., Brahman, F., et al. (2024). Creativity Support in the Age of Large Language Models: An Empirical Study Involving Professional Writers. In Proceedings of the 16th Conference on Creativity & Cognition, pages 132-155. ACM. DOI: 10.1145/3635636.3656201.

Chakrabarty, T., Padmakumar, V., He, H., et al. (2023). Creative Natural Language Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, EMNLP, pages 34-40. ACL. DOI: 10.18653/v1/2023.emnlp-tutorial.6.

Cheng, J. (2020). Fleshing Out Models of Gender in English-Language Novels (1850 – 2000). Journal of Cultural Analytics, 5(1). DOI: 10.22148/001c.11652.

Ding, Y., Zhao, J., Jia, C., Wang, Y., et al. (2025). Gender Bias in Large Language Models across Multiple Languages: A Case Study of ChatGPT. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 552-579, Albuquerque, New Mexico. Association for Computational Linguistics. Available online [link].

Dong, C., Li, Y., Gong, H., et al. (2022). A Survey of Natural Language Generation. ACM Comput. Surv., 55(8):173:1-173:38. DOI: 10.1145/3554727.

Ethayarajh, K., Duvenaud, D., and Hirst, G. (2019). Understanding Undesirable Word Embedding Associations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1696-1705. ACL. DOI: 10.18653/v1/P19-1166.

Freitas, C. and Santos, D. (2023). Gender depiction in portuguese. In Conference Reader: 2nd Annual Conference of Computational Literary Studies, CCLS, pages 4-30. DOI: 10.48694/jcls.3576.

Gallegos, I. O., Rossi, R. A., Barrow, J., et al. (2024). Bias and Fairness in Large Language Models: A Survey. Computational Linguistics, 50(3):1097-1179. DOI: 10.1162/coli_a_00524.

Ganguli, D., Askell, A., Schiefer, N., et al. (2023). The Capacity for Moral Self-Correction in Large Language Models. arXiv preprint. DOI: 10.48550/arXiv.2302.07459.

Gonen, H. and Goldberg, Y. (2019). Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of the 2019 Workshop on Widening NLP, pages 60-63, Florence, Italy. ACL. DOI: 10.48550/arXiv.1903.03862.

Gonçalo Oliveira, H. (2024). Automatic generation of creative text in Portuguese: an overview. Language Resources and Evaluation, 58(1):7-41. DOI: 10.1007/s10579-023-09646-3.

Guo, Y., Guo, M., Su, J., et al. (2024). Bias in Large Language Models: Origin, Evaluation, and Mitigation. arXiv preprint. DOI: 10.48550/arXiv.2411.10915.

Hartmann, N. S. et al. (2017). Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. In Simp. Bras. de Techologia da Informação e da Linguagem Humana, STIL, pages 122-131. SBC. DOI: 10.48550/arXiv.1708.06025.

Hu, T., Kyrychenko, Y., Rathje, S., et al. (2025). Generative language models exhibit social identity biases. Nature Computational Science, 5(1):65-75. DOI: 10.1038/s43588-024-00741-1.

Huang, H., Tang, T., Zhang, D., et al. (2023). Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting. arXiv preprint. DOI: 10.48550/arXiv.2305.07004.

Huang, T., Brahman, F., Shwartz, V., et al. (2021). Uncovering Implicit Gender Bias in Narratives through Commonsense Inference. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3866-3873. ACL. DOI: 10.18653/v1/2021.findings-emnlp.326.

Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). Mistral 7b. arXiv preprint. DOI: 10.48550/arXiv.2310.06825.

Lima, L. F. F. P. d. and Araujo, R. M. d. (2023). A call for a research agenda on fair NLP for Portuguese. In Simp. Bras. de Techologia da Informação e da Linguagem Humana, STIL, pages 187-192. SBC. DOI: 10.5753/stil.2023.233763.

Lucy, L. and Bamman, D. (2021). Gender and representation bias in GPT-3 generated stories. In Proceedings of the Third Workshop on Narrative Understanding, pages 48-55. ACL. DOI: 10.18653/v1/2021.nuse-1.5.

Luo, K., Mao, Y., Zhang, B., et al. (2024). Reflecting the Male Gaze: Quantifying Female Objectification in 19th and 20th Century Novels. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024, pages 13803-13812. ELRA and ICCL. DOI: 10.48550/arXiv.2403.17158.

Mikolov, T., Sutskever, I., Chen, K., et al. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc. Available online [link].

Mondshine, I., Paz-Argaman, T., and Tsarfaty, R. (2025). Beyond English: The impact of prompt translation strategies across languages and tasks in multilingual LLMs. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1331-1354, Albuquerque, New Mexico. Association for Computational Linguistics. DOI: 10.48550/arXiv.2502.09331.

Musacchio, E., Siciliani, L., Basile, P., et al. (2024). Adapting Large Language Models to Narrative Content. In Procs. Workshop on Artificial Intelligence and Creativity co-located with 27th European Conference on Artificial Intelligence, ECAI, pages 17-29. CEUR-WS. Available online [link].

Navigli, R., Conia, S., and Ross, B. (2023). Biases in Large Language Models: Origins, Inventory, and Discussion. ACM Journal of Data and Information Quality. DOI: 10.1145/3597307.

Oba, D., Kaneko, M., and Bollegala, D. (2024). In-Contextual Gender Bias Suppression for Large Language Models. In Findings of the Association for Computational Linguistics, EACL, pages 1722-1742. ACL. Available online [link].

Omrani Sabbaghi, S. and Caliskan, A. (2022). Measuring Gender Bias in Word Embeddings of Gendered Languages Requires Disentangling Grammatical Gender Signals. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES, pages 518-531. ACM. DOI: 10.1145/3514094.3534176.

Peeperkorn, M., Kouwenhoven, T., Brown, D., et al. (2024). Is temperature the creativity parameter of large language models? arXiv preprint. DOI: 10.48550/arXiv.2405.00492.

Petrov, A., Malfa, E. L., Torr, P. H., et al. (2023). Language model tokenizers introduce unfairness between languages. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, pages 36963-36990. Curran Associates Inc.. DOI: 10.48550/arxiv.2305.15425.

Qiu, H., Xu, Y., Qiu, M., et al. (2025). DR.GAP: Mitigating Bias in Large Language Models using Gender-Aware Prompting with Demonstration and Reasoning. arXiv preprint. DOI: 10.48550/arXiv.2502.11603.

Santana, B. S., Woloszyn, V., and Wives, L. K. (2018). Is there Gender bias and stereotype in Portuguese Word Embeddings? arXiv preprint. DOI: 10.48550/arXiv.1810.04528.

Santos, D. (2021). Portuguese Novel Corpus (ELTeC-por): April 2021 release. Zenodo. DOI: 10.5281/zenodo.4288235.

Santos, D., Freitas, C., and Bick, E. (2018). Obras: a fully annotated and partially human-revised corpus of brazilian literary works in public domain. CorLex 24 de setembro de 2018. Available online [link].

Shi, F., Suzgun, M., Markus, F., et al. (2022). Language Models are Multilingual Chain-of-Thought Reasoners. arXiv preprint. DOI: 10.48550/arXiv.2210.03057.

Silva, M., Brandão, M., and M. Moro, M. (2025). Rewriting Stories with LLMs: Gender Bias in Generated Portuguese-language Narratives. Zenodo. DOI: 10.5281/zenodo.15756454.

Silva, M. and Moro, M. (2024). NLP Pipeline for Gender Bias Detection in Portuguese Literature. In Anais do LI Seminário Integrado de Software e Hardware, SEMISH, pages 169-180. SBC. DOI: 10.5753/semish.2024.2914.

Silva, M. O., de Melo-Gomes, L., and Moro, M. M. (2024). From words to gender: Quantitative analysis of body part descriptions within literature in portuguese. Information Processing & Management, 61(3):103647. DOI: 10.1016/j.ipm.2024.103647.

Silva, M. O., Scofield, C., de Melo-Gomes, L., et al. (2022). Cross-collection dataset of public domain portuguese-language works. Journal of Information and Data Management, 13(1). DOI: 10.5753/jidm.2022.2349.

Silva, M. O., Scofield, C., and Moro, M. M. (2021). PPORTAL: Public domain Portuguese-language literature Dataset. In Anais do III Dataset Showcase Workshop, pages 77-88. SBC. DOI: 10.5753/dsw.2021.17416.

Stanczak, K. and Augenstein, I. (2021). A Survey on Gender Bias in Natural Language Processing. arXiv preprint. DOI: 10.48550/arXiv.2112.14168.

Stuhler, O. (2024). The gender agency gap in fiction writing (1850 to 2010). Proceedings of the National Academy of Sciences, 121(29):e2319514121. DOI: 10.1073/pnas.2319514121.

Sun, T., Gaut, A., Tang, S., et al. (2019). Mitigating Gender Bias in Natural Language Processing: Literature Review. arXiv preprint. DOI: 10.48550/arXiv.1906.08976.

Taso, F. T. d. S., Reis, V. Q., and Martinez, F. V. (2023). Sexismo no Brasil: análise de um Word Embedding por meio de testes baseados em associação implícita. In Simp. Bras. de Techologia da Informação e da Linguagem Humana, STIL, pages 53-62. SBC. DOI: 10.5753/stil.2023.233845.

Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint. DOI: 10.48550/arXiv.2302.13971.

Xu, H., Zhang, Z., Wu, L., et al. (2019). The Cinderella Complex: Word embeddings reveal gender stereotypes in movies and books. PLOS ONE, 14(11):e0225385. DOI: 10.1371/journal.pone.0225385.

Yang, K., Tian, Y., Peng, N., et al. (2022). Re3: Generating Longer Stories With Recursive Reprompting and Revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 4393-4479. ACL. DOI: 10.18653/v1/2022.emnlp-main.296.

Zampieri, M. and Becker, M. (2013). Colonia: Corpus of historical portuguese. ZSM Studien, Special Volume on Non-Standard Data Sources in Corpus-Based Research, 5. Available online [link].

Zhao, J., Wang, T., Yatskar, M., et al. (2018). Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15-20. ACL. DOI: 10.18653/v1/N18-2003.

Zhao, J., Wang, T., Yatskar, M., et al. (2019). Gender Bias in Contextualized Word Embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 629-634. ACL. DOI: 10.18653/v1/N19-1064.

Zhou, P. et al. (2019). Examining gender bias in languages with grammatical gender. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, pages 5276-5284. ACL. DOI: 10.18653/v1/D19-1531.

Downloads

Published

2025-10-21

How to Cite

Silva, M. O., Brandão, M. A., & Moro, M. M. (2025). Rewriting Stories with LLMs: Gender Bias in Generated Portuguese-language Narratives. Journal of the Brazilian Computer Society, 31(1), 1120–1136. https://doi.org/10.5753/jbcs.2025.5799

Issue

Section

Articles