Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset

Gabriel Bromonschenkel; Alessandro L. Koerich; Thiago M. Paixão; Hilário Tomaz Alves de Oliveira

doi:10.5753/jbcs.2026.5857

Authors

Gabriel Bromonschenkel Instituto Federal do Espírito Santo (IFES), Serra, Brazil https://orcid.org/0009-0000-6816-7913
Alessandro L. Koerich École de Technologie Supérieure (ÉTS), Montreal, Canada https://orcid.org/0000-0001-5879-7014
Thiago M. Paixão Instituto Federal do Espírito Santo (IFES), Serra, Brazil https://orcid.org/0000-0003-1554-6834
Hilário Tomaz Alves de Oliveira Instituto Federal do Espírito Santo (IFES), Serra, Brazil https://orcid.org/0000-0003-0643-7206

DOI:

https://doi.org/10.5753/jbcs.2026.5857

Keywords:

Image Captioning, Transformers, Brazilian Portuguese, Vision Encoder-Decoder, Multi-Modal Evaluation, Attention Maps, CLIP-Score, Vision-Language Models

Abstract

Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies.

Downloads

Download data is not yet available.

References

Abdelaal, A., ELshafey, N. F., Abdalah, N. W., Shaaban, N. H., Okasha, S. A., Yasser, T., Fathi, M., Fouad, K. M., and Abdelbaky, I. (2024). Image captioning using vision encoder decoder model. In 2024 International Conference on Machine Intelligence and Smart Innovation (ICMISI), pages 101-106. IEEE. DOI: 10.1109/ICMISI61517.2024.10580628.

Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., et al. (2024). Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219. DOI: 10.48550/arXiv.2404.14219.

Adalberto Ferreira Barbosa Junior (2024). distilbert-portuguese-cased (revision df1fa7a). DOI: 10.57967/hf/3041.

Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65-72. Available at:[link].

Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., et al. (2024). Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726. DOI: 10.48550/arXiv.2407.07726.

Bromonschenkel, G., Oliveira, H., and Paixão, T. M. (2024). A comparative evaluation of transformer-based vision encoder-decoder models for brazilian portuguese image captioning. In 2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 1-6. IEEE. DOI: 10.1109/SIBGRAPI62404.2024.10716325.

Chan, D., Petryk, S., Gonzalez, J., Darrell, T., and Canny, J. (2023). Clair: Evaluating image captions with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638-13646. DOI: 10.18653/v1/2023.emnlp-main.841.

Corrêa, N. K., Sen, A., Falk, S., and Fatimah, S. (2024). ViTucano: A Portuguese Vision Assitant. Available at:[link].

de Alencar, R. S., Castañeda, W. A. C., and Amadeus, M. (2024). Image captioning for brazilian portuguese using grit model. arXiv preprint arXiv:2402.05106. DOI: 10.48550/arXiv.2402.05106.

dos Santos, G. O., Colombini, E. L., and Avila, S. (2022). #pracegover: A large dataset for image captioning in portuguese. Data, 7(2). DOI: 10.3390/data7020013.

dos Santos, G. O., Moreira, D. A. B., Ferreira, A. I., Silva, J., Pereira, L., Bueno, P., Sousa, T., Maia, H., Da Silva, N., Colombini, E., et al. (2023). Capivara: Cost-efficient approach for improving multilingual clip performance on low-resource languages. In Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL), pages 184-207. DOI: 10.18653/v1/2023.mrl-1.15.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. DOI: 10.48550/arXiv.2010.11929.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783. DOI: 10.48550/arXiv.2407.21783.

Ghandi, T., Pourreza, H., and Mahyar, H. (2023). Deep learning approaches on image captioning: A review. ACM Computing Surveys, 56(3):1-39. DOI: 10.1145/3617592.

Gondim, J., Claro, D. B., and Souza, M. (2022). Towards image captioning for the portuguese language: Evaluation on a translated dataset. In ICEIS (1), pages 384-393. DOI: 10.5220/001108000000317.

Guillou, P. (2020). Gportuguese-2 (portuguese gpt-2 small): a language model for portuguese text generation (and more nlp tasks...). Available at:[link].

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y. (2021). Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514-7528. DOI: 10.18653/v1/2021.emnlp-main.595.

Hirota, Y., Nakashima, Y., and Garcia, N. (2023). Model-agnostic gender debiased image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15191-15200. DOI: 10.1109/CVPR52729.2023.01458.

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276. DOI: 10.48550/arXiv.2410.21276.

Ishan, T. I., Al Noman, A., Rokib, R., Masum, M. I., Ahmed, S., and Shah, F. M. (2023). Bengali image captioning using vision encoder-decoder model. In 2023 26th International Conference on Computer and Information Technology (ICCIT), pages 1-6. DOI: 10.1109/ICCIT60459.2023.10441125.

Jnaini, A., Shirazi, H., and Homayouni, H. (2024). Synergy of gpt-3 summarization and vision-encoder-decoder for chest x-ray captioning. In 2024 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pages 476-482. IEEE. DOI: 10.1109/CCECE59415.2024.10667261.

Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. (2017). Re-evaluating automatic metrics for image captioning. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 199-209. DOI: 10.18653/v1/e17-1019.

Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74-81. Available at:[link].

Liu, W., Chen, S., Guo, L., Zhu, X., and Liu, J. (2021a). Cptr: Full transformer network for image captioning. arXiv preprint arXiv:2101.10804. DOI: 10.48550/arXiv.2101.10804.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021b). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012-10022. DOI: 10.1109/ICCV48922.2021.00986.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311-318. DOI: 10.3115/1073083.1073135.

Sharma, H. and Padha, D. (2023). A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artificial Intelligence Review, pages 1-43. DOI: 10.1007/s10462-023-10488-2.

Silva Barbon, R. and Akabane, A. T. (2022). Towards transfer learning techniques—bert, distilbert, bertimbau, and distilbertimbau for automatic text classification from different languages: a case study. Sensors, 22(21):8184. DOI: 10.3390/s22218184.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20-23, 2020, Proceedings, Part I 9, pages 403-417. Springer. DOI: 10.1007/978-3-030-61377-8_28.

Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., and Cucchiara, R. (2022). From show to tell: A survey on deep learning-based image captioning. IEEE transactions on pattern analysis and machine intelligence, 45(1):539-559. DOI: 10.1109/TPAMI.2022.3148210.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347-10357. PMLR.

Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566-4575. DOI: 10.1109/CVPR.2015.7299087.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156-3164. DOI: 10.1109/CVPR.2015.7298935.

Viridiano, M., Lorenzi, A., Torrent, T. T., Matos, E. E., Pagano, A. S., Sigiliano, N. S., Gamonal, M., de Andrade Abreu, H., Dutra, L. V., Samagaio, M., et al. (2024). Framed multi30k: A frame-based multimodal-multilingual dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7438-7449. DOI: 10.63317/2urtgtf4vshk.

Wang, N., Xie, J., Wu, J., Jia, M., and Li, L. (2023). Controllable image captioning via prompting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2617-2625. DOI: 10.1609/aaai.v37i2.25360.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675. DOI: 10.48550/arXiv.1904.09675.