A bilingual analysis of multi-head attention mechanism for image captioning based on morphosyntactic information

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5792

Keywords:

Image Captioning, Attention, Morphosyntax

Abstract

Image Captioning is the task of describing the information conveyed by an image, i.e., its visual content in a natural language. Most of the current researches make use of the encoder-decoder architecture to create relations between images (the inputs) and text (the output). These relations are originated from the attention mechanisms, present on the Transformer model, and can be leveraged to understand how the image-text relationship is encoded during training and inference times. This work investigates the hypothesis that the attention mechanism behaves analogously for words that share morphosyntactic labels within texts. To this matter, the attention weights for each predicted word --- posed as the "focus" given in the image at each step --- are gathered, averaged and inspected; also, the analysis are performed taking into account one model trained with English captions and another trained with Portuguese captions, therefore comparing two languages with different morphological organization. Our results show that words with the same functioning in the sentence,e.g., being prone to similar inflections, usually have the same focal point in the image. Our work sheds light to the importance of linguistic studies for the vision-language area, reinforcing the benefits of including language-aware knowledge during training.

Downloads

Download data is not yet available.

References

Aker, A. and Gaizauskas, R. (2010). Generating image descriptions using dependency relational patterns. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, page 1250–1258, USA. Association for Computational Linguistics. Available online [link].

Al-Qatf, M., Hawbani, A., Wang, X., Abdusallam, A., Zhao, L., Alsamhi, S. H., and Curry, E. (2024). NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioning. Engineering Applications of Artificial Intelligence, 131:107732. DOI: 10.1016/j.engappai.2023.107732.

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65-72, Ann Arbor, Michigan. Association for Computational Linguistics. Available online [link].

Barry, A. M. S. (1997). Visual intelligence: Perception, image, and manipulation in visual communication. State University of New York Press. Book.

Chen, G., Hou, L., Chen, Y., Dai, W., Shang, L., Jiang, X., Liu, Q., Pan, J., and Wang, W. (2023). mCLIP: Multilingual CLIP via Cross-lingual Transfer. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13028-13043, Toronto, Canada. Association for Computational Linguistics. DOI: 10.18653/v1/2023.acl-long.728.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv. DOI: 10.48550/ARXIV.1406.1078.

Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). What does BERT look at? an analysis of BERT's attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276-286, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/W19-4828.

Cornia, M., Baraldi, L., and Cucchiara, R. (2022). Explaining transformer-based image captioning models: An empirical analysis. AI Commun., 35(2):111–129. DOI: 10.3233/AIC-210172.

Cornia, M., Stefanini, M., Baraldi, L., and Cucchiara, R. (2020). Meshed-Memory Transformer for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. DOI: 10.1109/cvpr42600.2020.01059.

Deshpande, A., Aneja, J., Wang, L., Schwing, A. G., and Forsyth, D. (2019). Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech . In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10687-10696, Los Alamitos, CA, USA. IEEE Computer Society. DOI: 10.1109/CVPR.2019.01095.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. DOI: 10.48550/arxiv.1810.04805.

dos Santos, G. O., Colombini, E. L., and Avila, S. (2022). #pracegover: A large dataset for image captioning in portuguese. Data, 7(2). DOI: 10.3390/data7020013.

Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision (ECCV'10), volume 6314, pages 15-29. DOI: 10.1007/978-3-642-15561-1_2.

Gatt, A. and Krahmer, E. (2018). Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. J. Artif. Int. Res., 61(1):65–170. DOI: 10.1613/jair.5477.

Gautam, T. (2021). Implementation of attention mechanism for caption generation on transformers using tensorflow. Website. Available online [link].

Gondim, J., Claro, D. B., and Souza, M. (2022). Towards image captioning for the portuguese language: Evaluation on a translated dataset. In Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS, pages 384-393. INSTICC, SciTePress. DOI: 10.5220/0011080000003179.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13(25):723-773. Available online [link].

He, X., Shi, B., Bai, X., Xia, G.-S., Zhang, Z., and Dong, W. (2017). Image caption generation with part of speech guidance. Pattern Recognition Letters, 119. DOI: 10.1016/j.patrec.2017.10.018.

Hodosh, M., Young, P., and Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Int. Res., 47(1):853–899. DOI: 10.1613/jair.3994.

Hossain, M. Z., Sohel, F., Shiratuddin, M. F., and Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6):1-36. DOI: 10.1145/3295748.

Hu, X., Gan, Z., Wang, J., Yang, Z., Liu, Z., Lu, Y., and Wang, L. (2021). Scaling up vision-language pre-training for image captioning. arXiv. DOI: 10.48550/ARXIV.2111.12233.

Huang, L., Wang, W., Chen, J., and Wei, X.-Y. (2019). Attention on attention for image captioning. arXiv. DOI: 10.48550/ARXIV.1908.06954.

Indurkhya, N. and Damerau, F. J. (2010). Handbook of Natural Language Processing. Chapman & Hall/CRC, 2nd edition. DOI: 10.1201/9781420085938.

Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. arXiv. DOI: 10.48550/ARXIV.1412.2306.

Karpathy, A., Joulin, A., and Li, F.-F. (2014). Deep fragment embeddings for bidirectional image sentence mapping. arXiv. DOI: 10.48550/ARXIV.1406.5679.

Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al. (2022). mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005. DOI: 10.18653/v1/2022.emnlp-main.488.

Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., and Gao, J. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. arXiv. DOI: 10.48550/ARXIV.2004.06165.

Liu, F., Bugliarello, E., Ponti, E. M., Reddy, S., Collier, N., and Elliott, D. (2021). Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467-10485. DOI: 10.18653/v1/2021.emnlp-main.818.

Lu, J., Xiong, C., Parikh, D., and Socher, R. (2017). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. DOI: 10.1109/cvpr.2017.345.

Moriarty, S. E. (2002). The symbiotics of semiotics and visual communication. Journal of Visual Literacy, 22(1):19-28. DOI: 10.1080/23796529.2002.11674579.

Nain, A. K. (2021). Image captioning. Website. Available online [link].

Ordonez, V., Kulkarni, G., and Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS). Available online [link].

Pan, J.-Y., Yang, H.-J., Duygulu, P., and Faloutsos, C. (2004). Automatic image captioning. volume 3, pages 1987-1990 Vol.3. DOI: 10.1109/ICME.2004.1394652.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, page 311–318, USA. Association for Computational Linguistics. DOI: 10.3115/1073083.1073135.

Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2641-2649. DOI: 10.1109/ICCV.2015.303.

Reale-Nosei, G., Amador-Domínguez, E., and Serrano, E. (2024). From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation. Medical Image Analysis, 97:103264. DOI: 10.1016/j.media.2024.103264.

Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (2016). Self-critical sequence training for image captioning. arXiv. DOI: 10.48550/ARXIV.1612.00563.

Rosa, G. M., Bonifacio, L. H., Souza, L. R. d., Lotufo, R., and Nogueira, R. (2021). A cost-benefit analysis of cross-lingual transfer methods. arXiv. DOI: 10.48550/arXiv.2105.06813.

Salgotra, G., Abrol, P., and Selwal, A. (2024). A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future Perspectives. Archives of Computational Methods in Engineering. DOI: 10.1007/s11831-024-10190-8.

Santos, G. O. d., Moreira, D. A. B., Ferreira, A. I., Silva, J., Pereira, L., Bueno, P., Sousa, T., Maia, H., Silva, N. D., Colombini, E., Pedrini, H., and Avila, S. (2023). CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages. arXiv. DOI: 10.48550/arXiv.2310.13683.

Sharif, N., Nadeem, U., Shah, S., Bennamoun, M., and Liu, W. (2020). Vision to Language: Methods, Metrics and Datasets, pages 9-62. Springer International Publishing. DOI: 10.1007/978-3-030-49724-8_2.

Sharma, H. and Padha, D. (2023). A comprehensive survey on image captioning: From handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artificial Intelligence Review, 56(11):13619-13661. DOI: 10.1007/s10462-023-10488-2.

Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., and Cucchiara, R. (2023). From Show to Tell: A Survey on Deep Learning-Based Image Captioning . IEEE Transactions on Pattern Analysis & Machine Intelligence, 45(01):539-559. DOI: 10.1109/TPAMI.2022.3148210.

Tan, M. and Le, Q. V. (2021). Efficientnetv2: Smaller models and faster training. DOI: 10.48550/arxiv.2104.00298.

Tensorflow (2022). Image captioning with visual attention. Website. Available online [link].

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.. DOI: 10.48550/arxiv.1706.03762.

Vig, J. (2019). A multiscale visualization of attention in the transformer model. arXiv. DOI: 10.48550/ARXIV.1906.05714.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption generator. DOI: 10.1109/cvpr.2015.7298935.

Wang, D., Liu, B., Zhou, Y., Liu, M., Liu, P., and Yao, R. (2022a). Separate syntax and semantics: Part-of-speech-guided transformer for image captioning. Applied Sciences, 12(23). DOI: 10.3390/app122311875.

Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. (2022b). Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR, abs/2202.03052. Available online [link].

Web Accessibility Initiative (2022). Introduction to web accessibility. Available online [link].

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. (2016). Show, attend and tell: Neural image caption generation with visual attention. Available online [link].

Yang, Y., Teo, C., Daumé III, H., and Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 444-454, Edinburgh, Scotland, UK. Association for Computational Linguistics. Available online [link].

Yao, B. Z., Yang, X., Lin, L., Lee, M. W., and Zhu, S.-C. (2010). I2t: Image parsing to text description. Proceedings of the IEEE, 98(8):1485-1508. DOI: 10.1109/JPROC.2010.2050411.

Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67-78. DOI: 10.1162/tacl_a_00166.

Zhang, J., Mei, K., Zheng, Y., and Fan, J. (2021a). Integrating part of speech guidance for image captioning. IEEE Transactions on Multimedia, 23:92-104. DOI: 10.1109/TMM.2020.2976552.

Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., and Gao, J. (2021b). Vinvl: Revisiting visual representations in vision-language models. DOI: 10.48550/ARXIV.2101.00529.

Downloads

Published

2025-10-17

How to Cite

Gondim, J., Claro, D. B., & Souza, M. (2025). A bilingual analysis of multi-head attention mechanism for image captioning based on morphosyntactic information. Journal of the Brazilian Computer Society, 31(1). https://doi.org/10.5753/jbcs.2025.5792

Issue

Section

Articles