A Hybrid Approach to Optical Music Recognition With Object Detection and Multimodal LLMs

Gustavo Henrique Romão; Hygor Santiago Lara; Jesuliana Nascimento Ulysses; Jorge Nei Brito

doi:10.5753/reic.2026.6891

Authors

Gustavo Henrique Romão Universidade Federal de São João del Rei
Hygor Santiago Lara Universidade Estadual de Campinas
Jesuliana Nascimento Ulysses Universidade Federal de São João del Rei
Jorge Nei Brito Universidade Federal de São João del Rei

DOI:

https://doi.org/10.5753/reic.2026.6891

Keywords:

Optical music recognition, Deep learning, Object detection, Multimodal LLMs, YOLO, Music transcription

Abstract

This research introduces a hybrid methodology for Optical Music Recognition (OMR), integrating multimodal language models (LLMs) with contemporary object detection approaches. For clef identification, Gemini 2.0 Flash was employed, capitalizing on its visual and contextual interpretation capabilities, while YOLOv8 and YOLOv11 were adopted for processing pitch value and rhythm detection. This task distribution minimizes object detection complexity, enabling YOLO models to concentrate on precise localization and classification of musical symbols. The proposed methodology demonstrated promising outcomes in the task of recognizing digital monophonic scores, with YOLOv11 achieving a mAP50 of 0.995 in the pitch detection network when clef detection is performed through LLMs.

Downloads

Não há dados estatísticos.

Referências

Calvo-Zaragoza, J. and Rizo, D. (2018). End-to-end neural optical music recognition of monophonic scores. Applied Sciences, 8(4):606. DOI: 10.3390/app8040606.

Cao, Y.-H., Ji, K., Huang, Z., Zheng, C., Liu, J., Wang, J., Chen, J., and Yang, M. (2024). Towards better vision-inspired vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13537–13547.

Ju, R.-Y. and Cai, W. (2023). Fracture detection in pediatric wrist trauma x-ray images using yolov8 algorithm. Scientific Reports, 13:10375. DOI: 10.1038/s41598-023-47460-7.

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., and Gao, J. (2023). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36.

Pacha, A., Hajič Jr., J., and Calvo-Zaragoza, J. (2018). A baseline for general music object detection with deep learning. Applied Sciences, 8(9):1488. DOI: 10.3390/app8091488.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640.

Reis, D., Kupec, J., Hong, J., and Daoudi, A. (2023). Real-time flying object detection with yolov8. arXiv preprint arXiv:2305.09972.

Ríos-Vila, A., Calvo-Zaragoza, J., and Paquet, T. (2024). Sheet music transformer: End-to-end optical music recognition beyond monophonic transcription. Journal of New Music Research.

Tuggener, L., Elezi, I., Schmidhuber, J., and Stadelmann, T. (2018). Deep watershed detector for music object recognition. arXiv preprint arXiv:1805.10548.

van der Wel, E. and Ullrich, K. (2017). Optical music recognition with convolutional sequence-to-sequence models. Zenodo. DOI: 10.48550/arXiv.1707.04877.

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. (2023). A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.

A Hybrid Approach to Optical Music Recognition With Object Detection and Multimodal LLMs

Authors

DOI:

Keywords:

Abstract

Downloads

Referências

Downloads

Published

Como Citar

Issue

Section

Licença

Enviar Submissão

Idioma