A Hybrid Approach to Optical Music Recognition With Object Detection and Multimodal LLMs
DOI:
https://doi.org/10.5753/reic.2026.6891Keywords:
Optical music recognition, Deep learning, Object detection, Multimodal LLMs, YOLO, Music transcriptionAbstract
This research introduces a hybrid methodology for Optical Music Recognition (OMR), integrating multimodal language models (LLMs) with contemporary object detection approaches. For clef identification, Gemini 2.0 Flash was employed, capitalizing on its visual and contextual interpretation capabilities, while YOLOv8 and YOLOv11 were adopted for processing pitch value and rhythm detection. This task distribution minimizes object detection complexity, enabling YOLO models to concentrate on precise localization and classification of musical symbols. The proposed methodology demonstrated promising outcomes in the task of recognizing digital monophonic scores, with YOLOv11 achieving a mAP50 of 0.995 in the pitch detection network when clef detection is performed through LLMs.
Downloads
Referências
Calvo-Zaragoza, J. and Rizo, D. (2018). End-to-end neural optical music recognition of monophonic scores. Applied Sciences, 8(4):606. DOI: 10.3390/app8040606.
Cao, Y.-H., Ji, K., Huang, Z., Zheng, C., Liu, J., Wang, J., Chen, J., and Yang, M. (2024). Towards better vision-inspired vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13537–13547.
Ju, R.-Y. and Cai, W. (2023). Fracture detection in pediatric wrist trauma x-ray images using yolov8 algorithm. Scientific Reports, 13:10375. DOI: 10.1038/s41598-023-47460-7.
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., and Gao, J. (2023). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36.
Pacha, A., Hajič Jr., J., and Calvo-Zaragoza, J. (2018). A baseline for general music object detection with deep learning. Applied Sciences, 8(9):1488. DOI: 10.3390/app8091488.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640.
Reis, D., Kupec, J., Hong, J., and Daoudi, A. (2023). Real-time flying object detection with yolov8. arXiv preprint arXiv:2305.09972.
Ríos-Vila, A., Calvo-Zaragoza, J., and Paquet, T. (2024). Sheet music transformer: End-to-end optical music recognition beyond monophonic transcription. Journal of New Music Research.
Tuggener, L., Elezi, I., Schmidhuber, J., and Stadelmann, T. (2018). Deep watershed detector for music object recognition. arXiv preprint arXiv:1805.10548.
van der Wel, E. and Ullrich, K. (2017). Optical music recognition with convolutional sequence-to-sequence models. Zenodo. DOI: 10.48550/arXiv.1707.04877.
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. (2023). A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
Downloads
Published
Como Citar
Issue
Section
Licença
Copyright (c) 2026 Os autores

Este trabalho está licenciado sob uma licença Creative Commons Attribution 4.0 International License.
