OneTrack-M: A Multitask Approach for Transformer-Based MOT Models

Luiz Carlos Silva de Araujo; Carlos Mauricio Seródio Figueiredo

doi:10.5753/jbcs.2026.4636

Authors

Luiz Carlos Silva de Araujo Federal University of Amazonas https://orcid.org/0000-0002-7501-2087
Carlos Mauricio Seródio Figueiredo State University of Amazonas https://orcid.org/0000-0002-4484-4411

DOI:

https://doi.org/10.5753/jbcs.2026.4636

Keywords:

MOT, Multiple Object Tracking, Transformers, Fast Tracking, End2End, Unified Model

Abstract

Multi-Object Tracking (MOT) is a critical problem in computer vision, essential for understanding how objects move and interact in videos. This field faces significant challenges such as occlusions and complex environmental dynamics, impacting model accuracy and efficiency. While traditional approaches have relied on Convolutional Neural Networks (CNNs), the introduction of transformers has brought substantial advancements. This work introduces OneTrack-M, a transformer-based MOT model that enhances tracking computational efficiency and accuracy. Our approach introduces the transformer encoder as the model backbone, significantly reducing processing time and increasing inference speed. Additionally, we employ innovative data preprocessing and multitask training techniques to address occlusion and diverse objective challenges within a single set of weights. Experimental results demonstrate that OneTrack-M achieves at least 25% faster inference times compared to state-of-the-art models in the literature while maintaining or improving tracking accuracy metrics. These improvements highlight the potential of the proposed solution for real-time applications such as autonomous vehicles, surveillance systems, and robotics, where rapid responses are crucial for system effectiveness.

Downloads

Download data is not yet available.

References

Aharon, N., Orfaig, R., and Bobrovsky, B.-Z. (2022). Bot-sort: Robust associations multi-pedestrian tracking. DOI: 10.48550/arxiv.2206.14651.

Bashar, M., Islam, S., Hussain, K. K., Hasan, M. B., Rahman, A. B. M. A., and Kabir, M. H. (2022). Multiple object tracking in recent times: A literature review.

Bernardin, K. and Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008. DOI: 10.1155/2008/246309.

Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464-3468. DOI: 10.1109/ICIP.2016.7533003.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers. DOI: 10.1007/978-3-030-58452-8_13.

Caruana, R. (1997). Multitask learning. Machine Learning, 28(1):41-75. DOI: 10.1007/978-1-4615-5529-2_5.

Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., and Leal-Taixé, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. DOI: 10.48550/arxiv.2003.09003.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. DOI: 10.48550/arxiv.2010.11929.

Du, Y., Zhao, Z., Song, Y., Zhao, Y., Su, F., Gong, T., and Meng, H. (2023). Strongsort: Make deepsort great again. DOI: 10.1109/tmm.2023.3240881.

Jaward, M., Mihaylova, L., Canagarajah, N., and Bull, D. (2006). Multiple object tracking using particle filters. In 2006 IEEE Aerospace Conference, pages 8 pp.-. DOI: 10.1109/AERO.2006.1655926.

Jonathon Luiten, A. H. (2020). Trackeval. Available at:[link].

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021a). Swin transformer: Hierarchical vision transformer using shifted windows. DOI: 10.1109/iccv48922.2021.00986.

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. (2021b). Video swin transformer. DOI: 10.1109/cvpr52688.2022.00320.

Luiten, J., Os̆ep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., and Leibe, B. (2020). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129(2):548–578. DOI: 10.1007/s11263-020-01375-2.

Luo, R., Song, Z., Ma, L., Wei, J., Yang, W., and Yang, M. (2024). Diffusiontrack: Diffusion model for multi-object tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5):3991-3999. DOI: 10.1609/aaai.v38i5.28192.

Meinhardt, T., Kirillov, A., Leal-Taixe, L., and Feichtenhofer, C. (2022). Trackformer: Multi-object tracking with transformers. DOI: 10.1109/cvpr52688.2022.00864.

Milan, A., Leal-Taixe, L., Reid, I., Roth, S., and Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. DOI: 10.48550/arxiv.1603.00831.

Mostafa, R., Baraka, H., and Bayoumi, A. (2022). Lmot: Efficient light-weight detection and tracking in crowds. IEEE Access, 10:83085-83095. DOI: 10.1109/ACCESS.2022.3197157.

Murad, A. and Pyun, J.-Y. (2017). Deep recurrent neural networks for human activity recognition. Sensors, 17:2556. DOI: 10.3390/s17112556.

Ristani, E., Solera, F., Zou, R., Cucchiara, R., and Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In Hua, G. and Jégou, H., editors, Computer Vision - ECCV 2016 Workshops, pages 17-35, Cham. Springer International Publishing. DOI: 10.1007/978-3-319-48881-3_2.

Su, H., Chen, Y., Tong, S., and Zhao, D. (2019). Real-time multiple object tracking based on optical flow. In 2019 9th International Conference on Information Science and Technology (ICIST), pages 350-356. DOI: 10.1109/ICIST.2019.8836764.

Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., and Luo, P. (2021). Transtrack: Multiple object tracking with transformer. DOI: 10.48550/arxiv.2012.15460.

Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2022). Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. DOI: 10.1109/cvpr52729.2023.00721.

Wojke, N., Bewley, A., and Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. DOI: 10.1109/icip.2017.8296962.

Yang, M., Han, G., Yan, B., Zhang, W., Qi, J., Lu, H., and Wang, D. (2024). Hybrid-sort: Weak cues matter for online multi-object tracking. DOI: 10.1609/aaai.v38i7.28471.

Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., and Wei, Y. (2022). Motr: End-to-end multiple-object tracking with transformer. DOI: 10.1007/978-3-031-19812-0_38.

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., and Wang, X. (2022). Bytetrack: Multi-object tracking by associating every detection box. DOI: 10.1007/978-3-031-20047-2_1.

Zhang, Y., Wang, C., Wang, X., Zeng, W., and Liu, W. (2021). Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129:3069-3087. DOI: 10.1007/s11263-021-01513-4.

Zhang, Y., Wang, T., and Zhang, X. (2023). Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. DOI: 10.1109/cvpr52729.2023.02112.

Zhang, Y. and Yang, Q. (2017). A survey on multi-task learning. arXiv preprint arXiv:1707.08114. DOI: 10.1109/tkde.2021.3070203.

Zheng, Y. (2024). Cuestrack:multi-object tracking based on weak clues and trajectory prediction. In 2024 5th International Conference on Machine Learning and Computer Application (ICMLCA), pages 321-325. DOI: 10.1109/ICMLCA63499.2024.10754464.

Zhou, X., Koltun, V., and Krähenbühl, P. (2020). Tracking objects as points. ECCV. DOI: 10.1007/978-3-030-58548-8_28.

Zhou, X., Wang, D., and Krähenbühl, P. (2019). Objects as points. In arXiv preprint arXiv:1904.07850. DOI: 10.48550/arXiv.1904.07850.