OneTrack - An End2End approach to enhance MOT with Transformers
DOI:
https://doi.org/10.5753/jisa.2024.3914Keywords:
Multiple Object Tracking, transformers, efficent tracking, join detection and tracking, deep learningAbstract
This paper introduces OneTrack, an innovative end-to-end transformer-based model for Multiple Object Tracking (MOT), focusing on enhancing efficiency without significantly compromising accuracy. Addressing the challenges inherent in MOT, such as occlusions, varied object sizes, and motion prediction, OneTrack leverages the power of vision transformers and attention layers, optimizing them for real-time applications. Utilizing a unique Object Sequence Patch Input and a Vision Transformer Encoder, the model simplifies the standard transformer approach by employing only the encoder component, significantly reducing computational costs. This approach is validated using the MOT17 dataset, a benchmark in the field, ensuring a comprehensive evaluation against established metrics like MOTA, HOTA, and IDF1. The experimental results demonstrate OneTrack's capability to outperform other transformer-based models in inference speed, with a marginal trade-off in accuracy metrics. The model's inherent design limitations, such as a maximum of 100 objects per window, are adjustable to suit specific applications, offering flexibility in various scenarios. The conclusion highlights the model's potential as a lightweight solution for MOT tasks, suggesting future work directions that include exploring alternative data representations and encoders, and developing a dedicated loss function to further enhance detection and tracking capabilities. OneTrack presents a promising step towards efficient and effective MOT solutions, catering to the demands of real-time applications.
Downloads
References
Aharon, N., Orfaig, R., and Bobrovsky, B.-Z. (2022). Bot-sort: Robust associations multi-pedestrian tracking.
Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE. DOI: 10.1109/icip.2016.7533003.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248-255. DOI: 10.1109/CVPR.2009.5206848.
Deng, J., Guo, J., Yang, J., Xue, N., Kotsia, I., and Zafeiriou, S. (2022). Arcface: Additive angular margin loss for deep face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):5962–5979. DOI: 10.1109/TPAMI.2021.3087709.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale.
Du, Y., Zhao, Z., Song, Y., Zhao, Y., Su, F., Gong, T., and Meng, H. (2023). Strongsort: Make deepsort great again.
Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021). Yolox: Exceeding yolo series in 2021.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Available online [link].
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. (2021). Video swin transformer. Available online [link].
Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization.
Luiten, J., Os̆ep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., and Leibe, B. (2020). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129(2):548–578. DOI: 10.1007/s11263-020-01375-2.
Meinhardt, T., Kirillov, A., Leal-Taixe, L., and Feichtenhofer, C. (2022). Trackformer: Multi-object tracking with transformers.
Milan, A., Leal-Taix'e, L., Reid, I., Roth, S., and Schindler, K. (2016). MOT16: A benchmark for multi-object tracking. arXiv:1603.00831 [cs]. arXiv: 1603.00831. DOI: 10.48550/arXiv.1603.00831.
Mostafa, R., Baraka, H., and Bayoumi, A. (2022). Lmot: Efficient light-weight detection and tracking in crowds. IEEE Access, 10:83085-83095. DOI: 10.1109/ACCESS.2022.3197157.
Pitie, F., Berrani, S.-A., Kokaram, A., and Dahyot, R. (2005). Off-line multiple object tracking using candidate selection and the viterbi algorithm. In IEEE International Conference on Image Processing 2005, volume 3, pages III-109. DOI: 10.1109/ICIP.2005.1530340.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified, real-time object detection. Available online [link].
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. Available online [link].
Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. DOI: 10.1109/cvpr.2015.7298682.
Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., and Luo, P. (2021). Transtrack: Multiple object tracking with transformer.
Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2022). Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Available online [link].
Wang, Y.-H., Hsieh, J.-W., Chen, P.-Y., Chang, M.-C., So, H. H., and Li, X. (2023). Smiletrack: Similarity learning for occlusion-aware multiple object tracking.
Wojke, N., Bewley, A., and Paulus, D. (2017). Simple online and realtime tracking with a deep association metric.
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., and Wei, Y. (2022). Motr: End-to-end multiple-object tracking with transformer.
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., and Wang, X. (2022). Bytetrack: Multi-object tracking by associating every detection box.
Zhang, Y., Wang, C., Wang, X., Zeng, W., and Liu, W. (2021). Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11):3069–3087. DOI: 10.1007/s11263-021-01513-4.
Zhang, Y., Wang, T., and Zhang, X. (2023). Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. Available online [link].
Zhou, X., Koltun, V., and Krähenbühl, P. (2020). Tracking objects as points. ECCV. DOI: 10.1007/978-3-030-58548-8_2.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Journal of Internet Services and Applications
This work is licensed under a Creative Commons Attribution 4.0 International License.