Memorizing Features Efficiently for Self-supervised Video Object Segmentation
DOI:
https://doi.org/10.5753/jbcs.2026.5904Keywords:
Video Object Segmentation, Superpixel Segmentation, Metric LearningAbstract
Video object segmentation (VOS) involves consistently identifying and classifying object pixels in video sequences, a task that traditionally depends on extensive, manually annotated datasets. In this work, we present SHLS (Superfeatures in a Highly Compressed Latent Space), a self-supervised VOS method that reduces reliance on both annotations and large training datasets. SHLS employs a metric learning framework combining superpixels and deep learning features, enabling effective training with just 10,000 unlabeled still images. Utilizing an efficient memory clustering mechanism, SHLS generates ultra-compact representations called superfeatures, which efficiently store and classify object information across video sequences. Experiments on the DAVIS dataset demonstrate SHLS's strong performance in multi-object scenarios, underscoring its potential as a robust and efficient alternative in self-supervised VOS.
Downloads
References
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Süsstrunk, S. (2012). Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). DOI: 10.1109/TPAMI.2012.120.
Araslanov, N., Schaub-Meyer, S., and Roth, S. (2021). Dense unsupervised learning for video segmentation. In Advances in Neural Information Processing Systems.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning.
Cheng, H. K. and Schwing, A. G. (2022). Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In Computer Vision - ECCV 2022. DOI: 10.1007/978-3-031-19815-1_37.
Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H. S., and Hu, S.-M. (2015). Global contrast based salient region detection. IEEE TPAMI. DOI: 10.1109/TPAMI.2014.2345401.
Guo, P., Zhang, W., Li, X., Fan, J., and Zhang, W. (2025). Self-supervised video object segmentation via pseudo label rectification. Pattern Recogn.. DOI: 10.1016/j.patcog.2025.111428.
He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). DOI: 10.1109/ICCV.2017.322.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In Computer Vision - ECCV 2016. DOI: 10.1007/978-3-319-46493-0_38.
Hou, R., Chen, C., and Shah, M. (2017). An end-to-end 3d convolutional neural network for action detection and segmentation in videos. 10.48550/ARXIV.1712.01111.
Jabri, A., Owens, A., and Efros, A. A. (2020). Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems.
Kim, Y., Choi, S., Lee, H., Kim, T., and Kim, C. (2020). Rpm-net: Robust pixel-level matching networks for self-supervised video object segmentation. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). DOI: 10.1109/WACV45572.2020.9093294.
Lai, Z., Lu, E., and Xie, W. (2020). MAST: A memory-augmented self-supervised tracker. In IEEE Conference on Computer Vision and Pattern Recognition. DOI: 10.1109/CVPR42600.2020.00651.
Lai, Z. and Xie, W. (2019). Self-supervised learning for video correspondence flow. In BMVC.
Li, M., Hu, L., Xiong, Z., Zhang, B., Pan, P., and Liu, D. (2022). Recurrent dynamic embedding for video object segmentation. In Conference on Computer Vision and Pattern Recognition. DOI: 10.1109/CVPR52688.2022.00139.
Li, R. and Liu, D. (2023). Spatial-then-temporal self-supervised learning for video correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2279-2288. DOI: 10.1109/CVPR52729.2023.00226.
Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., and Yang, M.-H. (2019). Joint-task self-supervised learning for temporal correspondence. In Advances in Neural Information Processing Systems.
Lu, X., Wang, W., Shen, J., Tai, Y., Crandall, D. J., and Hoi, S. H. (2020). Learning video object segmentation from unlabeled videos. In Conference on Computer Vision and Pattern Recognition (CVPR). DOI: 10.1109/CVPR42600.2020.00898.
Mendonça, M., Fontinele, J., and Oliveira, L. (2023). SHLS: Superfeatures learned from still images for self-supervised vos. In 34th British Machine Vision Conference BMVC, Aberdeen, UK.
Mendonça, M. and Oliveira, L. (2018). ISEC: Iterative over-segmentation via edge clustering. Image and Vision Computing, 80:45-57. DOI: 10.1016/j.imavis.2018.09.015.
Miao, B., Bennamoun, M., Gao, Y., and Mian, A. (2022). Self-supervised video object segmentation by motion-aware mask propagation. In International Conference on Multimedia and Expo (ICME). DOI: 10.1109/ICME52920.2022.9859966.
Nguyen, D. T., Dax, M., Mummadi, C. K., Ngo, T. P. N., Nguyen, T. H. P., Lou, Z., and Brox, T. (2019). DeepUSPS: Deep Robust Unsupervised Saliency Prediction with Self-Supervision. Curran Associates Inc.
Oh, S. W., Lee, J.-Y., Sunkavalli, K., and Kim, S. J. (2018). Fast video object segmentation by reference-guided mask propagation. In Conference on Computer Vision and Pattern Recognition. DOI: 10.1109/CVPR.2018.00770.
Oh, S. W., Lee, J.-Y., Xu, N., and Kim, S. J. (2019). Video object segmentation using space-time memory networks. In Proceedings of the International Conference on Computer Vision (ICCV). DOI: 10.1109/ICCV.2019.00932.
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., and Van Gool, L. (2017). The 2017 davis challenge on video object segmentation. arXiv:1704.00675. 10.48550/arXiv.1704.00675.
Tokmakov, P., Alahari, K., and Schmid, C. (2017). Learning video object segmentation with visual memory. In International Conference on Computer Vision (ICCV). DOI: 10.1109/ICCV.2017.480.
Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., and Giro-i Nieto, X. (2019). Rvos: End-to-end recurrent network for video object segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). DOI: 10.1109/CVPR.2019.00542.
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., and Murphy, K. (2018). Tracking emerges by colorizing videos. In Computer Vision – ECCV 2018: 15th European Conference. DOI: 10.1007/978-3-030-01261-8_24.
Wang, X., Jabri, A., and Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. In CVPR. DOI: 10.1109/CVPR.2019.00267.
Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., and Huang, T. (2018). Youtube-vos: Sequence-to-sequence video object segmentation. In Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings. DOI: 10.1007/978-3-030-01228-1_36.
Xu, X., Wang, J., Li, X., and Lu, Y. (2022). Reliable propagation-correction modulation for video object segmentation. Proceedings of the AAAI Conference on Artificial Intelligence. DOI: 10.1609/aaai.v36i3.20200.
Yang, Z., Wei, Y., and Yang, Y. (2020). Collaborative video object segmentation by foreground-background integration. In Computer Vision – ECCV 2020: 16th European Conference. DOI: 10.1007/978-3-030-58558-7_20.
Zhu, W., Meng, J., and Xu, L. (2021). Self-supervised video object segmentation using integration-augmented attention. Neurocomput.. DOI: 10.1016/j.neucom.2021.04.090.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Marcelo Mendonça, Luciano Oliveira

This work is licensed under a Creative Commons Attribution 4.0 International License.

