Salience prediction methods for video cropping in sidewalk footage

Suayder M. Costa; Rafael J. P. Damaceno; Henrique Morimitsu; Roberto M. Cesar-Jr

doi:10.5753/jbcs.2026.5895

Authors

Suayder M. Costa University of São Paulo https://orcid.org/0009-0008-8153-7913
Rafael J. P. Damaceno University of São Paulo https://orcid.org/0000-0003-4910-1534
Henrique Morimitsu University of Science and Technology Beijing https://orcid.org/0000-0001-9455-8571
Roberto M. Cesar-Jr University of São Paulo https://orcid.org/0000-0003-2701-4288

DOI:

https://doi.org/10.5753/jbcs.2026.5895

Keywords:

Salience Prediction, Sidewalk, Tactile Paving, Video Cropping

Abstract

The condition of urban infrastructure is an important aspect in ensuring the safety and well-being of pedestrians. This is especially important around public health facilities, such as sidewalks surrounding hospitals. Computational tools have already demonstrated their potential in this context, including surface material classification and obstacle detection; however, most solutions require labeled data, which is costly and time-consuming. To address this gap, we propose two strategies for salience prediction in videos that reduce the dependence of manual labeling. The first leverages human visual attention, converting user clicks into attention maps. The second employs the SAM2 model to generate labeled video data more efficiently. The outputs of this process are used to train specialized saliency detectors to identify general cracks, surface defects, and key sections of tactile paving, such as directional changes. Also, we apply these saliency models to video cropping in order to highlight the most relevant areas within each frame. This approach enables content-aware video retargeting, supports object-focused attention, and facilitates sidewalk condition analysis by emphasizing defects and potential hazards. This work presents the following contributions: (1) development of a click-based video annotation tool, (2) development of two saliency detection strategies for sidewalks video cropping, (3) training and evaluation of saliency models for sidewalk structure analysis, and (4) successful application of these introduced methods for video cropping. Our experimental results showed that saliency models were able to highlight relevant information in urban environments, achieving an AUC of 0.582 in the best case for human-based attention and 0.914 for tactile-based attention, thereby enhancing assistive technologies for visually impaired individuals.

Downloads

Download data is not yet available.

References

Abreu, D. R. d. O. M., Novaes, E. S., Oliveira, R. R. d., Mathias, T. A. d. F., and Marcon, S. S. (2018). Internação e mortalidade por quedas em idosos no brasil: análise de tendência. Ciência & Saúde Coletiva, 23(4):1131–1141. DOI: 10.1590/1413-81232018234.09962016.

Apostolidis, K. and Mezaris, V. (2021). A fast smart-cropping method and dataset for video retargeting. In 2021 IEEE International Conference on Image Processing (ICIP), pages 2618-2622. DOI: 10.1109/ICIP42928.2021.9506390.

Baba, T. (2021). Vidvip: Dataset for object detection during sidewalk travel. Journal of Robotics and Mechatronics, 33(5):1135-1143. DOI: 10.20965/jrm.2021.p1135.

Bellitto, G., Proietto Salanitri, F., Palazzo, S., Rundo, F., Giordano, D., and Spampinato, C. (2021). Hierarchical domain-adapted feature learning for video saliency prediction. International Journal of Computer Vision, 129(12):3216-3232. DOI: 10.1007/s11263-021-01519-y.

Bruckert, A., Christie, M., and Le Meur, O. (2023). Where to look at the movies: Analyzing visual attention to understand movie editing. Behavior Research Methods, 55(6):2940-2959. DOI: 10.3758/s13428-022-01949-7.

Chen, M. C., Anderson, J. R., and Sohn, M. H. (2001). What can a mouse cursor tell us more? correlation of eye/mouse movements on web browsing. In CHI'01 extended abstracts on Human factors in computing systems, pages 281-282. DOI: 10.1145/634067.634234.

Chen, W., Xie, Z., Yuan, P., Wang, R., Chen, H., and Xiao, B. (2023). A mobile intelligent guide system for visually impaired pedestrian. Journal of Systems and Software, 195:111546. DOI: 10.1016/j.jss.2022.111546.

Costa, S. M., Damaceno, R. J. P., and Jr., R. M. C. (2024). Video cropping using salience maps: A case study on a sidewalk dataset. In Extended Proceedings of the XXXVII Conference on Graphics, Patterns and Images (SIBGRAPI 2024) – Workshop on Works in Progress (WiP). Sociedade Brasileira de Computação (SBC). DOI: 10.5753/sibgrapi.est.2024.

da Fontoura Costa, L. and Jr., R. M. C. (2018). Shape Classification and Analysis: Theory and Practice. Taylor and Francis. Book.

Damaceno, R., Ferreira, L., Miranda, F., Hosseini, M., and Cesar Jr, R. (2024). Sideseeing: A multimodal dataset and collection of tools for sidewalk assessment. arXiv preprint arXiv:2407.06464. DOI: 10.48550/arXiv.2407.06464.

Ghilardi, M. C., Macedo, R. C., and Manssour, I. H. (2016). A new approach for automatic detection of tactile paving surfaces in sidewalks. Procedia Computer Science, 80:662-672. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA. DOI: 10.1016/j.procs.2016.05.356.

Gitman, Y., Erofeev, M., Vatolin, D., Andrey, B., and Alexey, F. (2014). Semiautomatic visual-attention modeling and its application to video compression. In 2014 IEEE international conference on image processing (ICIP), pages 1105-1109. IEEE. DOI: 10.1109/ICIP.2014.7025220.

Hosseini, A., Kazerouni, A., Akhavan, S., Brudno, M., and Taati, B. (2024). Sum: Saliency unification through mamba for visual attention modeling. arXiv preprint arXiv:2406.17815. DOI: 10.48550/arXiv.2406.17815.

Im Choi, J. and Tian, Q. (2023). Visual-saliency-guided channel pruning for deep visual detectors in autonomous driving. In 2023 IEEE Intelligent Vehicles Symposium (IV), pages 1-6. IEEE. DOI: 10.1109/iv55152.2023.10186819.

Imani, H. and Islam, M. B. (2024). Spatio-temporal consistent non-homogeneous extreme video retargeting. In 2024 IEEE International Conference on Consumer Electronics (ICCE), pages 1-6. DOI: 10.1109/ICCE59016.2024.10444165.

Ito, Y., Premachandra, C., Sumathipala, S., Premachandra, H. W. H., and Sudantha, B. (2021). Tactile paving detection by dynamic thresholding based on HSV space analysis for developing a walking support system. IEEE Access, 9:20358-20367. DOI: 10.1109/ACCESS.2021.3055342.

Jain, S., Yarlagadda, P., Jyoti, S., Karthik, S., Subramanian, R., and Gandhi, V. (2021). Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3520-3527. DOI: 10.1109/IROS51168.2021.9635989.

Jana, P., Bhaumik, S., and Mohanta, P. P. (2021). Unsupervised action localization crop in video retargeting for 3d convnets. In TENCON 2021 - 2021 IEEE Region 10 Conference (TENCON), pages 670-675. DOI: 10.1109/TENCON54134.2021.9707226.

Jiang, M., Huang, S., Duan, J., and Zhao, Q. (2015). SALICON: Saliency in context. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1072-1080. DOI: 10.1109/CVPR.2015.7298710.

Kim, N. W., Bylinskii, Z., Borkin, M. A., Gajos, K. Z., Oliva, A., Durand, F., and Pfister, H. (2017). Bubbleview: an interface for crowdsourcing image importance maps and tracking visual attention. ACM Transactions on Computer-Human Interaction (TOCHI), 24(5):1-40. DOI: 10.1145/3131275.

Le, T.-N.-H., Huang, H., Chen, Y.-R., and Lee, T.-Y. (2024). Retargeting video with an end-to-end framework. IEEE Transactions on Visualization and Computer Graphics, 30(9):6164-6176. DOI: 10.1109/TVCG.2023.3327825.

Lee, S., Ye, X., Nam, J. W., and Zhang, K. (2022). The association between tree canopy cover over streets and elderly pedestrian falls: A health disparity study in urban areas. Social Science & Medicine, 306:115169. DOI: 10.1016/j.socscimed.2022.115169.

Li, M., Lang, X., Gong, R., Zhou, J., Yang, X., and Sang, N. (2024). Tpsegmentdiff: An enhanced diffusion model for tactile paving image segmentation. In Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops, MMAsia '24 Workshops. Association for Computing Machinery. DOI: 10.1145/3700410.3702130.

Linardos, P., Mohedano, E., Nieto, J. J., O'Connor, N. E., Giró-i-Nieto, X., and McGuinness, K. (2019). Simple vs complex temporal recurrences for video saliency prediction. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019, page 182. BMVA Press. DOI: 10.48550/arXiv.1907.01869.

Lyudvichenko, V. and Vatolin, D. (2019). Predicting video saliency using crowdsourced mouse-tracking data. arXiv preprint arXiv:1907.00480. DOI: 10.30987/graphicon-2019-2-127-130.

Miranda, F., Hosseini, M., Lage, M., Doraiswamy, H., Dove, G., and Silva, C. T. (2020). Urban Mosaic: Visual exploration of streetscapes using large-scale image data. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI '20, page 1–15, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3313831.3376399.

Niu, L. and Bao, H. (2024). Fast tactile paving segmentation model based on reparameterized structure. In Proceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security, GAIIS '24, page 24–28, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3665348.3665354.

Ota, K., Kotani, N., Sugikawa, S., and Muraki, Y. (2024). Vertical video cropping considering multiple subjects. In 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), pages 94-95. DOI: 10.1109/GCCE62371.2024.10760449.

Park, K., Oh, Y., Ham, S., Joo, K., Kim, H., Kum, H., and Kweon, I. S. (2020). Sideguide:a large-scale sidewalk dataset for guiding impaired people. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10022-10029. DOI: 10.1109/IROS45743.2020.9340734.

Quang Minh Khiem, N., Ravindra, G., Carlier, A., and Ooi, W. T. (2010). Supporting zoomable video streams with dynamic region-of-interest cropping. In Proceedings of the first annual ACM SIGMM conference on Multimedia systems, pages 259-270. DOI: 10.1145/1730836.1730868.

Ramanathan, V., Dwivedi, P., Katabathuni, B., Chakraborty, A., and Thakur, C. S. (2020). Quicksal: A small and sparse visual saliency model for efficient inference in resource constrained hardware. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1678-1688. DOI: 10.1109/wacv45572.2020.9093354.

Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al. (2024). Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714. DOI: 10.48550/arXiv.2408.00714.

Saha, M., Saugstad, M., Maddali, H. T., Zeng, A., Holland, R., Bower, S., Dash, A., Chen, S., Li, A., Hara, K., and Froehlich, J. (2019). Project sidewalk: A web-based crowdsourcing tool for collecting sidewalk accessibility data at scale. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI '19, page 1–14, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3290605.3300292.

Shi, W., Goodchild, M. F., Batty, M., Kwan, M.-P., Zhang, A., et al. (2021). Urban informatics. Springer. DOI: 10.1007/978-981-15-8983-6.

Takano, T., Nakane, T., Yu, J., and Zhang, C. (2024). Tactile paving detection and tracking using tenji10k dataset. IEEJ Transactions on Electrical and Electronic Engineering, 19(10):1661-1672. DOI: 10.1002/tee.24123.

Tang, W., Liu, D.-e., Zhao, X., Chen, Z., and Zhao, C. (2023). A dataset for the recognition of obstacles on blind sidewalk. Universal Access in the Information Society, 22(1):69-82. DOI: 10.1007/s10209-021-00837-9.

Tang, Z., Lv, C., and Tang, Y. (2022). Adaptive cropping with interframe relative displacement constraint for video retargeting. Signal Processing: Image Communication, 104:116666. DOI: 10.1016/j.image.2022.116666.

Taylor, L. E., Mirdanies, M., and Saputra, R. P. (2016). Optimized object tracking technique using kalman filter. Journal of Mechatronics, Electrical Power, and Vehicular Technology, 7(2):57-66. DOI: 10.14203/j.mev.2016.v7.57-66.

Theodosiou, Z., Partaourides, H., Panayi, S., Kitsis, A., and Lanitis, A. (2020). Detection and recognition of barriers in egocentric images for safe urban sidewalks. In International Joint Conference on Computer Vision, Imaging and Computer Graphics, pages 530-543. Springer. DOI: 10.1007/978-3-030-94893-1_25.

Wang, W., Shen, J., Guo, F., Cheng, M.-M., and Borji, A. (2018). Revisiting video saliency: A large-scale benchmark and a new model. In The IEEE Conference on Computer Vision and Pattern Recognition. DOI: 10.1109/cvpr.2018.00514.

Wang, Z., Liu, Z., Li, G., Wang, Y., Zhang, T., Xu, L., and Wang, J. (2023). Spatio-temporal self-attention network for video saliency prediction. IEEE Transactions on Multimedia, 25:1161-1174. DOI: 10.1109/TMM.2021.3139743.

Xia, H., Yao, C., Tan, Y., and Song, S. (2023). A dataset for the visually impaired walk on the road. Displays, 79:102486. DOI: 10.1016/j.displa.2023.102486.

Yussif, A.-M., Zayed, T., Taiwo, R., and Fares, A. (2024). Promoting sustainable urban mobility via automated sidewalk defect detection. Sustainable Development, 32(5):5861-5881. DOI: 10.1002/sd.2999.

Zhang, K., Shang, Y., Li, S., Liu, S., and Chen, Z. (2022). Salcrop: Spatio-temporal saliency based video cropping. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP). DOI: 10.1109/VCIP56404.2022.10008849.

Zhang, L., Zhang, J., Lin, Z., Lu, H., and He, Y. (2019). Capsal: Leveraging captioning to boost semantics for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6024-6033. DOI: 10.1109/CVPR.2019.00618.

Zhou, X., Wu, S., Shi, R., Zheng, B., Wang, S., Yin, H., Zhang, J., and Yan, C. (2023). Transformer-based multi-scale feature integration network for video saliency prediction. IEEE Transactions on Circuits and Systems for Video Technology, 33(12):7696-7707. DOI: 10.1109/TCSVT.2023.3278410.

Zhu, R., Shi, L., Song, Y., and Cai, Z. (2023). Integrating gaze and mouse via joint cross-attention fusion net for students' activity recognition in e-learning. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 7(3). DOI: 10.1145/3610876.

Zünd, D. and Bettencourt, L. M. A. (2021). Street View Imaging for Automated Assessments of Urban Infrastructure and Services, pages 29-40. Springer Singapore, Singapore. DOI: 10.1007/978-981-15-8983-6_4.

Salience prediction methods for video cropping in sidewalk footage

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Metrics: