A Framework for Semantic and Musical Hyperlapses

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5845

Keywords:

Video Summarization, Semantic Fast-Forward, First-Person Videos, Hyperlapse, Loudness

Abstract

With the growing prevalence of portable cameras---such as smartphones, action cameras, and smart glasses---recording first-person videos of daily activities has become increasingly common. However, these recordings often suffer from shaky footage caused by the wearer's continuous movements, making them physically uncomfortable to watch, and include repetitive or irrelevant segments that make them tedious to watch. To address these challenges, hyperlapse methods fast-forward first-person videos while stabilizing camera motion, and semantic hyperlapse methods additionally preserve the most important segments. Although audio is an important part of watching videos, it is often overlooked in hyperlapse creation, leaving the choice of soundtrack to the user. In this work, we introduce a multimodal hyperlapse algorithm that jointly optimizes semantic content retention, visual stability, and playback alignment with a user-chosen song's loudness. Specifically, the hyperlapse slows down during quiet parts of the song to highlight important frames and speeds up during louder segments to de-emphasize less critical content. We also propose strategies to select songs that best complement the hyperlapse. Our experiments show that this approach outperforms existing methods in semantic retention and loudness--speed correlation, while maintaining comparable camera stability and temporal continuity.

Downloads

Download data is not yet available.

References

Aljanaki, A., Yang, Y.-H., and Soleymani, M. (2017). Developing a benchmark for emotional analysis of music. PLOS ONE, 12(3):1-22. DOI: 10.1371/journal.pone.0173392.

Baráth, D., Noskova, J., Ivashechkin, M., and Matas, J. (2020). Magsac++, a fast, reliable and accurate robust estimator. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1301-1309. DOI: 10.1109/CVPR42600.2020.00138.

Cheng, K.-Y., Luo, S.-J., Chen, B.-Y., and Chu, H.-H. (2009). Smartplayer: user-centric video fast-forwarding. In Conference on Human Factors in Computing Systems, pages 789-798. DOI: 10.1145/1518701.1518823.

del Molino, A. G., Tan, C., Lim, J.-H., and Tan, A.-H. (2017). Summarization of egocentric videos: A comprehensive survey. IEEE Transactions on Human-Machine Systems, 47(1):65-76. DOI: 10.1109/THMS.2016.2623480.

European Broadcasting Union (2023). Loudness Metering: 'EBU Mode' Metering to Supplement Loudness Normalisation. Available online [link].

Furlan, V. S., Bajcsy, R., and Nascimento, E. R. (2018). Fast forwarding egocentric videos by listening and watching. In In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Sight and Sound, pages 2504-2507. DOI: 10.48550/arXiv.1806.04620.

Halperin, T., Poleg, Y., Arora, C., and Peleg, S. (2018). Egosampling: Wide view hyperlapse from egocentric videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(5):1248-1259. DOI: 10.1109/TCSVT.2017.2651051.

Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell System Technical Journal, 29(2):147-160. DOI: 10.1002/j.1538-7305.1950.tb00463.x.

Joshi, N., Kienzle, W., Toelle, M., Uyttendaele, M., and Cohen, M. F. (2015). Real-time hyperlapse creation via optimal frame selection. ACM Trans. Graph., 34(4). DOI: 10.1145/2766954.

Karpenko, A. (2014). The technology behind hyperlapse from instagram. Available online [link].

Kopf, J., Cohen, M., and Szeliski, R. (2014). First-person hyperlapse videos. In ACM Transactions on Graphics (Proc. SIGGRAPH 2014), volume 33. ACM - Association for Computing Machinery. DOI: 10.1145/2601097.2601195.

Lu, L., Liu, D., and Zhang, H.-J. (2006). Automatic mood detection and tracking of music audio signals. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):5-18. DOI: 10.1109/TSA.2005.860344.

Matos, D. d., Ramos, W., Romanhol, L., and Nascimento, E. R. (2021). Musical hyperlapse: A multimodal approach to accelerate first-person videos. In 34th SIBGRAPI Conference on Graphics, Patterns and Images, pages 184-191. DOI: 10.1109/SIBGRAPI54419.2021.00033.

Matos, D. d., Ramos, W., Silva, M., Romanhol, L., and Nascimento, E. R. (2023). A multimodal hyperlapse method based on video and songs’ emotion alignment. Pattern Recognition Letters, 166:174-181. DOI: 10.1016/j.patrec.2022.08.014.

Nepomuceno, R., Ferreira, L., and Silva, M. (2024). A multimodal frame sampling algorithm for semantic hyperlapses with musical alignment. In Proceedings of the 37th Conference on Graphics, Patterns and Images (SIBGRAPI). DOI: 10.1109/SIBGRAPI62404.2024.10716336.

Neves, A. C., Silva, M. M., Campos, M. F. M., and Nascimento, E. R. (2020). A gaze driven fast-forward method for first-person videos. In Proceedings of the Sixth International Workshop on Egocentric Perception, Interaction and Computing (EPIC) at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1-4. DOI: 10.48550/arXiv.2006.05569.

Ogawa, M., Yamasaki, T., and Aizawa, K. (2017). Hyperlapse generation of omnidirectional videos by adaptive sampling based on 3d camera positions. In IEEE International Conference on Image Processing (ICIP), pages 2124-2128. DOI: 10.1109/ICIP.2017.8296657.

Okamoto, M. and Yanai, K. (2014). Summarization of egocentric moving videos for generating walking route guidance. In Klette, R., Rivera, M., and Satoh, S., editors, Image and Video Technology, pages 431-442. DOI: 10.1007/978-3-642-53842-1_37.

Poleg, Y., Halperin, T., Arora, C., and Peleg, S. (2015). Egosampling: Fast-forward and stereo for egocentric videos. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 4768-4776. DOI: 10.1109/CVPR.2015.7299109.

Ramos, W., Silva, M., Araujo, E., Marcolino, L. S., and Nascimento, E. (2020a). Straight to the point: Fast-forwarding videos via reinforcement learning using textual data. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10928-10937. DOI: 10.1109/CVPR42600.2020.01094.

Ramos, W., Silva, M., Araujo, E., Moura, V., Oliveira, K., Soriano Marcolino, L., and Nascimento, E. (2022). Text-driven video acceleration: A weakly-supervised reinforcement learning method. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1-1. DOI: 10.1109/TPAMI.2022.3157198.

Ramos, W. L. S., Silva, M. M., Araujo, E. R., Neves, A. C., and Nascimento, E. R. (2020b). Personalizing fast-forward videos based on visual and textual features from social network. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 3260-3269. DOI: 10.1109/WACV45572.2020.9093330.

Ramos, W. L. S., Silva, M. M., Campos, M. F. M., and Nascimento, E. R. (2016). Fast-forward video based on semantic extraction. In IEEE International Conference on Image Processing (ICIP), pages 3334-3338. DOI: 10.1109/ICIP.2016.7532977.

Rani, P., Jangid, A., Namboodiri, V. P., and Venkatesh, K. S. (2018). Visual odometry based omni-directional hyperlapse. In Rameshan, R., Arora, C., and Dutta Roy, S., editors, Computer Vision, Pattern Recognition, Image Processing, and Graphics, pages 3-13. DOI: 10.1007/978-981-13-0020-2_1.

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. (2011). Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision, pages 2564-2571. DOI: 10.1109/ICCV.2011.6126544.

Silva, M., Ramos, W., Campos, M., and Nascimento, E. R. (2021). A sparse sampling-based framework for semantic fast-forward of first-person videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(4):1438-1444. DOI: 10.1109/TPAMI.2020.2983929.

Silva, M., Ramos, W., Ferreira, J., Chamone, F., Campos, M., and Nascimento, E. R. (2018a). A weighted sparse sampling and smoothing frame transition approach for semantic fast-forward first-person videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2383-2392. DOI: 10.1109/CVPR.2018.00253.

Silva, M. M., Ramos, W. L. S., Chamone, F. C., Ferreira, J. P. K., Campos, M. F. M., and Nascimento, E. R. (2018b). Making a long story short: A multi-importance fast-forwarding egocentric videos with the emphasis on relevant objects. Journal of Visual Communication and Image Representation, 53:55-64. DOI: 10.1016/j.jvcir.2018.02.013.

Silva, M. M., Ramos, W. L. S., Ferreira, J. P. K., Campos, M. F. M., and Nascimento, E. R. (2016). Towards semantic fast-forward and stabilized egocentric videos. In European Conference on Computer Vision Workshops, pages 557-571. DOI: 10.1007/978-3-319-46604-0_40.

Spearman, C. (1904). The proof and measurement of association between two things. The American Jour. of Psych., 15(1):72-101. DOI: 10.2307/1412159.

Steuer, R. E. and Choo, E.-U. (1983). An interactive weighted tchebycheff procedure for multiple objective programming. Mathematical programming, 26:326-344. DOI: 10.1007/BF02591870.

Szabo, A., Boucher, K., Carroll, W., Klebanov, L., Tsodikov, A., and Yakovlev, A. (2002). Variable selection and pattern recognition with gene expression data generated by the microarray technology. Mathematical Biosciences, 176(1):71-98. DOI: 10.1016/S0025-5564(01)00103-1.

Tsodikov, A., Szabo, A., and Jones, D. (2002). Adjustments and measures of differential expression for microarray data. Bioinformatics, 18(2):251-260. DOI: 10.1093/bioinformatics/18.2.251.

von Neumann, J., Kent, R. H., Bellinson, H. R., and Hart, B. I. (1941). The Mean Square Successive Difference. The Annals of Mathematical Statistics, 12(2):153 - 162. DOI: 10.1214/aoms/1177731746.

Wang, M., Liang, J.-B., Zhang, S.-H., Lu, S.-P., Shamir, A., and Hu, S.-M. (2018). Hyper-lapse from multiple spatially-overlapping videos. IEEE Transactions on Image Processing, 27(4):1735-1747. DOI: 10.1109/TIP.2017.2749143.

Zwicker, E. and Fastl, H. (2013). Psychoacoustics: Facts and models, volume 22. Springer Science & Business Media. DOI: 10.1007/978-3-540-68888-4.

Downloads

Published

2025-10-03

How to Cite

Nepomuceno, R. C. S., Ferreira, L. de S., & da Silva, M. M. (2025). A Framework for Semantic and Musical Hyperlapses. Journal of the Brazilian Computer Society, 31(1), 840–857. https://doi.org/10.5753/jbcs.2025.5845

Issue

Section

Articles