Logical Operators for Multimodal Fusion in Temporal Video Scene Segmentation
DOI:
https://doi.org/10.5753/reic.2025.5535Keywords:
Multimodal Fusion, Fusion Operators, Video Scene Segmentation, Video AnalysisAbstract
Early fusion techniques in content analysis aim to enhance efficacy by generating compact data models that retain semantic clues from multimodal data. Initial attempts used fusion operators at low-level feature space, which compromised data representativeness. This led to the development of complex operations inseparable from multimodal semantic clues processing. Previous studies showed that simple arithmetic-based operators could be as effective as complex operations when applied at the mid-level feature space, highlighting an unexplored opportunity to assess the efficacy of logical operators. This paper investigates the application of logical fusion operators (And, Or, Xor) at the mid-level feature space for Temporal Video Scene Segmentation. Comparative analysis demonstrates that Or and Xor logical operators are viable alternatives in the specific Temporal Video Scene Segmentation content analysis tasks.
Downloads
Referências
Ashutosh, K., Xue, Z., Nagarajan, T., and Grauman, K. (2024). Detours for navigating instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18804–18815. DOI: 10.1109/CVPR52733.2024.01779.
Baraldi, L., Grana, C., and Cucchiara, R. (2015). A deep siamese network for scene detection in broadcast videos. In Proceedings of the 23rd ACM International Conference on Multimedia, pages 1199–1202. DOI: 10.1145/2733373.2806316.
Beserra, A. A. R. and Goularte, R. (2023). Multimodal early fusion operators for temporal video scene segmentation tasks. Multimed Tools and Applications, 82:31539–31556. DOI: 10.1007/s11042-023-14953-6.
Beserra, A. A. R., Kishi, R. M., and Goularte, R. (2020). Evaluating early fusion operators at mid-level feature space. In Proceedings of the Brazilian Symposium on Multimedia and the Web, pages 113–120. DOI: 10.1145/3428658.3431079.
Gross, B. M. (1965). The managing of organizations: The administrative struggle. The ANNALS of the American Academy of Political and Social Science, 1-2(1):197–198. DOI: 10.1177/000271626536000140.
Gôlo, M. P. S., de Moraes Junior, M. I., Goularte, R., and Marcacini, R. M. (2024). Unsupervised heterogeneous graph neural networks for one-class tasks: Exploring early fusion operators. Journal on Interactive Systems, 15(1):517–529. DOI: 10.5753/jis.2024.4109.
Han, B. and Wu, W. (2011). Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International Conference on Multimedia and Expo, pages 1–6. DOI: 10.1109/ICME.2011.6012001.
Jangra, A., Mukherjee, S., Jatowt, A., Saha, S., and Hasanuzzaman, M. (2023). A survey on multi-modal summarization. ACM Comput. Surv., 55(13s). DOI: 10.1145/3584700.
Jhuo, I.-H., Ye, G., Gao, S., Liu, D., Jiang, Y.-G., Lee, D., and Chang, S.-F. (2014). Discovering joint audio–visual codewords for video event detection. Machine Vision and Applications, 25:33–47. DOI: 10.1007/s00138-013-0567-0.
Jia, H. and Lao, H. (2022). Deep learning and multimodal feature fusion for the aided diagnosis of alzheimer’s disease. Neural Comput. Appl., 34(22):19585–19598. DOI: 10.1007/s00521-022-07501-0.
Kishi, R. M., Trojahn, T. H., and Goularte, R. (2019). Correlation based feature fusion for the temporal video scene segmentation task. Multimedia Tools and Applications, 78:15623–15646. DOI: 10.1007/s11042-018-6959-4.
Koprinska, I. and Carrato, S. (2001). Temporal video segmentation: A survey. Signal Processing: Image Communication, 16(5):477–500. DOI: 10.1016/S0923-5965(00)00011-4.
Liu, Z., Cheng, J., Liu, L., Ren, Z., Zhang, Q., and Song, C. (2022). Dual-stream cross-modality fusion transformer for rgb-d action recognition. Knowledge-Based Systems, 255:109741. DOI: 10.1016/j.knosys.2022.109741.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110. DOI: 10.1023/B:VISI.0000029664.99615.94.
Patterson, D. A. and Hennessy, J. L. (2021). Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufmann, Waltham, 5th edition.
Pereira, L. M., Salazar, A., and Vergara, L. (2024). A comparative study on recent automatic data fusion methods. Computers, 13(1). DOI: 10.3390/computers13010013.
Samadiani, N., Huang, G., Luo, W., Chi, C.-H., Shu, Y., Wang, R., and Kocaturk, T. (2022). A multiple feature fusion framework for video emotion recognition in the wild. Concurrency and Computation: Practice and Experience, 34(8):e5764. DOI: 10.1002/cpe.5764.
Sen, S., Dutta, A., and Dey, N. (2019). Audio processing and speech recognition. Springer, Berlin, 1st edition.
Shutaywi, M. and Kachouie, N. N. (2021). Silhouette analysis for performance evaluation in machine learning with applications to clustering. Entropy, 23(6). DOI: 10.3390/e23060759.
Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., and Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380. DOI: 10.1109/34.895972.
Snoek, C. G. M., Worring, M., and Smeulders, A. W. M. (2005). Early versus late fusion in semantic video analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA ’05, page 399–402, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/1101149.1101236.
Soucek, T. and Lokoc, J. (2024). Transnet v2: An effective deep network architecture for fast shot transition detection. In Proceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 11218–11221, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3664647.3685517.
Tan, J., Yang, P., Chen, L., and Wang, H. (2024). Temporal scene montage for self-supervised video scene boundary detection. ACM Trans. Multimedia Comput. Commun. Appl., 20(7). DOI: 10.1145/3654669.
Vendrig, J. and Worring, M. (2002). Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia, 4(4):492–499. DOI: 10.1109/TMM.2002.802021.
Wei, C., Chen, Y., Chen, H., Hu, H., Zhang, G., Fu, J., Ritter, A., and Chen, W. (2024). Uniir: Training andnbsp;benchmarking universal multimodal information retrievers. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXVII, page 387–404, Berlin, Heidelberg. Springer-Verlag. DOI: 10.1007/978-3-031-73021-4_23.
Xing, L., Tran, Q., Caba, F., Dernoncourt, F., Yoon, S., Wang, Z., Bui, T., and Carenini, G. (2024). Multi-modal video topic segmentation with dual-contrastive domain adaptation. In MultiMedia Modeling, pages 410–424, Cham. Springer Nature Switzerland. DOI: 10.1007/978-3-031-53311-2_30.
Yang, Z., Xu, L., Zhao, L., and Sharma, K. (2022). Multimodal feature fusion based hypergraph learning model. Intell. Neuroscience, 2022. DOI: 10.1155/2022/9073652.
Yeung, M., Yeo, B.-L., and Liu, B. (1998). Segmentation of video by clustering and graph analysis. Computer Vision and Image Understanding, 71(1):94–109. DOI: 10.1006/cviu.1997.0628.
Downloads
Published
Como Citar
Issue
Section
Licença
Copyright (c) 2025 Os autores

Este trabalho está licenciado sob uma licença Creative Commons Attribution 4.0 International License.
