Querying large video datasets: a systematic literature review

Authors

DOI:

https://doi.org/10.5753/jisa.2025.5495

Keywords:

Video Query Processing Methods, Video Query Languages, Datasets

Abstract

Querying large-scale video datasets differs from querying short videos due to the inherent challenges in volume, velocity, and variety. In the last decade, this area has emerged thanks to the effectiveness of deep learning methods, new graphics processing units, new video databases, advances in distributed computing, among others. The main goal of querying video streams is to find the best balance between available hardware, software resources, and query latency, taking into account quality goals, constraints, and video configurations. Due to these challenges, many development methods, frameworks, and evaluation metrics have been proposed. As a result, this systematic literature review addresses a gap in the current body of knowledge. It covers ten years, from 2014 to 2024, and 4,248 papers, of which 99 were identified as relevant and used to answer the research questions on (i) processing methods, hardware architecture, and software, (ii) query languages, (iii) evaluation metrics, (iv) and available datasets. In addition, this review shows how this niche is promising and concerned with the rational use of available resources. Among the results, the following are highlighted: cheap detection models are very popular, smart IoT devices are very useful, distributed computing for video query applications is complex, system latency is essential, and there is no standard video query language. Current trends include the development of a standard video query language, in-memory computing, processing where data is produced, low-latency processing, and active learning for labeling objects. This original work shows a domain perspective, identifies problems and opportunities, and provides directions for future studies.

Downloads

Download data is not yet available.

References

Aafaq, N., Mian, A., Liu, W., Gilani, S. Z., and Shah, M. (2019). Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys, 52(6):1-37. DOI: 10.1145/3355390.

Agarwal, N. and Netravali, R. (2023). Boggart: Towards General-Purpose acceleration of retrospective video analytics. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 933-951, Boston, MA. USENIX Association. Available online [link].

Alam, A., Khan, M. N., Khan, J., and Lee, Y.-K. (2020a). Intellibvr-intelligent large-scale video retrieval for objects and events utilizing distributed deep-learning and semantic approaches. In 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 28-35. IEEE. DOI: 10.1109/BigComp48618.2020.0-103.

Alam, A., Ullah, I., and Lee, Y.-K. (2020b). Video big data analytics in the cloud: A reference architecture, survey, opportunities, and open research issues. IEEE Access, 8:152377-152422. DOI: 10.1109/access.2020.3017135.

Anderson, M. R., Cafarella, M., Ros, G., and Wenisch, T. F. (2019). Physical representation-based predicate optimization for a visual analytics database. In Proceedings of the International Conference on Data Engineering, volume 2019-April, pages 1466-1477. DOI: 10.1109/icde.2019.00132.

Andreas Meier, M. K. (2019). Sql & nosql databases: Models, languages, consistency options and architectures for big data management. Springer. DOI: 10.1007/978-3-658-24549-8.

Bastani, F., He, S., Balasingam, A., Gopalakrishnan, K., Alizadeh, M., Balakrishnan, H., Cafarella, M., Kraska, T., and Madden, S. (2020). Miris: fast object track queries in video. In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 1907-1921. DOI: 10.1145/3318464.3389692.

Bochkovskiy, A., Wang, C., and Liao, H. M. (2020). Yolov4: Optimal speed and accuracy of object detection. CoRR, abs/2004.10934. DOI: 10.48550/arXiv.2004.10934.

Caltech (2016). Computational vision at caltech. Availabe at:[link].

Canel, C., Kim, T., Zhou, G., Li, C., Lim, H., Andersen, D. G., Kaminsky, M., and Dulloor, S. (2019). Scaling video analytics on constrained edge nodes. In Talwalkar, A., Smith, V., and Zaharia, M., editors, Proceedings of Machine Learning and Systems, volume 1, pages 406-417. Available online [link].

Chao, D., Chen, K., and Koudas, N. (2023). SVQ-ACT: Querying for Actions over Videos. In IEEE 39th International Conference on Data Engineering (ICDE), pages 3599-3602. DOI: 10.1109/ICDE55515.2023.00277.

Chao, D., Koudas, N., and Xarchakos, I. (2020). SVQ++: Querying for Object Interactions in Video Streams. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD '20, page 2769–2772, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3318464.3384701.

Chaudhary, S., Taneja, A., Singh, A., Roy, P., Sikdar, S., Maity, M., and Bhattacharya, A. (2024). {TileClipper}: Lightweight selection of regions of interest from videos for traffic surveillance. In 2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 967-984. Available online [link].

Chen, T. Y. H., Ravindranath, L., Deng, S., Bahl, P., and Balakrishnan, H. (2015). Glimpse: Continuous, real-time object recognition on mobile devices. In Proc. of the 13th ACM Conference on Embedded Networked Sensor Systems, pages 155-168. DOI: 10.1145/2809695.2809711.

Chen, Y., Yu, X., and Koudas, N. (2022). Ranked window query retrieval over video repositories. In IEEE International Conference on Data Engineering, pages 2776-2791. DOI: 10.1109/ICDE53745.2022.00253.

Chen, Y., Yu, X., Koudas, N., and Yu, Z. (2021). Evaluating temporal queries over video feeds. In Proc. of the 2021 International Conference on Management of Data, pages 287-299. DOI: 10.1145/3448016.3452803.

Chollet, F. (2021). Deep learning with Python. Simon and Schuster, Shelter Island, NY, USA. Book.

Chunduri, P., Bang, J., Lu, Y., and Arulraj, J. (2022). Zeus: Efficiently localizing actions in videos using reinforcement learning. In Proc. of the 2022 International Conference on Management of Data, page 545–558, New York, NY, USA. ACM. DOI: 10.1145/3514221.3526181.

Collins, Z. (2020). Active database interface for video search. Master of engineering thesis, Massachusetts Institute of Technology. Available online [link].

Cugola, G. and Margara, A. (2012). Processing flows of information: From data stream to complex event processing. ACM Computing Surveys (CSUR), 44(3):1-62. DOI: 10.1145/2187671.2187677.

da Silveira, T. B. N. (2023). Semantic-Related Challenges in Computational Intelligence: a Transdisciplinary Approach. Phd thesis, Universidade Tecnológica Federal do Paraná, Curitiba, PR, Brazil. Available online [link].

Dai, X., Yang, P., Zhang, X., Dai, Z., and Yu, L. (2022). Respire: Reducing spatial–temporal redundancy for efficient edge-based industrial video analytics. IEEE Transactions on Industrial Informatics, 18(12):9324-9334. DOI: 10.1109/TII.2022.3162598.

Dai, X., Zhang, Z., Yang, P., Xu, Y., Liu, X., and Lui, J. C. (2024). Axiomvision: Accuracy-guaranteed adaptive visual model selection for perspective-aware video analytics. In Proceedings of the 32nd ACM International Conference on Multimedia, MM '24, page 7229–7238, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3664647.3681269.

Daum, M., Haynes, B., He, D., Mazumdar, A., and Balazinska, M. (2021). TASM: A tile-based storage manager for video analytics. In Proc. of IEEE 37th International Conference on Data Engineering, pages 1775-1786. DOI: 10.1109/icde51399.2021.00156.

de Boer, M. H. T., Escher, C., and Schutte, K. (2017). Modelling temporal structures in video event retrieval using an AND-OR graph. In Proc. of the Ninth International Conferences on Advances in Multimedia, pages 85-88, Venice, Italy. Available online [link].

de Lausanne, V. (2020). Place de la palud. Available online [link], Date: 2023-07-01.

Dong, S., Wang, P., and Abbas, K. (2021). A survey on deep learning and its applications. Computer Science Reviews, 40(C). DOI: 10.1016/j.cosrev.2021.100379.

Du, K., Pervaiz, A., Yuan, X., Chowdhery, A., Zhang, Q., Hoffmann, H., and Jiang, J. (2020). Server-driven video streaming for deep learning inference. In Proc. of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, technologies, architectures, and Protocols for Computer Communication, pages 557-570. DOI: 10.1145/3387514.3405887.

Farhadi, A. and Redmon, J. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 1804:1-6. DOI: 10.48550/arxiv.1804.02767.

Fu, D. Y., Crichton, W., Hong, J., Yao, X., Zhang, H., Truong, A., Narayan, A., Agrawala, M., Ré, C., and Fatahalian, K. (2019). Rekall: Specifying video events using compositions of spatiotemporal labels. CoRR, abs/1910.02993. DOI: https://doi.org/10.48550/arXiv.1910.02993.

Furht, B. and Villanustre, F. (2016). Introduction to big data. Big data technologies and applications, pages 3-11. DOI: 10.1007/978-3-319-44550-2_1.

Giatrakos, N., Alevizos, E., Artikis, A., Deligiannakis, A., and Garofalakis, M. (2020). Complex event recognition in the big data era: a survey. The VLDB Journal, 29(1):313-352. DOI: 10.1007/s00778-019-00557-w.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. DOI: 10.1038/nature14539.

Grulich, P. M. and Nawab, F. (2018). Collaborative edge and cloud neural networks for real-time video processing. Proceedings of the VLDB Endowment, 11(12):2046-2049. DOI: 10.14778/3229863.3236256.

Guo, H., Yao, S., Yang, Z., Zhou, Q., and Nahrstedt, K. (2021). Crossroi: cross-camera region of interest optimization for efficient real time video analytics at scale. In Proceedings of the 12th ACM Multimedia Systems Conference, MMSys '21, page 186–199, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3458305.3463381.

Gutoski, M., Lazzaretti, A. E., and Lopes, H. S. (2021). Deep metric learning for open-set human action recognition in videos. Neural Computing and Applications, 33:1207-1220. DOI: 10.1007/s00521-020-05009-z.

Han, S., Shen, H., Philipose, M., Agarwal, S., Wolman, A., and Krishnamurthy, A. (2016). MCDNN: An approximation-based execution framework for deep stream processing under resource constraints. In Proc. of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pages 123-136, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2906388.2906396.

Haynes, B., Daum, M., He, D., Mazumdar, A., Balazinska, M., Cheung, A., and Ceze, L. (2021). Vss: A storage system for video analytics. In Proc. of International Conference on Management of Data, SIGMOD '21, pages 685-696, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3448016.3459242.

Haynes, B., Daum, M., Mazumdar, A., Balazinska, M., Cheung, A., and Ceze, L. (2020). VisualWorldDB: A DBMS for the Visual World. In Proc. of Conference on Innovative Data Systems Research. Available online [link].

Haynes, B., Mazumdar, A., Alaghi, A., Balazinska, M., Ceze, L., and Cheung, A. (2018). LightDB: A DBMS for Virtual Reality Video. Proceedings of the VLDB Endowment, 11(10):1192–1205. Available online [link].

He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961-2969. DOI: 10.1109/iccv.2017.322.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778. DOI: 10.1109/cvpr.2016.90.

Hirsch, J. and Buela-Casal, G. (2014). The meaning of the h-index. International Journal of Clinical and Health Psychology, 14(2):161-164. DOI: 10.1016/S1697-2600(14)70050-X.

Hole, S. J. (2018). Jackson hole, wyoming - town square live cam. Available online [link].

Honarparvar, S., Ashena, Z. B., Saeedi, S., and Liang, S. (2024). A systematic review of event-matching methods for complex event detection in video streams. Sensors, 24(22). DOI: 10.3390/s24227238.

HÖnig, R., Ackermann, J., and Chi, M. (2023). Bi-encoder cascades for efficient image search. In IEEE/CVF International Conference on Computer Vision Workshops, pages 1350-1355. Available online [link].

Hsieh, K. (2019). Machine Learning Systems for Highly-Distributed and Rapidly-Growing Data. PhD thesis, Pittsburgh, USA. Available online [link].

Hsieh, K., Ananthanarayanan, G., Bodik, P., Venkataraman, S., Bahl, P., Philipose, M., Gibbons, P. B., and Mutlu, O. (2018). Focus: Querying large video datasets with low latency and low cost. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, pages 269-286, Carlsbad, CA, USA. Available online [link].

Hung, C. C., Ananthanarayanan, G., Bodik, P., Golubchik, L., Yu, M., Bahl, P., and Philipose, M. (2018). VideoEdge: Processing camera streams using hierarchical clusters. In Proceedings - 2018 3rd ACM/IEEE Symposium on Edge Computing, SEC 2018, pages 115-131. IEEE. DOI: 10.1109/sec.2018.00016.

Hwang, J., Kim, M., Kim, D., Nam, S., Kim, Y., Kim, D., Sharma, H., and Park, J. (2022). CoVA: Exploiting Compressed-Domain analysis to accelerate video analytics. In USENIX Annual Technical Conference, pages 707-722. USENIX Association. Available online [link].

Inacio, A. S., Gutoski, M., Lazzaretti, A. E., and Lopes, H. S. (2021). OSVidCap: a framework for the simultaneous recognition and description of concurrent actions in videos in an open-set scenario. IEEE Access, 9:137029-137041. DOI: 10.1109/ACCESS.2021.3116882.

Jain, S., Ananthanarayanan, G., Jiang, J., Shu, Y., and Gonzalez, J. (2019). Scaling video analytics systems to large camera deployments. In Proceedings of the 20th International Workshop on Mobile Computing Systems and Applications, HotMobile '19, page 9–14, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3301293.3302366.

Jain, S., Zhang, X., Zhou, Y., Ananthanarayanan, G., Jiang, J., Shu, Y., Bahl, P., and Gonzalez, J. (2020). Spatula: Efficient cross-camera video analytics on large camera networks. In 2020 IEEE/ACM Symposium on Edge Computing (SEC), pages 110-124. DOI: 10.1109/SEC50012.2020.00016.

Jiang, J., Ananthanarayanan, G., Bodik, P., et al. (2018). Chameleon: Scalable adaptation of video analytics. In Proceedings of Conference of the ACM Special Interest Group on Data Communication, pages 253-266. DOI: 10.1145/3230543.3230574.

Jodoin, J.-P., Bilodeau, G.-A., and Saunier, N. (2014). Urban tracker: Multiple object tracking in urban mixed traffic. In Proc. of IEEE Winter Conference on Applications of Computer Vision, pages 885-892. DOI: 10.1109/wacv.2014.6836010.

Kabukicho, S. (2020). Shinjuku kabukicho live camera. Available online [link].

Kakkar, G. T., Cao, J., Chunduri, P., et al. (2023). Eva: An end-to-end exploratory video analytics system. In Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning, New York, NY, USA. ACM. DOI: 10.1145/3595360.3595858.

Kang, D. (2022). Efficient and Accurate Systems for Querying Unstructured Data. Phd thesis, Stanford University, Palo Alto, USA. Available online [link].

Kang, D., Bailis, P., and Zaharia, M. (2019a). Blazeit: Optimizing declarative aggregation and limit queries for neural network-based video analytics. Proceedings of VLDB Endowment, 13(4):533–546. DOI: 10.48550/arxiv.1805.01046.

Kang, D., Bailis, P., and Zaharia, M. (2019b). Challenges and opportunities in DNN-based video analytics: A demonstration of the blazeit video query engine. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research, Asilomar, California, USA. Available online [link].

Kang, D., Emmons, J., Abuzaid, F., Bailis, P., and Zaharia, M. (2017). Noscope: Optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11):1586–1597. DOI: 10.48550/arxiv.1703.02529.

Kang, D., Gan, E., Bailis, P., Hashimoto, T., and Zaharia, M. (2020a). Approximate selection with guarantees using proxies. Proceedings of the VLDB Endowment, 13(12):1990–2003. DOI: 10.14778/3407790.3407804.

Kang, D., Guibas, J., Bailis, P., Hashimoto, T., Sun, Y., and Zaharia, M. (2021). Accelerating approximate aggregation queries with expensive predicates. Proceedings of the VLDB Endowment, 14(11):2341–2354. DOI: 10.14778/3476249.3476285.

Kang, D., Mathur, A., Veeramacheneni, T., Bailis, P., and Zaharia, M. (2020b). Jointly optimizing preprocessing and inference for dnn-based visual analytics. volume 14, pages 87-100. DOI: 10.14778/3425879.3425881.

Kang, D., Romero, F., Bailis, P., Kozyrakis, C., and Zaharia, M. (2022). VIVA: An end-to-end system for interactive video analytics. In Proceedings of the 12th Conference on Innovative Data Systems Research (CIDR), Chaminade, USA. Available online [link].

Khani, M., Ananthanarayanan, G., Hsieh, K., Jiang, J., Netravali, R., Shu, Y., Alizadeh, M., and Bahl, V. (2023). RECL: Responsive Resource-Efficient continuous learning for video analytics. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 917-932, Boston, MA. USENIX Association. Available online [link].

Kossoski, C. (2024). NOP Query: a new notification-based method for processing video queries on the fly. PhD thesis, Federal University of Technology - Paraná (UTFPR), Curitiba, PR, Brazil. Available online [link].

Kossoski, C., Simão, J. M., and Lopes, H. S. (2024). Modeling and performance analysis of a notification-based method for processing video queries on the fly. Applied Sciences, 14(9):3566. DOI: 10.3390/app14093566.

Koudas, N., Li, R., and Xarchakos, I. (2022). Video monitoring queries. IEEE Transactions on Knowledge and Data Engineering, 34(10):5023–5036. DOI: 10.1109/icde48307.2020.00115.

Kraft, P., Kang, D., Narayanan, D., Palkar, S., Bailis, P., and Zaharia, M. (2020). A demonstration of willump: A statistically-aware end-to-end optimizer for machine learning inference. Proceedings of the VLDB Endowment, 13(12):2833–2836. DOI: 10.14778/3415478.3415487.

Krishnan, S., Dziedzic, A., and Elmore, A. J. (2018). Deeplens: Towards a visual data management system. In Proc. of 9th Biennial Conference on Innovative Data Systems Research, pages 1-10. DOI: 10.48550/arXiv.1812.07607.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25. DOI: 10.1145/3065386.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011). HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pages 2556-2563. IEEE. DOI: 10.1109/iccv.2011.6126543.

Lai, Z., Han, C., Liu, C., Zhang, P., Lo, E., and Kao, B. (2021). Top-k deep video analytics: A probabilistic approach. SIGMOD '21, page 1037–1050, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3448016.3452786.

Laskaridis, S., Venieris, S. I., Almeida, M., Leontiadis, I., and Lane, N. D. (2020). Spinn: synergistic progressive inference of neural networks over device and cloud. In Proc. of the 26th Annual International Conference on Mobile Computing and Networking, pages 1-15. DOI: 10.1145/3372224.3419194.

Latif, A., Rasheed, A., Sajid, U., Ahmed, J., Ali, N., Ratyal, N. I., Zafar, B., Dar, S. H., Sajid, M., and Khalil, T. (2019). Content-based image retrieval and feature extraction: a comprehensive review. Mathematical Problems in Engineering, 2019. DOI: 10.1155/2019/9658350.

Lemmer, W. (2019). Binnenhaven lemmer. Available online [link], Date: 2023-07-01.

Li, J., Liu, L., Xu, H., Wu, S., and Xue, C. J. (2023a). Cross-camera inference on the constrained edge. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications, pages 1-10. DOI: 10.1109/INFOCOM53939.2023.10229045.

Li, J. Z., Ozsu, M. T., Szafron, D., and Oria, V. (1997). MOQL: A multimedia object query language. In Proc. of the 3rd International Workshop on Multimedia Information Systems, pages 19-28. Available online [link].

Li, Y., Padmanabhan, A., Zhao, P., Wang, Y., Xu, G. H., and Netravali, R. (2020a). Reducto: On-camera filtering for resource-efficient real-time video analytics. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM '20, page 359–376, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3387514.3405874.

Li, Z., Katsifodimos, A., Bozzon, A., and Houben, G. J. (2020b). Complex event processing on real-time video streams. In CEUR Workshop Proceedings, page 2652. Available online [link].

Li, Z., Schönfeld, M., Hai, R., Bozzon, A., and Katsifodimos, A. (2023b). Optimizing machine learning inference queries for multiple objectives. In 2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW), pages 74-78. DOI: 10.1109/ICDEW58674.2023.00017.

Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., and Chua, T.-S. (2018). Cross-modal moment localization in videos. In Proceedings of the 26th ACM International Conference on Multimedia, MM '18, page 843–851, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3240508.3240549.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., and Berg, A. C. (2016). SSD: Single shot multibox detector. Lecture Notes in Computer Science, 9905:21-37. DOI: 10.1007/978-3-319-46448-0_2.

Liu, X., Ghosh, P., Ulutan, O., Manjunath, B. S., Chan, K., and Govindan, R. (2019). Caesar: cross-camera complex activity recognition. In Proceedings of the 17th Conference on Embedded Networked Sensor Systems, SenSys '19, page 232–244, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3356250.3360041.

Lu, C., Liu, M., and Wu, Z. (2015). SVQL: A SQL extended query language for video databases. International Journal of Database Theory and Application, 8(3):235-248. Available online [link].

Lu, Y., Chowdhery, A., and Kandula, S. (2016). Optasia: A Relational Platform for Efficient Large-Scale Video Analytics. SoCC '16, page 57–70, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2987550.2987564.

Lu, Y., Chowdhery, A., Kandula, S., and Chaudhuri, S. (2018). Accelerating machine learning inference with probabilistic predicates. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, page 1493–1508, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3183713.3183751.

Lv, X., Wang, Q., Yu, C., and Jin, H. (2023). A feedback-driven dnn inference acceleration system for edge-assisted video analytics. IEEE Transactions on Computers, 72(10):2902-2912. DOI: 10.1109/TC.2023.3275094.

Madden, S., Cafarella, M., Franklin, M., and Kraska, T. (2024). Databases unbound: Querying all of the world's bytes with ai. Proc. VLDB Endow., 17(12):4546–4554. DOI: 10.14778/3685800.3685916.

Mao, H., Kong, T., et al. (2019). CaTDet: Cascaded tracked detector for efficient object detection from video. In Proc. of Conference on Machine Learning and Systems, volume 1, pages 201-211. Available online [link].

Moll, O., Bastani, F., Madden, S., Stonebraker, M., Gadepally, V., and Kraska, T. (2022). Exsample: Efficient searches on video repositories through adaptive sampling. In Proc. of IEEE 38th International Conference on Data Engineering (ICDE), pages 3065-3077. DOI: 10.1109/ICDE53745.2022.00266.

MOT2016 (2022). Multiple object tracking benchmark. Available online [link].

Mullapudi, R. T., Chen, S., Zhang, K., Ramanan, D., and Fatahalian, K. (2019). Online model distillation for efficient video inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3573-3582. DOI: 10.1109/iccv.2019.00367.

Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In 27th International Confer- ence on Machine Learning (ICML), Haifa, Israel. Available online [link].

National Institute of Standards and Technology (2016). Trec video retrieval evaluation (trecvid). Available online [link].

Neves, F. S. (2021). Framework PON C++ 4.0: Contribuição para concepção de aplicações no paradigma orientado a notificações por meio de programação genérica [in portuguese]. Master's thesis, Federal University of Technology - Paraná (UTFPR), Curitiba, PR, Brazil. Available online [link].

Nguyen-Duc, M., Le-Tuan, A., Hauswirth, M., and Le-Phuoc, D. (2021). Towards autonomous semantic stream fusion for distributed video streams. In Proc. of the 15th ACM International Conference on Distributed and Event-based Systems, pages 172-175. DOI: 10.1145/3465480.3467837.

Nielsen, M. A. (2015). Neural networks and deep learning, volume 25. Determination Press, San Francisco, CA, USA. Available online [link].

of Auburn, C. (2023). Toomer's corner webcam 1. Available online [link], Date: 2023-07-01.

Ogle, V. E. and Stonebraker, M. (1995). Chabot: Retrieval from a Relational Database of Images. Computer, 28(9):40-48. DOI: 10.1109/2.410150.

Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., and others (2011). A large-scale benchmark dataset for event recognition in surveillance video. In Proc. of IEEE Computer Vision and Pattern Recognition Conference, pages 3153-3160. DOI: 10.1109/cvpr.2011.5995586.

Olatunji, I. E. and Cheng, C.-H. (2019). Video analytics for visual surveillance and applications: An overview and survey, pages 475-515. Springer. DOI: 10.1007/978-3-030-15628-2_15.

Oussous, A., Benjelloun, F.-Z., Lahcen, A. A., and Belfkih, S. (2018). Big data technologies: A survey. Journal of King Saud University-Computer and Information Sciences, 30(4):431-448. DOI: 10.1016/j.jksuci.2017.06.001.

Pagani, R. N., Kovaleski, J. L., and Resende, L. M. (2015). Methodi Ordinatio: a proposed methodology to select and rank relevant scientific papers encompassing the impact factor, number of citation, and year of publication. Scientometrics, 105(3):2109-2135. DOI: 10.1007/s11192-015-1744-x.

Pakha, C., Chowdhery, A., and Jiang, J. (2018). Reinventing video streaming for distributed vision analytics. In Proceedings of 10th USENIX Workshop on Hot Topics in Cloud Computing, Boston, MA, USA. USENIX Association. Available online [link].

Patino, L., Cane, T., Vallee, A., and Ferryman, J. (2016). Pets 2016: Dataset and challenge. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pages 1-8. DOI: 10.1109/cvprw.2016.157.

Pexels (2021). Pexels - free professional videos. Available online [link].

Poms, A., Crichton, W., Hanrahan, P., and Fatahalian, K. (2018a). Scanner: Efficient video analysis at scale. ACM Transactions on Graphics, 37(4):1-13. DOI: 10.1145/3197517.3201394.

Poms, A., Crichton, W., Hanrahan, P., and Fatahalian, K. (2018b). Scanner: Efficient video analysis at scale. ACM Transactions on Graphics, 37(4). DOI: 10.1145/3197517.3201394.

Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M. P., Shyu, M.-L., Chen, S.-C., and Iyengar, S. S. (2018a). A survey on deep learning: Algorithms, techniques, and applications. ACM Computing Surveys, 51(5). DOI: 10.1145/3234150.

Pouyanfar, S., Yang, Y., Chen, S.-C., Shyu, M.-L., and Iyengar, S. S. (2018b). Multimedia big data analytics: A survey. ACM Computing Surveys, 51(1). DOI: 10.1145/3150226.

Punchihewa, A. and Bailey, D. (2020). A review of emerging video codecs: Challenges and opportunities. In 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1-6, Wellington, New Zealand. IEEE. DOI: 10.1109/ivcnz51579.2020.9290536.

Qin, A., Xiao, M., Wu, Y., Huang, X., and Zhang, X. (2021). Mixer: efficiently understanding and retrieving visual content at web-scale. Proceedings of the VLDB Endowment, 14(12):2906-2917. DOI: 10.14778/3476311.3476371.

Rahmanian, A., Ali-Eldin, A., Tesfatsion, S. K., Skubic, B., Gustafsson, H., Shenoy, P., and Elmroth, E. (2023). Ravas: Interference-aware model selection and resource allocation for live edge video analytics. In 2023 IEEE/ACM Symposium on Edge Computing (SEC), pages 27-39. DOI: 10.1145/3583740.3628443.

Rahmanian, A., Amin, S., Gustafsson, H., and Ali-Eldin, A. (2024). Cvf: Cross-video filtration on the edge. In Proceedings of the 15th ACM Multimedia Systems Conference, MMSys '24, page 231–242, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3625468.3647627.

Ramachandra, B. and Jones, M. (2020). Street Scene: A new dataset and evaluation protocol for video anomaly detection. In Proc. of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2569-2578. DOI: 10.1109/wacv45572.2020.9093457.

Ran, X., Chen, H., Zhu, X., Liu, Z., and Chen, J. (2018). Deepdecision: A mobile deep learning framework for edge video analytics. In Proceedings of the IEEE Conference on Computer Communications, pages 1421-1429. DOI: 10.1109/infocom.2018.8485905.

Ristani, E., Solera, F., Zou, R., Cucchiara, R., and Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In Proc. of European Conference on Computer Vision Workshop on Benchmarking Multi-Target Tracking, Cham, Germany. DOI: 10.1007/978-3-319-48881-3_2.

Romero, F., Hauswald, J., Partap, A., Kang, D., Zaharia, M., and Kozyrakis, C. (2022). Optimizing video analytics with declarative model relationships. Proceedings of the VLDB Endowment, 16(3):447–460. DOI: 10.14778/3570690.3570695.

Rosebrock, A. (2016). Practical python and OpenCV - An introductory, example driven guide to image processing and computer vision. Pyimagesearch, Ebook. Book.

Rosebroke, A. (2017). Deep Learning for Computer Vision with Python. PyImageSearch, Ebook. Book.

Ryoo, M. S., Chen, C.-C., Aggarwal, J. K., and Roy-Chowdhury, A. (2010). An overview of contest on semantic description of human activities (SDHA) 2010. In International Conference on Pattern Recognition, pages 270-285. Springer. DOI: 10.1007/978-3-642-17711-8_28.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L. C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 4510-4520. DOI: 10.1109/cvpr.2018.00474.

School, O. M. (2018). Webcam from the oxford martin school on broad street. Available online [link].

Shen, H., Han, S., Philipose, M., and Krishnamurthy, A. (2017). Fast video classification via adaptive cascading of deep models. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3646-3654. IEEE. DOI: 10.1109/cvpr.2017.236.

Shen, H., Philipose, M., Agarwal, S., and Wolman, A. (2014). MCDNN: An Execution Framework for Deep Neural Networks on Resource-Constrained Devices. In Proc. of the 14th Annual International Conference on Mobile Systems, Applications, and Services, number December, pages 123-136. Available online [link].

Sipser, A. (2020). Video ingress system for surveillance video querying. Master of engineering thesis, Massachusetts Institute of Technology. Available online [link], Date: 2023-07-01.

Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv preprint. DOI: 10.48550/arXiv.1212.0402.

Spolaôr, N., Lee, H. D., Takaki, W. S. R., Ensina, L. A., Coy, C. S. R., and Wu, F. C. (2020). A systematic review on content-based video retrieval. Engineering Applications of Artificial Intelligence, 90:103557. DOI: 10.1016/j.engappai.2020.103557.

Stonebraker, M., Bhargava, B., Cafarella, M., Collins, Z., McClellan, J., Sipser, A., Sun, T., Nesen, A., Solaiman, K., Mani, G., et al. (2020). Surveillance video querying with a human-in-the-loop. In Proceedings of the Workshop on Human-in-the-Loop Data Analytics, pages 14-19, Portland, OR, USA. Available online [link], Date: 2023-07-01.

Sun, L., Wang, W., Yuan, T., Mi, L., Dai, H., Liu, Y., and Fu, X. (2024). Biswift: Bandwidth orchestrator for multi-stream video analytics on edge. In IEEE INFOCOM 2024 - IEEE Conference on Computer Communications, pages 1181-1190. DOI: 10.1109/INFOCOM52122.2024.10621392.

Technologies, S. (2019). Netcamlive 2 taiwan new taipei city 720p. Available online [link].

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pages 6450-6459. Available online [link].

Usman, M., Jan, M. A., He, X., and Chen, J. (2019). A survey on big multimedia data processing and management in smart cities. ACM Computing Surveys, 52(3):1-29. DOI: 10.1145/3323334.

Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2021). Scaled-yolov4: Scaling cross stage partial network. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13029-13038. IEEE. DOI: 10.1109/cvpr46437.2021.01283.

Wang, L., Lu, K., Zhang, N., Qu, X., Wang, J., Wan, J., Li, G., and Xiao, J. (2023). Shoggoth: Towards efficient edge-cloud collaborative real-time video inference via adaptive online learning. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1-6. DOI: 10.1109/DAC56929.2023.10247821.

Wang, L., Qu, X., Wang, J., Li, G., Wan, J., Zhang, N., Guo, S., and Xiao, J. (2024). Gecko: Resource-efficient and accurate queries in real-time video streams at the edge. In IEEE INFOCOM 2024 - IEEE Conference on Computer Communications, pages 481-490. DOI: 10.1109/INFOCOM52122.2024.10621399.

Wang, W., Gao, J., Zhang, M., Wang, S., Chen, G., Ng, T. K., Ooi, B. C., Shao, J., and Reyad, M. (2018). Rafiki: Machine learning as an analytics service system. Proceedings of the VLDB Endowment, 12(2):128–140. DOI: 10.48550/arxiv.1804.06087.

Wen, L., Du, D., Cai, Z., Lei, Z., Chang, M.-C., Qi, H., Lim, J., Yang, M.-H., and Lyu, S. (2020). UA-DETRAC: A New Benchmark and Protocol for Multi-Object Detection and Tracking. Computer Vision and Image Understanding, 193:102907. DOI: 10.1016/j.cviu.2020.102907.

Wen, Q., Zhou, J., Chen, R., Luo, Z., Tyson, G., Li, W., Wang, J., Pan, H., and Xu, Z. (2024). From limited resources to powerful insights: Empowering low-cost cameras for efficient retrospective querying. IEEE Internet of Things Journal, pages 1-1. DOI: 10.1109/JIOT.2024.3480089.

Wu, D., Zhang, D., Zhang, M., Zhang, R., Wang, F., and Cui, S. (2023). Ilcas: Imitation learning-based configuration- adaptive streaming for live video analytics with cross-camera collaboration. IEEE Transactions on Mobile Computing, pages 1-15. DOI: 10.1109/TMC.2023.3327097.

Xarchakos, I. and Koudas, N. (2019). Svq: Streaming video queries. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, page 2013–2016, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3299869.3320230.

Xipg (2021). Xipg video test media. Available online [link].

Xu, R., Razavi, S., and Zheng, R. (2023). Edge video analytics: A survey on applications, systems and enabling techniques. IEEE Communications Surveys & Tutorials, 25(4):2951-2982. DOI: 10.1109/COMST.2023.3323091.

Xu, T., Botelho, L. M., and Lin, F. X. (2019). VStore: A data store for analytics on large videos. In Proceedings of the 14th EuroSys Conference, pages 1-17, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3302424.3303971.

Yadav, P. (2019). High-performance complex event processing framework to detect event patterns over video streams. In Proceedings of the 20th International Middleware Conference Doctoral Symposium, pages 47-50, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3366624.3368169.

Yadav, P. (2021). Query-aware adaptive windowing for spatiotemporal complex video event processing for internet of multimedia things. PhD thesis, University of Galway, Ireland.

Yadav, P. and Curry, E. (2019a). VEKG: Video event knowledge graph to represent video streams for complex event pattern matching. In Proceedings of the IEEE First International Conference on Graph Computing, pages 13-20, Laguna Hills, CA, USA. IEEE. DOI: 10.1109/GC46384.2019.00011.

Yadav, P. and Curry, E. (2019b). VidCEP: Complex event processing framework to detect spatiotemporal patterns in video streams. In Proceedings of the IEEE International Conference on Big Data, pages 2513-2522, New York, NY, USA. Association for Computing Machinery. DOI: 10.1109/BigData47090.2019.9006018.

Yadav, P., Salwala, D., and Curry, E. (2021). Vid-win: Fast video event matching with query-aware windowing at the edge for the internet of multimedia things. IEEE Internet of Things Journal. DOI: 10.1109/jiot.2021.3075336.

Yadav, P., Salwala, D., Das, D. P., and Curry, E. (2020). Knowledge graph driven approach to represent video streams for spatiotemporal event pattern matching in complex event processing. International Journal of Semantic Computing, 14(03):423-455. DOI: 10.1142/s1793351x20500051.

Yang, K., Liu, J., Yang, D., Wang, H., Sun, P., Zhang, Y., Liu, Y., and Song, L. (2023). A novel efficient multi-view traffic-related object detection framework. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1-5. DOI: 10.1109/ICASSP49357.2023.10095027.

Yang, P., Lyu, F., Wu, W., Zhang, N., Yu, L., and Shen, X. S. (2020). Edge Coordinated Query Configuration for Low-Latency and Accurate Video Analytics. IEEE Transactions on Industrial Informatics, 16(7):4855-4864. DOI: 10.1109/tii.2019.2949347.

Yang, Z., Wang, Z., Huang, Y., Lu, Y., Li, C., and Wang, X. S. (2022). Optimizing machine learning inference queries with correlative proxy models. Proceedings of the VLDB Endowment, 15(10):2032–2044. DOI: 10.14778/3547305.3547310.

Yi, S., Hao, Z., Zhang, Q. Q., Zhang, Q. Q., Shi, W., and Li, Q. (2017). LAVEA: Latency-Aware video analytics on edge computing platform. In Proceedings of the International Conference on Distributed Computing Systems, pages 2573-2574, New York, NY, USA. Association for Computing Machinery. DOI: 10.1109/icdcs.2017.182.

Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., and Samaras, D. (2012). Two-person interaction detection using body-pose features and multiple instance learning. In Proc. of IEEE Computer Vision and Pattern Recognition Conference, pages 28-35. DOI: 10.1109/cvprw.2012.6239234.

Zhang, H., Ananthanarayanan, G., Bodik, P., Philipose, M., Bahl, P., and Freedman, M. J. (2017). Live video analytics at scale with approximation and delay-tolerance. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation, pages 377-392, Boston, MA. USENIX Association. Available online [link].

Zhang, Q., Sun, H., Wu, X., and Zhong, H. (2019). Edge video analytics for public safety: A review. Proceedings of the IEEE, 107(8):1675-1696. DOI: 10.1109/jproc.2019.2925910.

Zhang, R.-X., Li, C., Wu, C., Huang, T., and Sun, L. (2023). Owl: A pre-and post-processing framework for video analytics in low-light surroundings. In IEEE Conference on Computer Communications, pages 1-10. DOI: 10.1109/INFOCOM53939.2023.10229059.

Zhang, Y. and Kumar, A. (2019). Panorama: a data system for unbounded vocabulary querying over video. Proceedings of the VLDB Endowment, 13(4):477-491. DOI: 10.14778/3372716.3372721.

Zhang, Y., Zhang, X., Ananthanarayanan, G., Iyer, A., Shu, Y., Bahl, V., Mao, Z. M., and Chowdhury, M. (2024). Vulcan: Automatic query planning for live ML analytics. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1385-1402, Santa Clara, CA. USENIX Association. Available online [link].

Zhen, L., Hu, P., Wang, X., and Peng, D. (2019). Deep supervised cross-modal retrieval. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10394-10403. DOI: 10.1109/cvpr.2019.01064.

Downloads

Published

2025-10-02

How to Cite

Kossoski, C., Lopes, H. S., & Simão, J. M. (2025). Querying large video datasets: a systematic literature review. Journal of Internet Services and Applications, 16(1), 566–595. https://doi.org/10.5753/jisa.2025.5495

Issue

Section

Research article