Evaluating Methods for Violence Classification and Firearm Detection in Indoor CCTV Environment

Arnaldo V. Barros da Silva; Luis F. Alves Pereira

doi:10.5753/jbcs.2024.3282

Authors

Arnaldo V. Barros da Silva Universidade Federal do Agreste de Pernambuco https://orcid.org/0000-0002-0300-3341
Luis F. Alves Pereira Universidade Federal do Agreste de Pernambuco https://orcid.org/0000-0002-4624-6714

DOI:

https://doi.org/10.5753/jbcs.2024.3282

Keywords:

Violence Detection, Firearm Detection, Image Features, Deep Learning

Abstract

The adoption of security systems based on computer vision for violence detection has the potential to significantly improve safety in various public and private properties. However, developing these systems can be extremely challenging.We can choose to use classification models to identify violence in images or also use object detection models to identify firearms, which may indicate robbery. Additionally, when developing such systems focused on private environments, we encounter specific challenges, such as obtaining appropriate datasets to train the algorithms. Many publicly available datasets for violence detection consist of outdoor images, with elements such as streets and cars, which do not adequately reflect the nuances and unique characteristics of private properties. In this work, we evaluate both learned and handcrafted features to classify videos as 'violence' or 'non-violence' across a variety of datasets, including a new dataset composed exclusively of closed-circuit television (CCTV) images. Additionally, we propose a new dataset for firearm detection in CCTV images and conduct some experiments using YoloV8. In this way, we hope to provide a clearer insight into the possible decisions when developing a security system for indoor environments.

Downloads

Download data is not yet available.

References

Ahmed, M., Seraj, R., and Islam, S. M. S. (2020). The k-means algorithm: A comprehensive survey and performance evaluation. Electronics, 9(8):1295. DOI: 10.3390/electronics9081295.

Aktı, Ş., Tataroǧlu, G. A., and Ekenel, H. K. (2019). Vision-based fight detection from surveillance cameras. In 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), pages 1-6. IEEE. DOI: 10.48550/arXiv.2002.04355.

Antipov, G., Berrani, S. A., Ruchaud, N., and Dugelay, J.-L. (2015). Learned vs. hand-crafted features for pedestrian gender recognition. DOI: 10.1145/2733373.2806332.

Arroyo, R., Yebes, J. J., Bergasa, L. M., Daza, I. G., and Almazán, J. (2015). Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls. Expert systems with Applications, 42(21):7991-8005. DOI: 10.1016/j.eswa.2015.06.016.

Blunsden, S. and Fisher, R. (2010). The behave video dataset: ground truthed video for multi-person behavior classification. Annals of the BMVA, 4(1-12):4. Available online [link].

Breiman, L. (2001). Random forests. Machine learning, 45(1):5-32. DOI: 10.1023/A:1010933404324.

Chen, J.-H., Tseng, T.-H., Lai, C.-L., and Hsieh, S.-T. (2012). An intelligent virtual fence security system for the detection of people invading. 9th International Conference on Ubiquitous Intelligence and Computing and 9th International Conference on Autonomic and Trusted Computing, pages 786-791. DOI: 10.1109/UIC-ATC.2012.64.

Chen, M.-y. and Hauptmann, A. (2009). Mosift: Recognizing human actions in surveillance videos. DOI: 10.1184/R1/6607523.v1.

Cheng, M., Cai, K., and Li, M. (2020). Rwf-2000: An open large scale video database for violence detection. DOI: 10.1109/ICPR48806.2021.9412502.

Cian, D., van Gemert, J., and Lengyel, A. (2020). Evaluating the performance of the lime and grad-cam explanation methods on a lego multi-label image classification task. arXiv preprint arXiv:2008.01584. DOI: 10.48550/arXiv.2008.01584.

Coşar, S., Donatiello, G., Bogorny, V., Garate, C., Alvares, L. O., and Brémond, F. (2016). Toward abnormal trajectory and event detection in video surveillance. IEEE Transactions on Circuits and Systems for Video Technology, 27(3):683-695. DOI: 10.1109/TCSVT.2016.2589859.

Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray, C. (2004). Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1-2. Prague. Available online [link].

da Silva, A. V. B. and Alves Pereira, L. F. (2022). Bovw-cam: Visual explanation from bag of visual words. In Intelligent Systems: 11th Brazilian Conference, BRACIS 2022, Campinas, Brazil, November 28-December 1, 2022, Proceedings, Part II, pages 45-55. Springer. DOI: 10.1007/978-3-031-21689-3_4.

Delgado, B., Tahboub, K., and Delp, E. J. (2014). Automatic detection of abnormal human events on train platforms. IEEE National Aerospace and Electronics Conference, pages 169-173. DOI: 10.1109/NAECON.2014.7045797.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. IEEE conference on computer vision and pattern recognition, pages 248-255. DOI: 10.1109/CVPR.2009.5206848.

Deniz, O., Serrano Gracia, I., Bueno, G., and Kim, T.-T. (2014). Fast violence detection in video. volume 2. Available online [link].

Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3):297-302. DOI: 10.2307/1932409.

Gao, Y., Liu, H., Sun, X., Wang, C., and Liu, Y. (2016). Violence detection using oriented violent flows. Image and vision computing, 48:37-41. DOI: 10.1016/j.imavis.2016.01.006.

Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., and Garcia-Rodriguez, J. (2017). A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857. DOI: 10.48550/arXiv.1704.06857.

Grega, M., Matiolański, A., Guzik, P., and Leszczuk, M. (2016). Automated detection of firearms and knives in a cctv image. Sensors, 16(1):47. DOI: 10.3390/s16010047.

Hachiuma, R., Sato, F., and Sekii, T. (2023). Unified keypoint-based action recognition framework via structured keypoint pooling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22962-22971. DOI: 10.48550/arXiv.2303.15270.

Hao, W. and Zhili, S. (2020). Improved mosaic: Algorithms for more complex images. In Journal of Physics: Conference Series, volume 1684, page 012094. IOP Publishing. DOI: 10.1088/1742-6596/1684/1/012094.

Hassner, T., Itcher, Y., and Kliper-Gross, O. (2012). Violent flows: Real-time detection of violent crowd behavior. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1-6. DOI: 10.1109/CVPRW.2012.6239348.

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18-28. DOI: 10.1109/5254.708428.

Jocher, G. (2020). YOLOv5 by Ultralytics. Available online [link].

Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics YOLO. Available online [link].

Kang, M.-S., Park, R.-H., and Park, H.-M. (2021). Efficient spatio-temporal modeling methods for real-time violence recognition. IEEE Access, 9:76270-76285. DOI: 10.1109/ACCESS.2021.3083273.

Keçeli, A. and Kaya, A. (2017). Violent activity detection with transfer learning method. Electronics Letters, 53(15):1047-1048. DOI: 10.1049/el.2017.0970.

Khin, P. P. and Htaik, N. M. (2024). Gun detection: A comparative study of retinanet, efficientdet and yolov8 on custom dataset. In 2024 IEEE Conference on Computer Applications (ICCA), pages 1-7. DOI: 10.1109/ICCA62361.2024.10532867.

Krausz, B. and Bauckhage, C. (2012). Loveparade 2010: Automatic video analysis of a crowd disaster. Computer Vision and Image Understanding, 116(3):307-319. DOI: 10.1016/j.cviu.2011.08.006.

Lim, J., Al Jobayer, M. I., Baskaran, V. M., Lim, J. M., See, J., and Wong, K. (2021). Deep multi-level feature pyramids: Application for non-canonical firearm detection in video surveillance. Engineering applications of artificial intelligence, 97:104094. DOI: 10.1016/j.engappai.2020.104094.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740-755. Springer. DOI: 10.48550/arXiv.1405.0312.

Lindeberg, T. (2012). Scale Invariant Feature Transform, volume 7. DOI: 10.4249/scholarpedia.10491.

Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31-57. DOI: 10.48550/arXiv.1606.03490.

Lu, D. and Weng, Q. (2007). A survey of image classification methods and techniques for improving classification performance. International journal of Remote sensing, 28(5):823-870. DOI: 10.1080/01431160600746456.

Lu, N., Wu, Y., Feng, L., and Song, J. (2018). Deep learning for fall detection: Three-dimensional cnn combined with lstm on video kinematic data. IEEE journal of biomedical and health informatics, 23(1):314-323. DOI: 10.1109/JBHI.2018.2808281.

Mohammadi, H. and Nazerfard, E. (2023). Video violence recognition and localization using a semi-supervised hard attention model. Expert Systems with Applications, 212:118791. DOI: 10.1016/j.eswa.2022.118791.

Nanni, L., Ghidoni, S., and Brahnam, S. (2017). Handcrafted vs non-handcrafted features for computer vision classification. Pattern Recognition, 71. DOI: 10.1016/j.patcog.2017.05.025.

Nievas, E. B., Suarez, O. D., García, G. B., and Sukthankar, R. (2011). Violence detection in video using computer vision techniques. International conference on Computer analysis of images and patterns, pages 332-339. DOI: 10.1007/978-3-642-23678-5_39.

Olmos, R., Tabik, S., and Herrera, F. (2018). Automatic handgun detection alarm in videos using deep learning. Neurocomputing, 275:66-72. DOI: 10.1016/j.neucom.2017.05.012.

Perez, M., Kot, A. C., and Rocha, A. (2019). Detection of real-world fights in surveillance videos. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2662-2666. IEEE. DOI: 10.1109/ICASSP.2019.8683676.

Qi, D., Tan, W., Liu, Z., Yao, Q., and Liu, J. (2021). A gun detection dataset and searching for embedded device solutions. CoRR. DOI: 10.48550/arXiv.2105.01058.

Rastogi, R. and Varshney, Y. (2024). A comprehensive study for weapon detection technologies for surveillance under different YoloV8 models on primary data, pages 241-268. De Gruyter, Berlin, Boston. DOI: 10.1515/9783111331133-013.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779-788. DOI: 10.48550/arXiv.1506.02640.

Rougier, C., Meunier, J., St-Arnaud, A., and Rousseau, J. (2011). Robust video surveillance for fall detection based on human shape deformation. IEEE Transactions on circuits and systems for video Technology, 21(5):611-622. DOI: 10.1109/TCSVT.2011.2129370.

Saba, T. (2021). Computer vision for microscopic skin cancer diagnosis using handcrafted and non‐handcrafted features. Microscopy Research and Technique, 84:1272-1283. DOI: 10.1002/jemt.23686.

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618-626. DOI: 10.1109/ICCV.2017.74.

Setti, F., Conigliaro, D., Rota, P., Bassetti, C., Conci, N., Sebe, N., and Cristani, M. (2017). The s-hock dataset: A new benchmark for spectator crowd analysis. Computer Vision and Image Understanding, 159:47-58. DOI: 10.1016/j.cviu.2017.01.003.

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. DOI: 10.48550/arXiv.1409.1556.

Soliman, M. M., Kamal, M. H., El-Massih Nashed, M. A., Mostafa, Y. M., Chawky, B. S., and Khattab, D. (2019a). Violence recognition from videos using deep learning techniques. In 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pages 80-85. DOI: 10.1109/ICICIS46948.2019.9014714.

Soliman, M. M., Kamal, M. H., Nashed, M. A. E.-M., Mostafa, Y. M., Chawky, B. S., and Khattab, D. (2019b). Violence recognition from videos using deep learning techniques. In 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pages 80-85. IEEE. DOI: 10.1109/ICICIS46948.2019.9014714.

Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. DOI: 10.48550/arXiv.1212.0402.

Sultani, W., Chen, C., and Shah, M. (2018). Real-world anomaly detection in surveillance videos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6479-6488. DOI: 10.1109/CVPR.2018.00678.

Traoré, A. and Akhloufi, M. A. (2020). Violence detection in videos using deep recurrent and convolutional neural networks. In 2020 IEEE international conference on systems, man, and cybernetics (SMC), pages 154-159. IEEE. DOI: 10.1109/SMC42975.2020.9282971.

Van Gool, L. (2008). Action snippets: How many frames does human action recognition require? Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. DOI: 10.1109/CVPR.2008.4587730.

Varam, D., Mitra, R., Mkadmi, M., Riyas, R., Abuhani, D. A., Dhou, S., and Alzaatreh, A. (2023). Wireless capsule endoscopy image classification: An explainable ai approach. IEEE Access. DOI: 10.1109/ACCESS.2023.3319068.

Vijeikis, R., Raudonis, V., and Dervinis, G. (2022). Efficient violence detection in surveillance. Sensors, 22(6):2216. DOI: 10.3390/s22062216.

Wei, K., Chen, B., Zhang, J., Fan, S., Wu, K., Liu, G., and Chen, D. (2022). Explainable deep learning study for leaf disease classification. Agronomy, 12(5):1035. DOI: 10.3390/agronomy12051035.

Zhang, X., Shu, X., and He, Z. (2019). Crowd panic state detection using entropy of the distribution of enthalpy. Physica A: Statistical Mechanics and its Applications, 525:935-945. DOI: 10.1016/j.physa.2019.04.033.

Zhou, P., Ding, Q., Luo, H., and Hou, X. (2017). Violent interaction detection in video based on deep learning. Journal of Physics: Conference Series, 844:012044. DOI: 10.1088/1742-6596/844/1/012044.

Zou, Z., Shi, Z., Guo, Y., and Ye, J. (2019). Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055. DOI: 10.1109/JPROC.2023.3238524.