Deep Learning Applied to Imbalanced Malware Datasets Classification

Marcelo Palma Salas; Paulo Lício de Geus

doi:10.5753/jisa.2024.3907

Authors

Marcelo Palma Salas Universidade de Campinas (UNICAMP) https://orcid.org/0000-0001-6821-0002
Paulo Lício de Geus Universidade de Campinas (UNICAMP) https://orcid.org/0000-0002-6540-8686

DOI:

https://doi.org/10.5753/jisa.2024.3907

Keywords:

malware, classification, CNN, MobileNet, interpolation, Big 2015, Malimg, MaleVis, Fusion, dataset

Abstract

In the current day, the evolution and exponential proliferation of malware involve modifications and camouflage of their structure through techniques like obfuscation, polymorphism, metamorphism, and encryption. With the advancements in deep learning, methods such as convolutional neural networks (CNN) have emerged as potent tools for deciphering intricate patterns within this malicious software. The present research uses the capacity of CNN to learn the global structure of the code converted to an RGB or grayscale image and decipher the patterns present in the malware datasets generated from these images. The study explores fine-tuning techniques, including bicubic interpolation, ReduceLROnPlateau, and class weight estimation, in order to generalize the model and reduce the risk of overfitting for malware that uses evasion techniques against classification. Taking advantage of transfer learning and the MobileNet architecture, we created a MobileNet fine-tuning (FT) model. The application of this new model in four datasets, including Microsoft Big 2015, Malimg, MaleVis, and a new Fusion dataset, achieved 98.71%, 99.08%, 96.04%, and 98.04% accuracy, respectively, which underscores the robustness of the proposed model. The Fusion dataset is a combination of the first three datasets, consisting of a set of 32,601 known malware image files representing a mix of 59 different families. Despite the success, the study reveals performance deterioration with an increase in the number of malware families, highlighting the need for further exploration into the limits of CNNs in malware classification.

Downloads

Download data is not yet available.

References

Aslan, "O. and Yilmaz, A. A. (2021). A new malware classification framework based on deep learning algorithms. Ieee Access, 9:87936-87951. DOI: 10.1109/ACCESS.2021.3089586.

AV-TEST GmbH (2023). AV-TEST Malware Statistics. Available online [link] Accessed: May 20, 2023.

Bozkir, A. S., Cankaya, A. O., and Aydos, M. (2019). Utilization and comparision of convolutional neural networks in malware recognition. In 2019 27th Signal Processing and Communications Applications Conference (SIU), pages 1-4. IEEE. DOI: 10.1109/SIU.2019.8806511.

Fadnavis, S. (2014). Image interpolation techniques in digital image processing: an overview. International Journal of Engineering Research and Applications, 4(10):70-73. Available online [link].

Gibert, D., Mateu, C., Planes, J., and Vicens, R. (2019). Using convolutional neural networks for classification of malware represented as images. Journal of Computer Virology and Hacking Techniques, 15:15-28. DOI: 10.1007/s11416-018-0323-0.

Hemalatha, J., Roseline, S. A., Geetha, S., Kadry, S., and Damavsevivcius, R. (2021). An efficient Densenet-based deep learning model for malware detection. Entropy, 23(3):344. DOI: 10.3390/e23030344.

HobbyMaker (2023). Adding new pixels to a picture, an inexact comparison of several approaches to resampling. Available online [link] Accessed on 2023.11.15.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. DOI: 10.48550/arXiv.1704.04861.

Kalash, M., Rochan, M., Mohammed, N., Bruce, N. D., Wang, Y., and Iqbal, F. (2018). Malware classification with deep convolutional neural networks. In 2018 9th IFIP international conference on new technologies, mobility and security (NTMS), pages 1-5. IEEE. DOI: 10.1109/NTMS.2018.8328749.

Keras (2023). keras._legacy preprocessing image imagedatagenerator. Available online [link] Accessed on 2023.11.15.

Le, Q., Boydell, O., Mac Namee, B., and Scanlon, M. (2018). Deep learning at the shallow end: Malware classification for non-domain experts. Digital Investigation, 26:S118-S126. DOI: 10.1016/j.diin.2018.04.024.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436-444. DOI: 10.1038/nature14539.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324. DOI: 10.1109/5.726791.

Nataraj, L., Karthikeyan, S., Jacob, G., and Manjunath, B. S. (2011). Malware images: visualization and automatic classification. In Proceedings of the 8th international symposium on visualization for cyber security, pages 1-7. DOI: 10.1145/2016904.2016908.

Palma Salas, M. I., De Geus, P., and Botacin, M. (2023). Enhancing malware family classification in the Microsoft challenge dataset via transfer learning. In Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing, LADC '23, page 156–163, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3615366.3615374.

Patterson, J. and Gibson, A. (2017). Deep learning: A practitioner's approach. " O'Reilly Media, Inc.". Book.

Rezende, E., Ruppert, G., Carvalho, T., Ramos, F., and De Geus, P. (2017). Malicious software classification using transfer learning of Resnet-50 deep neural network. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1011-1014. IEEE. DOI: 10.1109/ICMLA.2017.00-19.

Rezende, E., Ruppert, G., Carvalho, T., Theophilo, A., Ramos, F., and Geus, P. d. (2018). Malicious software classification using VGG16 deep neural network’s bottleneck features. In Information Technology-New Generations: 15th International Conference on Information Technology, pages 51-59. Springer. DOI: 10.1007/978-3-319-77028-4_9.

Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., and Ahmadi, M. (2018). Microsoft malware classification challenge. arXiv preprint arXiv:1802.10135. DOI: 10.48550/arXiv.1802.10135.

Roseline, S. A., Geetha, S., Kadry, S., and Nam, Y. (2020). Intelligent vision-based malware detection and classification using deep random forest paradigm. IEEE Access, 8:206303-206324. DOI: 10.1109/ACCESS.2020.3036491.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211-252. DOI: 10.1007/s11263-015-0816-y.

Shaik, A., Pendharkar, G., Kumar, S., Balaji, S., et al. (2023). Comparative analysis of imbalanced malware byteplot image classification using transfer learning. arXiv preprint arXiv:2310.02742. DOI: 10.48550/arXiv.2310.02742.

Singh, A., Handa, A., Kumar, N., and Shukla, S. K. (2019). Malware classification using image representation. In Cyber Security Cryptography and Machine Learning: Third International Symposium, CSCML 2019, Beer-Sheva, Israel, June 27-28, 2019, Proceedings 3, pages 75-92. Springer. DOI: 10.1007/978-3-030-20951-3_6.

Srudeep, P. (2020). An overview on MobileNet: An efficient mobile vision CNN. Available online [link] Accessed on 2023.11.15.

Sun, C., Shrivastava, A., Singh, S., and Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843-852. DOI: 10.1109/ICCV.2017.97.

Talebi, H. and Milanfar, P. (2021). Learning to resize images for computer vision tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 497-506. DOI: 10.1109/ICCV48922.2021.00055.

Verma, V., Muttoo, S. K., and Singh, V. (2020). Multiclass malware classification via first-and second-order texture statistics. Computers & Security, 97:101895. DOI: 10.1016/j.cose.2020.101895.

Wang, C., Zhao, Z., Wang, F., and Li, Q. (2021). A novel malware detection and family classification scheme for IoT based on DEAM and DenseNet. Security and Communication Networks, 2021:1-16. DOI: 10.1155/2021/6658842.

Deep Learning Applied to Imbalanced Malware Datasets Classification

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Metrics:

Make a Submission