Synthetic Data: AI's New Weapon Against Android Malware
DOI:
https://doi.org/10.5753/jbcs.2026.5655Keywords:
Android Malware, Machine learning (ML), Artificial Inteligence (AI), Conditional Generative Adversarial Networks (cGANs), Synthetic dataAbstract
The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.
Downloads
References
AI & Data Today (2023). Top 10 reasons why ai projects fail. [link].
Amin, M., Shah, B., Sharif, A., Ali, T., Kim, K.-I., and Anwar, S. (2022). Android malware detection through generative adversarial networks. Transactions on Emerging Telecommunications Technologies, 33(2). DOI: 10.1002/ett.3675.
Antunes, A., Ferreira, B., Marques, N., and Carriço, N. (2023). Hyperparameter optimization of a convolutional neural network model for pipe burst location in water distribution networks. Journal of Imaging, 9(3):68. DOI: 10.3390/jimaging9030068.
Assolin, J., Kreutz, D., Siqueira, G., Rocha, V., Miers, C., Mansilha, R., and Feitosa, E. (2022). DroidAutoML: uma ferramenta de automl para o domínio de detecção de malwares android. In Anais Estendidos do XXII Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais, pages 135-142, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbseg_estendido.2022.227037.
Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., and Kasneci, G. (2022). Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280. DOI: 10.48550/arXiv.2210.06280.
Botacin, M., Ceschin, F., Sun, R., Oliveira, D., and Grégio, A. (2021). Challenges and pitfalls in malware research. Computers & Security, 106:102287. DOI: 10.1016/j.cose.2021.102287.
Bragança, H., Kreutz, D., Rocha, V., Assolin, J., and Feitosa, E. (2025). MH-1M: A 1.34 million-sample multi-feature android malware dataset with rich metadata. Scientific Data, 13(1):153. DOI: 10.1038/s41597-025-06469-5.
Canbek, G., Taskaya Temizel, T., and Sagiroglu, S. (2021). BenchMetrics: A systematic benchmarking method for binary classification performance metrics. Neural Computing and Applications, 33(21). DOI: 10.1007/s00521-021-06103-6.
Casola, K., Paim, K., Mansilha, R., and Kreutz, D. (2023). Droidaugmentor: uma ferramenta de treinamento e avaliação de cgans para geração de dados sintéticos. In Anais Estendidos do XXIII Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais, pages 57-64, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbseg_estendido.2023.235793.
Chauhan, R., Sabeel, U., Izaddoost, A., and Shah Heydari, S. (2021). Polymorphic adversarial cyberattacks using wgan. Journal of Cybersecurity and Privacy, 1(4):767-792. DOI: 10.3390/jcp1040037.
Choi, E. et. al. (2017). Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pages 286-305. DOI: 10.48550/arXiv.1703.06490.
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., and Bharath, A. A. (2018). Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53-65. DOI: 10.1109/MSP.2017.2765202.
Esteban, C., Hyland, S. L., and Rätsch, G. (2017). Real-valued (medical) time series generation with recurrent conditional GANs. arXiv preprint arXiv:1706.02633. DOI: 10.48550/arXiv.1706.02633.
Fristiana, A. H., Alfarozi, S. A. I., Permanasari, A. E., Pratama, M., and Wibirama, S. (2024). A survey on hyperparameters optimization of deep learning for time series classification. IEEE Access, 12:191162-191198. DOI: 10.1109/ACCESS.2024.3516198.
Gartenberg, C. (2021). Google says there are now over 3 billion active android devices. Available at:[link]. Accessed: 2025-01-06.
Hu, W. and Tan, Y. (2022). Generating adversarial malware examples for black-box attacks based on GAN. In International Conference on Data Mining and Big Data, pages 409-423. Springer. DOI: 10.1007/978-981-19-8991-9_29.
Kaspersky Lab (2024). Banking data theft: Attacks on smartphones triple in 2024, kaspersky reports. Available at:[link]. Accessed: 2025-05-18.
Kim, J. and Park, H. (2023). Limited discriminator gan using explainable ai model for overfitting problem. ICT Express, 9(2):241-246. DOI: 10.1016/j.icte.2021.12.014.
Kouliaridis, V. and Kambourakis, G. (2021). A comprehensive survey on machine learning techniques for Android malware detection. Information, 12(5):185. DOI: 10.3390/info12050185.
Kouliaridis, V., Kambourakis, G., and Peng, T. (2020). Feature importance in android malware detection. In 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 1449-1454. IEEE. DOI: 10.1109/TrustCom50675.2020.00195.
Kurach, K., Lučić, M., Zhai, X., Michalski, M., and Gelly, S. (2019). A large-scale study on regularization and normalization in gans. In International conference on machine learning, pages 3581-3590. PMLR. DOI: 10.48550/arXiv.1807.04720.
Li, J., He, J., Li, W., Fang, W., Yang, G., and Li, T. (2024). SynDroid: An adaptive enhanced Android malware classification method based on CTGAN-SVM. Computers & Security, 137:103604. DOI: 10.1016/j.cose.2023.103604.
Maas, A. L., Hannun, A. Y., Ng, A. Y., et al. (2013). Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Atlanta, GA. Available at:[link].
Meijin, L., Zhiyang, F., Junfeng, W., Luyu, C., Qi, Z., Tao, Y., Yinwei, W., and Jiaxuan, G. (2022). A systematic overview of android malware detection. Applied Artificial Intelligence, 36(1):2007327. DOI: 10.1080/08839514.2021.2007327.
Mimura, M. (2020). Using fake text vectors to improve the sensitivity of minority class for macro malware detection. JISA, 54:102600. DOI: 10.1016/j.jisa.2020.102600.
Miranda, T. C., Gimenez, P.-F., Lalande, J.-F., Tong, V. V. T., and Wilke, P. (2022). Debiasing android malware datasets: How can i trust your results if your dataset is biased? IEEE Transactions on Information Forensics and Security, 17:2182-2197. DOI: 10.1109/TIFS.2022.3180184.
Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. DOI: 10.48550/arXiv.1411.1784.
Park, N. et. al (2018). Data synthesis based on Generative Adversarial Networks. arXiv preprint arXiv:1806.03384. DOI: 10.48550/arXiv.1806.03384.
Paullada, A. et. al. (2021). Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11). DOI: 10.1016/j.patter.2021.100336.
Platzer, M. and Reutterer, T. (2021). Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data. Frontier in Big Data. DOI: 10.3389/fdata.2021.679939.
Radford, A. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. DOI: 10.48550/arXiv.1511.06434.
Rainio, O., Teuho, J., and Klén, R. (2024). Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14(1):6086. DOI: 10.1038/s41598-024-56706-x.
Rajabi, A. and Garibay, O. O. (2022). TabfairGAN: : Fair Tabular Data Generation with Generative Adversarial Networks. ML and Knowledge Extraction, 4(2):488. DOI: 10.3390/make4020022.
Renjith, G., Laudanna, S., Aji, S., Visaggio, C. A., and Vinod, P. (2022). GANG-MAM: GAN based engine for modifying Android malware. SoftwareX, 18:100977. DOI: 10.1016/j.softx.2022.100977.
Rocha V. et. al (2023). AMGenerator e AMExplorer: Geração de metadados e construção de datasets android. In Anais Estendidos do XXIII Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais, pages 41-48. SBC. DOI: 10.5753/sbseg_estendido.2023.235801.
Sabiri, B., El Asri, B., and Rhanoui, M. (2022). Effect of convulsion layers and hyper-parameters on the behavior of adversarial neural networks. In International Conference on Enterprise Information Systems, pages 222-245. Springer. DOI: 10.1007/978-3-031-39386-0_11.
Seybold, C. et al. (2018). Dropout-gan: Learning from a dynamic ensemble of discriminators. arXiv preprint arXiv:1807.11346. DOI: 10.48550/arXiv.1807.11346.
Siqueira, G., Kreutz, D., Assolin, J., Costa, E., Miers, C., Mansilha, R., Pontes, J., and Feitosa, E. (2022). Avaliaçao de ferramentas de automl em datasets de detecçao de malwares android. In Anais do XXII Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais, pages 302-315. SBC. DOI: 10.5753/sbseg.2022.225317.
Siqueira, G., Rodrigues, G., Feitosa, E., and Kreutz, D. (2021). QuickAutoML: Uma ferramenta para treinamento automatizado de modelos de aprendizado de máquina. In Anais da XIX Escola Regional de Redes de Computadores, pages 85-90, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/errc.2021.18547.
Vilanova, L., Kreutz, D., Assolin, J., Quincozes, V., Miers, C., Mansilha, R., and Feitosa, E. (2022). ADBuilder: uma ferramenta de construção de datasets para detecção de malwares android. In Anais Estendidos do XXII Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais, pages 143-150, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbseg_estendido.2022.227038.
Villaizán-Vallelado, M., Salvatori, M., Segura, C., and Arapakis, I. (2025). Diffusion models for tabular data imputation and synthetic data generation. ACM Transactions on Knowledge Discovery from Data, 19(6). DOI: 10.1145/3742435.
Wang, H., Si, J., Li, H., and Guo, Y. (2019). RmvDroid: Towards a reliable android malware dataset with app metadata. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 404-408. DOI: 10.1109/MSR.2019.00067.
Weerts, H. J., Mueller, A. C., and Vanschoren, J. (2020). Importance of tuning hyperparameters of machine learning algorithms. arXiv preprint arXiv:2007.07588. DOI: 10.48550/arXiv.2007.07588.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. biom. bull., 1, 80. Available at:[link].
Xiao, C., Li, B., Zhu, J.-Y., He, W., Liu, M., and Song, D. (2018). Generating adversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610. DOI: 10.48550/arXiv.1801.02610.
Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019). Modeling Tabular Data Using Conditional GAN. Advances in neural information processing systems, 32. DOI: 10.5555/3454287.3454946.
Xu, L. and Veeramachaneni, K. (2018). Synthesizing Tabular Data Using Generative Adversarial Networks. arXiv preprint arXiv:1811.11264. DOI: 10.48550/arXiv.1811.11264.
Zhao, Z., Kunar, A., Birke, R., Van der Scheer, H., and Chen, L. Y. (2024). Ctab-gan+: Enhancing tabular data synthesis. Frontiers in big Data, 6:1296508. DOI: 10.48550/arXiv.2204.00401.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Angelo Gaspar Diniz Nogueira, Kayua Oleques Paim, Hendrio Bragança, Rodrigo Brandão Mansilha, Diego Kreutz

This work is licensed under a Creative Commons Attribution 4.0 International License.

