Limitless Feature Selection: Revolutionizing Evaluation with MH-FSF

Authors

DOI:

https://doi.org/10.5753/jbcs.2026.5646

Keywords:

Feature Selection, Android Malware Detection, Benchmarking, Reproducibility, Evaluation, Data Preprocessing, Machine Learning, Artificial Inteligence

Abstract

Feature selection plays a crucial role in developing effective predictive models by reducing dimensionality and emphasizing the most relevant attributes. However, current research in this area often lacks comprehensive benchmarking and frequently depends on proprietary datasets. These limitations hinder reproducibility and may lead to inconsistent or suboptimal model performance. To address these limitations, we introduce the MH-FSF framework, a comprehensive, modular, and extensible platform designed to facilitate the reproduction and implementation of feature selection methods. Developed through collaborative research, MH-FSF provides implementations of 17 methods (11 classical, 6 domain-specific) and enables systematic evaluation on 10 publicly available Android malware datasets. Our results reveal performance variations across both balanced and imbalanced datasets, highlighting the critical need for data preprocessing and selection criteria that account for these asymmetries. We demonstrate the importance of a unified platform for comparing diverse feature selection techniques, fostering methodological consistency and rigor. By providing this framework, we aim to significantly broaden the existing literature and pave the way for new research directions in feature selection, particularly within the context of Android malware detection.

Downloads

Download data is not yet available.

References

Alazab, M. (2020). Automated Malware Detection in Mobile App Stores Based on Robust Feature Generation. Electronics, 9:435. DOI: 10.3390/electronics9030435.

Albahar, M. A., ElSayed, M. S., and Jurcut, A. (2022). A Modified ResNeXt for Android Malware Identification and Classification. Computational Intelligence and Neuroscience, 2022(1):8634784. DOI: 10.1155/2022/8634784.

Alomari, E. S., Nuiaa, R. R., Alyasseri, Z. A. A., Mohammed, H. J., et al. (2023). Malware Detection Using Deep Learning and Correlation-Based Feature Selection. Symmetry, 15(1):123. DOI: 10.3390/sym15010123.

Assolin, J., Canto, G., Kreutz, D., Feitosa, E., Bragança, H., Nogueira, A., and Rocha, V. (2025). Interpretable by design: MH-AutoML for transparent and efficient android malware detection without compromising performance. Available at:[link].

Azhagusundari, B., Thanamani, A. S., et al. (2013). Feature Selection Based on Information Gain. IJITEE, 2(2):18-21. Availablet at: [link].

Bhat, P. and Dutta, K. (2022). A Multi-Tiered Feature Selection Model for Android Malware Detection Based on Feature Discrimination and Information Gain. Journal of King Saud University - Computer and Information Sciences, 34(10, Part B):9464-9477. DOI: 10.1016/j.jksuci.2021.11.004.

Braganca, H., Kreutz, D., Rocha, V., Assolin, J., , and Feitosa, E. (2025). MH-1M: A 1.34 million-sample comprehensive multi-feature android malware dataset for machine learning, deep learning, large language models, and threat intelligence research. Available at:[link].

Bragança, H., Rocha, V., Souto, E., Kreutz, D., and Feitosa, E. (2023). Capturing the Behavior of Android Malware with MH-100K: A Novel and Multidimensional Dataset. In XXIII SBSeg. DOI: 10.5753/sbseg.2023.233596.

Cai, L., Li, Y., and Xiong, Z. (2021). JOWMDroid: Android Malware Detection Based on Feature Weighting with Joint Optimization of Weight-Papping and Classifier Parameters. Computers & Security, 100:102086. DOI: 10.1016/j.cose.2020.102086.

Cao, C., Chicco, D., and Hoffman, M. M. (2020). The MCC-F1 Curve: A Performance Evaluation Technique for Binary Classification. arXiv preprint arXiv:2006.11278. DOI: 10.48550/arXiv.2006.11278.

Chimeleze, C., Jamil, N., Ismail, R., Lam, K.-Y., Teh, J. S., Samual, J., and Akachukwu Okeke, C. (2022). BFEDroid: A Feature Selection Technique to Detect Malware in Android Apps Using Machine Learning. Security and Communication Networks, 2022(1):5339926. DOI: 10.1155/2022/5339926.

Cohen, I., Huang, Y., Chen, J., Benesty, J., Benesty, J., Chen, J., Huang, Y., and Cohen, I. (2009). Pearson Correlation Coefficient. Noise Reduction in Speech Processing, pages 1-4. DOI: 10.1007/978-3-642-00296-0_5.

Costa, E., Kreutz, D., Rocha, V., Leão, L., Sabóia, S., Neves, N., and Feitosa, E. (2022). FS3E: Uma Ferramenta para Execução e Avaliação de Métodos de Seleção de Características para Detecção de Malwares Android. In Anais Estendidos do XXII SBSeg, pages 151-158. DOI: 10.5753/sbseg_estendido.2022.227041.

Darst, B. F. et al. (2018). Using Recursive Feature Elimination in Random Forest to Account for Correlated Variables in High Dimensional Data. BMC Genetics, 19:1-6. DOI: 10.1186/s12863-018-0633-8.

Dhal, P. and Azad, C. (2022). A Comprehensive Survey on Feature Selection in the Various Fields of Machine Learning. Applied Intelligence, 52(4):4543-4581. DOI: 10.1007/s10489-021-02550-9.

Fatima, A., Maurya, R., et al. (2019). Android Malware Detection Using Genetic Algorithm Based Optimized Feature Selection and Machine Learning. In TSP, pages 220-223. IEEE. DOI: 10.1109/TSP.2019.8769039.

Galib, A. H. and Hossain, M. (2020). Significant API Calls in Android Malware Detection (Using Feature Selection Techniques and Correlation Based Feature Elimination). In International Conference on Software Engineering and Knowledge Engineering (SEKE). Available at: [link].

Islam, R., Sayed, M. I., Saha, S., et al. (2023). Android Malware Classification Using Optimum Feature Selection and Ensemble Machine Learning. Internet of Things and Cyber-Physical Systems (IOTCPS), 3:100-111. DOI: 10.1016/j.iotcps.2023.03.001.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer. DOI: 10.25334/q4ht55.

Karaboga, D., Gorkemli, B., Ozturk, C., and Karaboga, N. (2014). A Comprehensive Survey: Artificial Bee Colony (ABC) Algorithm and Applications. AI Review, 42:21-57. DOI: 10.1007/s10462-012-9328-0.

Konno, H. and Koshizuka, T. (2005). Mean-Absolute Deviation Model. IIE Transactions, 37(10):893-900. DOI: 10.1080/07408170591007786.

Kumar, S., Mishra, D., Panda, B., and Shukla, S. K. (2022). AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information. In Proceedings of the 19th International Conference on Mining Software Repositories, pages 454-458. DOI: 10.1145/3524842.3528493.

Kurita, T. (2019). Principal Component Analysis (PCA). Computer Vision: A Reference Guide, pages 1-4. DOI: 10.1007/978-3-030-63416-2_649.

Mahindru, A., Arora, H., Kumar, A., Gupta, S. K., Mahajan, S., Kadry, S., and Kim, J. (2024). PermDroid: A Framework Developed Using Proposed Feature Selection Approach and Machine Learning Techniques for Android Malware Detection. Scientific Reports, 14(1):10724. DOI: 10.1038/s41598-024-60982-y.

Mahindru, A. and Sangal, A. (2019). Deepdroid: Feature Selection Approach to Detect Android Malware Using Deep Learning. In IEEE ICSESS, pages 16-19. IEEE. DOI: 10.1109/ICSESS47205.2019.9040821.

Mahindru, A. and Sangal, A. L. (2021). SemiDroid: A Behavioral Malware Detector Based on Unsupervised Machine Learning Techniques Using Feature Selection Approaches. International Journal of Machine Learning and Cybernetics, 12(5):1369-1411. DOI: 10.1007/s13042-020-01238-9.

Miranda, T. C., Gimenez, P.-F., Lalande, J.-F., Tong, V. V. T., and Wilke, P. (2022). Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased? IEEE Transactions on Information Forensics and Security, 17:2182-2197. DOI: 10.1109/TIFS.2022.3180184.

Naheed, N., Shaheen, M., Khan, S. A., et al. (2020). Importance of Features Selection, Attributes Selection, Challenges and Future Directions for Medical Imaging Data: A Review. CMES, 125(1). DOI: 10.32604/cmes.2020.011380.

Neves, N., Rocha, V., Kreutz, D., et al. (2023). Avaliação de Métodos de Seleção de Características de Amostras Android com a Ferramenta FS3E (v2). In Anais da XX ERRC, pages 139-144. DOI: 10.5753/errc.2023.928.

Paim, K. O., Nogueira, A. G. D., Kreutz, D., Cordeiro, W., and Mansilha, R. B. (2025). MalDataGen: A modular framework for synthetic tabular data generation in malware detection. In Anais Estendidos do XXV Simpósio Brasileiro de Cibersegurança (SBSeg 2025), SBSeg Estendido 2025, page 38–47. Sociedade Brasileira de Computação - SBC. DOI: 10.5753/sbseg_estendido.2025.12113.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825-2830. DOI: 10.48550/arxiv.1201.0490.

Ranstam, J. and Cook, J. A. (2018). LASSO Regression. Journal of British Surgery, 105. DOI: 10.1002/bjs.10895.

Robnik-Šikonja, M. and Kononenko, I. (2003). Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning, 53:23-69. DOI: 10.1023/A:1025667309714.

Rocha, V., Bragança, H., Kreutz, D., and Feitosa, E. (2024). MH-FSF: um Framework para Reprodução, Experimentação e Avaliação de Métodos de Seleção de Características. [link].

Rocha, V., Kreutz, D., Canto, G., Bragança, H., and Feitosa, E. (2025). MH-FSF: A unified framework for overcoming benchmarking and reproducibility limitations in feature selection evaluation. arXiv eprint 2507.10591. [link].

Şahin, D. Ö., Kural, O. E., Akleylek, S., and Kılıç, E. (2023). A Novel Android Malware Detection System: Adaption of Filter-Based Feature Selection Methods. JAIHC, pages 1-15. DOI: 10.1007/s12652-021-03376-6.

Salah, A., Shalabi, E., and Khedr, W. (2020). A Lightweight Android Malware Classifier Using Novel Feature Selection Methods. Symmetry, 12(5):858. DOI: 10.3390/sym12050858.

Smmarwar, S. K., Gupta, G. P., and Kumar, S. (2022). A Hybrid Feature Selection Approach-Based Android Malware Detection Framework Using Machine Learning Techniques. In Cyber Security, Privacy and Networking: Proceedings of ICSPN 2021, pages 347-356. Springer. DOI: 10.1007/978-981-16-8664-1_30.

Soares, T., Kreutz, D., Rocha, V., Costa, E., Leão, L., Pontes, J., Assolin, J., Rodrigues, G., and Feitosa, E. (2022). Uma Análise de Métodos de Seleção de Características Aplicados à Detecção de Malwares Android. In Anais do XXII SSBSeg, pages 288-301. DOI: 10.5753/sbseg.2022.225321.

Soares, T., Siqueira, G., Barcellos, L., et al. (2021). Detecção de Malwares Android: Datasets e Reprodutibilidade. In Anais da XIX ERRC, pages 43-48, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/errc.2021.18540.

Sthle, L. and Wold, S. (1989). Analysis of Variance (ANOVA). Chemometr Intell Lab., 6(4):259-272. DOI: 10.1016/0169-7439(89)80095-4.

Subbiah, S. S. and Chinnappan, J. (2021). Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review. Ingénierie des Systèmes d'Information, 26(1). DOI: 10.18280/isi.260107.

Sun, L., Li, Z., Yan, Q., Srisa-an, W., and Pan, Y. (2016). SigPID: Significant Permission Identification for Android Malware Detection. In MALWARE, pages 1-8. IEEE Computer Society. DOI: 10.1109/MALWARE.2016.7888730.

Tallarida, R. J., Murray, R. B., Tallarida, R. J., and Murray, R. B. (1987). Chi-Square Test. Manual of Pharmacologic Calculations with Computer Programs, pages 140-142. DOI: 10.1007/978-1-4612-4974-0_43.

Thakkar, A. and Lohiya, R. (2022). A Survey on Intrusion Detection System: Feature Selection, Model, Performance Measures, Application Perspective, Challenges, and Future Research Directions. Artificial Intelligence Review, 55(1):453-563. DOI: 10.1007/s10462-021-10037-9.

Theng, D. and Bhoyar, K. K. (2024). Feature Selection Techniques for Machine Learning: A Survey of More than Two Decades of Research. Knowledge and Information Systems, 66(3):1575-1637. DOI: 10.1007/s10115-023-02010-5.

Waheed, M. and Qadir, S. (2024). Effective and Efficient Android Malware Detection and Category Classification Using the Enhanced KronoDroid Dataset. Security and Communication Networks, 2024(1):7382302. DOI: 10.1155/2024/7382302.

Wang, L., Gao, Y., Gao, S., and Yong, X. (2021). A New Feature Selection Method Based on a Self-Variant Genetic Algorithm Applied to Android Malware Detection. Symmetry, 13(7):1290. DOI: 10.3390/sym13071290.

Wang, L., Wang, H., He, R., Tao, R., Meng, G., Luo, X., and Liu, X. (2022). MalRadar: Demystifying Android Malware in the New Era. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(2):1-27. DOI: 10.1145/3530906.

Wu, Y., Li, M., Zeng, Q., Yang, T., Wang, J., Fang, Z., and Cheng, L. (2023). DroidRL: Feature Selection for Android Malware Detection with Reinforcement Learning. Computers & Security, 128:103126. DOI: 10.1016/j.cose.2023.103126.

Zhao, K., Zhang, D., Su, X., and Li, W. (2015). Fest: A Feature Extraction and Selection Tool for Android Malware Detection. In IEEE ISCC, pages 714-720. IEEE. DOI: 10.1109/ISCC.2015.7405598.

Downloads

Published

2026-02-06

How to Cite

Rocha, V., Kreutz, D., Bragança, H., & Feitosa, E. (2026). Limitless Feature Selection: Revolutionizing Evaluation with MH-FSF. Journal of the Brazilian Computer Society, 32(1), 73–84. https://doi.org/10.5753/jbcs.2026.5646

Issue

Section

Regular Issue