Advancing Test Data Selection by Leveraging Decision Tree Structures: An Investigation into Decision Tree Coverage and Mutation Analysis
DOI:
https://doi.org/10.5753/jserd.2025.4084Keywords:
Software Testing, Machine Learning, Decision Tree, Testing Criterion, Mutation TestingAbstract
Over the past decade, there has been a significant surge in interest regarding the application of machine learning (ML) across various tasks. Due to this interest, the adoption of ML-based systems has gone mainstream. It turns out that it is imperative to conduct thorough software testing on these systems to ensure that they behave as expected. However, ML-based systems present unique challenges for software testers who are striving to enhance the quality and reliability of these solutions. To cope with these testing challenges, we propose novel test adequacy criteria centered on decision tree models. Our criteria diverge from the conventional method of manually collecting and labeling data. Instead, our criteria relies on the inherent structure of decision tree models to inform the selection of test inputs. Specifically, we introduce decision tree coverage (DTC) and boundary value analysis (BVA) as approaches to systematically guide the creation of effective test data that exercises key structural elements of a given decision tree model. Additionally, we also propose a mutation based criterion to support the validation of ML-based systems. Essentially, this approach involves applying mutation analysis to the decision tree structure. The resulting mutated trees are then used as a reference for selecting test data that can effectively identify incorrect classifications in ML models. To evaluate these criteria, we carried out an experiment using 16 datasets. We measured the effectiveness of test inputs in terms of the difference in model’s behavior between the test input and the training data. According to the results of the experiment, our criteria can be used to improve the test data selection for ML applications by guiding the generation of diversified test data that negatively impact the prediction performance of models.
Downloads
References
Agrawal, H., Demillo, R. A., Hathaway, B., Hsu, W., Hsu, W., Krauser, E. W., Martin, R. J., Mathur, A. P., and Spafford, E. H. (1989). Design Of Mutant Operators For The C Programming Language. W. Lafayette, IN 47907, Software Engineering Research Center Department of Computer Sciences Purdue University.
Ahmed, Z. and Makedonski, P. (2024). Exploring the fundamentals of mutations in deep neural networks. In Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, MODELS Companion ’24, page 227–233, New York, NY, USA. Association for Computing Machinery.
Ammann, P. and Offutt, J. (2016). Introduction to software testing. Cambridge University Press, USA, 2nd edition.
Aniche, M., Maziero, E., Durelli, R., and Durelli, V. H. S. (2022). The effectiveness of supervised machine learning algorithms in predicting software refactoring. IEEE Transactions on Software Engineering, 48(4):1432–1450.
Braiek, H. B. and Khomh, F. (2020). On testing machine learning programs. Journal of Systems and Software, 164:110 – 542.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1983). Classification and Regression Trees. The Wadsworth Statistics/Probability Series. Wadsworth International Group, Belmont, CA, 1st edition.
DeMillo, R., Lipton, R., and Sayward, F. (1978). Hints on test data selection: Help for the practicing programmer. Computer, 11(4):34–41.
Durelli, V. H. S., Durelli, R. S., Borges, S. S., Endo, A. T., Eler, M. M., Dias, D. R. C., and Guimarães, M. P. (2019). Machine learning applied to software testing: A systematic mapping study. IEEE Transactions on Software Engineering, 68(3):1189–1212.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188.
Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly, 2nd edition.
Hu, Q., Ma, L., Xie, X., Yu, B., Liu, Y., and Zhao, J. (2019). Deepmutation++: A mutation testing framework for deep learning systems. In 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1157–1161.
Humbatova, N., Jahangirova, G., and Tonella, P. (2021). Deepcrime: Mutation testing of deep learning systems based on real faults. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2021, page 67–78, New York, NY, USA. Association for Computing Machinery.
Humbatova, N., Jahangirova, G., and Tonella, P. (2023). Deepcrime: from real faults to mutation testing tool for deep learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pages 68–72.
Jahangirova, G. and Tonella, P. (2020). An empirical evaluation of mutation operators for deep learning systems. In 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST), pages 74–84.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics.
Kim, Y.-Y., Cho, Y., Jang, J., Na, B., Kim, Y., Song, K., Kang, W., and Moon, I.-C. (2023). Saal: sharpness-aware active learning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
Li, T., Guo, Q., Liu, A., Du, M., Li, Z., and Liu, Y. (2023). Fairer: fairness as decision rationale alignment. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
Li, Z., Ma, X., Xu, C., and Cao, C. (2019). Structural coverage criteria for neural networks could be misleading. In Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pages 89–92. IEEE.
Lu, Y., Shao, K., Sun, W., and Sun, M. (2022). Mtul: Towards mutation testing of unsupervised learning systems. In Dependable Software Engineering. Theories, Tools, and Applications: 8th International Symposium, SETTA 2022, Beijing, China, October 27-29, 2022, Proceedings, page 22–40, Berlin, Heidelberg. Springer-Verlag.
Ma, L., Zhang, F., Sun, J., Xue, M., Li, B., Juefei-Xu, F., Xie, C., Li, L., Liu, Y., Zhao, J., and Wang, Y. (2018). Deepmutation: Mutation testing of deep learning systems. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE), pages 100–111.
Müller, A. C. and Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media.
Myers, G. J., Sandler, C., and Badgett, T. (2011). The Art of Software Testing. Wiley, 3rd edition.
Ogrizović, M., Drašković, D., and Bojić, D. (2024). Quality assurance strategies for machine learning applications in big data analytics: an overview. Journal of Big Data, 11(156):1–48.
Panichella, A. and Liem, C. C. S. (22021). What are we really testing in mutation testing for machine learning? a critical reflection. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pages 66–70.
Pei, K., Cao, Y., Yang, J., and Jana, S. (2019). Deepxplore: Automated whitebox testing of deep learning systems. 1 Communications of the ACM, 62(11):137–145.
Riccio, V., Humbatova, N., Jahangirova, G., and Tonella, P. (2022). Deepmetis: Augmenting a deep learning test set to increase its mutation score. 2 In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering, ASE ’21, pages 355–367. IEEE Press.
Riccio, V., Jahangirova, G., Stocco, A., Humbatova, N., Weiss, M., and Tonella, P. (2020). Testing machine learning based systems: a systematic mapping. Empirical Software Engineering, 25:1573–7616.
Rittler, N. and Chaudhuri, K. (2023). A two-stage active learning algorithm for k-nearest neighbors. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
Santos, S., Silveira, B., Durelli, V. H. S., Durelli, R., Souza, S., and Delamaro, M. (2021). On using decision tree coverage criteria fortesting machine learning models. In Proceedings of the 6th Brazilian Symposium on Systematic and Automated Software Testing, SAST ’21, page 1––9. ACM.
Shen, W., Wan, J., and Chen, Z. (2018). Munn: Mutation analysis of neural networks. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), pages 108–115.
Sherin, S., khan, M. U., and Iqbal, M. Z. (2019). A systematic mapping study on testing of machine learning programs.
Silveira, B., Durelli, V. H. S., Santos, S., Durelli, R., Delamaro, M., and Souza, S. (2023). Test data selection based on applying mutation testing to decision tree models. In Proceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, SAST ’23’, pages 38–46. ACM.
Tambon, F., Khomh, F., and Antoniol, G. (2023). A probabilistic framework for mutation testing in deep neural networks. Inf. Softw. Technol., 155(C).
Wohlin, C., Runeson, P., Hst, M., Ohlsson, M. C., Regnell, B., and Wessln, A. (2012). Experimentation in Software Engineering. Springer.
Zhang, J. M., Harman, M., Ma, L., and Liu, Y. (2022). Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering, 48(1):1–36.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Beatriz N. C. Silveira, Vinicius H. S. Durelli, Sebastião H. N. Santos, Rafael S. Durelli, Marcio E. Delamaro, Simone R. S. Souza

This work is licensed under a Creative Commons Attribution 4.0 International License.