Reducing Manual Efforts in Equivalence Analysis in Mutation Testing




mutation testing, equivalent mutant, automated testing


Mutation testing has attracted a lot of interest because of its reputation as a powerful adequacy criterion for test suites and for its ability to guide test case generation. However, the presence of equivalent mutants hinders its usage in industry. The Equivalent Mutant Problem has already been proven undecidable, but manually detecting equivalent mutants is an error-prone and time-consuming task. Thus, solutions, even partial, can help reduce this cost. To minimize this problem, we introduce an approach to suggest equivalent mutants. Our approach is based on automated behavioral testing, which consists of test cases based on the behavior of the original program. We perform static analysis to automatically generate tests for the entities impacted by the mutation. For each mutant analyzed, our approach can suggest the mutant as equivalent or non-equivalent. In the case of non-equivalent mutants, our approach provides a test case capable of killing it. For the equivalent mutants suggested, we also provide a ranking of mutants with a strong or weak chance of the mutant being indeed equivalent. In our previous work, we evaluated our approach against a set of 1,542 mutants manually classified in previous work as equivalents and non-equivalents. We noticed that the approach effectively suggests equivalent mutants, reaching more than 96% of accuracy in five out of eight subjects studied. Compared with manual analysis of the surviving mutants, our approach takes a third of the time to suggest equivalents and is 25 times faster to indicate non-equivalents. This extended article delves deeper into our evaluation. Our focus is on discerning the specific characteristics of mutants that our approach erroneously classified as equivalent, thereby producing false positives. Furthermore, our investigation delves into a comprehensive analysis of the mutation operators, providing essential insights for practitioners seeking to improve the accuracy of equivalent mutant detection and effectively mitigate associated costs.


Download data is not yet available.


Allen Troy Acree, J. (1980). On Mutation. PhD thesis, Georgia Institute of Technology.

Andrews, J., Briand, L., and Labiche, Y. (2005). Is mutation an appropriate tool for testing experiments? In ICSE, pages 402–411.

Arcuri, A., Fraser, G., and Just, R. (2017). Private api access and functional mocking in automated unit test generation. In ICST, pages 126–137.

Binder, R. (1994). Design for testability in object-oriented systems. Communications of the ACM, 37:87–101.

Braione, P., Denaro, G., Mattavelli, A., and Pezzè, M. (2017). Combining symbolic execution and search-based testing for programs with complex heap inputs. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 90–101.

Brito, C., Durelli, V., Durelli, R., Souza, S., Vincenzi, A., and Delamaro, M. (2020). A preliminary investigation into using machine learning algorithms to identify minimal and equivalent mutants. In ICSTW, pages 304–313.

Budd, T. and Angluin, D. (1982). Two notions of correctness and their relation to testing. Acta Informatica, 18(1):31–45.

DeMillo, R., Lipton, R., and Sayward, F. (1978). Hints on test data selection: Help for the practicing programmer. Computer, 11(4):34–41.

Fernandes, L., Ribeiro, M., Carvalho, L., Gheyi, R., Mongiovi, M., Santos, A., Cavalcanti, A., Ferrari, F., and Maldonado, J. C. (2017). Avoiding useless mutants. In GPCE, pages 187–198.

Fernandes, L., Ribeiro, M., Gheyi, R., Delamaro, M., Guimarães, M., and Santos, A. (2022). Put your hands in the air! reducing manual effort in mutation testing. page 198–207. Association for Computing Machinery.

Fernandes, Leo (2023). Nimrod - experimental pack repository. [link]. Accessed: 2024-01-18.

Fraser, G. and Arcuri, A. (2011). Evosuite: automatic test suite generation for object-oriented software. In ESEC/FSE, pages 416–419.

Fraser, G., Staats, M., McMinn, P., Arcuri, A., and Padberg, F. (2015). Does automated unit test generation really help software testers? a controlled empirical study. ACM Transactions on Software Engineering and Methodology, 24(4):23:1–23:49.

Gheyi, R., Ribeiro, M., Souza, B., Guimarães, M., Fernandes, L., d’Amorim, M., Alves, V., Teixeira, L., and Fonseca, B. (2021). Identifying method-level mutation subsumption relations using Z3. IST, 132:106496.

Gosling, J., Joy, B., Steele, G., Bracha, G., and Buckley, A. (2022). The Java Language Specification. Accessed: 2022-07-18.

Guimarães, M., Fernandes, L., Ribeiro, M., d’Amorim, M., and Gheyi, R. (2020). Optimizing mutation testing by discovering dynamic mutant subsumption relations. In ICST, pages 198–208.

Jia, Y. and Harman, M. (2011). An analysis and survey of the development of mutation testing. TSE, 37(5):649–678.

Just, R., Jalali, D., Inozemtseva, L., Ernst, M., Holmes, R., and Fraser, G. (2014). Are mutants a valid substitute for real faults in software testing? In ESEC/FSE, pages 654–665.

Kintis, M. and Malevris, N. (2015). Medic: A static analysis framework for equivalent mutant identification. IST, 68:1–17.

Kintis, M., Papadakis, M., Jia, Y., Malevris, N., Traon, Y. L., and Harman, M. (2018). Detecting trivial mutant equivalences via compiler optimisations. TSE, 44(4):308–333.

Lakhotia, K., McMinn, P., and Harman, M. (2009). Automated test data generation for coverage: Haven’t we solved this problem yet? In Testing: Academic and Industrial Conference-Practice and Research Techniques, pages 95–104.

Li, S., Xiao, X., Bassett, B., Xie, T., and Tillmann, N. (2016). Measuring code behavioral similarity for programming and software engineering education. In Proceedings of the 38th International Conference on Software Engineering Companion, pages 501–510.

Luo, Q., Hariri, F., Eloussi, L., and Marinov, D. (2014). An empirical analysis of flaky tests. In ESEC/FSE, pages 643–653.

Madeyski, L., Orzeszyna, W., Torkar, R., and Jozala, M. (2014). Overcoming the equivalent mutant problem: A systematic literature review and a comparative experiment of second order mutation. TSE, 40(1):23–42.

Mongiovi, M., Gheyi, R., Soares, G., Ribeiro, M., Borba, P., and Teixeira, L. (2018). Detecting Overly Strong Preconditions in Refactoring Engines. TSE, 44(5):429–452.

Mongiovi, M., Gheyi, R., Soares, G., Teixeira, L., and Borba, P. (2014). Making refactoring safer through impact analysis. SCP, 93:39–64.

Naeem, M. R., Lin, T., Naeem, H., and Liu, H. (2020). A machine learning approach for classification of equivalent mutants. Journal of Software: Evolution and Process, 32(5).

Offutt, J. and Craft, M. (1994). Using compiler optimization techniques to detect equivalent mutants. STVR, 4(3):131–154.

Offutt, J., Ma, Y.-S., and Kwon, Y.-R. (2006). The class-level mutants of mujava. In AST, pages 78–84.

Pacheco, C., Lahiri, S., Ernst, M., and Ball, T. (2007). Feedback-directed random test generation. In ICSE, pages 75–84.

Papadakis, M., Kintis, M., Zhang, J., Jia, Y., Le Traon, Y., and Harman, M. (2019). Mutation testing advances: an analysis and survey. In Advances in Computers, volume 112, pages 275–378. Elsevier.

Papadakis, M., Shin, D., Yoo, S., and Bae, D.-H. (2018). Are mutation scores correlated with real fault detection?: A large scale empirical study on the relationship between mutants and real faults. In ICSE, pages 537–548.

Peacock, S., Deng, L., Dehlinger, J., and Chakraborty, S. (2021). Automatic equivalent mutants classification using abstract syntax tree neural networks. In ICSTW, pages 13–18.

Pizzoleto, A. V., Ferrari, F. C., Offutt, J., Fernandes, L., and Ribeiro, M. (2019). A systematic literature review of techniques and metrics to reduce the cost of mutation testing. JSS, 157.

Schuler, D., Dallmeier, V., and Zeller, A. (2009). Efficient mutation testing by checking invariant violations. In ISSTA, pages 69–80.

Schuler, D. and Zeller, A. (2013). Covering and uncovering equivalent mutants. STVR, 23(5):353–374.

Shamshiri, S., Just, R., Rojas, J. M., Fraser, G., Mcminn, P., and Arcuri, A. (2015). Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges. In 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 201–211.

Soares, G., Gheyi, R., and Massoni, T. (2013a). Automated behavioral testing of refactoring engines. TSE, 39(2):147–162.

Soares, G., Gheyi, R., Murphy-Hill, E., and Johnson, B. (2013b). Comparing approaches to analyze refactoring activity on software repositories. JSS, 86(4):1006–1022.

Soares, G., Gheyi, R., Serey, D., and Massoni, T. (2010). Making program refactoring safer. IEEE software, 27(4):52–57.

Steimann, F. and Thies, A. (2010). From behaviour preservation to behaviour modification: Constraint-based mutant generation. In ICSE, pages 425–434.

van Hijfte, L. and Oprescu, A. (2021). Mutantbench: an equivalent mutant problem comparison framework. In ICSTW, pages 7–12.

Voas, J. and McGraw, G. (1997). Software fault injection: inoculating programs against errors. John Wiley & Sons, Inc.

Yao, X., Harman, M., and Jia, Y. (2014). A study of equivalent and stubborn mutation operators using human analysis of equivalence. In Proceedings of the 36th International Conference on Software Engineering, pages 919–930.




How to Cite

Amorim, S., Fernandes, L., Ribeiro, M., Gheyi, R., Delamaro, M., Guimarães, M., & Santos, A. (2024). Reducing Manual Efforts in Equivalence Analysis in Mutation Testing. Journal of Software Engineering Research and Development, 12(1), 3:1 – 3:17.



Research Article

Most read articles by the same author(s)