The evolution of CRISP-DM for Data Science: Methods, Processes and Frameworks

Authors

DOI:

https://doi.org/10.5753/reviews.2024.3757

Keywords:

Data Science, Process Model, CRISP-DM, Agile

Abstract

The expansion of Data Science projects in organizations has been led by three factors: the growth in the amount of data generated, the evolution in storage capacity, and the increase in computational capabilities. However, most of these projects fail to deliver the expected value: 82% of the teams do not use any process model. Despite the popularity of Agile Methods, their adoption in Data Science projects is still scarce. Most of the existing research focuses on algorithms. There is a lack of studies on agility in Data Science. This Systematic Literature Review (SLR) was performed to identify and evaluate 16 studies that can answer how to adapt and apply CRISP-DM using different approaches — methods, frameworks, or process models. In addition, it shows how CRISP-DM has evolved over the last few decades, with derivations emerging from rigid processes to agile methods. This research then analyzes the 16 tailored models and examines the similarities and differences between CRISP-DM derivatives. As a result, it summarizes the CRISP-DM adaptation patters identified, such as phase addition, phase modification, features and tools addition, and integration with other approaches. Consequently, this SLR showcases how CRISP-DM is a robust, flexible, and highly adaptable model that can be extended to different business domains. Finally, it proposes a theoretical guide to modify and customize CRISP-DM for Data Science projects.

Downloads

Download data is not yet available.

References

Ahern, M., O’Sullivan, D. T. J., and Bruton, K. (2022). Development of a framework to aid the transition from reactive to proactive maintenance approaches to enable energy reduction. Applied Sciences, 12(13). DOI: 10.3390/app12136704.

Ahmad, N., Hamid, A., and Ahmed, V. (2022). Data science: Hype and reality. Computer, 55(2):95-101. DOI: 10.1109/MC.2021.3130365.

Ahmed, B., Dannhauser, T., and Philip, N. (2018). A lean design thinking methodology (ldtm) for machine learning and modern data projects. In 2018 10th Computer Science and Electronic Engineering (CEEC), pages 11-14. DOI: 10.1109/CEEC.2018.8674234.

Anderson, D. J. (2010). Kanban: Successful Evolutionary Change for Your Technology Business. Blue Hole Press, Seattle, WA. Available online [link].

Asamoah, D. A. and Sharda, R. (2019). Crisp-esnep: Towards a data-driven knowledge discovery process for electronic social networks. Journal of Decision Systems, 28(4):286-308. DOI: 10.1080/12460125.2019.1696614.

Ayele, W. Y. (2020). Adapting crisp-dm for idea mining: A data mining process for generating ideas using a textual dataset. International Journal of Advanced Computer Science and Applications, 11(6). DOI: 10.14569/IJACSA.2020.0110603.

Azevedo, A. and Santos, M. (2008). Kdd, semma and crisp-dm: A parallel overview. In IADIS European Conf. Data Mining, pages 182-185. Available online [link].

Baijens, J., Helms, R., and Iren, D. (2020). Applying scrum in data science projects. In 2020 IEEE 22nd Conference on Business Informatics (CBI), volume 1, pages 30-38. DOI: 10.1109/CBI49978.2020.00011.

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R. (2000). CRISP-DM 1.0: Step-by-step Data Mining Guide. SPSS. Available online [link].

Costa, C. J. and Aparicio, J. T. (2020). Post-ds: A methodology to boost data science. In 2020 15th Iberian Conference on Information Systems and Technologies (CISTI), pages 1-6. DOI: 10.23919/CISTI49556.2020.9140932.

Diop, M., Camara, M. S., Fall, I., and Bah, A. (2017). A methodology for prior management of temporal data quality in a data mining process. In 2017 Intelligent Systems and Computer Vision (ISCV), pages 1-8. DOI: 10.1109/ISACV.2017.8054906.

Dåderman, A. and Rosander, S. (2018). Evaluating frameworks for implementing machine learning in signal processing: A comparative study of crisp-dm, semma and kdd. DiVA Portal. Available online [link].

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3):37. DOI: 10.1609/aimag.v17i3.1230.

Goyal, D., Goyal, R., Rekha, G., Malik, S., and Tyagi, A. (2020). Emerging trends and challenges in data science and big data analytics. In 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), pages 1-8. DOI: 10.1109/ic-ETITE47903.2020.316.

Hoda, R., Salleh, N., and Grundy, J. (2018). The rise and evolution of agile software development. IEEE Software, 35(5):58-63. DOI: 10.1109/MS.2018.290111318.

Huber, S., Wiemer, H., Schneider, D., and Ihlenfeldt, S. (2018). Dmme: Data mining methodology for engineering applications – a holistic extension to the crisp-dm model. Procedia CIRP, 79:403-408. DOI: 10.1016/j.procir.2019.02.106.

IDC (2020). The digitization of the world – from edge to core. Available online [link]. Last checked on Aug 20, 2024.

Julian, Brendanand Noble, J. and Anslow, C. (2019). Agile practices in practice: Towards a theory of agile adoption and process evolution. In Agile Processes in Software Engineering and Extreme Programming, pages 3-18, Cham. Springer International Publishing. DOI: 10.1007/978-3-030-19034-7_1.

Kitchenham, B. A. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE 2007-001, Keele University and Durham University Joint Report. Available online [link].

Larson, D. and Chang, V. (2016). A review and future direction of agile, business intelligence, analytics and data science. International Journal of Information Management, 36(5):700-710. DOI: 10.1016/j.ijinfomgt.2016.04.013.

Manirupa, Cui, R., Campbell, D. R., Agrawal, G., and Ramnath, R. (2015). Towards methods for systematic research on big data. In 2015 IEEE International Conference on Big Data (Big Data), pages 2072-2081. DOI: 10.1109/BigData.2015.7363989.

Mariscal, G., Marban, O., and Fernandez, C. (2010). A survey of data mining and knowledge discovery process models and methodologies. The Knowledge Engineering Review, 25(2):137–166. DOI: 10.1017/S0269888910000032.

Martínez-Plumed, F., Contreras-Ochando, L., Ferri, C., Hernandez-Orallo, J., Kull, M., Lachiche, N., Ramirez-Quintana, M. J., and Flach, P. (2019). Crisp-dm twenty years later: From data mining processes to data science trajectories. IEEE Transactions on Knowledge and Data Engineering, 33(8):3048-3061. DOI: 10.1109/TKDE.2019.2962680.

Matharu, G. S., Mishra, A., Singh, H., and Upadhyay, P. (2015). Empirical study of agile software development methodologies: A comparative analysis. SIGSOFT Softw. Eng. Notes, 40(1):1–6. DOI: 10.1145/2693208.2693233.

Merkelbach, S., Von Enzberg, S., Kühn, A., and Dumitrescu, R. (2022). Towards a process model to enable domain experts to become citizen data scientists for industrial applications. In 2022 IEEE 5th International Conference on Industrial Cyber-Physical Systems (ICPS), pages 1-6. DOI: 10.1109/ICPS51978.2022.9816871.

Microsoft (2024). Tdsp - processo de ciência de dados de equipe. Available online [link].

Montalvo-Garcia, J., Quintero, J., and Manrique, B. (2020). CRISP-DM/SMEs: A Data Analytics Methodology for Non-profit SMEs, pages 449-457. Springer Singapore. DOI: 10.1007/978-981-15-0637-6_38.

Nakagawa, E., Scannavino, K., Fabbri, S., and Ferrari, F. (2017). Revisão Sistemática da Literatura em Engenharia de Software: Teoria e Prática. Elsevier Brasil. Book.

Niaksu, O. (2015). Crisp data mining methodology extension for medical domain. Baltic J. Modern Computing, 3:92-109. Available online [link].

Palacios, H. J. G., Toledo, R. A. J., Pantoja, G. A. H., and Navarro, . A. M. (2017). A comparative between crisp-dm and semma through the construction of a modis repository for studies of land use and cover change. Advances in Science, Technology and Engineering Systems Journal, 2(3):598–604. DOI: 10.25046/aj020376.

Petticrew, M. and Roberts, H. (2006). Systematic Reviews in the Social Sciences: A Practical Guide. Wiley. Book.

Plotnikova, V., Dumas, M., Nolte, A., and Milani, F. (2022). Designing a data mining process for the financial services domain. Journal of Business Analytics, 0(0):1-27. DOI: 10.1080/2573234X.2022.2088412.

Provost, F. and Fawcett, T. (2013). Data science and its relationship to big data and data-driven decision making. Big Data, 1(1):51-59. DOI: 10.1089/big.2013.1508.

Riungu-Kalliosaari, L., Kauppinen, M., and Männistö, T. (2017). What can be learnt from experienced data scientists? a case study. In Product-Focused Software Process Improvement, pages 55-70, Cham. Springer International Publishing. DOI: 110.1007/978-3-319-69926-4_.

Saltz (2020). Crisp-dm is still the most popular framework for executing data science projects. Available online [link].

Saltz, J. and Suthrland, A. (2019). Ski: An agile framework for data science. In 2019 IEEE International Conference on Big Data (Big Data), pages 3468-3476. DOI: 10.1109/BigData47090.2019.9005591.

Saltz, J. S., Shamshurin, I., and Crowston, K. (2017). Comparing data science project management methodologies via a controlled experiment. In Hawaii International Conference on System Sciences. DOI: 10.24251/HICSS.2017.120.

SAS (2003). Data Mining Using SAS Enterprise Miner: A Case Study Approach. SAS Publishing, 2nd edition. Available online [link].

Schröer, C., Kruse, F., and Gómez, J. M. (2021). A systematic literature review on applying crisp-dm process model. Procedia Computer Science, 181:526-534. DOI: 10.1016/j.procs.2021.01.199.

Schwaber, K. and Sutherland, J. (2020). The Scrum Guide. Scrum.org. Available online [link].

Schäfer, F., Zeiselmair, C., Becker, J., and Otten, H. (2018). Synthesizing crisp-dm and quality management: A data mining approach for production processes. In 2018 IEEE International Conference on Technology Management, Operations and Decisions (ICTMOD), pages 190-195. DOI: 10.1109/ITMC.2018.8691266.

Siriweera, T., Paik, I., Kumara, B. T., and Koswatta, K. (2015). Intelligent big data analysis architecture based on automatic service composition. In 2015 IEEE International Congress on Big Data, pages 276-280. DOI: 10.1109/BigDataCongress.2015.46.

van der Voort, H., van Bulderen, S., Cunningham, S., and Janssen, M. (2021). Data science as knowledge creation a framework for synergies between data analysts and domain professionals. Technological Forecasting and Social Change, 173:121160. DOI: 10.1016/j.techfore.2021.121160.

Venter, J., de Waal, A., and Willers, C. (2007). Specializing crisp-dm for evidence mining. In Advances in Digital Forensics III, pages 303-315, New York, NY. Springer New York. DOI: 10.1007/978-0-387-73742-3_21.

Venter, J., de Waal, A., and Willers, C. (2007). Specializing crisp-dm for evidence mining. In Advances in Digital Forensics III, pages 303-315, New York, NY. Springer New York. DOI: '10.1007/978-0-387-73742-3_21'.

Versionone (2007). CRISP-DM is Still the Most Popular Framework for Executing Data Science Projects.Available online [link].

Wirth, R. and Hipp, J. (2000). Crisp-dm: Towards a standard process model for data mining. Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining. Available online [link].

Yang, L., Zhang, H., Shen, H., Huang, X., Zhou, X., Rong, G., and Shao, D. (2021). Quality assessment in systematic literature reviews: A software engineering perspective. Information and Software Technology, 130:106397. DOI: 10.1016/j.infsof.2020.106397.

Downloads

Published

2024-10-24

How to Cite

Shimaoka, A. M., Ferreira, R. C., & Goldman, A. (2024). The evolution of CRISP-DM for Data Science: Methods, Processes and Frameworks. SBC Reviews on Computer Science, 4(1), 28–43. https://doi.org/10.5753/reviews.2024.3757

Issue

Section

Articles