Machine Learning Model Explainability supported by Data Explainability: a Provenance-Based Approach


  • Rosana Leandro de Oliveira Instituto Militar de Engenharia | Centro de Análises de Sistemas Navais
  • Julio Cesar Duarte Instituto Militar de Engenharia
  • Kelli de Faria Cordeiro Instituto Militar de Engenharia | Ministério da Defesa



Data Pre-Procesing, Machine Learning, Data Provenance, Explainability


The task of explaining the result of Machine Learning (ML) predictive models has become critically important nowadays, given the necessity to improve the results' reliability. Several techniques have been used to explain the prediction of ML models, and some research works explore the use of data provenance in ML cycle phases. However, there is a gap in relating the provenance data with model explainability provided by Explainable Artificial Intelligence (XAI) techniques. To address this issue, this work presents an approach to capture provenance data, mainly in the pre-processing phase, and relate it to the results of explainability techniques. To support that, a relational data model was also proposed and is the basis for our concept of data explainability. Furthermore, a graphic visualization was developed to better present the improved technique. The experiments' results showed that the improvement of the ML explainability techniques was reached mainly by the understanding of the attributes' derivation, which built the model, enabled by data explainability.


Download data is not yet available.


Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., and Herrera, F. (2020). Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115. DOI:

Breiman, L. (2001). Machine learning, volume 45, number 1 - springerlink. Machine Learning, 45:5–32. DOI: 10.1023/A:1010933404324.

Chapman, A., Lauro, L., Missier, P., and Torlone, R. (2022). Dpds: Assisting data science with data provenance. Proc. VLDB Endow., 15(12):3614–3617. DOI: 10.14778/3554821.3554857.

Chapman, A., Missier, P., Simonelli, G., and Torlone, R. (2020). Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proc. VLDB Endow., 14(4):507–520. DOI: 10.14778/3436905.3436911.

Davidson, S. and Freire, J. (2008). Provenance and scientific workflows: Challenges and opportunities. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1345–1350. DOI: 10.1145/1376616.1376772.

de Salud, S. (2020). Datos abiertos dirección general de epidemiología. [link]. (Acessado em 04/08/2021).

Franklin, M. R. (2020). ”Kaggle: Mexico COVID-19 clinical data”. [link]. (Acessado em 02/10/2021).

Freire, J., Koop, D., Santos, E., and Silva, C. T. (2008). Provenance for computational tasks: A survey. Computing in Science Engineering, 10(3):11–21. DOI: 10.1109/MCSE.2008.79.

Hartley, M. and Olsson, T. S. (2020). dtoolai: Reproducibility for deep learning. Patterns, 1(5):100073. DOI:

Herschel, M., Diestelkämper, R., and Ben Lahmar, H. (2017). A survey on provenance: What for? what form? what from? The VLDB Journal, 26(6):881–906. DOI: 10.1007/s00778-017-0486-1.

Jaigirdar, F. T., Rudolph, C., Oliver, G., Watts, D., and Bain, C. (2020). What information is required for explainable ai? : A provenance-based research agenda and future challenges. In 2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC), pages 177–183. DOI: 10.1109/CIC50333.2020.00030.

Jentzsch, S. F. and Hochgeschwender, N. (2019). Don’t forget your roots! using provenance data for transparent and explainable development of machine learning models. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW), pages 37–40. DOI: 10.1109/ASEW.2019.00025.

Kaggle (2022). Titanic - machine learning from disaster. [link]. Acesso em: 26 de fevereiro 2022.

Linardatos, P., Papastefanopoulos, V., and Kotsiantis, S. (2021). Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1). DOI: 10.3390/e23010018.

Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4768–4777, Red Hook, NY, USA. Curran Associates Inc.. DOI: 10.5555/3295222.3295230.

Moura, L. d. A. L., da Silva, M. A. A., Cordeiro, K. d. F., and Cavalcanti, M. C. R. (2021). A well-founded ontology to support the preparation of training and test datasets. In Proceedings of the 23rd International Conference on Enterprise Information Systems, ICEIS 2021, pages 99–110. SCITEPRESS. DOI: 10.5220/0010460000990110.

Muhammad, L., Algehyne, E. A., Usman, S. S., Ahmad, A., Chakraborty, C., and Mohammed, I. A. (2021). Supervised machine learning models for prediction of covid-19 infection using epidemiology dataset. SN computer science, 2(1):1–13. DOI: 10.1007/s42979-020-00394-7.

Namaki, M. H., Floratou, A., Psallidas, F., Krishnan, S., Agrawal, A., Wu, Y., Zhu, Y., and Weimer, M. (2020). Vamsa: Automated provenance tracking in data science scripts. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. DOI: 10.1145/3394486.3403205.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikitlearn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830. DOI: 10.5555/1953048.2078195.

Rupprecht, L., Davis, J. C., Arnold, C., Gur, Y., and Bhagwat, D. (2020). Improving reproducibility of data science pipelines through transparent provenance capture. Proc. VLDB Endow., 13(12):3354–3368. DOI: 10.14778/3415478.3415556.

Scherzinger, S., Seifert, C., and Wiese, L. (2019). The best of both worlds: Challenges in linking provenance and explainability in distributed machine learning. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pages 1620–1629. DOI: 10.1109/ICDCS.2019.00161.

Souza, R., Azevedo, L., Lourenço, V., Soares, E., Thiago, R., Brandão, R., Civitarese, D., Brazil, E., Moreno, M., Valduriez, P., et al. (2019). Provenance data in the machine learning lifecycle in computational science and engineering. In 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pages 1–10. IEEE. DOI: 10.1109/WORKS49585.2019.00006.

Stonebraker, M., Rowe, L., and Hirohama, M. (1990). The implementation of postgres. IEEE Transactions on Knowledge and Data Engineering, 2(1):125–142. DOI: 10.1109/69.50912.

Tsakalakis, N., Stalla-Bourdillon, S., Carmichael, L., Huynh, T. D., Moreau, L., and Helal, A. (2021). The dual function of explanations: Why it is useful to compute explanations. Computer Law & Security Review, 41:105527. DOI: 10.1016/j.clsr.2020.105527.

van Rossum, G. (1995). Python reference manual.

W3C (2013). The prov data model. [link]. (Acessado em 01/11/2021).

Wollenstein-Betech, S., Cassandras, C. G., and Paschalidis, I. C. (2020). Personalized predictive models for symptomatic covid-19 patients using basic preconditions: Hospitalizations, mortality, and the need for an icu or ventilator. International Journal of Medical Informatics, 142:104258. DOI:




How to Cite

de Oliveira, R. L., Duarte, J. C., & de Faria Cordeiro, K. (2024). Machine Learning Model Explainability supported by Data Explainability: a Provenance-Based Approach. Journal of Information and Data Management, 15(1), 93–102.



Regular Papers