Provenance Support for Containerized Workflow Analyses in High-Performance Computing Environments
DOI:
https://doi.org/10.5753/jidm.2026.5723Keywords:
Container, Provenance, Workflows, Machine LearningAbstract
Deploying scientific workflows in High-Performance Computing (HPC) environments presents several challenges due to variations in computational infrastructure, execution environments, and resource availability. Containers offer a way to ease workflow deployment and foster reproducibility. However, effective use of containers requires more than just access to container images. Understanding container provenance is essential, as it provides detailed information on image creation, configuration, and execution history, which is critical when deploying workflows across different architectures and container engines. Existing provenance support focuses on tracking container actions and standalone processes, but does not relate it to the provenance of workflows. To address this limitation, we represent container metadata as provenance and relate it to the provenance captured by the workflow execution. This approach enables workflow deployment with multiple container configurations in HPC environments while being compliant with the W3C-PROV standard for structured container provenance. The proposed model was evaluated in a real scientific machine-learning workflow. The evaluation assessed how provenance data can improve traceability, support workflow reproducibility, and facilitate containerized workflow analyses.
Downloads
References
Abbas, M., Khan, S., Monum, A., Zaffar, F., et al. (2022). Paced: Provenance-based automated container escape detection. In 2022 IEEE IC2E, pages 261–272. IEEE.
Ahmad, R., Nakamura, Y., Manne, N. N., and Malik, T. (2020). PROV-CRT: Provenance support for container runtimes. In 12th International Workshop on Theory and
Practice of Provenance (TaPP 2020).
Campagna, D., da Silva, A., and Braganholo, V. (2020). Achieving gdpr compliance through provenance: An extended model. In Anais do XXXV Simpósio Brasileiro de
Bancos de Dados, pages 13–24, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2020.13621.
Canon, R. S. (2020). The role of containers in reproducibility. In 2020 2nd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), pages 19–25. IEEE.
Chen, X., Irshad, H., Chen, Y., Gehani, A., and Yegneswaran, V. (2021). CLARION: Sound and clear provenance tracking for microservice deployments. In 30th USENIX Security, pages 3989–4006.
Costa, F., Silva, V., de Oliveira, D., Ocaña, K. A. C. S., Ogasawara, E. S., Dias, J., and Mattoso, M. (2013). Capturing and querying workflow runtime provenance with PROV:
a practical approach. In Guerrini, G., editor, EDBT/ICDT’13, pages 282–289. DOI: 10.1145/2457317.2457365.
Datta, P., Polinsky, I., Inam, M. A., Bates, A., and Enck, W. (2022). ALASTOR: Reconstructing the provenance of serverless intrusions. In 31st USENIX Security, pages
–2460.
de Oliveira, D. C. M., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments.
Synthesis Lectures on Data Management. Morgan & Claypool Publishers. DOI: 10.2200/S00915ED1V01Y201904DTM060.
Freire, J., Koop, D., Santos, E., and Silva, C. T. (2008). Provenance for computational tasks: A survey. Computing in Science & Engineering, 10(3):11–21.
Freitas, R. S., Barbosa, C. H., Guerra, G. M., Coutinho, A. L., and Rochinha, F. A. (2021). An encoder-decoder deep surrogate for reverse time migration in seismic imaging under uncertainty. Computational Geosciences, 25:1229–1250.
Gruening, B., Sallou, O., Moreno, P., da Veiga Leprevost, F., Ménager, H., Søndergaard, D., Röst, H., Sachsenberg, T., O’Connor, B., Madeira, F., Dominguez Del Angel, V.,
Crusoe, M. R., Varma, S., Blankenberg, D., Jimenez, R. C., BioContainers Community, and Perez-Riverol, Y. (2018). Recommendations for the packaging and containerizing of bioinformatics software. F1000Res, 7.
Han, R., Zheng, M., Byna, S., Tang, H., Dong, B., Dai, D., Chen, Y., Kim, D., Hassoun, J., and Thorsley, D. (2024). PROV-IO + : A cross-platform provenance framework for
scientific data on hpc systems. IEEE Transactions on Parallel and Distributed Systems.
Kunstmann, L., Pina, D., de Oliveira, D., and Mattoso, M. (2024a). Scientific workflow deployment: Container provenance in high-performance computing. In Simpósio
Brasileiro de Banco de Dados (SBBD), pages 457–470. SBC.
Kunstmann, L., Pina, D., de Oliveira, D., and Mattoso, M. (2024b). ProvDeploy: Provenance-oriented containerization of high performance computing scientific workflows.
arXiv preprint arXiv:2403.15324 .
Kunstmann, L., Pina, D., de Oliveira, L. S., de Oliveira, D., and Mattoso, M. (2022). ProvDeploy: Explorando alternativas de conteinerização com proveniência para aplicações científicas com pad. In Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho, pages 49–60. SBC.
Lampa, S., Dahlö, M., Alvarsson, J., and Spjuth, O. (2019). Scipipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines. GigaScience, 8(5):giz044.
Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux j, 239(2):2.
Modi, A., Reyad, M., Malik, T., and Gehani, A. (2023). Querying container provenance. In Companion Proceedings of the ACM Web Conference 2023, pages 1564–1567.
Moreau, L. and Groth, P. (2013). Provenance: an introduction to prov. Synthesis lectures on the semantic web: theory and technology, 3(4):1–129.
Murta, L., Braganholo, V., Chirigati, F., Koop, D., and Freire, J. (2015). noworkflow: capturing and analyzing provenance of scripts. In IPAW 2014, pages 71–83. Springer.
Novella, J. A., Emami Khoonsari, P., et al. (2019). Container-based bioinformatics with pachyderm. Bioinformatics, 35(5):839–846.
Olaya, P., Kennedy, D., et al. (2022). Building trust in earth science findings through data traceability and results explainability. IEEE TPDS, 34(2):704–717.
Orzechowski, M., Balis, B., Pawlik, K., Pawlik, M., and Malawski, M. (2018). Transparent deployment of scientific workflows across clouds-kubernetes approach. In 2018 IEEE/ACM UCC Companion, pages 9–10. IEEE.
Paranhos, R., Lage, M., and de Oliveira, D. (2023). Uso de grafos de proveniência para análise temporal de uso do solo em centros urbanos: uma abordagem prática. In
Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 457–462, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2023.233061.
Pina, D., Chapman, A., Kunstmann, L., de Oliveira, D., and Mattoso, M. (2024). DLProv: A data-centric support for deep learning workflow analyses. In Companion of the 2024 ACM SIGMOD/PODS, Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning., DEEM ’24, pages 77–85. ACM. DOI: 10.1145/3650203.3663337.
Priedhorsky, R., Canon, R. S., Randles, T., and Younge, A. J. (2021). Minimizing privilege for building hpc containers. In IEEE/ACM SC, pages 1–14.
Satapathy, U., Thakur, R., Chattopadhyay, S., and Chakraborty, S. (2023). Disprotrack: Distributed provenance tracking over serverless applications. In IEEE IN-
FOCOM 2023-IEEE Conference on Computer Communications, pages 1–10. IEEE.
Schlegel, M. and Sattler, K.-U. (2023). Management of machine learning lifecycle artifacts: A survey. SIGMOD Rec., 51(4):18–35. DOI: 10.1145/3582302.3582306.
Shaffer, T., Phung, T. S., Chard, K., and Thain, D. (2023). Landlord: Coordinating dynamic software environments to reduce container sprawl. IEEE Transactions on Parallel
and Distributed Systems, 34(5):1376–1389.
Silva, V., Campos, V., Guedes, T., Camata, J., de Oliveira, D., Coutinho, A. L., Valduriez, P., and Mattoso, M. (2020). Dfanalyzer: runtime dataflow analysis tool for computational science and engineering applications. SoftwareX, 12:100592.
Straesser, M., Bauer, A., Leppich, R., Herbst, N., Chard, K., Foster, I., and Kounev, S. (2023). An empirical study of container image configurations and their impact on start
times. In 2023 IEEE/ACM 23rd CCGrid, pages 94–105. IEEE.
Wofford, Q., Hurd, J., Greenberg, H., Bridges, P. G., and Ahrens, J. (2022). Complete provenance for application experiments with containers and hardware interface metadata. In 2022 IEEE/ACM CANOPIE-HPC, pages 12–24. IEEE.
Zhu, Y. and Zabaras, N. (2018). Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. Journal of Computational
Physics, 366:415–447.

