Polyflow: a Polystore-compliant Mechanism to Provide Interoperability to Heterogeneous Provenance Graphs

Authors

  • Yan Mendes Federal University of Juiz de Fora
  • Daniel de Oliveira Fluminense Federal University
  • Victor Ströele Federal University of Juiz de Fora

DOI:

https://doi.org/10.5753/jidm.2020.2017

Keywords:

Polystore, Syntactic interoperability, Semantic interoperability

Abstract

Many scientific experiments are modeled as workflows. Workflows usually output massive amounts of data. To guarantee the reproducibility of workflows, they are usually orchestrated by Workflow Management Systems (WfMS), that capture provenance data. Provenance represents the lineage of a data fragment throughout its transformations by activities in a workflow. Provenance traces are usually represented as graphs. These graphs allows scientists to analyze and evaluate results produced by a workflow. However, each WfMS has a proprietary format for provenance and do it in different granularity levels. Therefore, in more complex scenarios in which the scientist needs to interpret provenance graphs generated by multiple WfMSs and workflows, a challenge arises. To first understand the research landscape, we conduct a Systematic Literature Mapping, assessing existing solutions under several different lenses. With a clearer understanding of the state of the art, we propose a tool called Polyflow, which is based on the concept of Polystore systems, integrating several databases of heterogeneous origin by adopting a global ProvONE schema. Polyflow allows scientists to query multiple provenance graphs in an integrated way. Polyflow was evaluated by experts using provenance data collected from real experiments that generate phylogenetic trees through workflows. The experiment results suggest that Polyflow is a viable solution for interoperating heterogeneous provenance data generated by different WfMSs, from both a usability and performance standpoint.

Downloads

Download data is not yet available.

References

The third provenance challenge. https://openprovenance.org/provenance-challenge/ThirdProvenanceChallenge. html, 2009. [Online; accessed 24-August-2018].

Prov-overview. https://www.w3.org/TR/prov-overview/, 2013. [Online; accessed 24-August-2018].

Abadi, D., Agrawal, R., Ailamaki, A., Balazinska, M., Bernstein, P. A., Carey, M. J., Chaudhuri, S., Dean, J., Doan, A., Franklin, M. J., Gehrke, J., Haas, L. M., Halevy, A. Y., Hellerstein, J. M., Ioannidis, Y. E., Jagadish, H. V., Kossmann, D., Madden, S., Mehrotra, S., Milo, T., Naughton, J. F., Ramakrishnan, R., Markl, V., Olston, C., Ooi, B. C., Ré, C., Suciu, D., Stonebraker, M., Walter, T., and Widom, J. The beckman report on database research. Commun. ACM 59 (2): 92–99, 2016.

Abadi, D., Ailamaki, A., Andersen, D., Bailis, P., Balazinska, M., Bernstein, P. A., Boncz, P. A., Chaud- huri, S., Cheung, A., Doan, A., Dong, L., Franklin, M. J., Freire, J., Halevy, A. Y., Hellerstein, J. M., Idreos, S., Kossmann, D., Kraska, T., Krishnamurthy, S., Markl, V., Melnik, S., Milo, T., Mohan, C., Neumann, T., Ooi, B. C., Ozcan, F., Patel, J., Pavlo, A., Popa, R. A., Ramakrishnan, R., Ré, C., Stonebraker, M., and Suciu, D. The seattle report on database research. SIGMOD Rec. 48 (4): 44–53, 2019.

Abbasi, A., Sarker, S., and Chiang, R. H. Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems 17 (2), 2016.

Altintas, I., Anand, M. K., Crawl, D., Bowers, S., Belloum, A., Missier, P., Ludäscher, B., Goble, C. A., and Sloot, P. M. Understanding collaborative studies through interoperable workflow provenance. In IPAW. Springer, pp. 42–58, 2010.

Altintas, I., Berkley, C., Jaeger, E., Jones, M. B., Ludäscher, B., and Mock, S. Kepler: An extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), 21-23 June 2004, Santorini Island, Greece. pp. 423–424, 2004.

Anand, M. K., Bowers, S., Altintas, I., and Ludäscher, B. Approaches for exploring and querying scientific workflow provenance graphs. In International Provenance and Annotation Workshop. Springer, pp. 17–26, 2010. Anand, M. K., Bowers, S., Mcphillips, T., and Ludäscher, B. Exploring scientific workflow provenance using hybrid queries over nested data and lineage graphs. In SSDBM. Springer, pp. 237–254, 2009.

Asuncion, C. H. and van Sinderen, M. Towards pragmatic interoperability in the new enterprise—a survey of approaches. In International IFIP Working Conference on Enterprise Interoperability. Springer, pp. 132–145, 2011. Atkinson, M. P., Gesing, S., Montagnat, J., and Taylor, I. J. Scientific workflows: Past, present and future. FGCS vol. 75, pp. 216–227, 2017.

Bavoil, L., Callahan, S. P., Scheidegger, C. E., Vo, H. T., Crossno, P., Silva, C. T., and Freire, J. Vistrails: Enabling interactive multiple-view visualizations. In 16th IEEE Visualization Conference, VIS 2005, Minneapolis, MN, USA, October 23-28, 2005. IEEE Computer Society, pp. 135–142, 2005.

Begoli, E., Kistler, D., and Bates, J. Towards a heterogeneous, polystore-like data architecture for the US department of veteran affairs (VA) enterprise analytics. In BigData 2016. pp. 2550–2554, 2016.

Bose, R. and Frew, J. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys (CSUR) 37 (1): 1–28, 2005.

Budgen, D., Turner, M., Brereton, P., and Kitchenham, B. A. Using mapping studies in software engineering. In PPIG. Vol. 8. pp. 195–204, 2008.

Buneman, P., Khanna, S., and Wang-Chiew, T. Why and where: A characterization of data provenance. In ICDT. Springer, pp. 316–330, 2001.

Chebotko, A., Lu, S., Fei, X., and Fotouhi, F. Rdfprov: A relational rdf store for querying and managing scientific workflow provenance. Data & Knowledge Engineering 69 (8): 836–865, 2010.

Chirigati, F. and Freire, J. Provenance and reproducibility. In Encyclopedia of Database Systems, Second Edition, L. Liu and M. T. Özsu (Eds.). Springer, 2018.

Costa, C. and Murta, L. Version control in distributed software development: A systematic mapping study. In 2013 ICGSE. IEEE, pp. 90–99, 2013.

Coulouris, G. F., Dollimore, J., and Kindberg, T. Distributed systems: concepts and design. pearson education, 2005.

Cuevas-Vicenttin, V., Dey, S., Wang, M. L. Y., Song, T., and Ludascher, B. Modeling and querying scientific workflow provenance in the d-opm. In 2012 SC Companion: High-Performance Computing, Networking, Storage and Analysis (SCC). IEEE, pp. 119–128, 2012.

Cuevas-Vicenttín, V., Ludäscher, B., Missier, P., Belhajjame, K., Chirigati, F., Wei, Y., and Leinfelder, B. Provone: A prov extension data model for scientific workflow provenance, 2015.

Cuevas-Vicenttín Víctor, Ludäscher Bertram, M. P. B. K. C. F. W. Y. D. S. K. P. K. D. B. S. A. I. J. C. B. J. M. W. L. S. P. L. B. C. Y. Provone data model. http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html, 2016. [Online; accessed 19-July-2020].

Davidson, S. B. and Freire, J. Provenance and scientific workflows: challenges and opportunities. In 2008 ACM SIGMOD. ACM, pp. 1345–1350, 2008.

de Oliveira, A. H. M., de Oliveira, D., and Mattoso, M. Clouds and reproducibility: A way to go to scientific experiments? In Cloud Computing - Principles, Systems and Applications, Second Edition, N. Antonopoulos and L. Gillam (Eds.). Computer Communications and Networks. Springer, pp. 127–151, 2017.

de Oliveira, D., Liu, J., and Pacitti, E. Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2019.

de Oliveira, D., Ocaña, K. A. C. S., Ogasawara, E. S., Dias, J., de A. R. Gonçalves, J. C., Baião, F. A., and Mattoso, M. Performance evaluation of parallel strategies in public clouds: A study with phylogenomic workflows. Future Gener. Comput. Syst. 29 (7): 1816–1825, 2013.

de Oliveira, D., Ogasawara, E. S., Baião, F. A., and Mattoso, M. Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In IEEE International Conference on Cloud Computing, CLOUD 2010, Miami, FL, USA, 5-10 July, 2010. pp. 378–385, 2010.

Deelman, E., Peterka, T., Altintas, I., Carothers, C. D., van Dam, K. K., Moreland, K., Parashar, M., Ramakrishnan, L., Taufer, M., and Vetter, J. S. The future of scientific workflows. Int. J. High Perform. Comput. Appl. 32 (1): 159–175, 2018.

Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P., Mayani, R., Chen, W., da Silva, R. F., Livny, M., and Wenger, R. K. Pegasus, a workflow management system for science automation. Future Generation Comp. Syst. vol. 46, pp. 17–35, 2015.

Ding, L., Michaelis, J., McCusker, J., and McGuinness, D. L. Linked provenance data: A semantic web-based approach to interoperable workflow traces. Future Generation Computer Systems 27 (6): 797–805, 2011.

Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., and Zdonik, S. The bigdawg polystore system. ACM Sigmod Record 44 (2): 11–16, 2015.

Ellqvist, T., Koop, D., Freire, J., Silva, C., and Strömbäck, L. Using mediation to achieve provenance inter- operability. In Services-I, 2009 World Conference on. IEEE, pp. 291–298, 2009.

Freire, J. and Chirigati, F. S. Provenance and the different flavors of reproducibility. IEEE Data Eng. Bull. 41 (1): 15–26, 2018.

Freire, J., Koop, D., Santos, E., and Silva, C. T. Provenance for computational tasks: A survey. CS & E 10 (3), 2008.

Gadepally, V., Chen, P., Duggan, J., Elmore, A., Haynes, B., Kepner, J., Madden, S., Mattson, T., and Stonebraker, M. The bigdawg polystore system and architecture. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE. IEEE, pp. 1–6, 2016.

Gaspar, W., Braga, R., and Campos, F. Sciprov: an architecture for semantic query in provenance metadata on e-science context. In International Conference on Information Technology in Bio-and Medical Informatics. Springer, pp. 68–81, 2011.

Gesing, S., Dooley, R., Pierce, M. E., Krüger, J., Grunzke, R., Herres-Pawlis, S., and Hoffmann, A. Gathering requirements for advancing simulations in HPC infrastructures via science gateways. FGCS vol. 82, pp. 544–554, 2018.

Glatard, T., Étienne Rousseau, M., Camarasu-Pop, S., Adalat, R., Beck, N., Das, S., da Silva, R. F., Khalili-Mahani, N., Korkhov, V., Quirion, P.-O., Rioux, P., Olabarriaga, S. D., Bellec, P., and Evans, A. C. Software architectures to integrate workflow engines in science gateways. Future Generation Computer Systems vol. 75, pp. 239 – 255, 2017.

Groth, P. and Moreau, L. Recording process documentation for provenance. IEEE TPDS 20 (9): 1246–1259, 2009.

Hamadou, H. B., Gallinucci, E., and Golfarelli, M. Answering GPSJ queries in a polystore: A dataspace-based approach. In Conceptual Modeling - 38th International Conference, ER 2019, Salvador, Brazil, November 4-7, 2019, Proceedings, A. H. F. Laender, B. Pernici, E. Lim, and J. P. M. de Oliveira (Eds.). Lecture Notes in Computer Science, vol. 11788. Springer, pp. 189–203, 2019.

Hazen, B. T., Boone, C. A., Ezell, J. D., and Jones-Farmer, L. A. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. International Journal of Production Economics vol. 154, pp. 72–80, 2014.

Hey, T., Tansley, S., Tolle, K. M., et al. The fourth paradigm: data-intensive scientific discovery. Vol. 1. Microsoft research Redmond, WA, 2009.

Huynh, T. D., Ebden, M., Fischer, J. E., Roberts, S. J., and Moreau, L. Provenance network analytics - an approach to data analytics using data provenance. Data Min. Knowl. Discov. 32 (3): 708–735, 2018.

Jabal, A. A. and Bertino, E. Simp: Secure interoperable multi-granular provenance framework. In IEEE e-Science 2016. IEEE, pp. 270–275, 2016.

Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., and Shahabi, C. Big data and its technical challenges. Commun. ACM 57 (7): 86–94, 2014.

Jhingran, A., Mattos, N., and Pirahesh, H. Information integration: A research agenda. IBM systems Journal 41 (4): 555–562, 2002.

Khan, F. Z., Soiland-Reyes, S., Sinnott, R. O., Lonie, A., Goble, C., and Crusoe, M. R. Sharing interoperable workflow provenance: A review of best practices and their practical application in cwlprov. GigaScience 8 (11): giz095, 2019.

Khan, Y., Zimmermann, A., Jha, A., Gadepally, V., d’Aquin, M., and Sahay, R. One size does not fit all: Querying web polystores. IEEE Access vol. 7, pp. 9598–9617, 2019.

Kitchenham, B. Procedures for performing systematic reviews. Keele, UK, Keele University 33 (2004): 1–26, 2004.

Li, C. and Sugimoto, S. Provenance description of metadata using prov with premis for long-term use of metadata. In International Conference on Dublin Core and Metadata Applications. pp. 147–156, 2014.

Lim, C., Lu, S., Chebotko, A., and Fotouhi, F. Storing, reasoning, and querying opm-compliant scientific workflow provenance using relational databases. FGCS 27 (6): 781–789, 2011.

Litwin, W. and Abdellatif, A. Multidatabase interoperability. Computer (12): 10–18, 1986.

Mattoso, M., Werner, C., Travassos, G. H., Braganholo, V., Ogasawara, E. S., de Oliveira, D., da Cruz, S. M. S., Martinho, W., and Murta, L. Towards supporting the life cycle of large scale scientific experiments. IJBPIM 5 (1): 79–92, 2010.

Mendes, Y., Ströele, V., de Oliveira, D., and Ocaña, K. Análise integrada de grafos de proveniência heterogêneos por meio de uma abordagem polystore. In Anais Principais do XXXIV Simpósio Brasileiro de Banco de Dados. SBC, Porto Alegre, RS, Brasil, pp. 73–84, 2019.

Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M. K., and Goble, C. Linking multiple workflow provenance traces for interoperable collaborative science. In WORKS 2010. IEEE, pp. 1–8, 2010.

Mondelli, M. L., Magalhães, T., Loss, G., Wilde, M., Foster, I. T., Mattoso, M., Katz, D. S., Barbosa, H. J. C., de Vasconcelos, A. T. R., Ocaña, K. A. C. S., and Jr., L. M. R. G. Bioworkbench: A high-performance framework for managing and analyzing bioinformatics experiments. CoRR vol. abs/1801.03915, 2018.

Moreau, L., Freire, J., Futrelle, J., McGrath, R. E., Myers, J., and Paulson, P. The open provenance model: An overview. In International Provenance and Annotation Workshop. Springer, pp. 323–326, 2008.

Moreau, L., Groth, P. T., Cheney, J., Lebo, T., and Miles, S. The rationale of PROV. J. Web Semant. vol. 35, pp. 235–257, 2015.

Ocaña, K. A., de Oliveira, D., Ogasawara, E., Dávila, A. M., Lima, A. A., and Mattoso, M. Sciphy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In BSB11. Springer, pp. 66–70, 2011.

Ogasawara, E. S., Dias, J., Sousa, V. S., Chirigati, F. S., de Oliveira, D., Porto, F., Valduriez, P., and Mattoso, M. Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. Pract. Exp. 25 (16): 2327–2341, 2013.

Oliveira, W., Missier, P., Ocaña, K., de Oliveira, D., and Braganholo, V. Analyzing provenance across heterogeneous provenance graphs. In IPAW. Springer, pp. 57–70, 2016.

Özsu, M. T. and Valduriez, P. Principles of distributed database systems. Springer Science & Business Media, 2011.

Parciak, M., Bauer, C., Bender, T., Lodahl, R., Schreiweis, B., Tute, E., and Sax, U. Provenance solutions for medical research in heterogeneous it-infrastructure: An implementation roadmap. Studies in health technology and informatics vol. 264, pp. 298–302, 2019.

Peng, R. The reproducibility crisis in science: A statistical counterattack. Significance 12 (3): 30–32, 2015.

Pérez, B., Rubio, J., and Sáenz-Adán, C. A systematic review of provenance systems. Knowledge and Information Systems, 2018.

Prabhune, A., Stotzka, R., Sakharkar, V., Hesser, J., and Gertz, M. Metastore: an adaptive metadata management framework for heterogeneous metadata models. DPD 36 (1): 153–194, 2018.

Prabhune, A., Zweig, A., Stotzka, R., Gertz, M., and Hesser, J. Prov2one: an algorithm for automatically constructing provone provenance graphs. In IPAW. Springer, pp. 204–208, 2016.

Prabhune, A., Zweig, A., Stotzka, R., Hesser, J., and Gertz, M. P-PIF: a provone provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces. Distributed and Parallel Databases 36 (1): 219–264, 2018a.

Prabhune, A., Zweig, A., Stotzka, R., Hesser, J., and Gertz, M. P-pif: a provone provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces. DPD 36 (1): 219–264, 2018b. Schwab, M., Karrenbach, N., and Claerbout, J. Making scientific computations reproducible. CS & E 2 (6): 61–67, 2000.

Simmhan, Y. L., Plale, B., and Gannon, D. A survey of data provenance in e-science. ACM Sigmod Record 34 (3): 31–36, 2005.

Souza, R., Azevedo, L., Thiago, R., Soares, E., Nery, M., Netto, M., Brazil, E. V., Cerqueira, R., Valduriez, P., and Mattoso, M. Efficient runtime capture of multiworkflow data using provenance, 2019.

Steinmacher, I., Chaves, A. P., and Gerosa, M. A. Awareness support in distributed software development: A systematic review and mapping of the literature. CSCW 22 (2-3): 113–158, 2013.

Tolk, A. and Muguira, J. A. The levels of conceptual interoperability model. In Proceedings of the 2003 fall simulation interoperability workshop. Vol. 7. Citeseer, pp. 1–11, 2003

Watson, P., Hiden, H., and Woodman, S. e-science central for CARMEN: science as a service. Concurrency and Computation: Practice and Experience 22 (17): 2369–2380, 2010.

Wegner, P. Interoperability. ACM Computing Surveys (CSUR) 28 (1): 285–287, 1996.

Wiederhold, G. Mediators in the architecture of future information systems. Computer 25 (3): 38–49, 1992. Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering. ACM, pp. 38, 2014.

Wolstencroft, K., Haines, R., Fellows, D., Williams, A. R., Withers, D., Owen, S., Soiland-Reyes, S.,

Dunlop, I., Nenadic, A., Fisher, P., Bhagat, J., Belhajjame, K., Bacall, F., Hardisty, A., de la Hidalga, A. N., Vargas, M. P. B., Sufi, S., and Goble, C. A. The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Research 41 (Webserver-Issue): 557– 561, 2013.

Wozniak, J. M., Armstrong, T. G., Wilde, M., Katz, D. S., Lusk, E. L., and Foster, I. T. Swift/t: Large-scale application composition via distributed-memory dataflow processing. In 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2013, Delft, Netherlands, May 13-16, 2013. pp. 95–102, 2013.

Yu, J. and Buyya, R. A taxonomy of scientific workflow systems for grid computing. ACM Sigmod Record 34 (3): 44–49, 2005.

Downloads

Published

2021-05-04

How to Cite

Mendes, Y., de Oliveira, D., & Ströele, V. (2021). Polyflow: a Polystore-compliant Mechanism to Provide Interoperability to Heterogeneous Provenance Graphs. Journal of Information and Data Management, 11(3). https://doi.org/10.5753/jidm.2020.2017

Issue

Section

SBBD 2019