A Systematic Review of FAIR-compliant Big Data Software Reference Architectures

Authors

DOI:

https://doi.org/10.5753/jidm.2025.4263

Keywords:

FAIR Principles, Open Science, Software Reference Architecture, SRA, Big Data, Systematic Review

Abstract

To meet the standards of the Open Science movement, the FAIR Principles emphasize the importance of making scientific data Findable, Accessible, Interoperable, and Reusable. Yet, creating a repository that adheres to these principles presents significant challenges. Managing large volumes of diverse research data and metadata, often generated rapidly, requires a precise approach. This necessity has led to the development of Software Reference Architectures (SRAs) to guide the implementation process for FAIR-compliant repositories. This article conducts a systematic review of research efforts focused on architectural solutions for such repositories. We detail our methodology, covering all activities undertaken in the planning and execution phases of the review. We analyze 323 references from reputable sources and expert recommendations, identifying 7 studies on general-purpose big data SRAs, 13 pipelines implementing FAIR Principles in specific contexts, and 3 FAIR-compliant big data SRAs. We provide a thorough description of their key features and assess whether the research questions posed in the planning phase were adequately addressed. Additionally, we discuss the limitations of the retrieved studies and identify tendencies and opportunities for further research.

Downloads

Download data is not yet available.

References

Ahmadi, S. (2023). Elastic data warehousing: Adapting to fluctuating workloads with cloud-native technologies. Journal of Knowledge Learning and Science Technology ISSN: 2959-6386 (Online), 2(3):282–301. DOI: 10.60087/jklst.vol2.n3.p301.

Angelov, S., Grefen, P., and Greefhorst, D. (2012). A framework for analysis and design of software reference architectures. Information and Software Technology, 54(4):417–431. DOI: 10.1016/j.infsof.2011.11.009.

Assante, M., Boizet, A., Candela, L., Castelli, D., Cirillo, R., Coro, G., Fernandez, E., Filter, M., Frosini, L., Kakaletris, G., et al. (2021). Realising a science gateway for the agri-food: the AGINFRA PLUS experience. In CEUR Workshop Proc.

Ataei, P. and Litchfield, A. (2021). NeoMycelia: A software reference architecture for big data systems. In Proc. APSEC, pages 452–462. DOI: 10.1109/APSEC53868.2021.00052.

Batista, N. A., Sousa, G. A., Brandão, M. A., da Silva, A. P. C., and Moro, M. M. (2018). Tie strength metrics to rank pairs of developers from github. Journal of Information and Data Management, 9(1):69–69. DOI: 10.5753/jidm.2018.1637.

Bog, A. (2014). Benchmarking Transaction and Analytical Processing Systems: The Creation of a Mixed Workload Benchmark and its Application. In-Memory Data Management Research. Springer Berlin Heidelberg, Berlin, Heidelberg.

Borges, V., de Oliveira, N. Q., Rodrigues, H., Campos, M., and Lopes, G. (2022). A platform to generate FAIR data for COVID-19 clinical research in Brazil. In Proc. ICEIS, pages 218–225. DOI: 10.5220/0011066800003179.

Brito, J. J., Mosqueiro, T., Ciferri, R. R., and Ciferri, C. D. A. (2016). Faster cloud star joins with reduced disk spill and network communication. Procedia Computer Science, 80:74–85. DOI: 10.1016/j.procs.2016.05.299.

Bruha, P., Mouček, R., Salamon, J., and Vacek, V. (2022). Workflow for health-related and brain data lifecycle. Frontiers in Digital Health, 4. DOI: 10.3389/fdgth.2022.1025086.

Castro, J. P. C. and Aguiar, C. D. (2023). Big data architectures for FAIR-compliant repositories: A systematic review. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 76–88. DOI: 10.5753/sbbd.2023.232494.

Castro, J. P. C., Romero, L. M., Carniel, A. C., and Aguiar, C. D. (2022a). FAIR Principles and Big Data: A software reference architecture for Open Science. In Proc. ICEIS, pages 27–38. DOI: 10.5220/0011045500003179.

Castro, J. P. C., Romero, L. M., Carniel, A. C., and Aguiar, C. D. (2022b). Open Science in the cloud: The CloudFAIR architecture for FAIR-compliant repositories. In Proc. ADBIS, pages 56 66. DOI: 10.1007/978-3-031-15743-1_6.

Chaudhuri, S. and Dayal, U. (1997). An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1):65–74. DOI: 10.1145/248603.248616.

Chen, M., Mao, S., and Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2):171–209. DOI: 10.1007/s11036-013-0489-0.

Clarindo, J. P., Castro, J. P. C., and Aguiar, C. D. (2021). Combining fog and cloud computing to support spatial analytics in smart cities. Journal of Information and Data Management, 12(4). DOI: 10.5753/jidm.2021.1798.

Clements, P., Garlan, D., Little, R., Nord, R., and Stafford, J. (2003). Documenting software architectures: views and beyond. In 25th International Conference on Software Engineering, pages 740–741. DOI: 10.1109/ICSE.2003.1201264.

Davoudian, A. and Liu, M. (2020). Big data systems: A software engineering perspective. ACM Computing Surveys, 53(5):1–39. DOI: 10.1145/3408314.

Deng, N., Wu, C., Yaseen, A., and Wu, H. (2022). ImmuneData: an integrated data discovery system for immunology data repositories. Database, 2022. DOI: 10.1093/database/baac003.

Dong, X. L. and Srivastava, D. (2013). Big data integration. In Proc. ICDE, pages 1245–1248. DOI: 10.1109/ICDE.2013.6544914.

Felikson, D., Fenty, I., Hamlington, B., Shiklomanov, A., Blackwood, C., Carroll, M., Croteau, M., David, C., Drushka, K., Duffy, D., et al. (2022). NASA’s earth information system: Sea-level change. In OCEANS 2022, Hampton Roads, pages 1–8. DOI: 10.1109/OCEANS47191.2022.9977250.

Fernandez, R. C., Pietzuch, P. R., Kreps, J., Narkhede, N., Rao, J., Koshy, J., Lin, D., Riccomini, C., and Wang, G. (2015). Liquid: Unifying nearline and offline big data integration. In Proc. CIDR.

Jacobsen, A., de Miranda Azevedo, R., Juty, N., Batista, D., Coles, S., Cornet, R., Courtot, M., Crosas, M., Dumontier, M., Evelo, C. T., et al. (2020). FAIR Principles: Interpretations and implementation considerations. Data Intelligence, 2(1-2):10–29. DOI: 10.1162/dint_r_00024.

Jha, A. K., Mithun, S., Sherkhane, U. B., Jaiswar, V., Shi, Z., Kalendralis, P., Kulkarni, C., Dinesh, M. S., Rajamenakshi, R., Sunder, G., Purandare, N., Wee, L., Rangarajan, V., van Soest, J., and

Dekker, A. (2022). Implementation of big imaging data pipeline adhering to FAIR principles for federated machine learning in oncology. IEEE Transactions on Radiation and Plasma Medical Sciences, 6(2):207–213. DOI: 10.1109/TRPMS.2021.3113860.

Kiran, M., Murphy, P., Monga, I., Dugan, J., and Baveja, S. S. (2015). Lambda architecture for cost-effective batch and speed big data processing. In Proc. IEEE Big Data, pages 2785–2792. DOI: 10.1109/BigData.2015.7364082.

Kreps, J. (2014). Questioning the Lambda architecture. Available at [link].

Lehmann, J., Schorz, S., Rache, A., Häußermann, T., Rädle, M., and Reichwald, J. (2023). Establishing reliable research data management by integrating measurement devices utilizing intelligent digital twins. Sensors, 23(1):468. DOI: 10.3390/s23010468.

Maedche, A., Elshan, E., Höhle, H., Lehrer, C., Recker, J., Sunyaev, A., Sturm, B., and Werth, O. (2024). Open science: Towards greater transparency and openness in science. Business & Information Systems Engineering, pages 1–16. DOI: 10.1007/s12599-024-00858-7.

Martinez-Fernandez, S., Medeiros Dos Santos, P. S., Ayala, C. P., Franch, X., and Travassos, G. H. (2015). Aggregating empirical evidence about the benefits and drawbacks of software reference architectures. In 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–10. DOI: 10.1109/ESEM.2015.7321184.

Martínez-Prieto, M. A., Cuesta, C. E., Arias, M., and Fernández, J. D. (2015). The solid architecture for real-time management of big semantic data. Future Generation Computer Systems, 47:62–79. DOI: 10.1016/j.future.2014.10.016.

Medeiros, C. B., Darboux, B. R., Sánchez, J. A., Tenkanen, H., Meneghetti, M. L., Shinwari, Z. K., Montoya, J. C., Smith, I., McCray, A. T., and Vermeir, K. (2020). IAP input into the UNESCO Open Science Recommendation. Available at [link].

Nadal, S., Herrero, V., Romero, O., Abelló, A., Franch, X., Vansummeren, S., and Valerio, D. (2017). A software reference architecture for semantic-aware big data systems. Information and Software Technology, 90:75–92. DOI: 10.1016/j.infsof.2017.06.001.

Nakagawa, E. Y., Antonino, P. O., and Becker, M. (2011). Reference architecture and product line architecture: A subtle but critical difference. In Proc. ECSA, pages 207–211. DOI: 10.1007/978-3 642-23798-0_22.

Pana, G. T., Ivanoaica, T., Raportaru, M. C., Baran, V., and Nicolin, A. (2021). Towards the implementation of FAIR principles on an earthquake analysis platform. In Proc. RoEduNet, pages 1–4. DOI: 10.1109/RoE-duNet54112.2021.9638283.

Pestryakova, S., Vollmers, D., Sherif, M. A., Heindorf, S., Saleem, M., Moussallem, D., and Ngomo, A. C. N. (2022). CovidPubGraph: A FAIR knowledge graph of COVID-19 publications. Scientific Data, 9(1):389. DOI: 10.1038/s41597-022-01298-2.

Reisen, M. V., Stokmans, M., Basajja, M., Ong’ayo, A. O., Kirkpatrick, C., and Mons, B. (2020). Towards the tipping point for FAIR implementation. Data Intelligence, 2(1-2):264–275. DOI: 10.1162/dint_a_00049.

Rueda-Ruiz, A. J., Ogáyar-Anguita, C. J., Segura-Sánchez, R. J., Béjar-Martos, J. A., and Delgado-Garcia, J. (2022). SPSLiDAR: towards a multi-purpose repository for large scale LiDAR datasets. International Journal of Geographical Information Science, 36(5):992–1011. DOI: 10.1080/13658816.2022.2030479.

Rusu, O., Halcu, I., Grigoriu, O., Neculoiu, G., Sandulescu, V., Marinescu, M., and Marinescu, V. (2013). Converting unstructured and semi-structured data into knowledge. In Proc. RoEduNet, pages 1–4. DOI: 10.1109/RoE-duNet.2013.6511736.

Scannavino, K. R. F., Nakagawa, E. Y., Fabbri, S. C. P. F., and Ferrari, F. C. (2017). Revisão Sistemática da Literatura em Engenharia de Software: teoria e prática. Elsevier.

Schwagereit, F., Romacker, M., Richard, F., Trypuz, R., Liener, T., and Roche, O. (2022). FAIR data APIs in the FAIR in vivo data sharing platform. In CEUR Worksho Proc.

Sciacca, E., Krokos, M., Bordiu, C., Brandt, C., Vitello, F., Bufano, F., Becciani, U., Raciti, M., Tudisco, G., Riggi, S., Topa, E., Azzi, S., Kyd, B., Mantovani, S., Vettorello, L., Tan, J., Quintana, J., Campos, R., and Pina, N. (2022). Scientific visualization on the cloud: the NEANIAS services towards EOSC integration. Journal of Grid Computing, 20(1):7. DOI: 10.1007/s10723-022-09598-y.

Sharma, S. and Mangat, V. (2015). Technology and trends to handle big data: Survey. In Proc. ICACCT, pages 266–271. DOI: 10.1109/ACCT.2015.121.

Toulet, A., Michel, F., Bobasheva, A., Menin, A., Dupré, S., Deboin, M.-C., Winckler, M., and Tchechmedjiev, A. (2022). ISSA: generic pipeline, knowledge model and visualization tools to help scientists search and make sense of a scientific archive. In Proc. ISWC, pages 660–677. DOI: 10.1007/978-3-031-19433-7_38.

Tsai, C.-W., Lai, C.-F., Chao, H.-C., and Vasilakos, A. V. (2015). Big data analytics: A survey. Journal of Big Data, 2:1–32. DOI: 10.1186/s40537-015-0030-3.

Vazquez, P., Hirayama-Shoji, K., Novik, S., Krauss, S., and Rayner, S. (2022). Globally accessible distributed data sharing (GADDS): A decentralized FAIR platform to facilitate data sharing in the life sciences. Bioinformatics, 38:3812–3817. DOI: 10.1093/bioinformatics/btac362.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Santos, L. B. S., Bourne, P. E., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1):1–9. DOI: 10.1038/sdata.2016.18.

Wrembel, R. (2017). Novel Big Data Integration Techniques - What is New. BigNovelTI - Panel Discussion. Available at: [link].

Downloads

Published

2025-03-18

How to Cite

João Pedro de Carvalho Castro, Maria Júlia Soares De Grandi, & Cristina Dutra de Aguiar. (2025). A Systematic Review of FAIR-compliant Big Data Software Reference Architectures. Journal of Information and Data Management, 16(1), 136–150. https://doi.org/10.5753/jidm.2025.4263

Issue

Section

SBBD 2023 Full papers - Extended papers