ORBITER: a Lightweight Framework for Automatic Deployment of Big Data Applications on Serverless Architectures
DOI:
https://doi.org/10.5753/jidm.2023.3233Keywords:
Serverless, FaaS, Big Data ApplicationsAbstract
The Serverless Computing paradigm has become a reality, with various public cloud environments offering serverless infrastructure and functionalities for general use. One of the main advantages of deploying applications in serverless architectures is that developers can focus on implementing the application and business logic while avoiding the complexities of deployment. However, despite these advantages, there are still challenges to consider, such as the delay in building the architecture of the distributed application. In this article, we introduce a lightweight framework designed for deploying big data applications in a serverless architecture, specifically following the Function-as-a-Service (FaaS) model. The proposed framework uses open-source tools and was thoroughly evaluated through the implementation of two practical applications for Crime Hot Spot Analysis and Rainfall Analysis in the Microsoft Azure cloud. The results of our evaluation demonstrated the potential of the proposed framework in streamlining the deployment process with acceptable overhead.
Downloads
References
Armbrust, M., Das, T., Paranjpye, S., Xin, R., Zhu, S., Ghodsi, A., Yavuz, B., Murthy, M., Torres, J., Sun, L., Boncz, P. A., Mokhtar, M., Hovell, H. V., Ionescu, A., Luszczak, A., Switakowski, M., Ueshin, T., Li, X., Szafranski, M., Senster, P., and Zaharia, M. (2020). Delta lake: High-performance ACID table storage over cloud object stores. Proc. VLDB Endow., 13(12):3411–3424. DOI: 10.14778/3415478.3415560.
Beetz, F. and Harrer, S. (2022). Gitops: The evolution of devops? IEEE Softw., 39(4):70–75. DOI: 10.1109/MS.2021.3119106.
Begoli, E., Camacho-Rodríguez, J., Hyde, J., Mior, M. J., and Lemire, D. (2018). Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, page 221–230, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3183713.3190662.
Behm, A., Palkar, S., et al. (2022). Photon: A fast query engine for lakehouse systems. SIGMOD ’22, page 2326–2339, New York, NY, USA. ACM. DOI: 10.1145/3514221.3526054.
Bhasi, V. M., Gunasekaran, J. R., Thinakaran, P., Mishra, C. S., Kandemir, M. T., and Das, C. (2021). Kraken: Adaptive container provisioning for deploying dynamic dags in serverless platforms. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’21, page 153–167, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3472883.3486992.
Bhat, A., Park, H., and Roy, M. (2022). Evaluating serverless architecture for big data enterprise applications. In 2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies (BDCAT ’21), BDCAT ’21, page 1–8, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3492324.3494169.
Camacho-Rodríguez, J., Chauhan, A., Gates, A., Koifman, E., O’Malley, O., Garg, V., Haindrich, Z., Shelukhin, S., Jayachandran, P., Seth, S., Jaiswal, D., Bouguerra, S., Bangarwa, N., Hariappan, S., Agarwal, A., Dere, J., Dai, D., Nair, T., Dembla, N., Vijayaraghavan, G., and Hagleitner, G. (2019). Apache hive: From mapreduce to enterprisegrade big data warehousing. In Boncz, P. A., Manegold, S., Ailamaki, A., Deshpande, A., and Kraska, T., editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 1773–1786. ACM. DOI: 10.1145/3299869.3314045.
Castellanos, C., Varela, C. A., and Correal, D. (2019). Measuring performance quality scenarios in big data analytics applications: A devops and domain-specific model approach. In Proceedings of the 13th European Conference on Software Architecture - Volume 2, ECSA ’19, page 165–172, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3344948.3344986.
Christidis, A., Moschoyiannis, S., Hsu, C.-H., and Davies, R. (2020). Enabling serverless deployment of large-scale ai workloads. IEEE Access, 8:70150–70161. DOI: 10.1109/ACCESS.2020.2985282.
de Carvalho, L. R. and de Araújo, A. P. F. (2020). Performance comparison of terraform and cloudify as multicloud orchestrators. In CCGRID, pages 380–389. IEEE. DOI: 10.1109/CCGrid49817.2020.00-55.
de Oliveira, D., Baião, F. A., and Mattoso, M. (2010). Towards a taxonomy for cloud computing from an e-science perspective. In Antonopoulos, N. and Gillam, L., editors, Cloud Computing, Principles, Systems and Applications, Computer Communications and Networks, pages 47–62. Springer. DOI: 10.1007/978-1-84996-241-4_3.
de Oliveira, D. E. M., Porto, F., Boeres, C., and de Oliveira, D. (2021). Towards optimizing the execution of spark scientific workflows using machine learning-based parameter tuning. Concurr. Comput. Pract. Exp., 33(5). DOI: 10.1002/cpe.5972.
Godlove, D. (2019). Singularity: Simple, secure containers for compute-driven workloads. In Furlani, T. R., editor, Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), PEARC 2019, Chicago, IL, USA, July 28 - August 01, 2019, pages 24:1–24:4. ACM. DOI: 10.1145/3332186.3332192.
González, L. M. V., Rodero-Merino, L., Caceres, J., and Lindner, M. A. (2009). A break in the clouds: towards a cloud definition. Comput. Commun. Rev., 39(1):50–55. DOI: 10.1145/1496091.1496100.
Hassan, H. B., Barakat, S. A., and Sarhan, Q. I. (2021). Survey on serverless computing. J. Cloud Comput., 10(1):39. DOI: 10.1186/s13677-021-00253-7.
Hellerstein, J. M., Faleiro, J. M., et al. (2019). Serverless computing: One step forward, two steps back. In [link].
Kikuchi, G., Amemiya, M., and Shimada, T. (2012). An analysis of crime hot spots using GPS tracking data of children and agent-based simulation modeling. Ann. GIS, 18(3):207–223. DOI: 10.1080/19475683.2012.691902.
Kimball, R. and Ross, M. (2002). The Data Warehouse Toolkit: The complete guide to dimensional modeling. Wiley, New York.
Kuhlenkamp, J., Werner, S., Borges, M. C., El Tal, K., and Tai, S. (2019). An evaluation of faas platforms as a foundation for serverless big data processing. In Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing, UCC’19, page 1–9, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3344341.3368796.
Liu, X. (2021). Learning from the Data Heterogeneity for Data Imputation. PhD thesis, Arizona State University, Tempe, USA.
Loureiro, J. and de Oliveira, D. (2022). Orbiter: um arcabouço para implantação automática de aplicações big data em arquiteturas serverless. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 379–384, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.225369.
Lourenço, V., Mann, P., Guimaraes, A., Paes, A., and de Oliveira, D. (2018). Towards safer (smart) cities: Discovering urban crime patterns using logic-based relational machine learning. In 2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018, pages 1–8. IEEE. DOI: 10.1109/IJCNN.2018.8489374.
Mampage, A., Karunasekera, S., and Buyya, R. (2021). A holistic view on resource management in serverless computing environments: Taxonomy, and future directions. CoRR, abs/2105.11592.
Marchioro, T., Kazlouski, A., and Markatos, E. P. (2022). How to publish wearables’ data: Practical guidelines to protect user privacy. In Séroussi, B., Weber, P., Dhombres, F., Grouin, C., Liebe, J., Pelayo, S., Pinna, A., Rance, B., Sacchi, L., Ugon, A., Benis, A., and Gallos, P., editors, Challenges of Trustable AI and Added-Value on Health - Proceedings of MIE 2022, Medical Informatics Europe, Nice, France, May 27-30, 2022, volume 294 of Studies in Health Technology and Informatics, pages 949–950. IOS Press. DOI: 10.3233/SHTI220635.
Miell, I. and Sayers, A. (2019). Docker in practice. Simon and Schuster.
Mizutori, M. and Guha-Sapir, D. (2020). Human cost of disasters 2000-2019. Technical report, United Nations Office for Disaster Risk Reduction.
Nandury, S. V. and Begum, B. A. (2016). Strategies to handle big data for traffic management in smart cities. In ICACCI 2016, India, pages 356–364. IEEE. DOI: 10.1109/ICACCI.2016.7732072.
Nascimento, L. C., Chagas, R. P., Lage, M., and de Oliveira, D. (2022). Beyond click-and-view: a comparative study of data management approaches for interactive visualization. J. Inf. Data Manag., 13(3). DOI: 10.5753/jidm.2022.2513.
Nascimento, L. C., Knust, L., Santos, R., Sá, B., Moreira, G., Freitas, F., Moura, N., Lage, M., and Oliveira, D. (2021). Análise de dados pluviométricos multi-fonte baseada em técnicas olap e de visualização: uma abordagem prática. In Anais do XII Workshop de Computação Aplicada à Gestão do Meio Ambiente e Recursos Naturais, pages 1–10, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/wcama.2021.15731.
Perron, M., Fernandez, R. C., DeWitt, D. J., and Madden, S. (2020). Starling: A scalable query engine on cloud functions. In SIGMOD ], June 14-19, 2020, pages 131–141. ACM. DOI: 10.1145/3318464.3380609.
Raasveldt, M. and Mühleisen, H. (2019). Duckdb: an embeddable analytical database. In Boncz, P. A., Manegold, S., Ailamaki, A., Deshpande, A., and Kraska, T., editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 1981–1984. ACM. DOI: 10.1145/3299869.3320212.
Sá, B. C., Muller, G., Banni, M., Santos, W., Lage, M., Rosseti, I., Frota, Y., and de Oliveira, D. (2022). Polroute-ds: a crime dataset for optimization-based police patrol routing. J. Inf. Data Manag., 13(1). DOI: 10.5753/jidm.2022.2355.
Sokolowski, D., Weisenburger, P., and Salvaneschi, G. (2021). Automating serverless deployments for devops organizations. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, page 57–69, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3468264.3468575.
Sousa, F. (2020). Computação serverless e gerenciamento de dados. In Anais do XXXV Simpósio Brasileiro de Bancos de Dados, pages 199–204, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2020.13641.
Suppa, M., Benešová, K., and Švec, A. (2021). Costeffective deployment of BERT models in serverless environment. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pages 187–195, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.naacl-industry.24.
Sznaier, M., Camps, O. I., Ozay, N., and Lagoa, C. M. (2014). Surviving the upcoming data deluge: A systems and control perspective. In 53rd IEEE Conference on Decision and Control, CDC 2014, Los Angeles, CA, USA, December 15-17, 2014, pages 1488–1498. IEEE. DOI: 10.1109/CDC.2014.7039611.
Thorndahl, S. and Willems, P. (2008). Probabilistic modelling of overflow, surcharge and flooding in urban drainage using the first-order reliability method and parameterization of local rain series. Water Research, 42(1):455–466. DOI: https://doi.org/10.1016/j.watres.2007.07.038.
Thurgood, B. and Lennon, R. G. (2019). Cloud computing with kubernetes cluster elastic scaling. In Proceedings of the 3rd International Conference on Future Networks and Distributed Systems, ICFNDS ’19, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3341325.3341995.
Wang, A., Zhang, J., et al. (2020). Infinicache: Exploiting ephemeral serverless functions to build a cost-effective memory cache. In Noh, S. H. and Welch, B., editors, USENIX FAST, pages 267–281. USENIX Association.
Wen, J., Chen, Z., Liu, Y., Lou, Y., Ma, Y., Huang, G., Jin, X., and Liu, X. (2021). An empirical study on challenges of application development in serverless computing. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, page 416–428, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3468264.3468558.
Wohlin, C. (2014). Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, EASE ’14. ACM.
Yussupov, V., Soldani, J., Breitenbücher, U., and Leymann, F. (2022). Standards-based modeling and deployment of serverless function orchestrations using BPMN and TOSCA. Softw. Pract. Exp., 52(6):1454–1495. DOI: 10.1002/spe.3073.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., and Stoica, I. (2016). Apache spark: A unified engine for big data processing. Commun. ACM, 59(11):56–65. DOI: 10.1145/2934664.
Zaharia, M. A. (2013). An Architecture for and Fast and General Data Processing on Large Clusters. PhD thesis, University of California, Berkeley, USA.
Zhang, P., Xing, L., Yang, N., Tan, G., Liu, Q., and Zhang, C. (2018). Redis++: A high performance in-memory database based on segmented memory management and two-level hash index. In Chen, J. and Yang, L. T., editors, IEEE International Conference on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018, Melbourne, Australia, December 11-13, 2018, pages 840–847. IEEE. DOI: 10.1109/BDCloud.2018.00125.