How Does Software Configuration Parameters Impact Job's Execution Time in Spark?
DOI:
https://doi.org/10.5753/jisa.2025.4904Keywords:
Big Data, Design of Experiments, Apache Spark, Time PerformanceAbstract
Traditional centralized systems cannot deal with the big data context. Distributed computing platforms such as Apache Spark have been widely adopted, but configuring their parameters is challenging given the number of factors and their interactions. This work employs Design of Experiments (DoE) techniques to screening most relevant factors regarding execution time of a Naïve Bayes machine learning distributed task on a subset of the PT7 Web Corpus, which has 14.88 GB of data. Employing a fractional factorial design with 192 experimental units and linear regression techniques with backward elimination, we obtained (i) the most relevant factors based on statistical significance and (ii) a model capable of predicting execution time according to parameters' values in the analyzed context. Our results also include a visualization technique based on Inselberg's Parallel Coordinates to comprehend the impact on performance facing various configuration possibilities.
Downloads
References
Ahmed, N., Barczak, A. L. C., Susnjak, T., and Rashid, M. A. (2020). A comprehensive performance analysis of apache hadoop and apache spark for large scale data sets using hibench. J. Big Data, 7(1):110. DOI: 10.1186/s40537-020-00388-5.
Amato, A. (2017). On the Role of Distributed Computing in Big Data Analytics, pages 1-10. Springer International Publishing, Cham. DOI: 10.1007/978-3-319-59834-5_1.
Chen, Q., Wang, K., Bian, Z., Cremer, I., Xu, G., and Guo, Y. (2016). Simulating spark cluster for deployment planning, evaluation and optimization. In 2016 6th International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SIMULTECH), pages 1-11. Available online [link].
Dietzsch, J., Heinrich, J., Nieselt, K., and Bartz, D. (2009). Spray: A visual analytics approach for gene expression data. In 2009 IEEE Symposium on Visual Analytics Science and Technology, pages 179-186. IEEE. DOI: 10.1109/VAST.2009.5333911.
Fisher, R. A. (1936). Design of experiments, volume 1. BMJ Publishing Group. Book.
Gounaris, A. and Torres, J. (2018). A methodology for spark parameter tuning. Big Data Research, 11:22-32. DOI: 10.1016/j.bdr.2017.05.001.
Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., and Ullah Khan, S. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47:98-115. DOI: 10.1016/j.is.2014.07.006.
Heinrich, J. and Weiskopf, D. (2013). State of the art of parallel coordinates. Eurographics (state of the art reports), pages 95-116. Available online [link].
Inselberg, A. (1985). The plane with parallel coordinates. The visual computer, 1:69-91. DOI: 10.1007/BF01898350.
Inselberg, A. and Dimsdale, B. (2009). Parallel coordinates. Human-Machine Interactive Systems, pages 199-233. DOI: 10.1007/978-1-4684-5883-1.
Kutner, M. (2005). Applied Linear Statistical Models. McGrwa-Hill international edition. McGraw-Hill Irwin. Book.
Laney, D. (2001). 3D data management: Controlling data volume, velocity, and variety. Technical report, META Group. Available online [link].
Lenth, R. V. (2009). Response-surface methods in r, using rsm. Journal of Statistical Software, 32(7):1–17. DOI: 10.18637/jss.v032.i07.
Lujan-Moreno, G. A., Howard, P. R., Rojas, O. G., and Montgomery, D. C. (2018). Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study. Expert Systems with Applications, 109:195-205. DOI: 10.1016/j.eswa.2018.05.024.
Montgomery, D. and Runger, G. (2003). Estatística aplicada e probabilidade para engenheiros. Livros Técnicos e Científicos. Book.
Montgomery, D. C. (2017). Design and analysis of experiments. John wiley & sons. Book.
Nguyen, N., Maifi Hasan Khan, M., and Wang, K. (2018). Towards automatic tuning of apache spark configuration. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 417-425. DOI: 10.1109/CLOUD.2018.00059.
Petridis, P., Gounaris, A., and Torres, J. (2017). Spark parameter tuning via trial-and-error. In Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., and Vellasco, M., editors, Advances in Big Data, pages 226-237, Cham. Springer International Publishing. DOI: 10.1007/978-3-319-47898-2_24.
Rodrigues, J., Vasconcelos, G., and Maciel, P. (2020). Pt7 web, an annotated portuguese language corpus.
Rodrigues, J., Vasconcelos, G., and Maciel, P. (2021). Screening hardware and volume factors in distributed machine learning algorithms on spark: A design of experiments (doe) based approach. Computing, 103. DOI: 10.1007/s00607-021-00965-3.
Rodrigues, J. B. (2020). Análise de fatores relevantes no desempenho de plataformas para processamento de Big Data: uma abordagem baseada em projeto de experimentos. Tese de doutorado, Universidade Federal de Pernambuco, Recife, PE, Brasil. Available online [link].
Rummukainen, H., Hörhammer, H., Kuusela, P., Kilpi, J., Sirviö, J., and Mäkelä, M. (2024). Traditional or adaptive design of experiments? a pilot-scale comparison on wood delignification. Heliyon, 10(2). Available online [link].
Simonet, A., Fedak, G., and Ripeanu, M. (2015). Active data: A programming model to manage data life cycle across heterogeneous systems and infrastructures. Future Generation Computer Systems, 53:25-42. DOI: 10.1016/j.future.2015.05.015.
Steed, C. A., Swan, J. E., Jankun-Kelly, T., and Fitzpatrick, P. J. (2009). Guided analysis of hurricane trends using statistical processes integrated with interactive parallel coordinates. In 2009 IEEE symposium on visual analytics science and technology, pages 19-26. IEEE. DOI: 10.1109/VAST.2009.5332586.
Unwin, A., Volinsky, C., and Winkler, S. (2003). Parallel coordinates for exploratory modelling analysis. Computational Statistics & Data Analysis, 43(4):553-564. DOI: 10.1016/S0167-9473(02)00292-X.
Wang, G., Xu, J., and He, B. (2016). A novel method for tuning configuration parameters of spark based on machine learning. In 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pages 586-593. DOI: 10.1109/HPCC-SmartCity-DSS.2016.0088.
Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates. Journal of the american statistical association, 85(411):664-675. Available online [link].
Wegman, E. J. and Luo, Q. (1997). High dimensional clustering using parallel coordinates and the grand tour. In Classification and Knowledge Organization: Proceedings of the 20th Annual Conference of the Gesellschaft für Klassifikation eV, University of Freiburg, March 6-8, 1996, pages 93-101. Springer. DOI: 10.1007/978-3-642-59051-1_10.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark: Cluster computing with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10). Available online [link].
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Journal of Internet Services and Applications

This work is licensed under a Creative Commons Attribution 4.0 International License.

