On limits of Machine Learning techniques in the learning of scheduling policies

Lucas de Sousa Rosa; Danilo Carastan-Santos; Alfredo Goldman; Denis Trystram

doi:10.5753/reic.2023.3419

Authors

Lucas de Sousa Rosa University of Sao Paulo
Danilo Carastan-Santos Univ. Grenoble Alpes https://orcid.org/0000-0002-1878-8137
Alfredo Goldman University of São Paulo https://orcid.org/0000-0001-5746-4154
Denis Trystram Univ. Grenoble Alpes https://orcid.org/0000-0002-2623-6922

DOI:

https://doi.org/10.5753/reic.2023.3419

Keywords:

Scheduling Heuristics, High Performance Computing, Machine Learning, Linear Regression

Abstract

This scientific initiation work explores the emerging relationship between managing resources on high-performance computing (HPC) platforms and the use of regression-derived scheduling heuristics to optimize performance. Recent research has shown that machine learning (ML) techniques can be used to generate scheduling heuristics that are simple and efficient. This work proposes an alternative approach using polynomial functions to generate scheduling heuristics. The simplest polynomial was found to be one of the most efficient heuristic. We also evaluated the resilience of the regression-derived heuristics over time. We published two papers in peer-reviewed national and international workshops (Qualis-B3/B4).

Downloads

Download data is not yet available.

References

Alin, A. (2010). Multicollinearity. Wiley Interdisciplinary Reviews: Computational Statistics, 2(3):370–374.

Amvrosiadis, G., Kuchnik, M., Park, J. W., Cranor, C., Ganger, G. R., Moore, E., and DeBardeleben, N. (2018). The atlas cluster trace repository. Usenix Mag, 43(4).

Brucker, P. (2007). Scheduling Algorithms. Springer, hardcover edition.

Carastan-Santos, D., Camargo, R. Y. D., Trystram, D., and Zrigui, S. (2019). One can only gain by replacing EASY backfilling: A simple scheduling policies case study. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE.

Carastan-Santos, D. and de Camargo, R. Y. (2017). Obtaining dynamic scheduling policies with simulation and machine learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM.

Casanova, H., Giersch, A., Legrand, A., Quinson, M., and Suter, F. (2014). Versatile, scalable, and accurate simulation of distributed applications and platforms. Journal of Parallel and Distributed Computing, 74(10):2899–2917.

Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., and Papka, M. E. (2021). Deep reinforcement agent for scheduling in hpc. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 807–816.

Feitelson, D. G. (2001). Metrics for parallel job scheduling and their convergence. In Job Scheduling Strategies for Parallel Processing, pages 188–205. Springer Berlin Heidelberg.

Feitelson, D. G., Rudolph, L., Schwiegelshohn, U., Sevcik, K. C., and Wong, P. (1997). Theory and practice in parallel job scheduling. In Job Scheduling Strategies for Parallel Processing: IPPS’97 Processing Workshop Geneva, Switzerland, April 5, 1997 Proceedings 3, pages 1–34. Springer.

Feitelson, D. G., Tsafrir, D., and Krakov, D. (2014). Experience with using the parallel workloads archive. Journal of Parallel and Distributed Computing, 74(10):2967–2982.

Garcia, C. G., Gómez, R. S., and P ́erez, J. G. (2022). A review of ridge parameter selection: minimization of the mean squared error vs. mitigation of multicollinearity. Communications in Statistics - Simulation and Computation, pages 1–13.

Legrand, A., Trystram, D., and Zrigui, S. (2019). Adapting batch scheduling to workload characteristics: What can we expect from online learning? In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.

Li, J., Zhang, X., Han, L., Ji, Z., Dong, X., and Hu, C. (2021). Okcm: improving parallel task scheduling in high-performance computing systems using online learning. The Journal of Supercomputing, 77(6):5960–5983.

Lublin, U. and Feitelson, D. G. (2003). The workload on parallel supercomputers: modeling the characteristics of rigid jobs. Journal of Parallel and Distributed Computing, 63(11):1105–1122.

Mu'alem, A. and Feitelson, D. (2001). Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6):529–543.

Rosa, L., Carastan-Santos, D., and Goldman, A. (2023). An experimental analysis of regression-obtained hpc scheduling heuristics. In Job Scheduling Strategies for Parallel Processing. Springer-Verlag. To be published.

Rosa, L. and Goldman, A. (2022). In search of efficient scheduling heuristics from simulations and machine learning. In Anais Estendidos do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho, pages 17–24, Porto Alegre, RS, Brasil. SBC.

Shalf, J. (2020). The future of computing beyond moore’s law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 378(2166):20190061.

Tang, W., Lan, Z., Desai, N., and Buettner, D. (2009). Fault-aware, utility-based job scheduling on blue, gene/p systems. In 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE.

Zhang, D., Dai, D., He, Y., Bao, F. S., and Xie, B. (2020). Rlscheduler: An automated hpc batch job scheduler using reinforcement learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.

Zrigui, S., de Camargo, R. Y., Legrand, A., and Trystram, D. (2022). Improving the performance of batch schedulers using online job runtime classification. Journal of Parallel and Distributed Computing, 164:83–95.

On limits of Machine Learning techniques in the learning of scheduling policies

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Language