HighP5: Programming using Partitioned Parallel Processing Spaces

Muhammad Nur Yanhaona; Andrew Grimshaw; Shahriar Hasan Mickey

doi:10.5753/jbcs.2024.4345

Authors

Muhammad Nur Yanhaona Brac University https://orcid.org/0000-0003-2450-3377
Andrew Grimshaw University of Virginia https://orcid.org/0000-0002-2072-9873
Shahriar Hasan Mickey Brac University https://orcid.org/0009-0008-1539-677X

DOI:

https://doi.org/10.5753/jbcs.2024.4345

Keywords:

Parallel Programming, Programming Language, Type-Architecture, Declarative Programming, Portability, Productive Computing

Abstract

HighP5 is a new high-level parallel programming language designed to help software developers to achieve three objectives simultaneously: programmer productivity, program portability, and superior program performance. HighP5 enables this by fostering a new programming paradigm that we call hardware-cognizant parallel programming. The paradigm uses a uniform hardware abstraction and a declarative programming syntax to allow programmers to write hardware feature-sensitive efficient programs without delving into the detail of those feature implementations. This paper is the first comprehensive description of HighP5's design rationale, language grammar, and core features. It also discusses the runtime behavior of HighP5 programs. In addition, the paper presents preliminary results on program performance from HighP5 compilers on three different architectural platforms: shared-memory multiprocessors, distributed memory multi-computers, and hybrid GPU/multi-computers.

Downloads

Download data is not yet available.

Author Biography

Andrew Grimshaw, University of Virginia

Andrew Grimshaw is the chief designer and architect of Mentat and Legion. In 1999 he co-founded Avaki Corporation, and served as its Chairman and Chief Technical Officer, until 2005 when Avaki was acquired by Sybase. In 2003 he won the Frost and Sullivan Technology Innovation Award. Andrew is a member of the Global Grid Forum (GGR) Steering Committee and the Architecture Area Director in the GGF. He has served on the National Partnership for Advanced Computational Infrastructure (NPACI) Executive Committee, the DoD MSRC Programming Environments and Training (PET) Executive Committee, the CESDIS Science Council, the NRC Review Panel for Information Technology, and the Board on Assessment of NIST Programs. He is the author or co-author of over 50 publications and book chapters.

References

Adve, V., Carle, A., Granston, E., Hiranandani, S., Kennedy, K., Koelbel, C., Kremer, U., Mellor-Crummey, J., and Warren, S. (1994). Requirements for data-parallel programming environments. IEEE Parallel and Distributed Technology, 2(3):48-58. DOI: 10.1109/M-PDT.1994.329801.

Alexandrov, A., Ionescu, M. F., Schauser, K. E., and Scheiman, C. (1995). Loggp: Incorporating long messages into the logp model - one step closer towards a realistic model for parallel computation. DOI: 10.1145/215399.215427.

Alpern, B., Carter, L., and Ferrante, J. (1993). Modeling parallel computers as memory hierarchies. In Programming Models for Massively Parallel Computers, 1993. Proceedings, pages 116-123. DOI: 10.1109/PMMP.1993.315548.

Aoki, R., Oikawa, S., Nakamura, T., and Miki, S. (2011). Hybrid opencl: Enhancing opencl for distributed processing. In 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications, pages 149-154. DOI: 10.1109/ISPA.2011.28.

Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., Yelick, K. A., Demmel, M. J., Plishker, W., Shalf, J., Williams, S., and Yelick, K. (2006). The landscape of parallel computing research: A view from berkeley. Technical report, TECHNICAL REPORT, UC BERKELEY. Available at: [link].

Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. (2009). Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. In Sips, H., Epema, D., and Lin, H.-X., editors, Euro-Par 2009 Parallel Processing, pages 863-874, Berlin, Heidelberg. Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-03869-3_80.

Bachan, J., Bonachea, D., Hargrove, P. H., Hofmeyr, S., Jacquelin, M., Kamil, A., van Straalen, B., and Baden, S. B. (2017). The upc++ pgas library for exascale computing. In Proceedings of the Second Annual PGAS Applications Workshop, PAW17, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3144779.3169108.

Bauer, M., Clark, J., Schkufza, E., and Aiken, A. (2011). Programming the memory hierarchy revisited: Supporting irregular parallelism in sequoia. SIGPLAN Not., 46(8):13-24. DOI: 10.1145/2038037.1941558.

Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. (2012). Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 66:1-66:11, Los Alamitos, CA, USA. IEEE Computer Society Press. DOI: 10.1109/SC.2012.71.

Beckingsale, D. A., Burmark, J., Hornung, R., Jones, H., Killian, W., Kunen, A. J., Pearce, O., Robinson, P., Ryujin, B. S., and Scogland, T. R. (2019). Raja: Portable performance for large-scale scientific applications. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pages 71-81. DOI: 10.1109/P3HPC49587.2019.00012.

Besard, T., Foket, C., and De Sutter, B. (2019). Effective extensible programming: Unleashing julia on gpus. IEEE Transactions on Parallel and Distributed Systems, 30(4):827-841. DOI: 10.1109/TPDS.2018.2872064.

Blagojević, F., Hargrove, P., Iancu, C., and Yelick, K. (2010). Hybrid pgas runtime support for multicore nodes. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS '10, pages 3:1-3:10, New York, NY, USA. ACM. DOI: 10.1145/2020373.2020376.

Blelloch, G. E. (1992). NESL: A Nested Data-Parallel Language. Carnegie Mellon University, USA. Book.

Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. (1995). Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, page 207–216, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/209936.209958.

Butenhof, D. R. (1997). Programming with POSIX Threads. Addison-Wesley Longman Publishing Co., Inc., USA. Book.

Chakravarty, M. M., Keller, G., Lee, S., McDonell, T. L., and Grover, V. (2011). Accelerating haskell array codes with multicore gpus. In Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming, DAMP '11, page 3–14, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/1926354.1926358.

Chamberlain, B., Callahan, D., and Zima, H. (2007). Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl., 21(3):291-312. DOI: 10.1177/1094342007078442.

Chamberlain, B. L., Choi, S.-E., Lewis, E. C., Lin, C., Snyder, L., and Weathersby, W. D. (2000). Zpl: A machine independent programming language for parallel computers. IEEE Trans. Softw. Eng., 26(3):197-211. DOI: 10.1109/32.842947.

Chapman, B., Jost, G., and Pas, R. v. d. (2007). Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press. Book.

Chapman, B., Mehrotra, P., and Zima, H. (1992). Programming in vienna fortran. SCIENTIFIC PROGRAMMING, 1(1):31-50. DOI: 10.1155/1992/258136.

Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., and Sarkar, V. (2005). X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not., 40(10):519-538. DOI: 10.1145/1103845.1094852.

Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., and von Eicken, T. (1993). Logp: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '93, pages 1-12, New York, NY, USA. ACM. DOI: 10.1145/155332.155333.

Dagum, L. and Menon, R. (1998). Openmp: an industry standard api for shared-memory programming. Computational Science Engineering, IEEE, 5(1):46-55. DOI: 10.1109/99.660313.

De Wael, M., Marr, S., De Fraine, B., Van Cutsem, T., and De Meuter, W. (2015). Partitioned global address space languages. 47(4). DOI: 10.1145/2716320.

Edwards, H. C. and Trott, C. R. (2013). Kokkos: Enabling performance portability across manycore architectures. In 2013 Extreme Scaling Workshop (xsw 2013), pages 18-24. DOI: 10.1109/XSW.2013.7.

El-Ghazawi, T. and Smith, L. (2006). Upc: Unified parallel c. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC '06, New York, NY, USA. ACM. DOI: 10.1145/1188455.1188483.

Foster, I. (1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Available at: [link].

Hanawa, T., Fujiwara, T., and Amano, H. (1996). Hot spot contention and message combining in the simple serial synchronized multistage interconnection network. In Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96), SPDP '96, page 298, USA. IEEE Computer Society. DOI: 10.1109/SPDP.1996.570347.

Henriksen, T., Serup, N. G. W., Elsman, M., Henglein, F., and Oancea, C. E. (2017). Futhark: Purely functional gpu-programming with nested parallelism and in-place array updates. SIGPLAN Not., 52(6):556–571. DOI: 10.1145/3140587.3062354.

Herdman, J. A., Gaudin, W. P., Perks, O., Beckingsale, D. A., Mallinson, A. C., and Jarvis, S. A. (2014). Achieving portability and performance through openacc. In 2014 First Workshop on Accelerator Programming using Directives, pages 19-26. DOI: 10.1109/WACCPD.2014.10.

Hoare, C. A. R. (1978). Communicating sequential processes. Commun. ACM, 21(8):666–677. DOI: 10.1145/359576.359585.

Larus, J. (1993). C**: A large-grain, object-oriented, data-parallel programming language. In Banerjee, U., Gelernter, D., Nicolau, A., and Padua, D., editors, Languages and Compilers for Parallel Computing, pages 326-341, Berlin, Heidelberg. Springer Berlin Heidelberg. DOI: 10.1007/3-540-57502-2_56.

Lee, P. and Kedem, Z. M. (2002). Automatic data and computation decomposition on distributed memory parallel computers. ACM Trans. Program. Lang. Syst., 24(1):1–50. DOI: 10.1145/509705.509706.

Leiserson, C. E. (1985). Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers, C-34(10):892-901. DOI: 10.1109/TC.1985.6312192.

Lin, W.-C. and McIntosh-Smith, S. (2021). Comparing julia to performance portable parallel programming models for hpc. In 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 94-105. DOI: 10.1109/PMBS54543.2021.00016.

MacNeice, P., Olson, K. M., Mobarry, C., de Fainchtein, R., and Packer, C. (2000). Paramesh: A parallel adaptive mesh refinement community toolkit. Computer Physics Communications, 126(3):330-354. DOI: 10.1016/S0010-4655(99)00501-9.

Mainland, G. and Morrisett, G. (2010). Nikola: Embedding compiled gpu functions in haskell. Haskell '10, page 67–78, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/1863523.1863533.

Mallón, D. A., Taboada, G. L., Teijeiro, C., Touriño, J., Fraguela, B. B., Gómez, A., Doallo, R., and Mouriño, J. C. (2009). Performance evaluation of mpi, upc and openmp on multicore architectures. In Ropo, M., Westerholm, J., and Dongarra, J., editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 174-184, Berlin, Heidelberg. Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-03770-2_24.

Marowka, A. (2022). On the performance portability of openacc, openmp, kokkos and raja. In International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia '22, page 103–114, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3492805.3492806.

McCarthy, J. (1978). History of LISP, page 173–185. Association for Computing Machinery, New York, NY, USA. DOI: 10.1145/800025.1198360.

Merrill, D. and Garland, M. (2016). Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2851141.2851190.

Merrill, D., Garland, M., and Grimshaw, A. (2012). Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, page 117–128, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2145816.2145832.

Mohr, M., Malony, A., Mohr, B., Beckman, P., Gannon, D., Yang, S., and Bodin, F. (1994). Performance analysis of pc++: A portable data-parallel programming system for scalable parallel computers. In Proc. 8th Int. Parallel Processing Symb. (IPPS), Canc'un, Mexico, IEEE Computer, pages 75-85. Society Press. DOI: 10.1109/IPPS.1994.288316.

Nickolls, J., Buck, I., Garland, M., and Skadron, K. (2008). Scalable parallel programming with cuda. Queue, 6(2):40-53. DOI: 10.1145/1365490.1365500.

Novosel, R. and Slivnik, B. (2019). Beyond classical parallel programming frameworks: Chapel vs julia. In Rodrigues, R., Janousek, J., Ferreira, L., Coheur, L., Batista, F., and Oliveira, H. G., editors, 8th Symposium on Languages, Applications and Technologies, SLATE 2019, June 27-28, 2019, Coimbra, Portugal, volume 74 of OASIcs. DOI: 10.4230/OASIcs.SLATE.2019.12.

Numrich, R. W. and Reid, J. (1998). Co-array fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):1-31. DOI: 10.1145/289918.289920.

Olukotun, K. (2014). Beyond parallel programming with domain specific languages. 49(8):179–180. DOI: 10.1145/2692916.2557966.

Pennycook, S. J., Hammond, S. D., Jarvis, S. A., and Mudalige, G. R. (2011). Performance analysis of a hybrid mpi/cuda implementation of the naslu benchmark. SIGMETRICS Perform. Eval. Rev., 38(4):23–29. DOI: 10.1145/1964218.1964223.

Quinn, M. J. (2003). Parallel Programming in C with MPI and OpenMP. McGraw-Hill Higher Education. McGraw-Hill Education Group. DOI: 10.5555/1211440.

Rabenseifner, R., Hager, G., and Jost, G. (2009). Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on, pages 427-436. DOI: 10.1109/PDP.2009.43.

Scannapieco, E. and Harlow, F. H. (1995). Introduction to finite-difference methods for numerical fluid dynamics. DOI: 10.2172/212567.

Schardl, T. B., Lee, I.-T. A., and Leiserson, C. E. (2018). Brief announcement: Open cilk. SPAA '18, page 351–353, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3210377.3210658.

Schlagkamp, S., Ferreira da Silva, R., Allcock, W., Deelman, E., and Schwiegelshohn, U. (2016). Consecutive job submission behavior at mira supercomputer. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC '16, page 93–96, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2907294.2907314.

Shan, H., Wright, N. J., Shalf, J., Yelick, K., Wagner, M., and Wichmann, N. (2012). A preliminary evaluation of the hardware acceleration of the cray gemini interconnect for pgas languages and comparison with mpi. SIGMETRICS Perform. Eval. Rev., 40(2):92-98. DOI: 10.1145/2381056.2381077.

Snyder, L. (1986). Type architectures, shared memory, and the corollary of modest potential. In Traub, J. F., Grosz, B. J., Lampson, B. W., and Nilsson, N. J., editors, Annual Review of Computer Science Vol. 1, 1986, pages 289-317. Annual Reviews Inc., Palo Alto, CA, USA. Available at: [link].

Stone, J. E., Gohara, D., and Shi, G. (2010). Opencl: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering, 12(3):66-73. DOI: http://dx.doi.org/10.1109/MCSE.2010.69.

Stunkel, C. B., Graham, R. L., Shainer, G., Kagan, M., Sharkawi, S. S., Rosenburg, B., and Chochia, G. A. (2020). The high-speed networks of the summit and sierra supercomputers. IBM Journal of Research and Development, 64(3/4):3:1-3:10. DOI: 10.1147/JRD.2020.2967330.

Trott, C. R., Lebrun-Grandié, D., Arndt, D., Ciesko, J., Dang, V., Ellingwood, N., Gayatri, R., Harvey, E., Hollman, D. S., Ibanez, D., Liber, N., Madsen, J., Miles, J., Poliakoff, D., Powell, A., Rajamanickam, S., Simberg, M., Sunderland, D., Turcksin, B., and Wilke, J. (2022). Kokkos 3: Programming model extensions for the exascale era. IEEE Transactions on Parallel and Distributed Systems, 33(4):805-817. DOI: 10.1109/TPDS.2021.3097283.

Walker, D. W., Walker, D. W., Dongarra, J. J., and Dongarra, J. J. (1996). Mpi: A standard message passing interface. Supercomputer, 12:56-68. Available at: [link].

Wirth, N. (1974). On the design of programming languages. In IFIP Congress, volume 74, pages 386-393. Available at: [link].

Yelick, K., Semenzato, L., Pike, G., Miyamoto, C., Liblit, B., Krishnamurthy, A., Hilfinger, P., Graham, S., Gay, D., Colella, P., and Aiken, A. (1998). Titanium: A high-performance java dialect. In In ACM, pages 10-11. Available at: [link].

Zhang, X., Guo, X., Weng, Y., Zhang, X., Lu, Y., and Zhao, Z. (2023). Hybrid mpi and cuda paralleled finite volume unstructured cfd simulations on a multi-gpu system. Future Generation Computer Systems, 139:1-16. DOI: https://doi.org/10.1016/j.future.2022.09.005.