The reign and modern challenges of the Message Passing Interface (MPI): A discussion with Dr. Torsten Hoefler.

A few years ago, while I was a graduate student in Greece, I was preparing  slides for my talk at the SIAM Parallel Processing 2012 conference. While showing my slides to one of my colleagues, one of his comments was: “All good, but why do you guys doing numerical linear algebra and parallel computing always use the Message Passing Interface to communicate between the processors?”. Having read* the book review of Beresford Parlett in [1], I did have the wit to imitate Marvin Minsky and reply “Is there any other way?”. Nowadays, this question is even more interesting, and my answer would certainly be longer (perhaps too long!). Execution of programs in distributed computing environments requires communication between the processors. It is then natural to consider by what protocols and guidelines should the processors communicate with each other? This is the question to which the Message Passing Interface (MPI) has been the answer for more than 25 years.

MPIlogo2

A logo of the Message Passing Interface. Photo credits: http://mpi-forum.org/

MPI is a standardized and portable message-passing system designed for the communication of processors in cluster computing environments (or even today’s desktops and laptops averaging just two to four compute cores each). MPI is designed to function on a wide variety of parallel computing architectures and each vendor can then implement their own portable version, with the only requirement being the respect of the MPI protocols. Since its first version, MPI has been widely adopted by researchers from academia and industry.

But Computer Science and High-Performance Computing (HPC) are not the same as they used to be. From an applications’ viewpoint, until a few years ago the HPC community was busy solving large-scale linear systems and eigenvalue problems required by the discretization of Partial Differential Equations used to model real-world phenomena (simulations). Nowadays, there is a huge interest on exploiting the massive amounts of data being generated every day in order to infer patterns and trends. Given the humongous volumes of data that must be analyzed, their analysis requires computers with excessive compute capabilities and these are bundled by compiling together thousands of processors, each with multiple cores (we tend to call these: supercomputers).

MPI has long been the de-facto language by which the different processors in supercomputers communicate and it is not an overstatement to say that current progress on supercomputers is owed largely to its existence. But is MPI the right communication protocol for new disciplines such as data analytics? Should MPI undergo major changes, especially now that scientists with a very limited background in Computer Science are also getting interested in solving large-scale problems requiring supercomputers?

image.imageformat.lightbox.945276002.png

Cover picture of Professor Torsten Hoefler for the occasion of his Latsis Prize of ETH Zurich award in 2015. Photo credits: ETH Zurich / Peter Rüegg.

To shed some light to these questions, we contacted Dr. Hoefler, a Professor at ETH Zurich, and a world-renowned expert in High-Performance Computing. Torsten is an Assistant Professor of Computer Science at ETH Zürich, Switzerland. Before joining ETH, he led the performance modeling and simulation efforts of parallel petascale applications for the NSF-funded Blue Waters project at NCSA/UIUC. He is also a key member of the Message Passing Interface (MPI) Forum where he chairs the “Collective Operations and Topologies” working group. Torsten won best paper awards at the ACM/IEEE Supercomputing Conference 2010 (SC10), EuroMPI  2013, ACM/IEEE Supercomputing Conference 2013 (SC13), and other conferences. He published numerous peer-reviewed scientific conference and journal articles and authored chapters of the MPI-2.2 and MPI-3.0 standards. For his work, Torsten received the SIAM SIAG/Supercomputing Junior Scientist Prize in 2012 and the IEEE TCSC Young Achievers in Scalable Computing Award in 2013. Following his Ph.D., he received the Young Alumni Award 2014 from Indiana University. Torsten was elected into the first steering committee of ACM’s SIGHPC in 2013. His research interests revolve around the central topic of “Performance-centric Software Development” and include scalable networks, parallel programming techniques, and performance modeling.  Additional information about Torsten can be found on his homepage at htor.inf.ethz.ch.

(The biographical note of Torsten was copied from “http://ppopp17.sigplan.org/profile/torstenhoefler”)

XRDS: MPI begun about 25 years ago and has been since then, undoubtedly, the “King” of HPC. What were the characteristics of MPI that made it the de-facto language of HPC?

Torsten: If you ask me, I would say that MPI’s major strength is due to its clear organization around a relatively small number of orthogonal concepts. These include communication contexts (communicators), blocking/nonblocking, datatypes, collective communications, remote memory access, and some more. These concepts work really like Lego blocks and can be combined into powerful functions. For example, the transpose of a parallel Fast Fourier Transformation (FFT) can be implemented with a single call to a nonblocking MPI_Alltoall when using datatypes [2]. Of course, big parts of MPI’s success also stem from it simplicity to implement libraries and applications. MPI’s library interface requires no compiler support to be added to most languages and can thus easily be implemented.

XRDS: What are the new features we should expect in the future? Are there any specific challenges that the standard will have to consider onroad to exascale computing?

Torsten: I would conjecture that the first exascale application will use MPI, even if it’s not extended until then. MPI’s latest version 3 is scalable and has very basic support for fault tolerance making it a great candidate for early exascale systems. However, among the things that could benefit very large-scale execution are improved fault-tolerance support, better interaction with accelerators, and task-based executions.

XRDS: There has been a lot of discussion recently regarding the exascale computing era and the extreme computing power it will make available. Given the increasing gap between the speed of processors compared to that of bringing data from the memory, how should we proceed? We need algorithms that communicate less or more efficient implementations in the programmers’ side?

Torsten: Exactly, minimizing and scheduling communication efficiently will be one of the major challenges for exascale. Very large-scale machines already have comparatively low communication bandwidth. While minimizing communications is a problem for algorithm developers, communication scheduling is managed by the MPI implementation. MPI-3 has added several interfaces to enable more powerful communication scheduling, for example nonblocking collective operations and neighborhood collective operations. The implementation of these interfaces spawns interesting research questions that are tackled by the community right now.

XRDS: With the rise of data analytics, new parallel computing paradigms came in the foreground, e.g. Spark, Hadoop and other. How do you see MPI competing against those paradigms in big data applications, where other considerations such as length of the code, fault tolerance, non-CS major scientists are enabled?

Torsten: MPI originated from within the HPC community that runs large simulation codes that have been developed for decades on very large systems. Much of the big data community moved from single-nodes to parallel and distributed computing to process larger amounts of data using relatively short-lived programs and scripts. Thus, programmer productivity only played a minor role in MPI/HPC while it was one of the major requirements for big data analytics. While MPI codes are often orders-of-magnitude faster than many big data codes, they also take much longer to develop. And that is most often a good trade-off. I would not want to spend weeks to write a data-analytics application to peek at some large dataset. Yet, when I am using tens of thousands of cores running the same code, I would probably like to invest in improving the execution speed. So I don’t think the models are competing — they’re simply designed for different uses and can certainly learn much from each other!

XRDS: MPI supports I/O operations but many applications of interest nowadays lie in the “big data” regime, where large chunks of data have to be read/written from/to the disk. What is the current status of MPI-I/O? Is there anything new coming in the next release?

Torsten: MPI I/O has been introduced nearly two decades ago to improve the handling of large datasets in parallel settings. It is successfully used in many large applications and I/O libraries such as HDF-5. I believe it could also be used to improve large block I/O in MapReduce applications, but that needs to be shown in practice. I do not know of any major new innovations in I/O planned for MPI-4.

XRDS: Compute accelerators such as GPUs play an important role in modern data analytics and machine learning. How is MPI reacting to this change?

Torsten: MPI predates the time when the use of accelerators became commonplace. However, when these accelerators are used in distributed-memory settings such as computer clusters, then MPI is the common way to program them. The current model, often called MPI+X (e.g., MPI+CUDA), combines traditional MPI with accelerator programming models (e.g., CUDA, OpenACC, OpenMP etc.) in a simple way. In this model, MPI communication is performed by the CPU. Yet, this can be inefficient and inconvenient and recently, we have proposed a programming model called distributed CUDA (dCUDA) to perform communication from within a CUDA compute kernel [3]. This allows to use the powerful GPU warp scheduler for communication latency hiding. In general, integrating accelerators and communication functions is an interesting research topic. Even a model called “MPI+MPI” combining MPI at different hierarchy levels seems very useful in practice, of course, depending on the application [4].

1. SIAM REVIEW, Vol. 52, Issue 4, pp. 771–791.

2. T. Hoefler and S. Gottlieb, “Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI datatypes”. Proceedings of the 17th European MPI User’s Group Meeting Conference on Recent Advances in the Message Passing Interface, pp: 131-141, 2010.

3. T. Gysi, J. Baer, T. Hoefler, “dCUDA: Hardware Supported Overlap of Computation and Communication”. In Proceedings of the ACM/IEEE Supercomputing Conference (SC16), 2016.

4. T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, et al., “MPI+MPI: A new hybrid approach to parallel programming with MPI plus shared memory”. Journal of Computing, (95) pp: 1121–1136, 2013.

Acknowledgments: VK would like to thank Dr. Geoffrey Dillon, Professor Torsten Hoefler and Dr. Pedro Lopes for their comments on this post. Of course, any omittances and/or errors is the sole responsibility of the main author :)

*: It was EfstratiosGallopoulos that brought the book review of Parlett to my attention.

Leave a Reply

Your email address will not be published. Required fields are marked *