Welcome to this mini-series on application performance profiling. The aim of these blogs is to give an overview of the process of application profiling for performance optimization. The topics we will cover in this series are:
- Performance profiling concepts: Parallelism, the critical path, and Amdahl's argument.
- The right tool for the job: Tracing, sampling and instrumenting profilers.
- Interpreting profiles and diagnosing issues: An example with MPI and I/O.
We recently gave a free, in-depth webinar covering these topics which you can watch at https://www.youtube.com/watch?v=t7yBoq6183A
NAG provides an extensive range of HPC performance optimization consultancy services. Contact us at firstname.lastname@example.org to start the conversation.
What Are We Trying To Measure?
Typically, when we talk about profiling for performance optimization, we are looking for ways to make our application run faster and more efficiently. This normally means refactoring and optimizing the code, so the key question we want to answer is, where should I spend my refactoring effort for maximum performance gain? Performance profiling performs two key roles for us here; it provides a baseline measurement to compare our refactored code against, and it can guide our choice of where to concentrate our efforts.
The first step in profiling is to build an overall picture of what our application spends its time doing and if it is doing it efficiently. For example, does it spend the majority of time in a single function or is the time spread evenly over many functions? Does it make consistently good use of the CPU with a consistently high IPC (instructions per cycle) count or are there periods where the utilisation is poor? And, if the code is parallel, is it making good use of all the available processors? Armed with this information, we can then decide where best to spend our available effort.
The primary factor in deciding if an application region is worth optimizing is its contribution to the overall runtime of the application. The larger the fraction of application runtime that section represents, the greater the potential for speedup from your optimization efforts. Take, for example, an application that has three distinct regions, A, B and C, which take 30%, 60% and 10% of the runtime respectively, perhaps representing the setup, computation and finalisation stages of a computation.
Now, suppose region B is optimized to run 25% faster (i.e., in 20% less time), this reduces the overall runtime of the application by $60% times 20% = 12%$
To achieve the same performance reduction by optimizing part A would require it to run in $12% / 60% = 40%$ less time, or 50% faster. Since part C is only 10% of the total runtime to start with, even it were optimized to run instantly, it would still not provide the same overall speedup as a 25% improvement in region B.
This argument, known as Amdahl's argument (sometimes Amdahl's law) can also be extended to a discussion of parallelisation. In this case, it provides an example of the potential gain which can be achieved by parallelising certain sections of the program. Taking our application from above as the example again, suppose that the computation section B were parallelised using 6 threads, giving a speedup of 500%. This would reduce the runtime of this section by 83% and of the overall program by 50%. However, increasing the parallelism to 12 threads (1100%) would only decrease the overall runtime by a further 5% because at this point the parallel region B no longer dominates the runtime and further increasing the parallelism brings diminishing returns.
This is particularly relevant for applications which spend time performing non-computation tasks such as I/O and communication. It is important to consider the contribution of these regions to the overall code performance. For example, if a code spends 90% of its time writing data to disk, there is very little overall performance to be gained in optimizing the 10% computation time compared to the potential gain from improving the I/O performance of the code.
Overall then, this simply provides a structured demonstration of the straightforward idea that it is best to concentrate our optimization efforts on the largest and most time-costly areas of the application first.
The Critical Path
Parallel applications come with an additional level of complexity due to the need to synchronise and communicate data between parallel tasks. These communications can be explicit messages sent between processes (e.g., MPI) or implicit communication through shared memory access, but in either case, synchronisation points are introduced into the code to manage the communication and ensure data safety. Often this management is done using a "barrier" in the code, at which the communicating processes wait until all participants are ready, the communication then occurs, possibly followed by another barrier to ensure communication is complete before computation continues.
The result of this need for synchronisation is the potential for inefficiency as processes wait for each other at barriers. Consider the code timeline below, representing an SPMD (single program multiple data) parallel application with 4 processes. Each process performs computation on some portion of the data before taking part in a global communication and receiving new data, this is used in further computation followed by another global communication.
In this example process 1 spends more time performing computation 1 than the other three processes and as a result, arrives "late" at communication 1, then process 4 spends the longest time in computation 2 and arrives "late" at communication 2. In this case, the only optimizations we can make that will improve the runtime of the application are to reduce the time waiting for process 1 to finish computation 1, and the time waiting for process 4 to finish computation 2. We, therefore, have a "critical path" through the application which represents what is "holding up" the application at any given point.
Though the example given here is for SPMD, the concept of a critical path is just as relevant for MPMD applications where different processes may be performing entirely different tasks but require some future communication. In this case, it is key to understand the details of the task interdependencies to allow structuring of the program for the best concurrency. Many of the available profiling tools support critical path analysis of parallel programs, something that will be discussed in more detail in the next part of this series.
Strategy: Target Expensive Functions On The Critical Path
The critical path is a crucial consideration for parallel programs. Since any optimization not affecting the critical path will not improve the overall runtime of the application. Combined with Amdahl's argument for targeting the functions which represent the most runtime, this means that the key areas to target for optimization are expensive functions which contribute to the critical path.
In the next part, we will look at what profiling tools are available for this task, as well as those suitable for more specialised analyses needed during the optimization process.