This is the final installment of this this mini-series on application performance profiling. The aim of these blogs is to give an overview of the process of application profiling for performance optimization. The topics we cover in this series are:

1. Performance profiling concepts: Parallelism, the critical path, and Amdahl's argument.
2. The right tool for the job: Tracing, sampling and instrumenting profilers.
3. Interpreting profiles and diagnosing issues: An example with MPI and I/O.

In this blog, we will use the concepts introduced in the previous two parts to analyze an example application, first at a high level, and then looking into a specific area of performance.

## The Example Code

For this blog post the example that we will use is the EPOCH open-source plasma physics code. This is a high-performance, MPI parallel application, capable of scaling efficiently to use many thousands of processors in HPC clusters. However, in this case, we will run it on just a few processes and look at the efficiency in the case that we want to write a large number of output files during the simulation. This might be required, for example, to capture fine-grained time-dependent behaviour of the plasma during the simulation period.

## Building A Picture of the Application

As we saw in part one, it is important to understand the overall application behaviour in order to determine what the code spends its time doing. For the case of a pure MPI code with no accelerators, and no OpenMP regions, we can cleanly divide the runtime into time spent doing computation, communication, and file I/O. Since EPOCH is an MPI parallel code, using MPI I/O to write data files, we can use a tracing profiler to efficiently gather this information. For this example, the data was collected with the open source Extrae tracing profiler. In this case, the code is configured to save its output after every 20 simulation steps, running for 100 steps in total.

The resulting trace timeline is shown below

From this, we can see that the simulation runs for roughly 150 seconds and appears to make balanced use of all the cores. However, when we look to see what the code is doing we find the runtime is dominated by the time in the I/O routines actually writing the output files, the actual simulation is a tiny fraction of this.

## Investigating the Hotspot

Now we know that writing the output files is the major bottleneck, we can investigate in more detail to try and work out what exactly the code is doing, and whether the performance can be improved.

To investigate the I/O issues in more detail we can now switch to using a tool specifically designed for investigating I/O behaviour. In this case we use Darshan, another open-source profiler. It is a tracing profiler designed to capture detailed statistics about the application's use of multiple types of I/O including MPI I/O and Posix I/O.

As with the high-level profiling, the first information to extract is the summary performance data:

# *******************************************************
# POSIX module data
# *******************************************************
# performance
# -----------
# total_bytes: 7686
#
# I/O timing for unique files (seconds): 0.019487
# I/O timing for shared files (seconds): 3.654749
# Aggregate performance, including both shared and unique files (MiB/s): 0.001995

# *******************************************************
# MPI-IO module data
# *******************************************************
# performance
# -----------
# total_bytes: 2466400536
#
# I/O timing for unique files (seconds): 0.000000
# I/O timing for shared files (seconds): 124.677478
# Aggregate performance, including both shared and unique files (MiB/s): 18.865820


This confirms that EPOCH almost exclusively performs MPI parallel I/O and, as reported by Extrae, spends roughly 82% of its time performing this I/O. The 7 kB of POSIX I/O is performed by the master process to read the configuration file at the start of the execution. (Note that there is some overlap between the data gathered by Darshan and Extrae, however as a dedicated I/O profiler Darshan is capable of gathering richer data and automatically calculating useful figures of merit.)

The next question we should ask is, "is this what we expect?" Checking the six output files shows they are each about 400 MB in size, consistent with the reported total I/O size. What is concerning, however, is the fact that the code takes roughly 124 seconds to write just 2.4 GB of data, an average performance of just 18 MB/s. Given that this data was captured on an HPC with a high performance GPFS filesystem, we would expect at least an order of magnitude better I/O performance, for example a write rate of over 1000 MB/s is achieved by the BT-IO benchmark on the same system.

To understand more about the behaviour of EPOCH we can use the Darshan results to view the type of writes that EPOCH is performing. The histogram below shows the number and size of writes performed as well as whether they are independent (i.e by a single process) writes, or collective writes (i.e all processes write some data at once). For MPI I/O collective writes are typically much more efficient than independent writes as the library can make performance optimizing decisions, as well as writing data in parallel. Writing large amounts of data in a single write is also much more efficient than writing in multiple segments as there are large latency and overhead costs associated with writing to the filesystem, managing metadata etc,

From this we can see that EPOCH is performing many thousands of very small (< 100 byte) independent writes which are used to populate metadata fields in the output file. This suggests that the cause of the poor write performance is due to large overhead costs being incurred by writing the data as small blocks through many calls. Modifying the code to merge as many of these small writes as possible should improve the performance of the code, something that can be quickly checked by repeating the analysis above with a modified version and comparing the results.

In the above example, we were able to use two tools to determine the primary cause of inefficiency in an example code. By first considering the overall application behaviour we found that the application spent the majority of time performing I/O and that this I/O performance was unexpectedly poor. This suggests the use of a second tool to provide detailed information on the I/O behaviour of code which highlighted a large number of very small, independent writes, which is a common inefficiency pattern observed in MPI-IO codes.

## Conclusions

There is a huge number of different profiling and performance analysis tools available in the software development ecosystem, and knowing which tool to choose and where to begin analysing performance is not straightforward.

In the previous parts of the series we considered the types of tools that are available and strategies for efficiently locating the causes of performance inefficiencies. In this final entry in the series we used the EPOCH PIC code as an example of this approach. Initially looking at an overview of the trace showed that the majority of the time was spent in I/O, making it the optimization target with the greatest potential for improvement.

Looking deeper at the I/O using a dedicated I/O profiler the average write speed was just 20 MB/s, compared to the benchmarked value of over 1 GB/s suggesting that the code performance can potentially be improved by an order of magnitude or more. Then using the same tool to look at the actual writes made, it is clear that the code is performing many very small writes. This is a known inefficient pattern which can be improved by aggregating many small writes into one larger one.

## Example Files

The trace files used in this blog and the accompanying webinar, along with the original EPOCH configuration file are available to download here. Please read the included instructions for detailed information on how to open and explore them.

Author