This is the second installment of this this mini-series on application performance profiling. The aim of these blogs is to give an overview of the process of application profiling for performance optimization. The topics we will cover in this series are:
- Performance profiling concepts: Parallelism, the critical path, and Amdahl's argument.
- The right tool for the job: Tracing, sampling and instrumenting profilers.
- Interpreting profiles and diagnosing issues: An example with MPI and I/O.
In this blog we will use the concepts introduced in the first part to understand how to choose the best profiling tools for different situations.
What Can a Profiler Tell Us?
Pretty much anything, but commonly
- Parallel calls (and other library calls)
- User functions
- Memory usage
- Timing information
- Hardware events (CPU instructions, total CPU cycles, cache misses etc.)
- Other detailed information like per line profiles, vectorization analyses, etc.
Available Profiling Tools
There are a huge number of profiling tools available, ranging from simple utilities to complex analysis toolkits, developed both as open source projects and by commercial companies. Such a wide range of available tools gives developers the ability to analyze any aspect of their code but also complicates finding the correct tool for a given situation.
Profiling tools often modify the code or libraries which are used to run the application, either by making changes to the code as it is compiled, intercepting calls to library functions or manually by the developer inserting instrumentation functions to the code. In addition, some features or optimizations in the application may conflict with the profiling tool causing crashes or invalid results. Therefore, to choose the correct tool it is important to have an understanding of the way different types of tools operate, and the effects this has on the data collected and the operation of the application being profiled.
Types of Profiler
Broadly speaking there are two ways to profile an application: tracing or sampling. Tracing involves inserting extra code into the application, which outputs the relevant information about the application's progress. Sampling uses the operating system of the machine to periodically inspect the application state and return information. Both of these methods have advantages and disadvantages depending on the application we are profiling and the information we want to collect.
Tracing Profilers
Tracing profilers operate by intercepting calls to functions in the application and running some additional profiling code. This extra code will capture information such as the function which is called, when it was called and how long it took to execute. Typically this will be written to a trace file to be post-processed and displayed.
For example, consider an application which contains three functions A, B and C. To capture information about these functions using a tracing profiler, the application must first be prepared by instrumenting code. This can be done in different ways depending on whether the functions are in our application code or in an external library. If the functions are in our application code we can directly instrument the code either by using an instrumenting compiler provided by the profiling tool, such as ScoreP, or by manually adding instrumentation using the profiler API. If the functions are in an external library, then the calls can be instrumented by interception, typically this is done by creating an additional library containing the instrumentation functions, the application is linked against this, and the instrumentation library is linked against the library containing the functions to be profiled. This approach is commonly used to trace calls to parallelism runtimes such as MPI and OpenMP, and parallel profilers will often automate the library interception process at runtime.

Now when the instrumented application is run, the profiler code will then be run every time an instrumented function is called and record the information we need. However, because we have inserted this additional code into the application, we have increased the total amount of work our application does, and so we will have increased the runtime, this additional time is known as the profiling overhead. This is shown pictorially below, where the addition of instrumentation adds overhead (orange) to each function call.

It is important to be aware of the overhead and how large it is, because this overhead time can distort the view of how long different parts of the application take to run. This is a particular issue if there are many calls to very short functions e.g inside loops, because the overheads can often be comparable to or even be larger than the amount of time spent in the function.
Although the overheads can potentially be large, tracing is an extremely powerful technique because the profiling code is executed in the application. This means that almost any information available inside the program can be gathered, e.g for MPI it is very useful to know how much data is communicated and the sender and receiver of the data. The tracing approach is commonly used for the profiling of parallel code by tracing calls to parallel libraries. For example, it is the method used by the Extrae, Scalasca and Intel ITAC profiling tools, amongst others, to trace MPI and OpenMP behaviour.
Sampling Profilers
Sampling profilers do not alter the application code and instead rely on methods provided by the operating system to periodically inspect the application being profiled. Sampling in this manner requires operating system support, but unlike tracing does not require any kind of additional instrumentation and so does not require any specific library or language support to insert tracing code. The profiler instead makes periodic requests to the operating system kernel to return information about the application under investigation, such as the current function being executed and the hardware resources used since the previous query. This is typically a lighter weight approach than tracing and gathers less detailed information in exchange for reduced overhead.

Although sampling does not require modification of the code it will still normally require the application to be compiled with debugging symbol names. This enables the profiler to express the information using human-readable function names.
Use of a sampling profiler will still incur some overhead as the kernel must interrupt the application in order to inspect its state. however, compared to a tracing profiler, sampling overhead is essentially fixed by the choice of sampling rate, rather than being dependent on application behaviour, and can be used on all code types. The downside however, is that sampling can only provide a statistical picture of the application behaviour, rather than a deterministic one. An example of this behaviour is given below: notice there are certain function calls which are not sampled (orange lines), while other function calls are sampled twice. This means that the profiler is not able to construct a complete timeline of function executions. Instead it will count the number of times it finds the code in any given function and provide information on the relative amount of time spent in each of these functions.

Notice that there is still some overhead from the sampling because the application must be interrupted by the OS kernel. However, this will typically be lower than the overhead due to tracing because the method is kernel-based, and the independence of the overhead from the number of function calls.
Hybrid Profilers
When performing analysis of HPC code it is common to combine tracing and sampling methods for a "hybrid" profiling approach. Consider for example a code parallelized with MPI or OpenMP, there are likely to be relatively few calls to the parallelization library, and so it is useful to trace these calls and build a complete picture of the parallel behaviour. At the same time, the user code may be a complex C++ code with many small function calls that cannot be traced without large overhead, so sampling can be used to profile this without major overhead. This is the approach used by, for example, Intel Vtune to simultaneously profile parallel library calls and user code behaviour while minimizing overhead costs.
Choosing A Tool
So what does this all mean for actually choosing a tool? In short, we want to capture as much relevant information as we can, as quickly and accurately as possible. This means balancing the level of detail we capture against the overheads we will incur in doing the profiling.
In Part 1 we looked at how the application structure affects where it is most effective to focus the optimization effort and the importance of understanding the critical path through the application. As a result, for initial profiling, we will likely want to use hybrid tools that can perform lightweight sampling of the whole application, with tracing of parallel library calls if necessary. This whole-application overview lets us build an understanding of where the bottlenecks are in the code, although it may not show what is causing them. We can then target these regions with more specific tools designed to investigate specific aspects of the code performance, such as I/O, communication or vectorization. These tools often come with a larger overhead, but we can use the information we have gained from the whole-application profiling to target specific code regions, thus improving the efficiency of our analyses.
Next Time
So far in this series, we have looked at how different types of profiling tools work, how they can interact with different application types and regions, and what to consider when choosing a profiling tool.
In the final part of this series, we will look at how to understand and interpret the tracing results using examples. We will begin with an application-overview trace, use this to determine the location of an inefficiency in the code, and then investigate the hotspot in detail to understand and fix the cause of the problem.