NAG's HPC team was chosen to perform the following work because of their experience in working with large HPC centers on specialist business focused projects. It was conducted within NAG's involvement in the Performance Optimisation and Productivity Centre of Excellence.
Shearwater Reveal is a seismic processing code that does time and depth analysis for land and maritime applications. Its land processing tools cover all aspects of land processing, from refraction statistics to final time and depth imaging.
The Shearwater Reveal application was assessed and the recommended changes were tested in a Proof of Concept (PoC). Wadud Miah, a HPC Application Analyst at NAG performed the work. During the assessment phase he identified reasonable load balance, good Amdahl’s efficiency and good computational efficiency. However, the OpenMP efficiency was low due to a high overhead.
The cause of the high overhead was found to be an OpenMP critical region that protected file read/write operations from race conditions. In the PoC, Wadud modified the code such that the I/O was taken out of the OpenMP region, allowing the OpenMP critical region to then be removed. In addition, the OpenMP dynamic schedule was used. The parallel scalability graph below shows the changed PoC code with both the static and dynamic schedules.
The PoC code with the dynamic schedule only shows a performance gain at 18 and 24 threads. This is due to the increase in sequential execution caused by the memory allocation/deallocation needed to store the temporary data and the file I/O.
To investigate the potential performance gains, the memory allocation and file I/O were removed from the focus of analysis, even though these changes on their own led to an incorrect solution. The resulting scaling graph is shown below, with the linear and 80% of linear graphs scaled by the CPU frequency reductions.
The Shearwater Reveal PoC showed a performance improvement as a result of these changes of up to 44%. The yellow line shows the parallel scaling with the I/O and memory allocation/deallocation removed and its performance is approaching linear scaling.
Additional recommendations were made by Wadud to aggregate the I/O into larger read/write sizes and to increase the re-use of data once it is read from disk.
Changes were made based on our analysis and recommendations. For a full-scale production run the computational cost was close to halved.
For more information about any of this work please contact us.