Performance Improvements on Suite of Computational Chemistry HPC Codes
NAG was chosen to perform the following work due to our involvement in the Performance Optimisation and Productivity Centre of Excellence.
The ADF Modeling Suite is a powerful computational chemistry package produced by SCM (Software for Chemistry and Materials), a company based in the Netherlands. It tackles chemistry, materials science and engineering challenges and is comprised of six applications.
We performed a broad investigation of the ADF Modeling Suite including assessing the performance of new functionality, locating and fixing bottlenecks and testing multiple approaches on three of the most widely used applications: ADF, BAND and DFTB. To analyse the performance of these applications, we used the tools Extrae, Paraver and Dimemas from Barcelona Supercomputing Center, along with Score-P and Scalasca.
ADF is a computational chemistry application which uses density functional theory calculations to predict the structure and reactivity of molecules. Sally Bridgwater, a Technical Consultant at NAG, assessed a particular calculation of interest to the developers which highlighted new functionality (medium-sized molecule with hybrid exchange-correlation functional).
Initial investigations identified load imbalance as the main cause of inefficiency, resulting in idle time and the processors not being fully utilised. A secondary issue was the increase in time spent transferring data with increased core counts.
ADF is parallelised with MPI and shared POSIX array buffers within a node. These shared buffers are not captured automatically by the profiling tools and so had to be manually instrumented using the Extrae API. The locking and unlocking of these shared arrays was found to exacerbate the load imbalance, leading to more idle time than initially thought.
The long waiting times were found to be due to the load balancing algorithm not distributing work frequently enough and this was made worse by the imbalanced input used in this specific case. By distributing the work more frequently and tuning the chunk size of the work, Sally predicted that a factor of two saving in runtime could be achieved here. This gave SCM a clear idea of the potential improvement in performance.
SCM adapted their load balancing scheme to provide work more frequently by using a dedicated process to manage load balance, as well as initially dividing the work up into smaller chunks. SCM made these changes quickly and the estimated improvement of a factor of two was achieved.
Timelines for the original (top) and version after improvements were made (bottom). This shows the large reduction in idle time (red) and halving of the total time taken.
“I can honestly say your analysis gave us a new insight into performance of one of the newer features available in ADF. What is more important, it clearly showed us the limitations of the current implementation and pointed us to the ways to improve it.” – Alexei Yakovlev, Software developer SCM.
The initial assessment of the BAND application by Jonathan Boyle, HPC Application Analyst at NAG, found profiling difficult due to the long runtimes and large amount of data produced for a moderately sized system. Initial findings, however, showed that load balance was the main bottleneck and that this was exacerbated by poor computational scalability.
Two more detailed assessments investigated sub-components of BAND. The first of these looked at computation of the overlap matrix. The performance observed ranged from reasonable to good, depending on the system being computed. Jonathan found that the main issue was reduced computational scaling, with contributions from low instructions scalability and IPC scalability. The routines responsible for the largest increases in exclusive time were identified for further investigation by the code developers.
The performance of the complex matrix-matrix multiplications was the focus of the second assessment, where each complex matrix is held using two shared memory real arrays, which are replicated on each compute node. This was found to have significant room for improvement and so a Proof of Concept study followed on from this work. The main bottlenecks identified were a reduction in IPC at scale and an increasing amount of time spent in MPI data transfer. Most computational work is within a dgemm call and hence this became a target for optimisation.
The Proof of Concept work that Jonathan implemented first tested out approaches to improve the performance and then implemented the ones that were beneficial in the source code.
The optimisations that were implemented are:
- overlapping computation with communication
- improved use of BLAS, which doubled the speed of computation
- reorganising the algorithm to reduce the amount of data communicated via MPI.
The optimised subroutine showed four times speed up compared to the original code on eight 36-core compute nodes.
Speedup plot showing the large improvement in scalability of the new subroutine.
A case study of the work carried out on this application by Nick Dingle, Technical Consultant at NAG, can be read here.