NAMD is a popular choice of HPC benchmark, and AMD, Intel and Nvidia have all made recent claims of high performance. Intel recently contributed patches to accelerate NAMD using AVX-512 and claim these optimizations can improve performance up to 1.8x and outperform the latest AMD hardware. Intel have produced the 1.8x performance improvements by implementing a “tile” algorithm which makes highly efficient use of the AVX-512 vector units and large cache of latest generation Intel Xeon CPUs. This is a port of the same tiling algorithm used by the CUDA-enabled version of NAMD.
We were curious to try these patches to verify the performance for ourself and compare the performance gains with both AMD and Nvidia hardware. In particular, we wanted to see what this means when running NAMD in a cloud environment, where the question is often not only “how fast will it run?” but also “how much will it cost?”.
To explore this we ran benchmarks on Microsoft Azure using both AMD and Intel powered HPC-class VMs.
The instance types we used for benchmarking were:
The specs of these VMs reflect the different approaches taken by AMD and Intel: the AMD EPYC processors provide a large number of cores with lower clock speeds while the Intel Xeon CPUs provide fewer, faster cores with more advanced vectorization hardware. (The prices shown here are for the West US 2 region when this was published, but pricing will vary with region and over time.)
Benchmarking instances were provisioned with Centos 8.2 images. The NAMD 2.15a1 package and its dependencies were installed using the Spack HPC package manager.
We chose the ApoA1 (92,224 atoms) and STMV (1,066,628 atoms) benchmarks as being most suitable for single-node benchmarking. Typically NAMD benchmarking results are reported in units of nanoseconds of simulation time per day of computation time (ns/day). But, since we are interested in cost-to-solution it is more useful to calculate performance in terms of compute cost per nanosecond of simulated time ($/ns). Each benchmark was run 10 times in each configuration to capture run-to-run variation.
The results show a roughly 1.5x speedup over NAMD 2.13 with the latest Intel optimizations for both benchmarks and give Intel a performance and cost advantage over AMD for the ApoA1 benchmark. However, things are not so clear cut for STMV. With the larger problem size AMD retain the performance advantage over Intel, although the lower price per hour for the HC44rs VM means that Intel remains the more cost-efficient choice. The better AMD performance could be because of the higher memory bandwidth available on EPYC-based systems and would be interesting to investigate further.
NAMD simulation performance (Intel vs. AMD)
NAMD simulation cost-efficiency (Intel vs. AMD)
These days it is very popular to run NAMD with GPU acceleration. This gives excellent performance but GPUs in the cloud can be pricy. We were curious to see how the new Intel optimizations stack up against the GPU version for both performance and cost-efficiency.
Our choice of GPU VM was the NC6s v3 type costing $3.06/hour. This the one of the cheapest GPU VMs offered by Azure, and has a single NVIDIA V100 card, 112GB RAM and 6 Xeon E5-2690 v3 (Haswell) cores. This comes at a comparable price to the CPU-based VMs we have used but has trades CPU and RAM resources for the GPU - this may hinder performance if there are still CPU bottlenecks in the CUDA version of NAMD.
NAMD simulation performance (Intel vs. AMD vs. NVIDIA)
NAMD simulation cost-efficiency (Intel vs. AMD vs. NVIDIA)
Even with the latest optimizations, NAMD on GPU is still significantly faster than on CPU and is more cost efficient. As with the AVX optimisations, we see that the performance advantage of the GPU version is significantly lower for the larger STMV benchmark. Suggesting that the larger computation is either CPU-bound or memory-bandwidth bound on the GPU-equipped VM. Since the STMV benchmark is still relatively small compared to many NAMD simulations it would be very useful to look at larger simulations still. This would let us understand the best hardware choices for performance or cost-to-solution at different scales.
Looking to the future, it is worth having a quick look at some of the GPU performance improvements being developed for NAMD version 3. These are intended to port all the remaining computation to GPU and preliminary benchmarks show up to 3x performance improvement over the current version.
NAMD simulation performance (Latest CUDA preview)
Despite Intel’s optimisations, NAMD GPU performance for the typical benchmarks is still significantly higher, and upcoming improvements should further improve this by a factor of 2-3x. So for NAMD simulations on the scale of these benchmarks, using GPU-enabled compute offers significant performance and cost advantages over using CPU-only instances. That said, there is also a noticeable trend in favour of high-memory bandwidth CPU-based computation for larger simulations, so if you want to run very large NAMD simulations, you might find better performance with CPU-only hardware. As always, the best advice is to run benchmarks to find the best solution for your needs.