One of the primary drivers for Cloud computing is access to architectures and systems which may not be readily available in-house.  One example of this is AWS’s somewhat recent introduction of their own custom-designed Graviton 2 processor. This processor is based on the ARM architecture, rather than the x86-based architectures from Intel and AMD. We have had a number of clients enquire about how viable ARM is for their HPC needs. While there are a handful of published benchmarks available, I decided to take an afternoon and try it for myself.

For this small exercise, I decided to benchmark the weather code WRF v3.9.1.1. There are two "traditional" benchmarks for WRFv3, representing different resolutions (12km and 2.5km resolutions). Both benchmarks run for 3 simulated hours. The smaller benchmark (12km resolution) typically scales well to a few hundred cores, and the larger benchmark (2.5km resolution) will scale to a few thousand cores.  However, for this project, I ran the benchmarks on only a single node, and as this exercise was only to satisfy my own curiosity, I did not re-run the benchmarks multiple times which we would normally do to capture statistical variation.

For the benchmarking hardware, I wanted to compare offerings from Intel, AMD and ARM.  AWS is currently the only major cloud provider to offer ARM-based instances with their AWS Graviton 2 processor. This led me to the C5, C5a and C6g instance types on AWS, of which I selected the largest instance type available, in order to get a “full node.” The benchmarked systems all used Amazon Linux 2 as the OS, and I used Spack to install GCC 9.3.0, and to build WRF’s dependencies[1] which made building for ARM no more difficult than for Intel and AMD. When building WRF itself, the only modifications to the default compilation configuration was to add the ‘aarch64’ architecture to the GNU/Linux configuration section, and tuning parameters to optimize for the target platform (-march=native -mtune=native).

 

Instance Name

Processor ID

# Sockets

# vCPU

# Physical Cores

RAM

Price/Hr

c5.24xlarge

Intel Xeon Platinum 8275CL

2

96

48

192GB

$4.08

c5a.24xlarge

AMD EPYC 7R32

1

96

48

192GB

$3.696

c6g.16xlarge

AWS Graviton 2

1

64

64

128GB

$2.176

 
Already, it is clear that this benchmark will not quite be an apples-to-apples comparison. The C5 (Intel) instance is a dual-socket configuration, whereas both the C5a (AMD) and C6g (ARM) instances are single socket. Both C5 and C5a have SMT (aka, HyperThreading), whereas the C6g does not. If we instead consider comparing the performance from each instance offering, rather than looking at the underlying CPU configuration, the comparisons become much more straightforward.

Results

SMT / Hyperthreading

Many HPC applications demonstrate a performance degradation when SMT is in use, so many HPC centers disable it. I wanted to double-check to see if that is the case for this benchmark. A quick look at Figure 1 shows that in this case, there is a performance advantage to using SMT on both Intel and AMD systems for WRF (aka, use all 96 threads), but the difference is minor at best. We can also see that the dual-socket Intel system significantly out-performs the single-socket AMD system on the larger benchmark, most likely due to the higher overall system memory bandwidth[2]. For the remainder of this paper, “full instance” will refer to using all of the vCPUs available to the instance.

Performance advantage to using SMT

Figure 1: Comparing using SMT vs not – shorter is better

Comparing Performance for ARM vs Intel vs AMD for WRF

Figure 2 shows the total compute time (not including startup time or writing the results to disk) for WRF running both benchmarks across the three architectures. It is plain to see that AWS's Graviton 2 chip performs quite competitively. While it is the slowest of the three for the smaller benchmark (12km resolution), it out-performs AMD's offering during the larger scale benchmark (2.5km resolution). The Intel-based system shows a non-trivial performance advantage over both ARM and AMD.

Compute time using full instance (all vCPU)

Figure 2: Compute time using full instance (all vCPU) – shorter is better

I expect that the higher memory speeds available on the Graviton 2 processor is the main reason that it out-performs the AMD system on the larger-scale benchmark. If AWS introduces a dual-socket AMD Rome instance type, this should, of course, be revisited. The Intel processor’s higher clock speed combined with the increased memory bandwidth from having 2 sockets give it a sizable performance advantage here.

Comparing Costs of ARM vs Intel vs AMD for WRF

With this study taking place in "the Cloud," it is imperative to also consider costs when benchmarking.  AWS has priced their Graviton 2 offerings extremely competitively. Recall that the instances being benchmarked cost (as of today, in the US-EAST-2 region, with On-Demand pricing) $4.08/hr for the Intel system, $3.70/hr for the AMD system, and only $2.18/hr for the Graviton 2 system.

We multiply the hourly price by our runtime to give us our cost-to-solution numbers. As can be seen in Figure 3, while the Intel-based instance has much higher performance than the ARM-based instance, when you factor in prices, the Graviton 2 gives us a lower cost-to-solution, despite taking a longer time to reach the solution.

Cost comparisons

Figure 3: Cost comparisons - shorter is better

There is a distinct tradeoff between performance and cost for WRF on these platforms. For the 2.5km benchmark, we would ideally explore a scaling study as well, to see if there is a point where you can get the performance of an Intel-based solution, but at a cheaper overall cost, with an ARM-based solution.

Cost & Performance comparisons

Figure 4: Cost and Performance comparisons - shorter is better

System Name Benchmark Compute Time (s) Benchmark Cost ($)
c6g – (Graviton 2) CONUS 12km 75.33 $0.046
c5a – (EPYC 7R32) CONUS 12km 68.26 $0.070
c5 – (Xeon 8275CL) CONUS 12km 59.00 $0.067
c6g – (Graviton 2) CONUS 2.5km 4384.16 $2.65
c5a – (EPYC 7R32) CONUS 2.5km 4799.63 $4.93
c5 – (Xeon 8275CL) CONUS 2.5km 3395.74 $3.85

As the AMD EYPC and ARM HPC ecosystems mature, we can hope to see increased performance from compilers which are more targeted at these architectures (ie, AOCC from AMD, and ARM’s Allinea Studio), as well as other LLVM-based compilers. In the past, we have seen that the Intel compiler does a better job than gfortran for optimizing WRF for Intel processors. It would be interesting to revisit this benchmark study with additional compilers.

Summary

In response to questions about the suitability for ARM processors for HPC today, I ran one popular HPC benchmark, WRFv3 on 3 different “compute-optimized” AWS instance types. We found that AWS’s custom ARM-based offering, while not the fastest processor available for this benchmark, provides a very cost-efficient solution for WRF, and performance is competitive to other, more traditional HPC processors.

If you’re interested in more in-depth benchmarking and performance analysis of various HPC hardware solutions, please get in contact.

Further reading

 

[1] I first attempted to use GCC 10, but gfortran 10 and WRF do not appear to get along well.  WRF would crash at runtime due to routines in libgfortran.

[2] If AWS introduces a dual-socket AMD Rome instance size, such as are available on other cloud providers, the performance profile should change significantly, and this will be worth revisiting.

Author
Leave a Comment

This form has an automated anti-spam system running (Recaptcha). If it suspects you are not a valid visitor a backup challenge will appear here.