Cost of Solution for Cloud HPC

I recently gave a talk to the Society of HPC Professionals about a hypothetical Cloud HPC migration, and how to go about choosing the “best” cloud resources for that particular HPC workload. One aspect of the talk, which has garnered some good discussions, was about the concept of “Cost of Solution” as a driving metric. In this series of blog posts, my colleagues and I will explore various aspects of this concept, and how to best use this metric to guide choices for Cloud HPC.

We will define “Cost of Solution” to be the total of the cloud costs incurred by running the workload. Combine the hourly prices for the compute and storage with the amount of time those resources are provisioned for the workload, mix in any other associated costs, such as data transfers or use of other cloud services, and you have the total cost needed to run the HPC workload. For today’s discussion, we will not include costs such as long-term storage of input or results. While they are important costs to consider while calculating TCO, these costs tend to be independent of how the HPC workload is run.

Compare this to some other metrics typically used with HPC

• “Time to Solution” – A measure for how long it takes a workload to complete.  This is the typical driver of HPC optimizations, and factors critically into the calculation of “Cost of Solution.” The longer an application takes to run, the more costly the workload will be.
• “Total Cost of Ownership” (TCO) – This metric reflects the costs to “own” and operate an HPC resource over some period of time, whether a year, or the lifetime of the equipment.  It includes not just purchase prices of hardware, but also includes things such as operations, maintenance, power, cooling, etc. For a traditional on-prem environment, one way in which you could calculate the “Cost of Solution” of a workload would be by converting the TCO into an hourly cost, and pro-rating that hourly cost by the percentage of the resource being used by the workload.
What makes “Cost of Solution” worth talking about?

Looking at the cost to run a workload, or solve some problem, seems like an obvious thing to do.  However, it doesn’t always come naturally to the world of HPC. Often, the users of an HPC resource are not exposed to the costs incurred by the HPC provider to procure, operate and maintain the hardware. In some instances, users or projects may be charged some price per CPU-hour, but inside many organizations, the users do not directly see those costs at all and have little incentive to think about price. This often changes when moving to the cloud, as it becomes more natural to associate the hourly costs of resources directly to the users of those resources. Cloud HPC users are regularly exposed to the per-hour cost of their workloads and may be pressured to select resources that have a low hourly price.

In addition, most HPC centers do not have as wide of a variety of HPC resources as cloud providers can offer. There is often only a single storage solution and maybe only one or two processor generations available to users. Accelerators such as GPUs may not even be available. Even if hourly rates of the use of the HPC center are made available to users, the number of choices is low so that it becomes almost trivial to choose which system configuration is the best option for the workload.

Finally, we get to the fact that cloud resources are billed on a per-hour basis (or per-second basis, as the case may be), and the prices are listed for all to see. There is a temptation to gravitate toward the cheapest $/hr options as a cost-saving measure. As we will shortly see, this can lead you down a path that may look less expensive to start but may end up costing more. Optimizing for Cost of Solution In the talk I gave for the Society of HPC Professionals, we were discussing an Oil & Gas workload. This workload represents an internally developed Reverse-Time-Migration (RTM) benchmark, a common algorithm for seismic processing. This workload consists of many independent problems (tens of thousands), with each individual problem requiring a non-trivial amount of computation as well as temporary storage (about 50TB). We looked at a variety of ways in which this calculation could be done and compared that to a straight “Lift & Shift” implementation, where we attempt to replicate our “on-prem” system in Amazon’s cloud. In a later blog post, we will go into more detail about how these calculations were made, and how we came up with the resulting prices, but for now, let’s explore some of the solutions we came up with, and how they affect the cost to run our hypothetical workload. ## Approach 1 – Replicating the original on-premise solution Our “Lift & Shift” architecture uses Amazon’s FSx for Lustre managed service solution to provide an 8PB Lustre filesystem for high performance storage. We replicated our on-prem nodes by provisioning 480 c5n.18xlarge instances. These are AWS’s workhorse HPC instances, combining a dual-socket Skylake Xeon with AWS’s HPC network, known as EFA. Each of our individual workload problems, or “shots”, is run as an MPI job across 4 instances, allowing us to reconstruct 120 shots at a time. In order to run our 10,000 individual shots, it will take this cluster about 47 days (each shot will take approximately 13.5 hrs to reconstruct, and we can reconstruct 120 shots simultaneously), and cost just shy of$4M USD (using On-Demand pricing).

Figure 1: Lift and Shift Architecture

One of the driving factors for the price of this application is the large shared filesystem. It is primarily being used in this application as temporary scratch space.  Each shot requires approximately 50TB of scratch space. As each “shot” is a computationally independent problem, there is no requirement that the scratch space come from a shared filesystem. It had been done that way previously for our on-prem system due to it being the only place available to store that much data.

## Approach 2 – Switching to fat VMs with large local storage. One instance per shot.

AWS has a storage oriented VM instance type known as ‘i3en’. The ‘i3en.24xlarge’ system has not only 96 vCPU and 768GB of RAM, but also includes 8 7.5TB NVMe drives directly attached to the instance. This 60 TB of fast, local storage is more than enough for our 50TB of scratch data. We can now conceivably reconstruct a single shot using just the resources available on a single instance. We can launch as many of these instances as are available at the time, and work through the 10,000 shots we need to process. This avoids needing the large shared filesystem, as well as any need to deal with an HPC-focused network and the decoupling gives us much more flexibility to schedule the work as instances become available. With On-Demand pricing, our workload will cost us only \$1.64M USD, a significant savings over our “Lift & Shift” example! Each shot will take approximately 15 hours to reconstruct. If we stick with reconstructing 120 shots simultaneously, it will take 52 days to complete. We can speed this up by doing more simultaneous shots (up to the availability of instances), and the price will remain constant.

Figure 2: 1 instance per shot, using i3en.24xlarge

Going with the single instance per shot does have some downsides. Each instance only has 96 vCPU, whereas we were previously using 288 vCPU (4 instances x 72 vCPU / instance). While our costs have dropped significantly, we’re spending a lot more time doing the calculations. As compute has replaced I/O as our bottleneck to increased performance, perhaps we can add some resources to increase the amount of compute we have available.

## Approach 3 – Adding more resources to bring costs down

AWS recently introduced their Graviton 2 processors, a 64-core ARM-based CPU. These processors perform HPC tasks well and come at a price discount over other x86 processors. We can try increasing the amount of compute available by combining our “i3en” instance from the previous example acting as a “per-shot” NFS filesystem with a handful of ARM processor instances.

Figure 3: Combine per-shot NFS with Arm processors

In Figure 3, we can see that each shot is now using a set of 4 ARM-based C6g instances with an I3en instance to provide the scratch-space storage for the computation. As we’ve added more compute capability, addressing the bottleneck of our workload, the runtime drops dramatically (per-shot reconstruction time drops to 6 hours, and at 120 simultaneous shots, the entire job can be finished in 21 days), and our overall Cost of Solution drops by almost a half a million dollars! We’ve added cloud resources and reduced our total cost!

Wrap-up

In this series of blog posts, we’re going to talk about a number of different factors that go in to understanding, calculating, and optimizing for the “Cost of Solution” of HPC workloads in the cloud.  Today we introduced the metric “Cost of Solution” and gave a short example showing how optimizing for this metric can lead to some non-intuitive situations, such as how adding additional resources (spending more per hour) can sometimes lead to saving money.

To learn more about the Cost of Solution metric, and how NAG can help your organization make the most efficient use of Cloud HPC, check out our Cloud HPC Migration Service or schedule a talk with a Cloud expert.

Author