As we have discussed in previous blogs, choosing a suitable storage solution and optimizing it appropriately is crucial for getting the best value from your cloud HPC spend. So far, we have looked at two different ML workloads, Mask R-CNN and CosmoFlow, but now we will take a more general look at the available options and the optimization process, specifically for read-intensive workloads. We will start by looking at the generally available Azure Storage options and how their performance stacks up for HPC workloads. Armed with this knowledge you will then be able to choose optimized solutions for both the cost and performance of your own cloud HPC workloads.
Before we consider the relative I/O performance of the various options we must first ensure that they provide sufficient performance for the workload and do not cause bottlenecks for the compute hardware. This is particularly critical for the ND series of virtual machines with NVIDIA GPUs. To ensure cost-effectiveness, the powerful GPUs must be fully supplied with data at all times. We demonstrated this in our CosmoFlow Tutorial where the I/O performance of different storage solutions caused up to a factor of 10x variation in the total cost to train the model. Choosing an optimal storage solution needs an understanding of both the requirements of the workload and the capabilities of the various storage services.
Several different metrics can be used to quantify I/O performance, each being a potential limiting factor for different classes of workload.
Total Read Bandwith
Total read bandwidth is the key performance driver for data-intensive workloads which consume large blocks of data. Accessing this type of data typically requires loading contiguous chunks of data from a small number of relatively large files. This incurs very few metadata operations and allows efficient streaming of data from the underlying storage medium. As a result, the overall performance will normally be limited by the bulk data transfer capacity of either the network layer or the underlying storage devices, rather than by the performance of metadata operations or transaction rate limits.
For workloads that must access many small files or perform small random reads into large files, the limiting factor for performance will usually be the I/O operations (IOPs) rate. This can also be the case for workloads that perform large numbers of metadata operations. Storage services typically have a limit on the maximum IOP rate for different scopes (file, client, storage instance), limiting the total data throughput achievable for small transactions. When considering the pricing dimension, it is also important to consider that some tiers will also charge for IOPs/transactions as well as for the at-rest data size.
An important note at this point is that not all IOPs are the same. Because of the way different technology layers handle data, what appears to be a single IOP to a client application may be viewed as multiple IOPs by the storage service. This is a result of how the data requests are translated by the filesystem driver or network protocol into hardware level operations.
The final figure of merit to consider is the latency to complete an I/O operation. In contrast to local HDD or SSD storage where latencies can be measured in microseconds, for Azure network attached storage services we can expect latencies from about 10 milliseconds for small files on high-performance services to seconds for larger files or slow/cold storage tiers. The effect of this will vary depending on the workload and how sensitive it is to I/O latency. Machine learning workloads in particular can often be improved by using data-prefetching to ensure that data is loaded into CPU or GPU RAM ahead of time so that it is immediately ready when required by the application.
Broadly speaking there are two types of Azure Storage that we can use for HPC applications, Azure Files and Azure Blob Storage. These come with different performance tiers, and, in the case of Azure Blob, also different access methods. The Azure platform also provides the Azure NetApp Files (ANF) service, however at the time of writing this blog it is not a generally available service, requiring a separate application process and waitlist for access. Because of this, we have not included it in this study.
The table below gives a summary of the maximum performance capability of the services we have considered:
The Azure Files service is designed to provide serverless file shares accessible via either the SMB or NFS (in preview) protocols. The service is available at two performance tiers - Premium and Standard - with the Premium tier charging based on provisioned storage size and the Standard tier charging based on the actual stored data size plus the number of transactions (read, write, delete etc.) performed on the file share. The Premium tier offers improved per-file bandwidth, IOPs and latency performance compared to the Standard tier, at a correspondingly higher cost.
Performance limits for Premium File Shares scale with the total provisioned storage, with maximum performance reached for 100 TiB of provisioned storage. The Standard tier is further divided into Hot and Cool price and performance tiers, with an additional “Transaction Optimized” pricing model for the Hot tier. Pricing tiers are designed such that at-rest storage costs decrease moving from transaction optimized to Hot and then Cool tiers, which per-transaction costs increase. Therefore, the transaction optimized tier is most cost-efficient for small, frequently accessed data stores, while the Cool tier is most cost-efficient for rarely accessed larger datasets. Note also that the Cool tier can incur significantly higher latencies to read rarely accessed files due to the nature of the storage service.
The Azure Blob service is designed to be a high-performance object store accessed via a ReST API over http(s). To use Azure Blob for HPC I/O two methods can be used to mount it as a traditional filesystem on HPC compute nodes: the BlobFuse userspace driver and the newly released Azure Blob NFS service. While neither of these present a complete Posix-compliant filesystem they can still be used to provide high-performance storage to HPC applications that do not require advanced filesystem features such as file locking.
It is important to also understand that BlobFuse, due to its implementation, does not provide consistency guarantees for files that are interacted with by multiple clients, and only promises eventual write consistency. These loose guarantees on consistency allow the BlobFuse driver to cache aggressively, both for repeated reads of the same file and for writing to files. This means that BlobFuse can only safely be used for workloads where there is no modification of shared files.
We collected best-effort performance benchmarks using the ElBencho storage benchmarking tool in multi-file mode with clusters of 4 nodes mounting the same storage service as a shared filesystem. The benchmark first writes multiple files in multiple directories onto the shared storage, then reads that data back from storage on different nodes to avoid node-local caching.
First, let’s look at the best-effort read bandwidth achieved by the different services for a variety of file sizes. The benchmark values given here are for the storage system – that is to say, the sum over the throughput of all the nodes in the cluster.
Best Effort Read Bandwidth
This shows the diverse range of performance that the various services can provide in different regimes with the broad theme that Azure Files perform best for small file access, while Azure Blob services excel for larger files. However, even for small file workloads, the total throughput capability of the Premium Blob tiers exceeds that of Premium Files. Therefore, unless it does not provide a required feature we recommend Blob-based NFS over Azure files for its improved performance and lower cost. We also find that the BlobFuse mounting method outperforms the NFS mounting method with an increasingly large performance difference as file sizes increase. This is likely due to the difference in network protocol behaviour, where SMB and NFS make many small (a few kB) RPC calls to download a file as many chunks, while BlobFuse makes a single ReST transaction to download the entire file.
Now let’s look at the latency behaviour. ElBencho measures the latency to complete the request and read the total file into RAM. This represents the delay between an application making an initial open() call and the file data being available for use. This latency will typically also be significantly longer than the time taken to complete a single request to the storage service – especially for large files. It is also important to note that, unlike the IOPs measurement, latency is not directly correlated with throughput. The level of thread parallelism is also a factor because the total available read throughput is spread across the files being read and over-parallelization can increase the time a single file read takes to complete.
Best Effort Latency (average)
As would be expected given the underlying hardware, the Premium tiers of both Azure Files and Azure Blob offer better performance than the Standard tiers. However, there is also a significant performance penalty for using SMB and NFS filesystems when compared to BlobFuse, which is to be expected given the differences in how BlobFuse is implemented as compared to a traditional network filesystem. BlobFuse uses fewer larger transactions and performs less strict checking on data consistency due to aggressive caching optimizations, this means that less time is spent on metadata operations and communication overhead – reducing the total time required to load files from the storage service.
Now we have seen the best effort performance of these various services, let’s look at the potential cost of these various services and choices that we can make to optimize the price/performance tradeoff for any particular application.
Choosing the best service and price is complicated by the fact that the different services have very different pricing models, and in the case of Azure Premium Files, performance scales with the allocated storage size. To get maximum performance from an Azure Premium Files share it must be scaled up to a full 100TB provisioned size. All the other services provide maximum performance at all times and charge based on the actual amount of data stored ("at rest"). The Hot tier services additional charge for operations performed at around $0.01 per 10000 read and $0.11 per 10000 write operations for the service tiers we have benchmarked here.
The graph below shows the cost for each service to store 100TB of data (the threshold for maximum Azure Files performance) and perform read transactions at the maximum supported rate (the real transaction rate is likely to be lower but using the maximum rate provides the “worst-case” cost).
Maximum Service Costs
In this case, the cost of Premium Files and Premium Blob is very similar, however, we need to consider what performance we are getting for this cost. Based on our throughput benchmarks we can calculate this for the 100TB stored data case, and they are shown in the graph in terms of MB/s of throughput we get per $/hour of storage costs paid.
Service performance/cost 100TB data stored
Not unexpectedly given its low hourly cost, Hot tier Blob storage consistently provides the best cost/performance value. This margin grows increasingly larger with both thread parallelism and stored file size as the Hot tier blob becomes increasingly performant. The premium tiers are most competitive for small file sizes which is a result of their lower-latency SSD-based nature and higher maximum transaction rates, however, for large file transfers, Hot tier blob is a highly attractive solution.
The size of the storage is an important dimension to consider when looking at cost, and so far, we have assumed a very large dataset size of 100TB. But what happens to the cost/performance of the various services if we reduce the size of the stored dataset to 10TB. To maintain maximum performance, we will need to retain the 100TB provisioned size of the Premium Files share, but the other services should now be significantly cheaper since we pay only 1/10th of the data at rest costs. We will also still be paying transaction costs at the same rate on the Hot tier services, meaning that they will become more expensive relative to the Premium Blob service as dataset size is reduced.
Service performance/cost 10TB data stored
At this scale, the provisioned-performance nature of Azure Files penalizes it heavily compared to the other Azure storage services in terms of cost/performance. Hot tier Blob storage is still the best value, but Premium tier blob is significantly more competitive. This is a direct result of the different pricing models of the Hot and Premium tiers - while both tiers have reduced their storage at rest costs, the Hot tier still has transaction costs which are fixed in this example where we are running the service at maximum transaction rate.
With all this performance and cost data in mind, how do we actually work out what the optimal solution is for our specific application? As is always the case when it comes to HPC performance questions, there is simply no substitute for benchmarking. You should benchmark your application in a situation where network bandwidth is not an issue e.g., using large local NVMe for dataset storage, and calculate what the bandwidth and IOP requirements of the application are. From there, using the benchmarking data we have gathered here you can determine what strategies are likely to provide the required performance at a reasonable cost.
Cloud storage services can be very different from on-premises storage. Our benchmarking has shown that Azure Blob Storage provides the highest bandwidth and lowest latency options, however, it does not operate as a traditional filesystem. Therefore, optimizing your application for the cloud may involve making changes to allow Blob storage to be used.
The cloud storage service landscape is complex, with a wide range of services designed for optimal performance in different scenarios. Leveraging these optimally for HPC workloads is a major challenge when migrating to the cloud, and care must be taken to find a solution that is optimal for both performance and overall cost – if storage is too slow then expensive compute time can be wasted waiting for data to be loaded, while the highest performance storage options can rapidly become a dominant part of the overall cost for some workloads.
Finding the optimal solution requires a deep understanding of both the workload requirements and the capabilities of the various cloud storage services. The benchmarks and discussion in this article should provide a good starting point for you to design your own optimal HPC storage solution.
Our results here have shown that Blob storage is the most performant and cost-efficient storage solution offered by Azure, however, it is very different from traditional networked filesystems used on-premise. The new Blob NFS service provides an attractive solution for using Blob storage with workloads that require traditional filesystem features, while BlobFuse provides even greater performance at the expense of data consistency guarantees. However, some workloads may still require adaptations to be able to make efficient use of Azure Blob or other non-traditional storage services.
Want to know more about how to optimize your HPC workload for the best cost and performance? Get in touch with NAG to find out how we can put our cloud expertise to work for you!
The work demonstrated here was funded by Microsoft in partnership with NVIDIA. The authors would like to thank Microsoft and NVIDIA employees for their contributions to this article.