NAGnews 156

In this issue:


NAG's HPC team improves ease of use in the Cloud for Volkswagen

NAG has been a partner of the Performance Optimisation and Productivity (POP) and Fortissimo projects for some time now. Because of NAG's experience in working with large HPC centers on specialist business focused projects it made sense that our HPC team were chosen to work on making various performance and ease-of-use projects for uses of OpenFOAM, specifically in the Cloud within the Fortissimo project.

OpenFOAM is a C++ toolbox, parallelized with MPI, for the development of customized numerical solvers, and pre-/post-processing utilities for the solution of continuum mechanics problems, including computational fluid dynamics.

The Volkswagen Fortissimo project scope was to move some of Volkswagen's OpenFOAM simulations onto cloud-based HPC platforms. Ning Li, Senior Technical Consultant at NAG and partners from RWTH Aachen began the work by studying the overall architecture and software environment of the hosting clusters and supporting partners. They proceeded to streamline the software deployment process (an OpenFOAM environment is difficult to set up without specialized knowledge). They then performed various parallel scaling of the test cases and also helped develop a cost analysis model to demonstrate the benefit of Volkswagen switching to a cloud-based solution.

As a result of the project work, Volkswagen were able to make well-informed business decisions on their best path forward and then, when the move took place, were able to quickly get started using their chosen cloud-based solution with help from NAG's HPC team.

This image shows geometry of the Volkswagen Polo Mk V car body with mesh decomposition over 192 cores. NAG performed external aerodynamics simulations of this real car model on the ARCHER supercomputer to validate the software deployment and perform scaling studies.

More information on NAG's HPC Services and Consulting.


What to do when Julia's lufact runs out of memory

Reid Atcheson, Accelerator Software Engineer at NAG has blogged about 'What to do when Julia's lufact runs out of memory' - see the blog snippet below.

Direct sparse linear solvers have a huge problem: fill-in. As we increase problem size the amount of memory it takes to compute a basic factorization grows rapidly. Traditionally the solution is to scale out and use more nodes, likely mediated through a message passing library such as MPI. That is an expensive solution, and I was curious if we could instead try pushing direct solvers further on a single compute node.

With some of the growth and improvement in I/O technologies, such as the emergence of burst buffers, I decided to look at out-of-core solvers. These are solvers which make intermittent reads and writes to a large storage solution so as to reduce the amount of main memory required.

In this blog post I showed a simple sparse linear system involving about 2 million unknown variables that I could not solve using Julia's built in "lufact" (which interfaces UMFpack). I show how to write a simple out-of-core solver which makes periodic use of disk to store very large precomputed LU factors. This simple change enabled me to solve systems that otherwise could not be done on a single node due to memory constraints. Read the blog here.


Latest NAG Library (Mark 26.1) now available for Apple Mac OS X

The latest mathematical and statistical functionality introduced at Mark 26.1 of the NAG Library for both C and Fortran has been made available for users of Apple Mac OS X. Mark 26.1 of the Library sees two new Optimization routines added to the NAG Optimization Modelling Suite within the Library - Derivative-free Optimization for Data Fitting and an Interior Point Method for Large Scale Linear Programming Problems with an additional 20 numerical routines also new to the Library at Mark 26.1. Learn more about the new functionality.

Software downloads of the NAG Library can be found here.


Upcoming learning opportunities: webinars and courses

Webinar: How to identify causes of poor OpenMP parallel performance using the Intel VTune amplifier | Wednesday 11 July 2018 | 15:00 GMT
This webinar is aimed at anyone who wants an introduction to using VTune to understand the causes of OpenMP underperformance. It describes a systematic way of using Intel's VTune Amplifier to identify the sources of parallel inefficiency in OpenMP code, e.g. load imbalance, serial execution, OpenMP overheads and slowdown in processor throughput. Register for the webinar.

Training course: HPC Leadership Training Institute | Texas Advanced Computer Centre | 11-13 September 2018
The High Performance Computing (HPC) Leadership Institute is specifically tailored to managers and decision makers who are using, or considering using, HPC within their organizations. It is also applicable to those with a real opportunity to make this career step in the near future. Topics covered will include procurement considerations, pricing and capital expenditures, operating expenditures, and cost/benefit analysis of adding HPC to a company's or institution's R&D portfolio. A broad scope of HPC is covered from department scale clusters to the largest supercomputers, modelling and simulation to non-traditional use cases, and more. We encourage attendees from diverse backgrounds and under-represented communities. Click here for more information and to register.

Webinar: Verification and modernization of Fortran Codes using the NAG Fortran Compiler | Thursday 13 September 2018 | 15:00 GMT
This webinar will show how the NAG Fortran Compiler can be used to write correct and performance portable code which is not always possible with other compilers. By strictly adhering to the language standard, this makes the code portable to other compilers. Register for the webinar.


Technical Report: Batched Least Squares of Tall Skinny Matrices on GPU

NAG has produced a highly efficient batched least squares solver for NVIDIA GPUs. The solver allows matrices in a batch to have different sizes and content. The code is optimized for tall skinny matrices which frequently arise in data fitting problems (e.g. XVA in finance) and are typically not that easy to parallelize. The code is 20x to 40x faster than building a batched GPU least squares solver using the NVIDIA libraries (cuBLAS, cuSolver). This gives a pronounced speedup for applications where the matrices are already in GPU memory. For CPU-only applications, the cost of transferring the matrices from CPU memory to the GPU can dominate. We observed speedups including all transfer costs of between 1.5x and 12x for large enough problem sizes. Hence CPU-only applications can see a healthy speedup by adding a GPU and using NAG's software. NAG can provide code to benchmark users' systems to determine likely benefits and minimum batch sizes. If the matrices have structure (e.g. polynomial basis functions), then much less data needs to be transferred. Evaluating the basis functions on the GPU means current CPU-only applications can see a 10x to 20x speedup for large enough problems. NAG can help users write the small amount of additional GPU code needed to do this. Read the technical report here.


Out & About with NAG

Exhibitions, Conferences, Trade Shows and Webinars

Webinar: How to identify causes of poor OpenMP parallel performance using the Intel VTune Amplifier Online, 11 July 2018

Advanced Risk & Portfolio Management (ARPM) Bootcamp New York City, 13-18 August 2018 or online, anytime.
To register with a discounted affiliate rate: go to the Packages(link is external) Page ? under Courses click on 'ARPM Bootcamp' ? in the Type List select 'Affiliate' ? in the Affiliation List select 'NAG' ? click on 'Next Step.'

HPC Leadership Training Institute Texas Advanced Computer Centre 11-13 September 2018

Webinar: Verification and Modernization of Fortran Codes using the NAG Fortran Compiler Online, 13 September 2018

CppCon Bellevue, 23-29 September 2018

4th Quantitative Finance Conference Nice, 26-28 September 2018