Batched Least Squares of Tall Skinny Matrices on GPU

NAG has produced a highly efficient batched least squares solver for NVIDIA GPUs. The code is optimized for tall skinny matrices. These frequently arise in data fitting problems such as XVA in finance, and are typically not that easy to parallelize. The code is 20x to 40x faster than building a batched GPU least squares solver using NVIDIA libraries (cuBLAS, cuSOLVER). This gives a pronounced speedup for applications where the matrices are already in GPU memory. 

For CPU-only applications, the cost of transferring the matrices from CPU memory to the GPU can dominate. We observed speedups including all transfer costs of between 1.5x and 12x for large enough problem sizes. Hence CPU-only applications can see a healthy speedup by adding a GPU and using NAG's software. NAG can provide code to benchmark users' systems to determine likely benefits and minimum batch sizes.

If the matrices have structure (e.g. polynomial basis functions) then much less data needs to be transferred. Evaluating the basis functions on the GPU means current CPU-only applications can see a 10x to 20x speedup for large enough problems. NAG can help users write the small amount of additional GPU code needed to do this. 

Read the full report by Jacques Du Toit and Tim Schmielau here 'Batched Least Squares of Tall Skinny Matrices'.

Getting Access to the Code

The code is available to trial. To arrange access, or if you have any questions, please contact  

Leave a Comment