## ▸▿ Contents

Settings help

CL Name Style:

In multithreaded applications, each thread in a team processes instructions independently while sharing the same memory address space. For these applications to operate correctly any functions called from them must be thread safe. That is, any global variables they contain are guaranteed not to be accessed simultaneously by different threads, as this can compromise results. This can be ensured through appropriate synchronization, such as that found in OpenMP.
When a function is described as thread safe we are considering its behaviour when it is called by multiple threads. It is worth noting that a thread unsafe function can still, itself, be multithreaded. A team of threads can be created inside the function to share the workload as described in Section 2.
The NAG CL Interface is thread safe by design: the functions do not use global variables and all communication between them is via argument lists, and thus can be safely called simultaneously by multiple threads in your program.

### 1.1Functions with Function Arguments

Some Library functions require you to supply a function and to pass the name of the function as an actual argument in the call to the Library function. For many of these Library functions, the supplied function interface includes an array parameter (called comm) specifically for you to pass information to the supplied function without the need for global variables.
If you need to provide your supplied function with more information than can be given via the interface argument list, then you are advised to check, in the relevant Chapter Introduction, whether the Library function you intend to call has an equivalent reverse communication interface. These have been designed specifically for problems where user-supplied function interfaces are not flexible enough for a given problem, and their use should eliminate the need to provide data through global variables. Where reverse communication interfaces are not available, it is usual to use global variables containing the required data that is accessible from both the supplied function and from the calling program. It is thread safe to do this only if any global data referenced is made threadprivate by OpenMP or is updated using appropriate synchronisation, thus avoiding the possibility of simultaneous modification by different threads.
Thread safety of user-supplied functions is also an issue with a number of functions in multithreaded implementations of the NAG Library, which may internally parallelize around the calls to the user-supplied functions. This issue affects not just global variables but also how the comm array may be used. In these cases, synchronisation may be needed to ensure thread safety. Chapter X06 provides functions which can be used in your supplied function to determine whether it is being called from within an OpenMP parallel region. If you are in doubt over the thread safety of your program you are advised to contact NAG for assistance.

### 1.2Functions with Handle Arguments

Some Library functions have arguments described as handles which are pointers to internal data structures, see for example Section 4.1 in the E04 Chapter Introduction or Section 2.1 in the G22 Chapter Introduction. The internal data structures referenced by the handles should always be considered as Input/Output arguments, i.e., their data may be freely read from and written to. As such, when calling a routine that has a handle argument in a multithreaded region each thread must have its own copy of the handle, either initialized directly on the thread or explicitly copied via a call to a relevant library routine where one exists.

### 1.3Input/Output

When using the NAG CL Interface in multithreaded applications we recommend that when using its error mechanism, the output is switched off (by setting fail:print=Nag_FALSE).

### 1.4Implementation Issues

In very rare cases we are unable to guarantee the thread safety of a particular specific implementation. Note also that in some implementations, the Library is linked with one or more vendor libraries to provide, for example, efficient BLAS functions. NAG cannot guarantee that any such vendor library is thread safe. Please consult the Users' Note for your implementation for any additional implementation-specific information.

## 2Parallelism

### 2.1Introduction

The time taken to execute a function from the NAG Library has traditionally depended, to a large degree, on the serial performance capabilities of the processor being used. In an effort to go beyond the performance limitations of a single core processor, multithreaded implementations of the NAG Library are available. These implementations divide the computational workload of some functions between multiple cores and executes these tasks in parallel. Traditionally, such systems consisted of a small number of processors each with a single core. Improvements in the performance capabilities of these processors happened in line with increases in clock frequencies. However, this increase reached a limit which meant that processor designers had to find another way in which to improve performance; this led to the development of multicore processors, which are now ubiquitous. Instead of consisting of a single compute core, multicore processors consist of two or more, which typically comprise at least a Central Processing Unit and a small cache. Thus making effective use of parallelism, wherever possible, has become imperative in order to maximize the performance potential of modern hardware resources, and the multithreaded implementations.
The effectiveness of parallelism can be measured by how much faster a parallel program is compared to an equivalent serial program. This is called the parallel speedup. If a serial program has been parallelized then the speedup of the parallel implementation of the program is defined by dividing the time taken by the original serial program on a given problem by the time taken by the parallel program using $n$ cores to compute the same problem. Ideal speedup is obtained when this value is $n$ (i.e., when the parallel program takes $\frac{1}{n}$th the time of the original serial program). If speedup of the parallel program is close to ideal for increasing values of $n$ then we say the program has good scalability.
The scalability of a parallel program may be less than the ideal value because of two factors:
1. (a)the overheads introduced as part of the parallel implementation, and
2. (b)inherently serial parts of the program.
Overheads include communication and synchronisation as well as any extra setup required to allow parallelism. Such overheads depend on the efficiency of the compiler and operating system libraries and the underlying hardware. The impact on performance of inherently serial fractions of a program is explained theoretically (i.e., assuming an idealised system in which overheads are zero) by Amdahl's law. Amdahl's law places an upper bound on the speedup of a parallel program with a given inherently serial fraction. If $r$ is the parallelizable fraction of a program and $s=1-r$ is the inherently serial fraction then the speedup using $n$ sub-tasks, ${S}_{n}$, satisfies the following:
 $S n ≤ 1 (s+ r n )$
Thus, for example, this says that a program with a serial fraction of one quarter can only ever achieve a speedup of 4 since as $n\to \infty$, ${S}_{n}\le 4$.
Parallelism may be utilised on two classes of systems: shared memory and distributed memory machines, which require different programming techniques. Distributed memory machines are composed of processors located in multiple components which each have their own memory space and are connected by a network. Communication and synchronisation between these components is explicit. Shared memory machines have multiple processors (or a single multicore processor) which can all access the same memory space, and this shared memory is used for communication and synchronisation. The NAG Library makes use of shared memory parallelism using OpenMP as described in Section 2.2.
Something to be aware of for multithreaded programs, compared to serial ones, is that identical results cannot be guaranteed, nor should be expected. Identical results are often impossible in a parallel program since using different numbers of threads may cause floating-point arithmetic to be evaluated in a different (but equally valid) order, thus changing the accumulation of rounding errors. For a more in-depth discussion of reproducibility of results see Section 8 in How to Use the NAG Library.

### 2.2How is Parallelism Used in the NAG Library?

The multithreaded implementations differ from the serial implementations of the NAG Library in that it makes use of multithreading through use of OpenMP, which is a portable specification for shared memory programming that is available in many different compilers on a wide range of different hardware platforms (see The OpenMP API Specification for Parallel Programming).
Note that not all functions are parallelized; you should check Section 8 of the function documents to find details about parallelism and performance of functions of interest.
There are two situations in which a call to a function in the NAG Library makes use of multithreading:
1. 1.The function being called is a NAG-specific function that has been threaded using OpenMP, or that internally calls another NAG-specific function that is threaded. This applies to multithreaded implementations of the NAG Library only.
2. 2.The function being called calls through to BLAS or LAPACK functions. The vendor library recommended for use with your implementation of the NAG Library (whether the NAG Library is threaded or not) may be threaded. Please consult the Users' Note for further information.
A complete list of all the functions in the NAG Library, and their threaded status is given in Section 3.
It is useful to understand how OpenMP is used within the Library in order to avoid the potential pitfalls which lead to making inefficient use of the Library.
If you are calling multithreaded NAG functions from within another threading mechanism you need to be aware of whether or not this threading mechanism is compatible with the OpenMP compiler runtime used to build the multithreaded implementation of the NAG Library on your platform(s) of choice. The Users' Note document for each of the implementations in question will include some guidance on this, and you should contact NAG for further advice if required.
Parallelism is used in many places throughout the NAG Library since, although many functions have not been the focus of parallel development by NAG, they may benefit by calling functions that have, and/or by calling parallel vendor functions (e.g., BLAS, LAPACK). Thus, the performance improvement due to multithreading, if any, will vary depending upon which function is called, problem sizes and other parameters, system design and operating system configuration. If you frequently call a function with similar data sizes and other parameters, it may be worthwhile to experiment with different numbers of threads to determine the choice that gives optimal performance. Please contact NAG for further advice if required.
As a general guide, many key functions in the following areas are known to benefit from shared memory parallelism:
• Dense and Sparse Linear Algebra
• FFTs
• Random Number Generators
• Partial Differential Equations
• Interpolation
• Curve and Surface Fitting
• Correlation and Regression Analysis
• Multivariate Methods
• Time Series Analysis
• Financial Option Pricing
• Global Optimization
• Wavelets