# NAG FL InterfaceMultithreading

## ▸▿ Contents

Settings help

FL Name Style:

FL Specification Language:

In multithreaded applications, each thread in a team processes instructions independently while sharing the same memory address space. For these applications to operate correctly any routines called from them must be thread safe. That is, any global variables they contain are guaranteed not to be accessed simultaneously by different threads, as this can compromise results. This can be ensured through appropriate synchronization, such as that found in OpenMP.
When a routine is described as thread safe we are considering its behaviour when it is called by multiple threads. It is worth noting that a thread unsafe routine can still, itself, be multithreaded. A team of threads can be created inside the routine to share the workload as described in Section 2.
Most routines in the NAG FL Interface are thread safe, however there remain some older routines, listed in the document Thread Unsafe Routines, that are not thread safe as they use unsynchronised global variables (such as module variables, common blocks or variables with the SAVE attribute). These routines should not be called by multiple threads in a user program. Please consult Section 8 of each routine document for further information.
In the NAG FL Interface there are some pairs of routines which share the same five character root name, for example, the routines e04ucf/​e04uca. Each routine in the pair has exactly the same functionality, except that one of them has additional parameters in order to make it thread safe. The thread safe routine has a different last character in the name in place of the usual character (typically ‘a’ instead of ‘f’). Such pairs are documented in a single routine document and are listed in the individual Chapter Contents.

### 1.1Routines with Routine Arguments

Some Library routines require you to supply a routine and to pass the name of the routine as an actual argument in the call to the Library routine. For many of these Library routines, the supplied routine interface includes array arguments (called iuser and ruser) specifically for you to pass information to the supplied routine without the need for global variables.
In the NAG FL Interface, if the interfaces of a pair of thread safe (ending ‘a’) and non-thread safe (ending ‘f’) routines contain a user-supplied routine argument then the ‘a’ routine will contain the additional array arguments iuser and ruser (possibly plus others for internal use). In some cases the ‘a’ routine may need to be initialized by a separate initialization routine; this requirement will be clearly documented.
From Mark 26.1 added routines with routine arguments will also contain the argument cpuser which is of Type (c_ptr) (from the iso_c_binding module). This allows more complicated data structures to be passed easily to the user-supplied routine in cases where iuser and ruser would be inconvenient. The following code fragment shows how this can be used.
```Module mymodule
Use iso_c_binding, Only: c_f_pointer, c_ptr
Private
Public                           :: myfun
Type, Public                      :: mydata
Integer                         :: nx
Real (Kind=nag_wp), Allocatable :: x(:)
End Type mydata
Contains
Subroutine myfun(...,iuser,ruser,cpuser)
Type (c_ptr), Intent (In)       :: cpuser
Type (mydata), Pointer          :: md
Call c_f_pointer(cpuser,md)
... Use md%x and md%nx ...
End Subroutine myfun
End Module mymodule
...
Program myprog
Use mymodule, Only: myfun,mydata
Use iso_c_binding, Only: c_loc, c_ptr
Type (c_ptr)                      :: cpuser
Type (mydata), Target             :: md
...
md%nx = 1000
Allocate (md%x(md%nx))
cpuser = c_loc(md)
...
call nagroutine(...,myfun,cpuser,iuser,ruser,ifail)
...
End Program```
This mechanism is used, for example, in Section 10 in e04stf.
If you need to provide your supplied routine with more information than can be given via the interface argument list, then you are advised to check, in the relevant Chapter Introduction, whether the Library routine you intend to call has an equivalent reverse communication interface. These have been designed specifically for problems where user-supplied routine interfaces are not flexible enough for a given problem, and their use should eliminate the need to provide data through global variables. Where reverse communication interfaces are not available, it is usual to use global variables containing the required data that is accessible from both the supplied routine and from the calling program. It is thread safe to do this only if any global data referenced is made threadprivate by OpenMP or is updated using appropriate synchronisation, thus avoiding the possibility of simultaneous modification by different threads.
Thread safety of user-supplied routines is also an issue with a number of routines in multithreaded implementations of the NAG Library, which may internally parallelize around the calls to the user-supplied routines. This issue affects not just global variables but also how the iuser and ruser arrays, and any data structures pointed to by cpuser, may be used. In these cases, synchronisation may be needed to ensure thread safety. Chapter X06 provides routines which can be used in your supplied routine to determine whether it is being called from within an OpenMP parallel region. If you are in doubt over the thread safety of your program you are advised to contact NAG for assistance.

### 1.2Routines with Handle Arguments

Some Library routines have arguments described as handles which are pointers to internal data structures, see for example Section 3.1 in the E04 Chapter Introduction or Section 2.1 in the G22 Chapter Introduction. The internal data structures referenced by the handles should always be considered as Input/Output arguments, i.e., their data may be freely read from and written to. As such, when calling a routine that has a handle argument in a multithreaded region each thread must have its own copy of the handle, either initialized directly on the thread or explicitly copied via a call to a relevant library routine where one exists.

### 1.3Input/Output

The Library contains routines for setting the current error and advisory message unit numbers (x04aaf and x04abf). These routines use the SAVE statement to retain the values of the current unit numbers between calls. It is therefore not advisable for different threads of a multithreaded program to set the message unit numbers to different values. A consequence of this is that error or advisory messages output simultaneously may become garbled, and in any event there is no indication of which thread produces which message. You are therefore advised always to select the ‘soft failure’ mechanism without any error message ($\mathbf{ifail}=+1$, see Section 4 in the Introduction to the NAG Library FL Interface) on entry to each NAG Library routine called from a multithreaded application; it is then essential that the value of ifail be tested on return to the application.

### 1.4Implementation Issues

In very rare cases we are unable to guarantee the thread safety of a particular specific implementation. Note also that in some implementations, the Library is linked with one or more vendor libraries to provide, for example, efficient BLAS functions. NAG cannot guarantee that any such vendor library is thread safe. Please consult the Users' Note for your implementation for any additional implementation-specific information.

## 2Parallelism

### 2.1Introduction

The time taken to execute a routine from the NAG Library has traditionally depended, to a large degree, on the serial performance capabilities of the processor being used. In an effort to go beyond the performance limitations of a single core processor, multithreaded implementations of the NAG Library are available. These implementations divide the computational workload of some routines between multiple cores and executes these tasks in parallel. Traditionally, such systems consisted of a small number of processors each with a single core. Improvements in the performance capabilities of these processors happened in line with increases in clock frequencies. However, this increase reached a limit which meant that processor designers had to find another way in which to improve performance; this led to the development of multicore processors, which are now ubiquitous. Instead of consisting of a single compute core, multicore processors consist of two or more, which typically comprise at least a Central Processing Unit and a small cache. Thus making effective use of parallelism, wherever possible, has become imperative in order to maximize the performance potential of modern hardware resources, and the multithreaded implementations.
The effectiveness of parallelism can be measured by how much faster a parallel program is compared to an equivalent serial program. This is called the parallel speedup. If a serial program has been parallelized then the speedup of the parallel implementation of the program is defined by dividing the time taken by the original serial program on a given problem by the time taken by the parallel program using $n$ cores to compute the same problem. Ideal speedup is obtained when this value is $n$ (i.e., when the parallel program takes $\frac{1}{n}$th the time of the original serial program). If speedup of the parallel program is close to ideal for increasing values of $n$ then we say the program has good scalability.
The scalability of a parallel program may be less than the ideal value because of two factors:
1. (a)the overheads introduced as part of the parallel implementation, and
2. (b)inherently serial parts of the program.
Overheads include communication and synchronisation as well as any extra setup required to allow parallelism. Such overheads depend on the efficiency of the compiler and operating system libraries and the underlying hardware. The impact on performance of inherently serial fractions of a program is explained theoretically (i.e., assuming an idealised system in which overheads are zero) by Amdahl's law. Amdahl's law places an upper bound on the speedup of a parallel program with a given inherently serial fraction. If $r$ is the parallelizable fraction of a program and $s=1-r$ is the inherently serial fraction then the speedup using $n$ sub-tasks, ${S}_{n}$, satisfies the following:
 $S n ≤ 1 (s+ r n )$
Thus, for example, this says that a program with a serial fraction of one quarter can only ever achieve a speedup of 4 since as $n\to \infty$, ${S}_{n}\le 4$.
Parallelism may be utilised on two classes of systems: shared memory and distributed memory machines, which require different programming techniques. Distributed memory machines are composed of processors located in multiple components which each have their own memory space and are connected by a network. Communication and synchronisation between these components is explicit. Shared memory machines have multiple processors (or a single multicore processor) which can all access the same memory space, and this shared memory is used for communication and synchronisation. The NAG Library makes use of shared memory parallelism using OpenMP as described in Section 2.2.
Something to be aware of for multithreaded programs, compared to serial ones, is that identical results cannot be guaranteed, nor should be expected. Identical results are often impossible in a parallel program since using different numbers of threads may cause floating-point arithmetic to be evaluated in a different (but equally valid) order, thus changing the accumulation of rounding errors. For a more in-depth discussion of reproducibility of results see Section 8 in How to Use the NAG Library.

### 2.2How is Parallelism Used in the NAG Library?

The multithreaded implementations differ from the serial implementations of the NAG Library in that it makes use of multithreading through use of OpenMP, which is a portable specification for shared memory programming that is available in many different compilers on a wide range of different hardware platforms (see The OpenMP API Specification for Parallel Programming).
Note that not all routines are parallelized; you should check Section 8 of the routine documents to find details about parallelism and performance of routines of interest.
There are two situations in which a call to a routine in the NAG Library makes use of multithreading:
1. 1.The routine being called is a NAG-specific routine that has been threaded using OpenMP, or that internally calls another NAG-specific routine that is threaded. This applies to multithreaded implementations of the NAG Library only.
2. 2.The routine being called calls through to BLAS or LAPACK routines. The vendor library recommended for use with your implementation of the NAG Library (whether the NAG Library is threaded or not) may be threaded. Please consult the Users' Note for further information.
A complete list of all the routines in the NAG Library, and their threaded status is given in Section 3.
It is useful to understand how OpenMP is used within the Library in order to avoid the potential pitfalls which lead to making inefficient use of the Library.
If you are calling multithreaded NAG routines from within another threading mechanism you need to be aware of whether or not this threading mechanism is compatible with the OpenMP compiler runtime used to build the multithreaded implementation of the NAG Library on your platform(s) of choice. The Users' Note document for each of the implementations in question will include some guidance on this, and you should contact NAG for further advice if required.
Parallelism is used in many places throughout the NAG Library since, although many routines have not been the focus of parallel development by NAG, they may benefit by calling routines that have, and/or by calling parallel vendor routines (e.g., BLAS, LAPACK). Thus, the performance improvement due to multithreading, if any, will vary depending upon which routine is called, problem sizes and other parameters, system design and operating system configuration. If you frequently call a routine with similar data sizes and other parameters, it may be worthwhile to experiment with different numbers of threads to determine the choice that gives optimal performance. Please contact NAG for further advice if required.
As a general guide, many key routines in the following areas are known to benefit from shared memory parallelism:
• Dense and Sparse Linear Algebra
• FFTs
• Random Number Generators
• Partial Differential Equations
• Interpolation
• Curve and Surface Fitting
• Correlation and Regression Analysis
• Multivariate Methods
• Time Series Analysis
• Financial Option Pricing
• Global Optimization
• Wavelets