NAG Library
Introduction to the NAG Library for the Xeon Phi Coprocessor
1 What is the NAG Library for the Xeon Phi Coprocessor?
The NAG Library for the Xeon Phi Coprocessor is a Fortran library of numerical routines intended for use on heterogeneous machines consisting of a host system and one or more Xeon Phi coprocessors. The library aims to make use of parallelism available in both components to provide superior performance compared to either the serial NAG Fortran Library or NAG Library for SMP & Multicore.
1.1 What is parallelism?
The time taken to execute a routine from the NAG Fortran Library depends, to a large degree, on the serial performance capabilities of the processor being used.
In an effort to go beyond the performance limitations of a single core processor the NAG Library for SMP & Multicore was introduced.
The NAG Library for SMP & Multicore divides the computational workload of some routines between multiple cores and executes these tasks in parallel.
Traditionally, such systems consisted of a small number of processors each with a single core.
Improvements in the performance capabilities of these processors had until recently happened in line with increases in clock frequencies.
However, this increase reached a limit which meant that processor designers had to find another way in which to improve performance; this led to the development of multicore processors, which are now ubiquitous.
Instead of consisting of a single compute core, multicore processors consist of two or more, which typically comprise at least a Central Processing Unit and a small cache.
Thus making effective use parallelism, wherever possible, has become imperative in order to maximise the performance potential of modern hardware resources, and the importance of the NAG Library for SMP & Multicore has grown.
As an extension of the multicore trend manycore processors, i.e. processors with high numbers (tens or even hundreds) of cores, are also now becoming available.
The NAG Library for the Xeon Phi Coprocessor is the first version of the NAG library which aims to make effective use of the extra parallelism potential of a manycore processor, such as the Intel Xeon Phi coprocessor.
The effectiveness of parallelism can be measured by how much faster a parallel program is compared to an equivalent serial program.
This is called the parallel speedup.
If a serial program has been parallelised then the speedup of the parallel implementation of the program is defined by dividing the time taken by the original serial program on a given problem by the time taken by the parallel program using n cores to compute the same problem.
Ideal speedup is obtained when this value is n (i.e. when the parallel program takes 1nth the time of the original serial program).
If speedup of the parallel program is close to ideal for increasing values of n then we say the program has good scalability.
The scalability of a parallel program may be less than the ideal value because of two factors: (a) the overheads introduced as part of the parallel implementation, and (b) inherently serial parts of the program.
Overheads include communication and synchronisation as well as any extra setup required to allow parallelism.
Such overheads can depend on efficiency of implementation and use of Application Programming Interfaces (APIs), and can vary depending on underlying hardware.
The impact on performance of inherently serial fractions of a program is explained theoretically (i.e. assuming an idealised system in which overheads are zero) by
Amdahl's law.
Amdahl's law places an upper bound on the speedup of a parallel program with a given inherently serial fraction. If
r is the parallelizable fraction of a program and
s=1-r is the inherently serial fraction then the speedup using
n sub-tasks,
Sn, satisfies the following:
Thus, for example, this says that a program with a serial fraction of one quarter can only ever achieve a speedup of 4 since as
n→∞,
Sn≤4.
Parallelism may be utilised on two classes of systems: shared memory and distributed memory machines, which require different programming techniques.
Distributed memory machines are composed of processors located in multiple components which each have their own memory space and are connected by a network.
Communication and synchronisation between these components is explicit.
Shared memory machines have multiple processors (or a single multicore processor) which can all access the same memory space, and this shared memory is used for communication and synchronisation.
The NAG Library for the Xeon Phi Coprocessor makes use of shared memory parallelism using the OpenMP API as described in
Section 1.2.
Parallel programs which use OpenMP create (or "fork") a number of
threads from a single process when required at run-time.
(Programs which make use of shared memory parallelism are also called
multithreaded programs.)
Once the parallel work has been completed the threads return control to the parent process and become inactive (or "join") until the next region of parallel work.
The threads share the same memory address space, i.e. that of the parent process, and this shared memory is used for communication and synchronisation.
OpenMP provides some mechanisms for access control so that, as well as allowing all threads to access shared variables, it is possible for each thread to have private copies of other variables that only it can access.
For shared variables, thread safety is an issue.
A program is deemed to be "thread safe" if it can be executed using two or more threads without compromising results.
Thread safe programs should return equally valid results no matter how many threads are used in the parallel regions.
However, that is not to say that identical results can be guaranteed, or should be expected.
Identical results are often impossible in a parallel program since using different numbers of threads may cause floating point arithmetic to be evaluated in a different (but equally valid) order, thus changing the accumulation of rounding errors.
For a more in-depth discussion of reproducibility of results see the
Essential Introduction document.
1.2 How is parallelism used in the NAG Library for the Xeon Phi Coprocessor?
The NAG Library for the Xeon Phi Coprocessor contains the full functionality currently available in the NAG Fortran Library, and is based on the NAG Library for SMP & Multicore; users are encouraged to familiarise themselves with the
Essential Introduction for a general overview of the structure of these products.
Routine interfaces are identical to those of the NAG Fortran Library.
This makes the migration from using the NAG Fortran Library to using the NAG Library for the Xeon Phi Coprocessor trivial.
However, there are a number of thread safety issues which users should be aware of, as discussed in
Thread Safety.
The NAG Library for the Xeon Phi Coprocessor, like the NAG Library for SMP & Multicore, differs from the serial NAG Fortran Library in that it makes use of multithreading through the OpenMP API (version 3.0), which is a portable specification for shared memory programming that is available in many different compilers on a wide range of different hardware platforms (see
OpenMP).
Note that not all routines are parallelised; users should check Section 8 of the routine documents to find details about parallelism and performance of routines of interest.
There are three situations in which a call to a routine in the NAG Library for the Xeon Phi Coprocessor makes use of multithreading:
- The routine being called is a NAG-specific routine that has been threaded using OpenMP. These are termed "directly threaded" routines.
- The routine being called calls through to the vendor library (i.e. the Intel Math Kernel Library MKL). This happens if the routine is not specific to the NAG library, and the vendor library offers superior parallel performance and equivalent numerical properties. For example, most BLAS and LAPACK routines fall into this category.
- The routine being called is not directly threaded itself (i.e. the work specific to this routine has not been threaded), but part of the computation involves calls to routines of the type 1 or 2 above. These are termed "indirectly threaded" routines.
It is informative for users to understand how OpenMP is used within the library in order to avoid the potential pitfalls which lead to making inefficient use of the library.
A call to a threaded NAG-specific routine may, depending on input and at one or more points during execution, use OpenMP to create a team of threads for a parallel region of work.
The team of threads will fork at the start of the parallel region before joining at the end of the parallel region.
Both the fork and the join will happen internally within the routine call (although there are situations in which the teams of threads may be made available to orphaned directives in user code via user-supplied subprograms, see
Section 2.2.3).
Furthermore, OpenMP constructs within NAG routines bind to teams of threads created within the NAG code (i.e. there are no orphaned directives).
For threaded NAG-specific routines all thread management is performed by the OpenMP run-time and NAG does not provide any extra threading controls or options.
Thus all OpenMP environment variables and function settings apply equally to calls to these NAG routines and to users' own parallel regions.
In particular, users should take care when calling these NAG routines from within their own parallel regions, since if nested parallelism is enabled (it is disabled by default) the NAG routine will fork-and-join a team of threads for each calling thread, which may lead to contention on system resources and very poor performance.
Poor performance due to contention can also occur if the number of threads requested exceeds that which the hardware is capable of supporting, or if some hardware resources are busy executing other processes (which may belong to other users in a shared system).
For these reasons users should be aware of the maximum number of threads supported in hardware and the workload of their machine, and use this information in selecting a number of threads which minimises contention on resources.
Please read the Users' Note for advice about setting the number of threads to use, or contact
NAG for advice.
The vendor library may handle parallelism slightly differently to the OpenMP specification.
This is true of MKL where, even though it is parallelised using OpenMP, MKL provides some mechanisms to override OpenMP and has different default behaviour.
For a full discussion of how MKL handles parallelism users are referred to the section "Improving Performance with Threading" in
Intel MKL.
In particular users should be aware that, by default since MKL_DYNAMIC is true, if an MKL routine is called and the requested number of threads exceeds the number of physical cores MKL will scale down the number of threads as appropriate for the system being used (this will be the number of physical cores for the host system, or the maximum number of threads supported in hardware for Xeon Phi coprocessors).
Furthermore, if an MKL routine is called from a parallel region and OMP_NESTED has been set to true, nested parallelism will not be enabled, since MKL settings override OpenMP settings.
In both cases it is possible to change this behaviour by setting MKL_DYNAMIC to false.
Thus a routine in the NAG library that calls through to MKL may use a different number of threads to a threaded NAG-specific routine in the same environment.
Parallelism is used in many places throughout the library since, although many routines have not been the focus of parallel development by NAG, they may benefit by calling routines in the library that have, and/or by calling parallel vendor routines (e.g. BLAS, LAPACK).
Thus, the performance improvement due to multithreading, if any, will vary depending upon which routine is called, problem sizes and other parameters, system design and operating system configuration.
If you frequently call a routine with similar data sizes and other parameters, it may be worthwhile to experiment with different numbers of threads to determine the choice that gives optimal performance.
Please contact
NAG for further advice if required.
As a general guide, many key routines in the following areas are known to benefit from shared memory parallelism:
- Dense and Sparse Linear Algebra
- FFTs
- Random Number Generators
- Quadrature
- Partial Differential Equations
- Interpolation
- Curve and Surface Fitting
- Correlation and Regression Analysis
- Multivariate Methods
- Time Series Analysis
- Financial Option Pricing
- Global Optimization
- Wavelets
1.3 What kind of systems is the NAG Library for the Xeon Phi Coprocessor for?
As discussed in
Section 1 the NAG Library for the Xeon Phi Coprocessor is intended for use on a system consisting of two components: a host system and one or more Xeon Phi coprocessors.
The host system is a traditional shared memory machine, as targeted by the NAG Library for SMP & Multicore. The Xeon Phi coprocessor is a manycore coprocessor, a device which is separate to the host system with its own memory address space.
Programming an application to make use of the whole
heterogeneous system, host system and Xeon Phi coprocessor, together in the same program can be achieved using Intel's
Language Extensions for Offload (LEO), which are directives and run-time functions understood by the Intel compilers.
Under this heterogeneous model of execution an OpenMP program may be run as a single process on the host system, and certain sections of highly parallel work (code and data) are transferred automatically at run-time to the Xeon Phi coprocessor.
OpenMP may be used to express parallelism in the same way for both the host system and Xeon Phi coprocessor.
The transfer of data and work is called an
offload.
The LEO directives are used to mark which parts of a program should be offloaded.
It is possible to separate the offload of data from work, and to execute asynchronously on the host system whilst offloaded work is executing on the Xeon Phi coprocessor.
Users are referred to
Intel for details about LEO directives.
It is also possible to use a Xeon Phi coprocessor as a stand-alone compute node.
This is called native execution.
Under this model binaries are cross-compiled on the host with the -mmic compiler flag, and will execute on the Xeon Phi coprocessor architecture only.
1.4 How does the NAG Library for the Xeon Phi Coprocessor make use of the Xeon Phi coprocessor?
The NAG Library for the Xeon Phi Coprocessor is an extension of NAG Library for SMP & Multicore with many routines specially tuned to make use of the extended parallelism available in the Xeon Phi coprocessor.
Other routines in the Library benefit from this increased performance by calling one or more of the tuned routines.
All routines in the NAG Library for the Xeon Phi Coprocessor can be executed on the host system or the Xeon Phi coprocessor.
Furthermore, for some routines, under certain conditions, calls made from the host will run heterogeneously: part of the internal computation is performed on the host system and some highly parallel work is automatically offloaded to the Xeon Phi coprocessor using LEO directives, returning to the host before the routine returns control to the user.
It is expected that the NAG Library for the Xeon Phi Coprocessor is being used in conjunction with a version of the Intel Fortran compiler that incorporates LEO directives for offloading to the Xeon Phi coprocessor.
For specific compiler version details see the Users' Note document shipped with the product.
Users who are either offloading using mechanisms different to Intel's LEO directives, or using a threading API other than OpenMP need to be aware of whether or not these are compatible with LEO and OpenMP. Users should contact
NAG for further advice if required.
For scalable, highly parallel regions of work within a routine it is decided at run-time whether or not the Xeon Phi coprocessor should be used for that work based on a measure of the problem size.
Since there is an overhead associated with offloading to the Xeon Phi coprocessor it is not profitable to offload small problems, but there may be a threshold problem size greater than which performance on the Xeon Phi coprocessor, taking into account offload cost, is better than can be achieved on the host system.
See
Section 2.2 for more information about how these thresholds have been set and how to change the default behaviour.
2 How to Use the NAG Library for the Xeon Phi Coprocessor
2.1 Linking and Executing Your Code
If your code currently contains calls to NAG Fortran Library routines then for most parallelized routines it is a simple matter of re-linking your code to the NAG Library for the Xeon Phi Coprocessor (in place of the NAG Fortran Library) to benefit from the optimized performance (assuming use of the correct version of the Intel compiler, as discussed in Section 1.4).
Exceptions to this general rule, where code changes may be necessary, are noted in
Section 2.2.3 below.
Commands for linking to the NAG Library for the Xeon Phi Coprocessor may be found in the Users' Note.
Running heterogeneous executables linked to the NAG Library for the Xeon Phi Coprocessor is the same as running a host-only application.
However, it is important to set some environment variables, e.g. OMP_NUM_THREADS, separately for the host system and the Xeon Phi coprocessor.
For native executables only one set of environment variables is required on the Xeon Phi coprocessor, although it may be necessary to first copy the executable and any dependent shared libraries to the coprocessor before running.
The Users' Note contains specific advice for system settings and instructions on how to copy across native executables.
2.2 How to Maximize the Performance of Your Application
2.2.1 NAG Performance Parameters module
In order to take advantage of some of the scalable, highly parallel functionality in the NAG Library for the Xeon Phi Coprocessor work may be offloaded automatically from the host system to the Xeon Phi coprocessor.
For routines where this is possible some measure of problem size is used in order to decide at run-time if running on the host system or Xeon Phi coprocessor would produce the best performance.
The decision is based on a threshold problem size: if, in a call to a routine, the problem size is greater than the threshold, then the Xeon Phi coprocessor will be used.
The threshold problem sizes have been determined on a host system with a single Xeon processor and are contained in a module called nag_performance_parameters.
An alternative version of nag_performance_parameters based on a host system with two of the same Xeon processors is also provided, and may be linked-to instead.
Since this alternative nag_performance_parameters is based on a more powerful host the values of the threshold problem size parameters are in general larger than the defaults, and so offloading automatically will only happen for larger problems.
See the Users' Note documentation for more details about the particular processors used to set the values in the modules and how to link to the alternative module.
It should be noted that there is a higher cost associated with a program's first offload to the Xeon Phi coprocessor than later offloads.
The threshold problem sizes assume that the offload being performed within the NAG routine is not a program's first offload.
Since the threshold problem sizes are sensitive to the performance capabilities of the particular host system being used, it is useful for users to override the default behaviour for particular routines.
This is possible by setting default-kind integer
switch variables contained in the nag_performance_parameters module.
If a routine is capable of offloading, the names of any applicable switches are listed in Section 8 of the routine's documentation.
Each switch corresponds to a different potential offload within the NAG library.
If the value of a switch is 0 offloading at that point within the NAG library is always disabled.
If the value of a switch is 1 offloading is always enabled at that point.
If a switch has any other value offloading is automatic, based on the threshold problem sizes defined in the nag_performance_parameters module that has been linked-to.
By default all switches have the value -1, enabling automatic offloading.
Users are encouraged to benchmark routines of interest on their own systems with offloading both enabled and disabled in order to find out which option gives the best performance.
To change the values of switches USE the nag_performance_parameters module in your source code and assign to the appropriate integer switches documented for the routines you are using.
Please contact
NAG for further advice if required.
It is possible to disable ALL offloads within the NAG Library for the Xeon Phi Coprocessor by setting the default-kind integer variable nag_host_only (from nag_performance_parameters) to a value other than zero.
By default nag_host_only is zero.
If a NAG routine is called from within an OpenMP parallel region no offload will take place, irrespective of the value of routines' offload switches.
This avoids multiple threads all offloading to a single Xeon Phi coprocessor at the same time.
The conditions for offloading within the NAG library are therefore as follows:
- The routine called contains an offloadable region (check Section 8 of routine documentation), AND
- the routine has been called from the host system, AND
- the routine has not been called from an OpenMP parallel region, AND
- the nag_host_only variable is zero, AND
- the routine-specific offload switch is 1, OR (the routine-specific switch is not zero AND the size of the problem in the call to the routine is greater than the threshold problem size set by NAG in the nag_performance_parameters module).
As mentioned in
Section 1.2 a routine that is not specific to the NAG library may call through to the vendor library, MKL.
This is true of almost all of the BLAS and LAPACK routines in chapters F06, F07 and F08.
Routines in these chapters may be called using either the standard BLAS/LAPACK names (e.g. DGETRF) or the equivalent NAG name (e.g. F07ADF).
The only difference between using these two names is that, for some of the performance-critical routines in these chapters, the NAG name will use the same offloading scheme as described above for general NAG routines and there is a corresponding nag_performance_parameters offloading switch.
A call to the BLAS/LAPACK name will bypass the NAG library and is a direct call into MKL.
Users should note that MKL provides a similar offloading scheme called Automatic Offload for a selection of routines.
This feature must be explicitly enabled by setting the environment variable MKL_MIC_ENABLE=1.
A key difference between MKL Automatic Offload and offloads within NAG routines is that it is possible for MKL to divide work between the host system and Xeon Phi coprocessor asynchronously (i.e. the host and Xeon Phi coprocessor can be used concurrently), whereas offloads within NAG routines are synchronous (i.e. the host system waits for any computation on the Xeon Phi coprocessor before continuing).
Users are referred to
Intel for more details of using Automatic Offload.
The nag_performance_parameters module also contains a default-kind integer variable called nag_mic_target, which identifies the Xeon Phi coprocessor to use for offloads in NAG routines in a multi-device setup.
Xeon Phi coprocessors are numbered from zero, and this value will be taken modulo the number of devices in the system.
This variable is provided in case users have their own offload regions that are running asynchronously with a call to a NAG routine, in order to allow users to ensure that the two potentially concurrent offloads are sent to different devices.
A consequence of using nag_mic_target is that if the library is used on a system which does not have any Xeon Phi coprocessors or they are disabled, offloads internal to the library will fail at run-time.
In order to guard against this users should to set nag_host_only to a non-zero value on systems where offloading is not possible.
Note that all variables in nag_performance_parameters discussed above are global, which means that using different threads to change their values can lead to race conditions, and therefore incorrect results.
However, it is expected that if these variables are to be modified then assignments will take place during the initial (serial) phases of a program and will remain unchanged through the rest of the program (e.g. the variables could be set in a boilerplate section during initialization).
2.2.2 Data persistence and asynchronous use of the Xeon Phi coprocessor
All routines in the NAG Library for the Xeon Phi Coprocessor can be called either on the host system or on the Xeon Phi coprocessor.
Thus, in a heterogeneous program a routine may be called on the Xeon Phi coprocessor from within a user-created offload region.
Using the Intel LEO directives opens the possibility of using the host system and Xeon Phi coprocessor asynchronously and minimising the number of data transfers.
For example:
1 USE NAG_PERFORMANCE_PARAMETERS, ONLY: NAG_MIC_TARGET
2 INTEGER :: S1, S2
3 ...
4 !DIR$ OFFLOAD_TRANSFER TARGET(MIC:NAG_MIC_TARGET) IN(A: ALLOC_IF(.TRUE.) FREE_IF(.FALSE.)) SIGNAL(S1)
5 CALL HOST_WORK1(X,Y,Z)
6 !DIR$ OFFLOAD BEGIN TARGET(MIC:NAG_MIC_TARGET) IN(M,N,LDA,NWCM) OUT(CA,CH,CV,CD) INOUT(ICOMM,IFAIL) NOCOPY(A: ALLOC_IF(.FALSE.) FREE_IF(.FALSE.)) WAIT(S1) SIGNAL(S2)
7 ...
8 CALL C09EAF(M,N,A,LDA,CA,NWCM,CH,NWCM,CV,NWCM,CD,NWCM,ICOMM,IFAIL)
9 ...
10 !DIR$ END OFFLOAD
11 CALL HOST_WORK2(X,Y,Z)
12 !DIR$ OFFLOAD_WAIT TARGET(MIC:NAG_TARGET) WAIT(S2)
13 !DIR$ OFFLOAD BEGIN TARGET(MIC:NAG_MIC_TARGET) IN(X,Y,Z) OUT(A: ALLOC_IF(.FALSE.) FREE_IF(.TRUE.))
14 ...
15 !DIR$ END OFFLOAD
Here the array A is copied to the Xeon Phi coprocessor on line 4.
The directive specifies that memory is allocated for A on the device and not freed at the end of the transfer.
The transfer is also asynchronous due to the use of a signal clause, which means that the call to host_work1 on line 5 can happen concurrently with the transfer.
The offload region between lines 6 and 10 involve a call to a NAG routine.
This offload transfers some data, but not the array A.
It waits for the offload_transfer of A to complete and is itself asynchronous with the host system, using signal s2, which means that the call to host_work2 on line 11 may run concurrently on the host system.
Once host_work2 has returned the host system waits for the offload region to complete on line 12 and then on line 13 initiates another offload that uses A.
This offload is synchronous (note the signal clause) and on return A is transferred back to the host and the memory used for it on the device is freed.
For more details about the use of LEO directives see
Intel.
2.2.3 Advice by Chapter
In addition to the above advice there are specific areas of the NAG Library for the Xeon Phi Coprocessor that require further guidance, please see the following sections.
Where possible the FFT routines in
Chapter C06 use the Intel MKL FFT routines for optimal performance.
The performance of the quadrature routines in
Chapter D01 depends upon the nature of the user-supplied function that calculates the value of the integrand at a given point and other problem parameters such as the relative accuracy required. Parallelism may not be beneficial for all problems, in particular the parallelism in
D01GAF is only suitable for problems with a large number of data points.
D03RAF and
D03RBF require a user-supplied routine
PDEDEF to evaluate the functions
Fj, for
j=1,2,…,NPDE. The parallelism within
D03RAF and
D03RBF will be more efficient if
PDEDEF can also be parallelized. This is often the case, but you must add some OpenMP directives to your version of
PDEDEF to implement the parallelism. For example, the body of code from the first test case in the document for
D03RAF is
DO 20 I = 1, NPTS
RES(I,1) = UT(I,1) - DIFF*(UXX(I,1)+UYY(I,1)) - &
D*(1.0D0+ALPHA-U(I,1))*EXP(-DELTA/U(I,1))
20 CONTINUE
This example can be parallelized, as the updating of
RES in each iteration of the loop
I over
1,…,NPTS is independent of every other iteration. Thus this should be parallelized in OpenMP as follows
C$OMP DO
DO 20 I = 1, NPTS
RES(I,1) = UT(I,1) - DIFF*(UXX(I,1)+UYY(I,1)) - &
D*(1.0D0+ALPHA-U(I,1))*EXP(-DELTA/U(I,1))
20 CONTINUE
C$OMP END DO
Note that the OpenMP PARALLEL directive must
not be specified, as the OpenMP DO directive will bind to the PARALLEL region within the
D03RAF or
D03RBF code. Also note that this assumes the default OpenMP behaviour that all variables are SHARED, except for loop indices that are PRIVATE.
To avoid problems for existing library users, who will not have specified any OpenMP directives in their
PDEDEF routine, the default assumption of
D03RAF and
D03RBF is that
PDEDEF has not been parallelized, and they execute calls to
PDEDEF in serial mode. If you have parallelized
PDEDEF you must indicate this fact by using the argument
IND to
D03RAF and
D03RBF by adding 10 to the normal value. Thus, in the NAG Library for the Xeon Phi Coprocessor only, the following values may be specified for
IND:
- IND=0
- Starts the integration in time. PDEDEF is assumed to be serial.
- IND=1
- Continues the integration after an earlier exit from the routine. In this case, only the following parameters may be reset between calls to D03RAF or D03RBF: TOUT, DT, TOLS, TOLT, OPTI, OPTR, ITRACE and IFAIL. PDEDEF is assumed to be serial.
- IND=10
- Starts the integration in time. PDEDEF is assumed to have been parallelized by you, as described above. In all other respects, this is equivalent to IND=0.
- IND=11
- Continues the integration after an earlier exit from the routine. In this case, only the following parameters may be reset between calls to D03RAF or D03RBF: TOUT, DT, TOLS, TOLT, OPTI, OPTR, ITRACE and IFAIL. PDEDEF is assumed to have been parallelized by you, as described above. In all other respects, this is equivalent to IND=1.
Constraint: 0≤IND≤1 or 10≤IND≤11.
On exit:
IND=1, if
IND on input was
0 or
1, or
IND=11, if
IND on input was
10 or
11.
If the code within
PDEDEF cannot be parallelized, you must
not add any OpenMP directives to your code, and must
not set
IND to
10 or
11. If
IND is set to
10 or
11 and
PDEDEF has not been parallelized, results on multiple threads will be unpredictable and may give rise to incorrect results and/or program crashes or deadlocks. Please contact
NAG for advice if required. Overloading
IND in this manner is not entirely satisfactory, consequently it is likely that replacement interfaces for
D03RAF and
D03RBF will be included in a future NAG Library release.
Modified example programs for
D03RAF and
D03RBF, which include parallel versions of the PDEDEF routines, are included in the distribution material for each implementation of the NAG Library for the Xeon Phi Coprocessor.
2.2.3.4 Global Optimization of a Function (Chapter E05)
Users of the Particle Swarm Optimization (PSO) routines in the NAG Library for the Xeon Phi Coprocessor need to be aware of several additional parallel-specific options they need to set, in particular, to state whether or not they have ensured that the user functions they provide to these routines are implemented in a thread-safe manner. See the routine documents for
E05SAF and
E05SBF for details.
Where possible the BLAS routines in
Chapter F06 use the Intel MKL BLAS routines for optimal performance. Use the NAG names for the level 3 BLAS routines to allow offloading based on the values of variables in nag_performance_parameters as discussed in
Section 2.2. Use the standard BLAS names to call MKL directly, optionally making use of Automatic Offloading.
The MKL implementation of the standard BLAS error handling routine, XERBLA, does not stop when an input parameter has an invalid value, and this behaviour is inherited by the routines in
Chapter F06. Users should note that not stopping is counter to the recommended BLAS behaviour and that documented in the the
F06 Chapter Introduction.
2.2.3.6 LAPACK, Linear Equations (Chapter F07)
Where possible the LAPACK routines in
Chapter F07 use the Intel MKL LAPACK routines for optimal performance. Use the NAG names for LAPACK routines to allow offloading based on the values of variables in nag_performance_parameters as discussed in
Section 2.2. Use the standard LAPACK names to call MKL directly, optionally making use of Automatic Offloading.
2.2.3.7 LAPACK, Least Squares and Eigenvalue Problems (Chapter F08)
Where possible the LAPACK routines in
Chapter F08 use the Intel MKL LAPACK routines for optimal performance. Use the NAG names for LAPACK routines to allow offloading based on the values of variables in nag_performance_parameters as discussed in
Section 2.2. Use the standard LAPACK names to call MKL directly, optionally making use of Automatic Offloading.
2.2.3.8 Sparse Iterative Solvers (Chapter F11)
When running the sparse iterative solvers with preconditioning on multiple processors, it may be beneficial to reduce the action of the preconditioner, e.g., by decreasing
LFILL, or by increasing
DTOL with
LFILL<0 in
F11DAF or
F11JAF. This will tend to increase the number of iterations required to obtain a converged solution, but will also allow a greater percentage of the computational work to be spent in the parallelized iterative solvers, resulting in a lower overall time to solution. There is unfortunately no choice of the various preconditioner parameters which is optimal for all types of matrix, and all numbers of processors, and some experimentation will generally be required for each new type of matrix encountered.
2.2.3.9 Quasi-random number generators (Chapter G05)
The Sobol, Sobol (A659) and Niederreiter quasi-random number generators in
G05YMF have been parallelized, but require quite large problem sizes to see any significant performance gain. Parallelism is only enabled for the
RCORD=2 option.
3 References
OpenMP
The OpenMP Specification for Parallel Programming http://www.openmp.org