Periodically a new **Mark** of the NAG Library is released: new routines are added, corrections and/or improvements are made to existing routines; and, occasionally, routines are withdrawn or deprecated if they have been superseded by improved routines. A point release such as 26.1, only differs from a new Mark (e.g., 27) in that no routines are withdrawn or deprecated.

You must know **which implementation**, and **which mark and revision** of the Library you are using or intend to use. To find out these Library details, you can run a program which calls the NAG Library NAG FL Interface routine a00aaf or the NAG CL Interface routine a00aac.

In Fortran the program could be:

```
Use nag_library, Only: a00aaf
Call a00aaf
End
```

In C, using the NAG FL Interface, the program could be:

```
#include <nag.h>
void main(void) {
a00aaf_();
}
```

or, using the NAG CL Interface, the program could be:
```
#include <nag.h>
void main(void)
{
a00aac();
}
```

Alternatively, the example program for a00aaf or a00aac can be run using the nag_example scripts supplied with your implementation (see the Users' Note for details).

An example of the output is:

*** Start of NAG Library implementation details *** Implementation title: Linux, 64-bit, GNU C or NAG Fortran Precision: double precision Product Code: NLL6I27D9L Mark: 27.0.0 (self-contained) *** End of NAG Library implementation details ***

Details can also be found by a call to a00adf.

The Library is divided into **chapters**, each chapter is devoted to a branch of numerical analysis or statistics and contains any number of individual routines.

Each chapter has a three-character name (where the second and third characters are digits) and a title,
e.g.,

- Chapter D01 –
**Quadrature**

Exceptionally, Chapters H and S have one-character names. The chapters and their names are based on the ACM modified SHARE classification index (see ACM (1960–1976)).

Each chapter also has an associated descriptive single word that is used as the second word in a routine long name. For example Chapter C05 – **Roots of One or More Transcendental Equations** also has the long name **roots**.

All documented routines in the Library have a lower case five-character name, beginning with the characters of the chapter name and suffixed with with a single character or string of characters to identify their interface/variant, e.g.,

- c05ayf – NAG FL Interface
- c05ayc – NAG CL Interface
- c05ay_a1w_f – NAG AD Library

Each routine also has a more meaningful, longer name, e.g.,

- nagf_roots_contfn_brent – NAG FL Interface
- nag_roots_contfn_brent – NAG CL Interface
- nagf_roots_contfn_brent_a1w – NAG AD Library

Each long name begins with the root, e.g., **nagf_** or **nag_** and consists of an underscore separated list of words. The long name naming scheme has been chosen so that the long names group like routines together and group routines within a suite together.

The long name for each routine in a chapter is listed in the respective Chapter Contents page. The second word in the long name uses the chapter long name (see Section 3.1.2).

Prior to Mark 26, the NAG CL Interface used a different form of the long name that did not always use the same prefix for all functions in a chapter. These older names are still defined via the C header file nag.h, and documented in the specification section of the routine documentation if this differs from the current naming convention. See for example g02aac which has the older name **nag_nearest_correlation** in addition to the name following the convention above, **nag_correg_corrmat_nearest**.

Note that the long names of BLAS and LAPACK routines, such as nagf_blas_dgemm, will not take advantage of machine-specific versions of BLAS and LAPACK. As mentioned in Section 3.2.3 you are recommended to use the plain BLAS or LAPACK name (in this case dgemm) for performance reasons.

In the NAG FL Interface, an ‘a’ version of a routine is always paired with an ‘f’ routine, the ‘a’ version being safe to use in a multithreaded environment, but otherwise having identical functionality to the ‘f’ version. For those chapters that have both ‘a’ and ‘f’ versions of a routine, the long name for the ‘f’ version is the same as that of the ‘a’ version, but with an additional last word (**old** – signifying that the ‘f’ version predates the ‘a’ version).

It should be noted that the long names are implemented by the use of aliasing in the NAG Library interface block modules, for Fortran, or in the C header files, and so when using Fortran, long names are only accessible when calling the NAG Library from a Fortran program that uses nag_library.mod.

Please refer to Section 3.1 in the Introduction to the NAG Library FL Interface for advice on supplying alternative routine names and, possibly, simplified routine interfaces.

The documention may show either long or short names, see Section 5 in the Guide to the NAG Library Documentation.

Chapter F06 (Linear Algebra Support Routines) contains all the Basic Linear Algebra Subprograms, BLAS (Dongarra et al. (1988) and Dongarra et al. (1990)), with NAG-style names, e.g., f06paf, as well as the actual BLAS names, e.g., dgemv. Chapter F16 contains some of the routines specified in the BLAS Technical Form (The BLAS Technical Forum Standard and Blackford et al. (2002)) and also some additional routines for integer valued vectors that are not in the standard. Some of the routines in Chapter F16 have both NAG-style names and BLAS names. Chapter F07 (Linear Equations (LAPACK)) and Chapter F08 (Least Squares and Eigenvalue Problems (LAPACK)) contain routines derived from the LAPACK project (Anderson et al. (1999)); also, Chapter F01 (Matrix Operations, Including Inversion) contains storage conversion routines derived from the LAPACK project. Like the BLAS, these routines have NAG-style names, e.g., f07adf, as well as LAPACK names, e.g., dgetrf. Details regarding these alternative names can be found in the relevant Chapter Introduction.

In order to take full advantage of machine-specific versions of BLAS and LAPACK routines provided by some computer hardware vendors, you are encouraged to use the BLAS and LAPACK names (e.g., dgemv and dgetrf) rather than the corresponding NAG-style names (e.g., f06paf and f07adf) wherever possible in your programs.

Some routines in the Library may be classified as experimental. These routines will be flagged by a note at the top of the documentation.

Routines may be classified as experimental if:

- (a)The interface and/or functionality of the routine may change between Marks of the Library. The routine will have been designed in such a way as to minimize the number of such changes.
- (b)The complexity of the routine, or suite of routines it is part of, is such that comprehensive testing is not practical. Routines classified as experimental have gone through the same testing and review processes as a normal Library routine, however it is recommended that additional care is taken when using the routine.

A routine classified as experimental may be reclassified as no longer being experimental, at which point the interface will become fixed as per a normal Library routine.

A NAG Library routine **cannot** be guaranteed to return meaningful results irrespective of the data supplied to it. Care and thought **must** be exercised in:

- (a)formulating the problem;
- (b)programming the use of Library routines;
- (c)assessing the significance of the results.

The documentation supporting the available interfaces is designed with the assumption that you will write a calling program in the appropriate langauge and that you have sufficient knowledge of the programming language to be able to do so.

When programming a call to a routine, read the routine document carefully, especially the description of the **arguments**. This states clearly which arguments must have values assigned to them on entry to the routine, and which return useful values on exit.

The most common types of programming error in using the Library are:

- incorrect arguments in a call to a Library routine;
- calling the Library from a single precision program.

The Introduction to the NAG Library FL Interface, the Introduction to the NAG AD Library and the Introduction to the NAG Library CL Interface provide programming advice which is interface specific.

Routines in the Library that require a user-supplied function may be classified as either direct communication or reverse communication.

Direct communication routines require a user-supplied subroutine to be provided as an actual argument to the NAG Library routine. You must write this subroutine using a very rigid interface as specified in the relevant routine document. For the majority of applications this is the simplest and most convenient usage. Sometimes however this approach can be restrictive:

- (i)when the required format of the subroutine does not allow useful information to be passed conveniently to and from your calling program;
- (ii)when the direct communication routine is being called from another computer language which does not fully support procedure arguments in a way that is compatible with the Library.

These restrictions can be removed by using a reverse communication routine. Instead of obtaining the solution in one call, reverse communication routines perform one step of the solution process before returning to the calling program with an appropriate flag (**irevcm**) set. The value of **irevcm** determines whether the process has finished or whether fresh information is required. In the latter case the required information must be calculated before re-entering the reverse communication routine. Thus you have the responsibility for providing an iterative loop. Although reverse communication routines will typically be more complicated to use than direct communication equivalents they do provide greater flexibility for the evaluation of the function.

The results obtained when calling a NAG Library routine depend not only on the algorithm used to solve the problem, but also on the compiler used to build the Library, compiler run-time libraries and also the arithmetic properties of the machine on which the code is run.

Historically, different kinds of computer hardware tended to have different kinds of arithmetic. Some machines would store floating-point numbers using a base 16 significand and exponent system, others would use base 2, and some even used base 8 or 10. Such differences caused major headaches for software library providers because code that worked well on one arithmetic system might not behave in exactly the same way on another. This meant that great care had to be taken to make the Library code **portable**.

In addition, it was not unheard of for machine arithmetic to have flaws or errors where basic operations such as multiplication or division could sometimes give incorrect results, especially on numbers that were in some way ‘extreme’, such as being very large or small.

After the first of the IEEE standards for floating-point arithmetic (ANSI/IEEE (1985)) was introduced in the 1980s, the situation improved greatly. Nowadays most significant hardware, and certainly most hardware that NAG libraries run on, will use IEEE-style base 2 arithmetic. This makes production of portable code easier, but there are still problems, partly due to the latitude allowed by the IEEE standards. For example, hardware which uses extra-precise 80-bit internal registers for arithmetic, as originally introduced in the Intel 8087 coprocessor in the 1980s, behaves slightly differently from hardware that uses 64-bit registers, particularly if a compiler generates optimized code which holds arithmetic subexpressions in the extra-precise registers.

Since, for performance reasons, computer arithmetic is generally finite precision (as is certainly the case for IEEE standard arithmetic) most of the numerical methods implemented by NAG Library routines can only return an approximation to the true solution, simply due to the accumulation of rounding errors.

It should therefore be clear that running a program which calls a NAG Library routine with the same data on two different machines can give different results, due to compiler, hardware and run-time library considerations. Usually these differences are small – it may be that a result computed on one machine differs only in the last few significant bits from the same result computed on another machine – for example, when solving a well-conditioned set of linear equations on two different machines. Occasionally small differences may be magnified, for example if a conditional test depends on an imprecise result. A routine that searches for a mininum of an optimization problem may converge to a different local minimum, but in general, so long as the routine's documentation doesn't claim that the **same** local minimum will always be obtained, this should be acceptable. Even if an algorithm converges to the same local minimum, arithmetic differences may mean that a different number of iterations is taken to get there.

Modern hardware and optimizing compilers have introduced further scope for arithmetic quirks. An example is in the use of **Streaming SIMD Extension (SSE)** instructions. These low-level machine instructions allow hardware to operate on more than one number in parallel, if your compiler is smart enough to generate and use them correctly, or if you hand-code your own assembly language routines.

SSE instructions enable low-level parallelism of floating-point arithmetic operations. For example, a 128-bit SSE register can hold two 64-bit double precision (or four 32-bit single precision) numbers at the same time, and operate on them all simultaneously. This can lead to big time savings when working on large amounts of data.

But this may come at a price. Efficient use of SSE instructions can sometimes depend on exactly how the memory used to store data is aligned. Some SSE instructions for moving data to and from memory need memory to be aligned on a 16-byte boundary. If it happens that the memory (for example, a pointer to an array of numbers) that a NAG routine uses is **not** aligned nicely, then it may not be possible to use those SSE instructions.
An optimizing compiler might well generate two instruction streams, one for when it detects that memory is aligned and one for when it is not.

An example should serve to make things clearer. Suppose we wish to compute the inner product of two vectors, **x** and **y**, each of length **n**. The inner product (or dot product) of two vectors is computed by multiplying together corresponding elements of the two vectors, and summing the individual products to get the result. A routine compiled by a good optimizing compiler would load numbers two or four at a time, multiply them together two or four at a time, and accumulate the results into the final result.

But if the memory is not nicely aligned – and it may well not be – the compiler needs to generate a different code path to deal with the situation. Here the result will take longer to get because the products must be computed and accumulated one at a time. At run-time, the code checks whether it can take the fast path or not, and works appropriately.

The problem is that by altering the order of the accumulations, we are quite possibly changing the final result, simply due to rounding differences when working with finite precision computer arithmetic. Instead of getting the inner product

we may get

$$s={x}_{1}\times {y}_{1}+{x}_{2}\times {y}_{2}+{x}_{3}\times {y}_{3}+\cdots +{x}_{n}\times {y}_{n}$$ |

$$s=\left({x}_{1}\times {y}_{1}+{x}_{3}\times {y}_{3}\right)+\left({x}_{2}\times {y}_{2}+{x}_{4}\times {y}_{4}\right)+\cdots \text{.}$$ |

It is likely that the result will be just as accurate either way – neither result will be precise due to finite arithmetic – but they may differ by a tiny amount. And if that tiny difference leads to a different decision being made by the code that called the inner product routine, the difference may be magnified.

Furthermore, it is possible that the same program running with bitwise identical data on the same machine may give different results when run twice in a row simply because, when the program is loaded, by chance some piece of memory may or may not be aligned on a particular boundary. Such non-deterministic results can be frustrating if you depend on always getting identical results for the same data.

On even newer hardware, **AVX** instructions use 256-bit and 512-bit registers, and can therefore operate on more numbers at a time. For AVX instructions, memory may need to be 32-byte aligned.

Some memory used by NAG Library routines is allocated inside the NAG Library. In order to minimize differences due to effects like that described above, we can try to make sure the memory is always aligned nicely – for example, by use of more controllable memory allocation routines where available – but that is not always possible since it partly depends on the support of the compiler.

Of course, no Library routine has control over memory you have allocated before being passed to the routine. If you do observe non-deterministic results which you suspect are due to memory considerations, and you are unable to accept this variation, then you are advised to make sure that any memory you allocate is aligned nicely; unfortunately, precisely how you do this is dependent on your system, but you may be able to get advice through NAG's usual support channels (see NAG Technical Support Service
).

Parallelism, coming from a multithreaded implementation of the NAG Library and/or a multithreaded vendor library is another potential source of non-determinism in numerical results. Some routines may give different results when run on different numbers of cores, or even different results when a calculation is repeated on the same number of cores. Where reproducibility of results is vital, a purely serial NAG Library, without parallelism in either NAG routines or calls to parallel vendor library routines, will generally be available in an appropriate implementation, and may be the best choice. You are advised to contact NAG (see NAG Technical Support Service
) for advice.

Mathematical operations on fixed-length floating point numbers (e.g., 32-bit floats or 64-bit doubles) are not associative. This means that a computer may produce different results for $a+\left(b+c\right)$ and $\left(a+b\right)+c$. For example, an IEEE 754 32-bit floating point number has a mantissa of $23$ bits. Therefore in this number format ${2}^{24}+1={2}^{24}$, which means that for instance $\left({2}^{24}+1\right)-{2}^{24}=0$ while ${2}^{24}+\left(1-{2}^{24}\right)=1$. BWR is a term which refers to the case in which a given computer program (e.g., a set of source codes) produces bit-for-bit the exact same answer in different computing environments such as

- 1.Different operating systems (e.g., answers produced on Windows vs answers produced on Linux).
- 2.Different CPU architectures (e.g., Intel vs AMD or Intel Sandy Bridge vs Intel Ivy Bridge etc.).
- 3.Different compiler versions.
- 4.Different numbers of threads.

Users often desire BWR, however, it is extremely difficult to achieve. Typically you should ensure that:

- (a)Instructions are always executed in exactly the same order.
- (b)No advanced CPU features are used which may not be available on other processors (e.g., SSE3, SSE4, AVX).
- (c)A fixed number of threads is always used.

Often condition (a) is equivalent to compiling with no (or very limited) compiler optimizations, since newer versions of compilers typically improve their code optimization algorithms, which means one version of a compiler may optimize a set of operations one way while the next version may optimize it a different way. Condition (b) typically means that only basic SSE instructions are allowed, such as are supported across the widest range of processors and the enhanced SIMD instructions present in newer processors are not exploited.

The result is that to achieve BWR across a wide range of computing environments you often have to sacrifice a lot of performance.

An implementation of the NAG Library that is not self-contained will make calls to an appropriate vendor library containing, in particular, high performance linear algebra routines. The NAG Library has no direct control over BWR with respect to results obtained from calls to the vendor library. However, for at least one such vendor library, CBWR has been introduced such that if an environment variable is set and a set of conditions adhered to in the code calling the vendor library then BWR can be forced. Where CBWR is available for a vendor library used by an implementation of the NAG Library, details will be given in the Users' Note for that implementation.

It should be noted that many NAG routines do not adhere to the conditions set out by vendor library CBWR and so it may not be possible to ensure BWR for all NAG Library routines across different CPU architectures for implementations that are not self-contained.

ACM (1960–1976) Collected algorithms from ACM index by subject to algorithms

Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J J, Du Croz J J, Greenbaum A, Hammarling S, McKenney A and Sorensen D (1999) *LAPACK Users' Guide* (3rd Edition) SIAM, Philadelphia https://www.netlib.org/lapack/lug

ANSI/IEEE (1985) IEEE standard for binary floating-point arithmetic *Std 754-1985* IEEE, New York

Dongarra J J, Du Croz J J, Hammarling S and Hanson R J (1988) An extended set of FORTRAN basic linear algebra subprograms *ACM Trans. Math. Software* **14** 1–32