Using the NAG Toolbox for MATLAB - Part 3

Previous articles in this series are [site:url]IndustryArticles/usingtoolboxmatlab.asp and [site:url]IndustryArticles/usingtoolboxmatlabpart2.asp; in this note, we continue our exploration of the Toolbox, demonstrating how it allows users to call any NAG Library routine from within MATLAB, and use MATLAB's plotting facilities to view the results.

Note: The code examples in this article have been extracted from demo scripts, and will not necessarily work properly if cut and pasted from this page into MATLAB. The full version of the scripts, which have been used to make the figures in this article, are available in this archive.

The G13 chapter - Time series analysis

A time series is a set of observations of some time-dependent process, collected at various points in time. The G13 chapter of the NAG Library contains several routines for investigating and modelling the statistical structure of time series; the models constructed by these routines may then be used to better understand the data, or to create forecasts (i.e. predictions of future behaviour) from the series. For example, a so-called autoregressive integrated moving average (ARIMA) model can be fitted to the series - see below.

One way of initially characterising a time series is to calculate its autocorrelation function, which describes the correlation (or degree of dependence) that exists between the behaviour of the underlying process at different points in time. The separation between the different times is called the lag, and the autocorrelation function is usually expressed as a set of autocorrelation coefficients, for different values of the lag. The routine g13ab can be used to compute this, along with more elementary statistical quantities such as the mean and variance. Here's the code:

   
% Here's the time series data (this comes from
% sunspot readings).
x = [5;  11;  16;  23;  36;
    58;  29;  20;  10;   8;
     3;   0;   0;   2;  11;
    27;  47;  63;  60;  39;
    28;  26;  22;  11;  21;
    40;  78; 122; 103;  73;
    47;  35;  11;   5;  16;
    34;  70;  81; 111; 101;
    73;  40;  20;  16;   5;
    11;  22;  40;  60;   80.9];

% nk is the number of lags for which the autocorrelations
% are required.
nk = nag_int(40);

% Call the NAG routine.
[xm, xv, coeff, stat, ifail] = g13ab(x, nk);

Because lag is a discrete variable, the autocorrelation function is best displayed as a histogram (sometimes called an autocorrelogram in this context), as in this picture:


Figure 1: Computing the autocorrelation function of a time series.

The autocorrelation function contains both quantitative and qualitative information about the time-dependence of the underlying process; in this example, the period of the oscillations indicates a seasonality of around 11 units. In addition, the shape of the autocorrelation plot can be used to give some indication of suitable model parameters when fitting an ARIMA model to the time series. The curve should tail off quickly to zero; failure to do so, as in Figure 1, may indicate that the series is non-stationary, which necessitates further treatment. If the correlation is high for the first few lags and then quickly tails off, it suggests a so-called moving average (MA) series, whilst a sinusoidal shape is often associated with an autoregressive (AR) series. In many cases, a full ARIMA model (i.e. one with both AR and MA components) is required to fit the series.

Besides the autocorrelation function, additional insight may be obtained from a plot of the partial autocorrelation function; this can be produced by calling g13ac in place of g13ab in the code fragment above.

The MATLAB script for this demo is available as the file NAGToolboxDemos/Time_series_analysis/g13ab_demo.m, distributed in this archive.

The D01 chapter - Quadrature

The numerical evaluation of definite integrals in one or more dimensions is a commonly encountered task in analysis. Routines from the D01 chapter provide a variety of algorithms for use in this area; the applicability of individual routines depends on the form of the integrand. If its functional form is known analytically, then d01aj is most generally suited, and can be used when the integrand contains singularities, especially of an algebraic or logarithmic type. Other, more specialized, routines include d01ak if the integrand is oscillatory, and d01al if it exhibits discontinuities at known points.

The algorithm implemented by d01aj is adaptive - that is, it divides the interval over which the function is to be integrated into a set of subintervals, which are in turn subdivided until some accuracy condition (specified by the variables epsabs and epsrel in the fragment below) is met.

Here is the code which calls the routine to calculate the integral.

   
epsabs = 0.0;
epsrel = 0.000001;

% lw & liw are the dimensions of the w and iw output arrays.  lw
% should be between 800 & 2000; liw should be a quarter of that.
lw = nag_int(1000);
liw = lw/4;
iw = zeros(liw, 'int32');
iw = nag_int(iw);
w = zeros(lw, 'double');

% Call the NAG routine, and check for errors.
[result, abserr, w, iw, ifail] = d01aj(func, a, b, epsabs, epsrel);
if ifail ~= 0
    disp([' Non-zero ifail after d01aj call ', num2str(ifail)]);
    return;
end;

In this fragment, the variable func, passed to d01aj, is a string containing the name of a MATLAB m-file; this file must contain a function which returns the value of the integrand at a given point. Here are the contents of an example file of this type:

   
function [result] = myfunc(x)
    result = x^2*abs(sin(5*x));

(this is the integrand used in figure 2, below). In addition to the approximation of the integral value (result), d01aj's output contains the specification of the the final set of subintervals, along with the error associated with each one (some manipulation is required to get these arrays into a form that MATLAB's plotting routines can use). Figure 2 shows results for the contributions from each subinterval to the integral, and the associated errors. It can be seen that, for this integrand, the algorithm has used narrower subintervals in regions where the function is changing rapidly; the subintervals associated with (relatively) large errors are those whose width is very small indeed (near where the function goes to zero).


Figure 2: Calculating an approximation to the integral of a function.

The MATLAB script for this demo is available as the file NAGToolboxDemos/Quadrature/d01aj_demo.m, distributed in this archive.

The G03 chapter - Multivariate Methods

A multivariate data set contains several variables measured for a number of objects. Examples of such data sets arise in all branches of science, and the routines in the G03 chapter of the NAG Library can be used to study them. For example, environmental scientists who want to find out how many species of water vole (genus Arvicola) there are in the UK have made observations of 300 vole skulls, looking at the presence or absence of 13 characteristic measurements. Each observation was made in one of 14 geographical regions, distributed between the UK and continental Europe. The data from Europe is already classified into two species (A. terrestis and A. sapidus); and the scientist's task is to determine to which species the UK data belongs.

The treatment of the data starts by averaging the measurements within each region, giving 14 observations, each of 13 variables. This can be thought of as 14 points in 13-dimensional space, and a standard way of analysing such a data set is principal components analysis (as offered by, for example, the g03aa routine). This is a method to reduce the dimensionality of the data set to some smaller value; the structure of the derived points within this lower dimensional space can be considered in place of the original set of points, as long as they are adequately represented by the derived set. For the vole data set however, consideration of the first three principal components only explains 65% of the variance in the original data, which is insufficient.

An alternative technique which can be used in such circumstances is known as metric scaling. Here, the first step is to construct a 14 by 14 dissimilarity matrix whose elements are the distances between each pair of points in the original 13-dimensional space. The g03ea routine calculates the dissimilarity matrix. Here's the code:

   
% These are the region labels.
label = char( 'Surrey', 'Shropshire', 'Yorkshire', ...
'Perthshire', 'Aberdeen', 'Eilean Gamhna', 'Alps', ...
'Yugoslavia', 'Germany', 'Norway', 'Pyrenees I', ...
'Pyrenees II', 'North Spain', 'South Spain' );

% This is the vole data.  14 rows (one for each region),
% 13 columns (one for each variable).
data0 = [
   [48.5  89.2   7.9  42.1  92.1 100.0 100.0 ...
    35.3  11.4 100.0  71.9  31.6   2.8];
   [67.7  67.0   2.0  23.0  93.0 100.0  86.0 ...
    44.0  14.0  99.0  97.0  31.0  17.0];
   [51.7  84.8   0.0  31.3  88.2 100.0  94.1 ...
    18.8  25.0 100.0  83.3  33.3   5.9];
   [42.9  50.0   0.0  50.0  77.3 100.0  90.9 ...
    36.4  59.1 100.0 100.0  38.9   0.0];
   [18.1  79.6   4.1  44.9  79.6 100.0  77.6 ...
    16.7  37.1 100.0  90.4   9.8   0.0];
   [65.0  81.8   9.1  31.8  81.8 100.0  59.1 ...
    20.0  30.0 100.0 100.0   5.0   9.1];
   [57.1  76.2  21.4  38.1  66.7  97.6  14.3 ...
    23.5   9.5 100.0  91.4  11.8  17.5];
   [26.7  53.1  23.5  38.2  44.1  94.1  11.8 ...
    11.8  18.2 100.0  94.9  12.5   5.9];
   [38.5  67.9  17.9  21.4  82.1 100.0  60.7 ...
    35.7  24.0 100.0  91.7  37.5   0.0];
   [33.3  83.3  27.8  29.4  86.1 100.0  63.9 ...
    53.8  18.8 100.0  83.3   8.3  34.3];
   [47.6  92.9  26.7  10.0  36.7 100.0  50.0 ...
    14.3   7.4 100.0  86.4  90.9   3.3];
   [60.0  90.9  13.6  68.2  40.9 100.0  18.2 ...
    100.0  5.0  80.0  90.0  50.0   0.0];
   [53.8  88.1   7.1  33.3  88.1 100.0  19.0 ...
    85.7   9.8  73.8  72.2  73.7   2.4];
   [29.2  74.0  16.0  46.0  86.0 100.0  18.0 ...
    88.0  16.3  72.0  80.4  69.6   4.0];
];

% Invoke this standard transform to remove the effect
% of unequal variances between ranges.
data1 = asin(1.0 - 0.02*data0);

% Initialze D to zero before adding distances to D.
update = 'I';

% Distances are squared Euclidean.
dist = 'S';

% No scaling for the variables.
scal = 'U';

% Number of observations (regions in which data is gathered).
nobs = nag_int(14);

% Number of variables (measurements in each region).
nvar = nag_int(13);

% Which variables are to be included in the distance computation?
isx = ones(1, nvar, 'int32');
isx = nag_int(isx);

% Scaling parameter for each variable (not used, but must
% be supplied).
s = ones(1, nvar, 'double');

% Input distance matrix (not used, but must be supplied).
d = zeros(nobs*(nobs-1)/2, 1, 'double');

% Call the NAG routine to calculate the dissimilarity matrix.
% The string at the end of the parameter list signals an optional
% parameter of this name (m), whose value (nvar) follows it.
[sOut, dOut, ifail] = g03ea(update, dist, scal, nobs, data1, ...
    isx, s, d, 'm', nvar);

The second step is to find a new matrix which represents the distances amongst a new set of points in a space of lower dimensionality (say, three); the projection of the old matrix onto the new one is done using metric scaling. Here's the continuation of the code, in which the g03fa routine is invoked to perform the metric scaling:

   
% Standardize the matrix by the number of variables.
dOut = dOut ./ double(nvar);

% Now correct for the bias caused by the different number of
% observations in each region.
ns = [19 50 17 11 49 11 21 17 14 18 16 11 21 25];

jend = nobs-1;
for j = 1:jend
    istart = j+1;
    for i = istart:nobs
        index = (i-1)*(i-2)/2 + j;
        dOut(index) = abs(dOut(index) - (1.0/ns(i) + 1.0/ns(j)));
    end
end

% Now prepare for the metric scaling from nvar to ndim dimensions.
roots = 'L';
ndim = nag_int(3);
[x, eval, ifail] = g03fa(roots, nobs, dOut, ndim);

Finally, the new set of points in 3D can be displayed in a scatter plot:

   
% Do the scatter plot of the 3D data.
scatter3(x(1:nobs), x(nobs+1:2*nobs), x(2*nobs+1:3*nobs), ...
    40.0, colours, 'filled');

% Add the labels, coloured according to species.
dims = size(label);
for i = 1 : nobs
    text(x(i), x(nobs+i), x(2*nobs+i), label(i,1:dims(2)), ...
        'Color', [colours(i), colours(nobs+i), colours(2*nobs+i)], ...
        'FontSize', 13);
end

The resulting display is shown in Figure 3, from which it can be seen that the UK data (black points) are closer to the blue points (A. terrestis) than the red (A. sapidus), implying that those voles belong to that species.


Figure 3: Scatter plot from metric scaling of vole species data set.

The MATLAB script for this demo is available as the file NAGToolboxDemos/Multivariate_methods/g03fa_demo.m, distributed in this archive.

The G05 chapter - Random Number Generators

The reliable generation of a sequence of random numbers is a task that is found in many areas of computation - for example, in Monte Carlo simulation. The G05 chapter contains numerous routines for doing this; we illustrate the use of two of them here.

A fundamental distinction in this area is the one between pseudo and quasi random numbers. The former are numbers whose statistical properties are as close as possible to those of "true" random numbers - i.e. those obtained from an intrinsically random physical process (such as the time elapsed between clicks of a Geiger counter placed next to a radioactive sample). For example, consecutive numbers in a pseudo-random sequence have negligible correlation between them. Quasi-random numbers, by contrast, do not have this property - instead, they are designed to give a more even distribution in space; this property makes them well-suited for Monte Carlo methods where, for a given sequence length, they yield more accurate estimates than pseudo-random numbers.

Our example should make this distinction clear. Here is the code for the generation of two pseudo-random sequences of numbers, using the g05sq routine:

   
% The number of (x,y) pairs to generate.
n = nag_int(10000);

% Generate pseudo-random numbers.
a = 0;
b = 1;
[state, x, fail] = g05sq(n, a, b, state);
[state, y, fail] = g05sq(n, a, b, state);

And this is the code for generating a quasi-random 2D sequence, via g05ym, which uses the so-called Faure method:

   
% The generator creates generate vectors of random numbers;
% first, specify the length of each vector.
idim = nag_int(2);

% Initialize the Faure quasi-random number generator.
genid = nag_int(4);
iskip = nag_int(1);
[iref, ifail] = g05yl(genid, idim, iskip);

% Now generate n 2D vectors of them.
[quasi, irefOut, ifail] = g05ym(n, iref);

If each sequence is interpreted as a collection of points in 2D space, it can be displayed as a scatter plot (see Figure 4), which clearly shows the difference between the two types of generator. Though they both appear to fill the 2D space randomly, the quasi-random sequence fills it in a more uniform fashion (technically, this is because each number in the sequence is designed to be maximally avoiding of the others).


Figure 4: Two types of random number generation.

The MATLAB script for this demo is available as the file NAGToolboxDemos/Random_number_generation/g05fa_demo.m, distributed in this archive.