Using the NAG Toolbox for MATLAB - Part 3
Using the NAG Toolbox for MATLAB® - Part 3
Previous articles in this series are http://www.nag.co.uk/IndustryArticles/usingtoolboxmatlab.asp and http://www.nag.co.uk/IndustryArticles/usingtoolboxmatlabpart2.asp; in this note, we continue our exploration of the Toolbox, demonstrating how it allows users to call any NAG Library routine from within MATLAB, and use MATLAB's plotting facilities to view the results.
Note: The code examples in this article have been extracted from demo scripts, and will not necessarily work properly if cut and pasted from this page into MATLAB. The full version of the scripts, which have been used to make the figures in this article, are available in this archive.
A time series is a set of observations of some time-dependent process, collected at various points in time. The G13 chapter of the NAG Library contains several routines for investigating and modelling the statistical structure of time series; the models constructed by these routines may then be used to better understand the data, or to create forecasts (i.e. predictions of future behaviour) from the series. For example, a so-called autoregressive integrated moving average (ARIMA) model can be fitted to the series - see below.
One way of initially characterising a time series is to calculate its autocorrelation function, which describes the correlation (or degree of dependence) that exists between the behaviour of the underlying process at different points in time. The separation between the different times is called the lag, and the autocorrelation function is usually expressed as a set of autocorrelation coefficients, for different values of the lag. The routine g13ab can be used to compute this, along with more elementary statistical quantities such as the mean and variance. Here's the code:
Because lag is a discrete variable, the autocorrelation function is best displayed as a histogram (sometimes called an autocorrelogram in this context), as in this picture:
Figure 1: Computing the autocorrelation function of a time series.
The autocorrelation function contains both quantitative and qualitative information about the time-dependence of the underlying process; in this example, the period of the oscillations indicates a seasonality of around 11 units. In addition, the shape of the autocorrelation plot can be used to give some indication of suitable model parameters when fitting an ARIMA model to the time series. The curve should tail off quickly to zero; failure to do so, as in Figure 1, may indicate that the series is non-stationary, which necessitates further treatment. If the correlation is high for the first few lags and then quickly tails off, it suggests a so-called moving average (MA) series, whilst a sinusoidal shape is often associated with an autoregressive (AR) series. In many cases, a full ARIMA model (i.e. one with both AR and MA components) is required to fit the series.
Besides the autocorrelation function, additional insight may be obtained from a plot of the partial autocorrelation function; this can be produced by calling g13ac in place of g13ab in the code fragment above.
The MATLAB script for this demo is available as the file NAGToolboxDemos/Time_series_analysis/g13ab_demo.m, distributed in this archive.
The numerical evaluation of definite integrals in one or more dimensions is a commonly encountered task in analysis. Routines from the D01 chapter provide a variety of algorithms for use in this area; the applicability of individual routines depends on the form of the integrand. If its functional form is known analytically, then d01aj is most generally suited, and can be used when the integrand contains singularities, especially of an algebraic or logarithmic type. Other, more specialized, routines include d01ak if the integrand is oscillatory, and d01al if it exhibits discontinuities at known points.
The algorithm implemented by d01aj is adaptive - that is, it divides the interval over which the function is to be integrated into a set of subintervals, which are in turn subdivided until some accuracy condition (specified by the variables epsabs and epsrel in the fragment below) is met.
Here is the code which calls the routine to calculate the integral.
In this fragment, the variable func, passed to d01aj, is a string containing the name of a MATLAB m-file; this file must contain a function which returns the value of the integrand at a given point. Here are the contents of an example file of this type:
(this is the integrand used in figure 2, below). In addition to the approximation of the integral value (result), d01aj's output contains the specification of the the final set of subintervals, along with the error associated with each one (some manipulation is required to get these arrays into a form that MATLAB's plotting routines can use). Figure 2 shows results for the contributions from each subinterval to the integral, and the associated errors. It can be seen that, for this integrand, the algorithm has used narrower subintervals in regions where the function is changing rapidly; the subintervals associated with (relatively) large errors are those whose width is very small indeed (near where the function goes to zero).
Figure 2: Calculating an approximation to the integral of a function.
The MATLAB script for this demo is available as the file NAGToolboxDemos/Quadrature/d01aj_demo.m, distributed in this archive.
A multivariate data set contains several variables measured for a number of objects. Examples of such data sets arise in all branches of science, and the routines in the G03 chapter of the NAG Library can be used to study them. For example, environmental scientists who want to find out how many species of water vole (genus Arvicola) there are in the UK have made observations of 300 vole skulls, looking at the presence or absence of 13 characteristic measurements. Each observation was made in one of 14 geographical regions, distributed between the UK and continental Europe. The data from Europe is already classified into two species (A. terrestis and A. sapidus); and the scientist's task is to determine to which species the UK data belongs.
The treatment of the data starts by averaging the measurements within each region, giving 14 observations, each of 13 variables. This can be thought of as 14 points in 13-dimensional space, and a standard way of analysing such a data set is principal components analysis (as offered by, for example, the g03aa routine). This is a method to reduce the dimensionality of the data set to some smaller value; the structure of the derived points within this lower dimensional space can be considered in place of the original set of points, as long as they are adequately represented by the derived set. For the vole data set however, consideration of the first three principal components only explains 65% of the variance in the original data, which is insufficient.
An alternative technique which can be used in such circumstances is known as metric scaling. Here, the first step is to construct a 14 by 14 dissimilarity matrix whose elements are the distances between each pair of points in the original 13-dimensional space. The g03ea routine calculates the dissimilarity matrix. Here's the code:
The second step is to find a new matrix which represents the distances amongst a new set of points in a space of lower dimensionality (say, three); the projection of the old matrix onto the new one is done using metric scaling. Here's the continuation of the code, in which the g03fa routine is invoked to perform the metric scaling:
Finally, the new set of points in 3D can be displayed in a scatter plot:
The resulting display is shown in Figure 3, from which it can be seen that the UK data (black points) are closer to the blue points (A. terrestis) than the red (A. sapidus), implying that those voles belong to that species.
Figure 3: Scatter plot from metric scaling of vole species data set.
The MATLAB script for this demo is available as the file NAGToolboxDemos/Multivariate_methods/g03fa_demo.m, distributed in this archive.
The reliable generation of a sequence of random numbers is a task that is found in many areas of computation - for example, in Monte Carlo simulation. The G05 chapter contains numerous routines for doing this; we illustrate the use of two of them here.
A fundamental distinction in this area is the one between pseudo and quasi random numbers. The former are numbers whose statistical properties are as close as possible to those of "true" random numbers - i.e. those obtained from an intrinsically random physical process (such as the time elapsed between clicks of a Geiger counter placed next to a radioactive sample). For example, consecutive numbers in a pseudo-random sequence have negligible correlation between them. Quasi-random numbers, by contrast, do not have this property - instead, they are designed to give a more even distribution in space; this property makes them well-suited for Monte Carlo methods where, for a given sequence length, they yield more accurate estimates than pseudo-random numbers.
Our example should make this distinction clear. Here is the code for the generation of two pseudo-random sequences of numbers, using the g05fa routine:
And this is the code for generating a quasi-random 2D sequence, via g05yd, which uses the so-called Faure method:
If each sequence is interpreted as a collection of points in 2D space, it can be displayed as a scatter plot (see Figure 4), which clearly shows the difference between the two types of generator. Though they both appear to fill the 2D space randomly, the quasi-random sequence fills it in a more uniform fashion (technically, this is because each number in the sequence is designed to be maximally avoiding of the others).
Figure 4: Two types of random number generation.
The MATLAB script for this demo is available as the file NAGToolboxDemos/Random_number_generation/g05fa_demo.m, distributed in this archive.