Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

# NAG Toolbox: nag_mv_prin_comp (g03aa)

## Purpose

nag_mv_prin_comp (g03aa) performs a principal component analysis on a data matrix; both the principal component loadings and the principal component scores are returned.

## Syntax

[s, e, p, v, ifail] = g03aa(matrix, std, x, isx, s, nvar, 'n', n, 'm', m, 'wt', wt)
[s, e, p, v, ifail] = nag_mv_prin_comp(matrix, std, x, isx, s, nvar, 'n', n, 'm', m, 'wt', wt)
Note: the interface to this routine has changed since earlier releases of the toolbox:
Mark 22: n has been made optional
Mark 24: drop weight, wt optional
.

## Description

Let X$X$ be an n$n$ by p$p$ data matrix of n$n$ observations on p$p$ variables x1,x2,,xp${x}_{1},{x}_{2},\dots ,{x}_{p}$ and let the p$p$ by p$p$ variance-covariance matrix of x1,x2,,xp${x}_{1},{x}_{2},\dots ,{x}_{p}$ be S$S$. A vector a1${a}_{1}$ of length p$p$ is found such that:
 a1TSa1  is maximized subject to  a1Ta1 = 1. $a1TSa1 is maximized subject to a1Ta1=1.$
The variable z1 = i = 1pa1ixi${z}_{1}=\sum _{i=1}^{p}{a}_{1i}{x}_{i}$ is known as the first principal component and gives the linear combination of the variables that gives the maximum variation. A second principal component, z2 = i = 1pa2ixi${z}_{2}=\sum _{i=1}^{p}{a}_{2i}{x}_{i}$, is found such that:
 a2TSa2  is maximized subject to ​a2Ta2 = 1and ​a2Ta1 = 0. $a2TSa2 is maximized subject to ​a2Ta2=1and ​a2Ta1=0.$
This gives the linear combination of variables that is orthogonal to the first principal component that gives the maximum variation. Further principal components are derived in a similar way.
The vectors a1,a2,,ap${a}_{1},{a}_{2},\dots ,{a}_{p}$, are the eigenvectors of the matrix S$S$ and associated with each eigenvector is the eigenvalue, λi2${\lambda }_{i}^{2}$. The value of λi2 / λi2${\lambda }_{i}^{2}/\sum {\lambda }_{i}^{2}$ gives the proportion of variation explained by the i$i$th principal component. Alternatively, the ai${a}_{i}$'s can be considered as the right singular vectors in a singular value decomposition with singular values λi${\lambda }_{i}$ of the data matrix centred about its mean and scaled by 1 / sqrt((n1))$1/\sqrt{\left(n-1\right)}$, Xs${X}_{s}$. This latter approach is used in nag_mv_prin_comp (g03aa), with
 Xs = VΛP′ $Xs=VΛP′$
where Λ$\Lambda$ is a diagonal matrix with elements λi${\lambda }_{i}$, P$P$ is the p$p$ by p$p$ matrix with columns ai${a}_{i}$ and V$V$ is an n$n$ by p$p$ matrix with VV = I${V}^{\prime }V=I$, which gives the principal component scores.
Principal component analysis is often used to reduce the dimension of a dataset, replacing a large number of correlated variables with a smaller number of orthogonal variables that still contain most of the information in the original dataset.
The choice of the number of dimensions required is usually based on the amount of variation accounted for by the leading principal components. If k$k$ principal components are selected, then a test of the equality of the remaining pk$p-k$ eigenvalues is
(n − (2p + 5) / 6) ( p ) − ∑ log(λi2) + (p − k)
 log (p ) ∑ λi2 / (p − k)i = k + 1
i = k + 1
$(n-(2p+5)/6) {-∑i=k+1plog(λi2)+(p-k)log(∑i=k+1pλi2/(p-k)) }$
which has, asymptotically, a χ2${\chi }^{2}$-distribution with (1/2)(pk1)(pk + 2)$\frac{1}{2}\left(p-k-1\right)\left(p-k+2\right)$ degrees of freedom.
Equality of the remaining eigenvalues indicates that if any more principal components are to be considered then they all should be considered.
Instead of the variance-covariance matrix the correlation matrix, the sums of squares and cross-products matrix or a standardized sums of squares and cross-products matrix may be used. In the last case S$S$ is replaced by σ(1/2)Sσ(1/2)${\sigma }^{-\frac{1}{2}}S{\sigma }^{-\frac{1}{2}}$ for a diagonal matrix σ$\sigma$ with positive elements. If the correlation matrix is used, the χ2${\chi }^{2}$ approximation for the statistic given above is not valid.
The principal component scores, F$F$, are the values of the principal component variables for the observations. These can be standardized so that the variance of these scores for each principal component is 1.0$1.0$ or equal to the corresponding eigenvalue.
Weights can be used with the analysis, in which case the matrix X$X$ is first centred about the weighted means then each row is scaled by an amount sqrt(wi)$\sqrt{{w}_{i}}$, where wi${w}_{i}$ is the weight for the i$i$th observation.

## References

Chatfield C and Collins A J (1980) Introduction to Multivariate Analysis Chapman and Hall
Cooley W C and Lohnes P R (1971) Multivariate Data Analysis Wiley
Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25
Kendall M G and Stuart A (1969) The Advanced Theory of Statistics (Volume 1) (3rd Edition) Griffin
Morrison D F (1967) Multivariate Statistical Methods McGraw–Hill

## Parameters

### Compulsory Input Parameters

1:     matrix – string (length ≥ 1)
Indicates for which type of matrix the principal component analysis is to be carried out.
matrix = 'C'${\mathbf{matrix}}=\text{'C'}$
It is for the correlation matrix.
matrix = 'S'${\mathbf{matrix}}=\text{'S'}$
It is for a standardized matrix, with standardizations given by s.
matrix = 'U'${\mathbf{matrix}}=\text{'U'}$
It is for the sums of squares and cross-products matrix.
matrix = 'V'${\mathbf{matrix}}=\text{'V'}$
It is for the variance-covariance matrix.
Constraint: matrix = 'C'${\mathbf{matrix}}=\text{'C'}$, 'S'$\text{'S'}$, 'U'$\text{'U'}$ or 'V'$\text{'V'}$.
2:     std – string (length ≥ 1)
Indicates if the principal component scores are to be standardized.
std = 'S'${\mathbf{std}}=\text{'S'}$
The principal component scores are standardized so that FF = I${F}^{\prime }F=I$, i.e., F = XsPΛ1 = V$F={X}_{s}P{\Lambda }^{-1}=V$.
std = 'U'${\mathbf{std}}=\text{'U'}$
The principal component scores are unstandardized, i.e., F = XsP = VΛ$F={X}_{s}P=V\Lambda$.
std = 'Z'${\mathbf{std}}=\text{'Z'}$
The principal component scores are standardized so that they have unit variance.
std = 'E'${\mathbf{std}}=\text{'E'}$
The principal component scores are standardized so that they have variance equal to the corresponding eigenvalue.
Constraint: std = 'E'${\mathbf{std}}=\text{'E'}$, 'S'$\text{'S'}$, 'U'$\text{'U'}$ or 'Z'$\text{'Z'}$.
3:     x(ldx,m) – double array
ldx, the first dimension of the array, must satisfy the constraint ldxn$\mathit{ldx}\ge {\mathbf{n}}$.
x(i,j)${\mathbf{x}}\left(\mathit{i},\mathit{j}\right)$ must contain the i$\mathit{i}$th observation for the j$\mathit{j}$th variable, for i = 1,2,,n$\mathit{i}=1,2,\dots ,n$ and j = 1,2,,m$\mathit{j}=1,2,\dots ,m$.
4:     isx(m) – int64int32nag_int array
m, the dimension of the array, must satisfy the constraint m1${\mathbf{m}}\ge 1$.
isx(j)${\mathbf{isx}}\left(j\right)$ indicates whether or not the j$j$th variable is to be included in the analysis.
If isx(j) > 0${\mathbf{isx}}\left(\mathit{j}\right)>0$, the variable contained in the j$\mathit{j}$th column of x is included in the principal component analysis, for j = 1,2,,m$\mathit{j}=1,2,\dots ,m$.
Constraint: isx(j) > 0${\mathbf{isx}}\left(j\right)>0$ for nvar values of j$j$.
5:     s(m) – double array
m, the dimension of the array, must satisfy the constraint m1${\mathbf{m}}\ge 1$.
The standardizations to be used, if any.
If matrix = 'S'${\mathbf{matrix}}=\text{'S'}$, the first m$m$ elements of s must contain the standardization coefficients, the diagonal elements of σ$\sigma$.
Constraint: if isx(j) > 0${\mathbf{isx}}\left(j\right)>0$, s(j) > 0.0${\mathbf{s}}\left(\mathit{j}\right)>0.0$, for j = 1,2,,m$\mathit{j}=1,2,\dots ,m$.
6:     nvar – int64int32nag_int scalar
p$p$, the number of variables in the principal component analysis.
Constraint: 1nvarmin (n1,m)$1\le {\mathbf{nvar}}\le \mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{n}}-1,{\mathbf{m}}\right)$.

### Optional Input Parameters

1:     n – int64int32nag_int scalar
Default: The first dimension of the array x.
n$n$, the number of observations.
Constraint: n2${\mathbf{n}}\ge 2$.
2:     m – int64int32nag_int scalar
Default: The dimension of the arrays isx, s and the second dimension of the array x. (An error is raised if these dimensions are not equal.)
m$m$, the number of variables in the data matrix.
Constraint: m1${\mathbf{m}}\ge 1$.
3:     wt( : $:$) – double array
Note: the dimension of the array wt must be at least n${\mathbf{n}}$ if weight = 'W'$\mathit{weight}=\text{'W'}$, and at least 1$1$ otherwise.
If weight = 'W'$\mathit{weight}=\text{'W'}$, the first n$n$ elements of wt must contain the weights to be used in the principal component analysis.
If wt(i) = 0.0${\mathbf{wt}}\left(i\right)=0.0$, the i$i$th observation is not included in the analysis. The effective number of observations is the sum of the weights.
If weight = 'U'$\mathit{weight}=\text{'U'}$, wt is not referenced and the effective number of observations is n$n$.
Constraints:
• wt(i)0.0${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$, for i = 1,2,,n$\mathit{i}=1,2,\dots ,n$;
• the sum of weights nvar + 1$\text{}\ge {\mathbf{nvar}}+1$.

### Input Parameters Omitted from the MATLAB Interface

weight ldx lde ldp ldv wk

### Output Parameters

1:     s(m) – double array
If matrix = 'S'${\mathbf{matrix}}=\text{'S'}$, s is unchanged on exit.
If matrix = 'C'${\mathbf{matrix}}=\text{'C'}$, s contains the variances of the selected variables. s(j)${\mathbf{s}}\left(j\right)$ contains the variance of the variable in the j$j$th column of x if isx(j) > 0${\mathbf{isx}}\left(j\right)>0$.
If matrix = 'U'${\mathbf{matrix}}=\text{'U'}$ or 'V'$\text{'V'}$, s is not referenced.
2:     e(lde,6$6$) – double array
ldenvar$\mathit{lde}\ge {\mathbf{nvar}}$.
The statistics of the principal component analysis.
e(i,1)${\mathbf{e}}\left(i,1\right)$
The eigenvalues associated with the i$\mathit{i}$th principal component, λi2${\lambda }_{\mathit{i}}^{2}$, for i = 1,2,,p$\mathit{i}=1,2,\dots ,p$.
e(i,2)${\mathbf{e}}\left(i,2\right)$
The proportion of variation explained by the i$\mathit{i}$th principal component, for i = 1,2,,p$\mathit{i}=1,2,\dots ,p$.
e(i,3)${\mathbf{e}}\left(\mathit{i},3\right)$
The cumulative proportion of variation explained by the first i$\mathit{i}$th principal components, for i = 1,2,,p$\mathit{i}=1,2,\dots ,p$.
e(i,4)${\mathbf{e}}\left(\mathit{i},4\right)$
The χ2${\chi }^{2}$ statistics, for i = 1,2,,p$i=1,2,\dots ,p$.
e(i,5)${\mathbf{e}}\left(i,5\right)$
The degrees of freedom for the χ2${\chi }^{2}$ statistics, for i = 1,2,,p$i=1,2,\dots ,p$.
If matrix'C'${\mathbf{matrix}}\ne \text{'C'}$, e(i,6)${\mathbf{e}}\left(\mathit{i},6\right)$ contains significance level for the χ2${\chi }^{2}$ statistic, for i = 1,2,,p$\mathit{i}=1,2,\dots ,p$.
If matrix = 'C'${\mathbf{matrix}}=\text{'C'}$, e(i,6)${\mathbf{e}}\left(i,6\right)$ is returned as zero.
3:     p(ldp,nvar) – double array
ldpnvar$\mathit{ldp}\ge {\mathbf{nvar}}$.
The first nvar columns of p contain the principal component loadings, ai${a}_{i}$. The j$j$th column of p contains the nvar coefficients for the j$j$th principal component.
4:     v(ldv,nvar) – double array
ldvn$\mathit{ldv}\ge {\mathbf{n}}$.
The first nvar columns of v contain the principal component scores. The j$j$th column of v contains the n scores for the j$j$th principal component.
If weight = 'W'$\mathit{weight}=\text{'W'}$, any rows for which wt(i)${\mathbf{wt}}\left(i\right)$ is zero will be set to zero.
5:     ifail – int64int32nag_int scalar
${\mathrm{ifail}}={\mathbf{0}}$ unless the function detects an error (see [Error Indicators and Warnings]).

## Error Indicators and Warnings

Errors or warnings detected by the function:

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

ifail = 1${\mathbf{ifail}}=1$
 On entry, m < 1${\mathbf{m}}<1$, or n < 2${\mathbf{n}}<2$, or nvar < 1${\mathbf{nvar}}<1$, or ${\mathbf{nvar}}>{\mathbf{m}}$, or ${\mathbf{nvar}}\ge {\mathbf{n}}$, or ldx < n$\mathit{ldx}<{\mathbf{n}}$, or ldv < n$\mathit{ldv}<{\mathbf{n}}$, or ldp < nvar$\mathit{ldp}<{\mathbf{nvar}}$, or lde < nvar$\mathit{lde}<{\mathbf{nvar}}$, or matrix ≠ 'C'${\mathbf{matrix}}\ne \text{'C'}$, 'S'$\text{'S'}$, 'U'$\text{'U'}$ or 'V'$\text{'V'}$, or std ≠ 'S'${\mathbf{std}}\ne \text{'S'}$, 'U'$\text{'U'}$, 'Z'$\text{'Z'}$ or 'E'$\text{'E'}$, or weight ≠ 'U'$\mathit{weight}\ne \text{'U'}$ or 'W'$\text{'W'}$.
ifail = 2${\mathbf{ifail}}=2$
 On entry, weight = 'W'$\mathit{weight}=\text{'W'}$ and a value of wt < 0.0${\mathbf{wt}}<0.0$.
ifail = 3${\mathbf{ifail}}=3$
 On entry, there are not nvar values of isx > 0${\mathbf{isx}}>0$, or weight = 'W'$\mathit{weight}=\text{'W'}$ and the effective number of observations is less than nvar + 1${\mathbf{nvar}}+1$.
ifail = 4${\mathbf{ifail}}=4$
 On entry, s(j) ≤ 0.0${\mathbf{s}}\left(j\right)\le 0.0$ for some j = 1,2, … ,m$j=1,2,\dots ,m$, when matrix = 'S'${\mathbf{matrix}}=\text{'S'}$ and isx(j) > 0${\mathbf{isx}}\left(j\right)>0$.
ifail = 5${\mathbf{ifail}}=5$
The singular value decomposition has failed to converge. This is an unlikely error exit.
W ifail = 6${\mathbf{ifail}}=6$
All eigenvalues/singular values are zero. This will be caused by all the variables being constant.

## Accuracy

As nag_mv_prin_comp (g03aa) uses a singular value decomposition of the data matrix, it will be less affected by ill-conditioned problems than traditional methods using the eigenvalue decomposition of the variance-covariance matrix.

None.

## Example

```function nag_mv_prin_comp_example
matrix = 'V';
std = 'E';
x = [7, 4, 3;
4, 1, 8;
6, 3, 5;
8, 6, 1;
8, 5, 7;
7, 2, 9;
5, 3, 3;
9, 5, 8;
7, 4, 5;
8, 2, 2];
isx = [int64(1);1;1];
s = [-5.04677090184712e-39;
-5.04512289241806e-39;
-1.790699005126953];
nvar = int64(3);
[sOut, e, p, v, ifail] = nag_mv_prin_comp(matrix, std, x, isx, s, nvar)
```
```

sOut =

-0.0000
-0.0000
-1.7907

e =

8.2739    0.6515    0.6515    8.6127    5.0000    0.1255
3.6761    0.2895    0.9410    4.1183    2.0000    0.1276
0.7499    0.0590    1.0000         0         0         0

p =

-0.1376    0.6990   -0.7017
-0.2505    0.6609    0.7075
0.9583    0.2731    0.0842

v =

-2.1514   -0.1731    0.1068
3.8042   -2.8875    0.5104
0.1532   -0.9869    0.2694
-4.7065    1.3015    0.6517
1.2938    2.2791    0.4492
4.0993    0.1436   -0.8031
-1.6258   -2.2321    0.8028
2.1145    3.2512   -0.1684
-0.2348    0.3730    0.2751
-2.7464   -1.0689   -2.0940

ifail =

0

```
```function g03aa_example
matrix = 'V';
std = 'E';
x = [7, 4, 3;
4, 1, 8;
6, 3, 5;
8, 6, 1;
8, 5, 7;
7, 2, 9;
5, 3, 3;
9, 5, 8;
7, 4, 5;
8, 2, 2];
isx = [int64(1);1;1];
s = [-5.04677090184712e-39;
-5.04512289241806e-39;
-1.790699005126953];
nvar = int64(3);
[sOut, e, p, v, ifail] = g03aa(matrix, std, x, isx, s, nvar)
```
```

sOut =

-0.0000
-0.0000
-1.7907

e =

8.2739    0.6515    0.6515    8.6127    5.0000    0.1255
3.6761    0.2895    0.9410    4.1183    2.0000    0.1276
0.7499    0.0590    1.0000         0         0         0

p =

-0.1376    0.6990   -0.7017
-0.2505    0.6609    0.7075
0.9583    0.2731    0.0842

v =

-2.1514   -0.1731    0.1068
3.8042   -2.8875    0.5104
0.1532   -0.9869    0.2694
-4.7065    1.3015    0.6517
1.2938    2.2791    0.4492
4.0993    0.1436   -0.8031
-1.6258   -2.2321    0.8028
2.1145    3.2512   -0.1684
-0.2348    0.3730    0.2751
-2.7464   -1.0689   -2.0940

ifail =

0

```