g03aa:: Multivariate Methods (NAG Toolbox)

Let

X

be an

n

p

data matrix of

n

observations on

p

variables

x_{1}, x_{2}, \dots, x_{p}

and let the

p

p

variance-covariance matrix of

x_{1}, x_{2}, \dots, x_{p}

S

. A vector

a_{1}

of length

p

is found such that:

a_{1}^{T} S a_{1} is maximized subject to a_{1}^{T} a_{1} = 1 .

The variable

z_{1} = \sum_{i = 1}^{p} a_{1 i} x_{i}

is known as the first principal component and gives the linear combination of the variables that gives the maximum variation. A second principal component,

z_{2} = \sum_{i = 1}^{p} a_{2 i} x_{i}

, is found such that:

a_{2}^{T} S a_{2} is maximized subject to ​ a_{2}^{T} a_{2} = 1 and ​ a_{2}^{T} a_{1} = 0 .

This gives the linear combination of variables that is orthogonal to the first principal component that gives the maximum variation. Further principal components are derived in a similar way.

The vectors

a_{1}, a_{2}, \dots, a_{p}

, are the eigenvectors of the matrix

S

and associated with each eigenvector is the eigenvalue,

λ_{i}^{2}

. The value of

λ_{i}^{2} / \sum λ_{i}^{2}

gives the proportion of variation explained by the

i

th principal component. Alternatively, the

a_{i}

's can be considered as the right singular vectors in a singular value decomposition with singular values

λ_{i}

of the data matrix centred about its mean and scaled by

1 / \sqrt{(n - 1)}

X_{s}

. This latter approach is used in nag_mv_prin_comp (g03aa), with

X_{s} = V Λ P^{'}

where

Λ

is a diagonal matrix with elements

λ_{i}

P

is the

p

p

matrix with columns

a_{i}

and

V

is an

n

p

matrix with

V^{'} V = I

, which gives the principal component scores.

Principal component analysis is often used to reduce the dimension of a dataset, replacing a large number of correlated variables with a smaller number of orthogonal variables that still contain most of the information in the original dataset.

The choice of the number of dimensions required is usually based on the amount of variation accounted for by the leading principal components. If

k

principal components are selected, then a test of the equality of the remaining

p - k

eigenvalues is

(n - (2 p + 5) / 6) \{- \sum_{i = k + 1}^{p} \log (λ_{i}^{2}) + (p - k) \log (\sum_{i = k + 1}^{p} λ_{i}^{2} / (p - k))\}

which has, asymptotically, a

χ^{2}

-distribution with

\frac{1}{2} (p - k - 1) (p - k + 2)

degrees of freedom.

Instead of the variance-covariance matrix the correlation matrix, the sums of squares and cross-products matrix or a standardized sums of squares and cross-products matrix may be used. In the last case

S

is replaced by

σ^{- \frac{1}{2}} S σ^{- \frac{1}{2}}

for a diagonal matrix

σ

with positive elements. If the correlation matrix is used, the

χ^{2}

approximation for the statistic given above is not valid.

The principal component scores,

F

, are the values of the principal component variables for the observations. These can be standardized so that the variance of these scores for each principal component is

1.0

or equal to the corresponding eigenvalue.

References

Parameters

Compulsory Input Parameters

Optional Input Parameters

Output Parameters

Error Indicators and Warnings

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

Accuracy

Further Comments

Example

A dataset is taken from Cooley and Lohnes (1971), it consists of ten observations on three variables. The unweighted principal components based on the variance-covariance matrix are computed and the principal component scores requested. The principal component scores are standardized so that they have variance equal to the corresponding eigenvalue.

function g03aa_example


fprintf('g03aa example results\n\n');

x    = [7, 4, 3;
	4, 1, 8;
	6, 3, 5;
	8, 6, 1;
	8, 5, 7;
	7, 2, 9;
	5, 3, 3;
	9, 5, 8;
	7, 4, 5;
	8, 2, 2];
n    = size(x,2);

matrix = 'V';
std  = 'E';
isx  = ones(n,1,'int64');
s    = zeros(n,1);
nvar = int64(n);

[s, e, p, v, ifail] = g03aa( ...
			     matrix, std, x, isx, s, nvar);

fprintf('Eigenvalues  Percentage  Cumulative     Chisq      DF     Sig\n');
fprintf('              variation   variation\n\n');
fprintf('%11.4f%12.4f%12.4f%10.4f%8.1f%8.4f\n',e');
fprintf('\n');

mtitle = 'Principal component loadings';
matrix = 'General';
diag   = ' ';

[ifail] = x04ca( ...
                 matrix, diag, p, mtitle);

fprintf('\n');
mtitle = 'Principal component scores';
[ifail] = x04ca( ...
                 matrix, diag, v, mtitle);

fig1 = figure;
subplot(1,2,1);
xlabel('PC 1');
ylabel('PC 2');
title({'Observation numbers', 'for PC 1 and PC 2'});
axis([-5 5 -3 4]);
for j = 1:size(x,1)
  ch = sprintf('%d',j);
  text(v(j,1),v(j,2),ch);
end
subplot(1,2,2);
bar(e(:,2));
ax = gca;
ax.XTickLabel = {'PC 1','PC 2','PC 3'};
xlabel('PC 1');
ylabel('Percentage variation');
title('Scree plot');

g03aa example results

Eigenvalues  Percentage  Cumulative     Chisq      DF     Sig
              variation   variation

     8.2739      0.6515      0.6515    8.6127     5.0  0.1255
     3.6761      0.2895      0.9410    4.1183     2.0  0.1276
     0.7499      0.0590      1.0000    0.0000     0.0  0.0000

 Principal component loadings
          1       2       3
 1  -0.1376  0.6990 -0.7017
 2  -0.2505  0.6609  0.7075
 3   0.9583  0.2731  0.0842

 Principal component scores
              1          2          3
  1     -2.1514    -0.1731     0.1068
  2      3.8042    -2.8875     0.5104
  3      0.1532    -0.9869     0.2694
  4     -4.7065     1.3015     0.6517
  5      1.2938     2.2791     0.4492
  6      4.0993     0.1436    -0.8031
  7     -1.6258    -2.2321     0.8028
  8      2.1145     3.2512    -0.1684
  9     -0.2348     0.3730     0.2751
 10     -2.7464    -1.0689    -2.0940

At Mark 24:	weight was removed from the interface; wt was made optional
At Mark 22:	n was made optional

On entry,	$m < 1$ ,
or	$n < 2$ ,
or	$nvar < 1$ ,
or	$nvar > m$ ,
or	$nvar \geq n$ ,
or	$ldx < n$ ,
or	$ldv < n$ ,
or	$ldp < nvar$ ,
or	$lde < nvar$ ,
or	$matrix \neq'C'$ , $'S'$ , $'U'$ or $'V'$ ,
or	$std \neq'S'$ , $'U'$ , $'Z'$ or $'E'$ ,
or	$weight \neq'U'$ or $'W'$ .

On entry,	there are not nvar values of $isx > 0$ ,
or	$weight ='W'$ and the effective number of observations is less than $nvar + 1$ .

NAG Toolbox: nag_mv_prin_comp (g03aa)

▸▿ Contents

Purpose

Syntax

Description