Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

## Purpose

nag_mv_canon_corr (g03ad) performs canonical correlation analysis upon input data matrices.

## Syntax

[e, ncv, cvx, cvy, ifail] = g03ad(z, isz, nx, ny, mcv, tol, 'n', n, 'm', m, 'wt', wt)
[e, ncv, cvx, cvy, ifail] = nag_mv_canon_corr(z, isz, nx, ny, mcv, tol, 'n', n, 'm', m, 'wt', wt)
Note: the interface to this routine has changed since earlier releases of the toolbox:
Mark 22: n has been made optional
Mark 24: drop weight, wt optional
.

## Description

Let there be two sets of variables, x$x$ and y$y$. For a sample of n$n$ observations on nx${n}_{x}$ variables in a data matrix X$X$ and ny${n}_{y}$ variables in a data matrix Y$Y$, canonical correlation analysis seeks to find a small number of linear combinations of each set of variables in order to explain or summarise the relationships between them. The variables thus formed are known as canonical variates.
Let the variance-covariance matrix of the two datasets be
 ( Sxx Sxy ) Syx Syy
$Sxx Sxy Syx Syy$
and let
 Σ = Syy − 1SyxSxx − 1Sxy $Σ=Syy -1SyxSxx -1Sxy$
then the canonical correlations can be calculated from the eigenvalues of the matrix Σ$\Sigma$. However, nag_mv_canon_corr (g03ad) calculates the canonical correlations by means of a singular value decomposition (SVD) of a matrix V$V$. If the rank of the data matrix X$X$ is kx${k}_{x}$ and the rank of the data matrix Y$Y$ is ky${k}_{y}$, and both X$X$ and Y$Y$ have had variable (column) means subtracted then the kx${k}_{x}$ by ky${k}_{y}$ matrix V$V$ is given by:
 V = QxTQy, $V=QxTQy,$
where Qx${Q}_{x}$ is the first kx${k}_{x}$ columns of the orthogonal matrix Q$Q$ either from the QR$QR$ decomposition of X$X$ if X$X$ is of full column rank, i.e., kx = nx${k}_{x}={n}_{x}$:
 X = QxRx $X=QxRx$
or from the SVD of X$X$ if kx < nx${k}_{x}<{n}_{x}$:
 X = QxDxPxT. $X=QxDxPxT.$
Similarly Qy${Q}_{y}$ is the first ky${k}_{y}$ columns of the orthogonal matrix Q$Q$ either from the QR$QR$ decomposition of Y$Y$ if Y$Y$ is of full column rank, i.e., ky = ny${k}_{y}={n}_{y}$:
 Y = QyRy $Y=QyRy$
or from the SVD of Y$Y$ if ky < ny${k}_{y}<{n}_{y}$:
 Y = QyDyPyT. $Y=QyDyPyT.$
Let the SVD of V$V$ be:
 V = UxΔUyT $V=UxΔUyT$
then the nonzero elements of the diagonal matrix Δ$\Delta$, δi${\delta }_{\mathit{i}}$, for i = 1,2,,l$\mathit{i}=1,2,\dots ,l$, are the l$l$ canonical correlations associated with the l$l$ canonical variates, where l = min (kx,ky) $l=\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({k}_{x},{k}_{y}\right)$.
The eigenvalues, λi2${\lambda }_{i}^{2}$, of the matrix Σ$\Sigma$ are given by:
 λi2 = δi2 . $λi2 = δi2 .$
The value of πi = λi2 / λi2${\pi }_{i}={\lambda }_{i}^{2}/\sum {\lambda }_{i}^{2}$ gives the proportion of variation explained by the i$i$th canonical variate. The values of the πi${\pi }_{i}$'s give an indication as to how many canonical variates are needed to adequately describe the data, i.e., the dimensionality of the problem.
To test for a significant dimensionality greater than i$i$ the χ2${\chi }^{2}$ statistic:
 l (n − (1/2)(kx + ky + 3)) ∑ log(1 − δj2) j = i + 1
$( n - 12 ( kx + ky + 3 ) ) ∑ j=i+1 l log( 1 - δj2 )$
can be used. This is asymptotically distributed as a χ2${\chi }^{2}$-distribution with (kxi)(kyi)$\left({k}_{x}-i\right)\left({k}_{y}-i\right)$ degrees of freedom. If the test for i = kmin$i={k}_{\mathrm{min}}$ is not significant, then the remaining tests for i > kmin$i>{k}_{\mathrm{min}}$ should be ignored.
The loadings for the canonical variates are calculated from the matrices Ux${U}_{x}$ and Uy${U}_{y}$ respectively. These matrices are scaled so that the canonical variates have unit variance.

## References

Hastings N A J and Peacock J B (1975) Statistical Distributions Butterworth
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Morrison D F (1967) Multivariate Statistical Methods McGraw–Hill

## Parameters

### Compulsory Input Parameters

1:     z(ldz,m) – double array
ldz, the first dimension of the array, must satisfy the constraint ldzn$\mathit{ldz}\ge {\mathbf{n}}$.
z(i,j)${\mathbf{z}}\left(\mathit{i},\mathit{j}\right)$ must contain the i$\mathit{i}$th observation for the j$\mathit{j}$th variable, for i = 1,2,,n$\mathit{i}=1,2,\dots ,n$ and j = 1,2,,m$\mathit{j}=1,2,\dots ,m$.
Both x$x$ and y$y$ variables are to be included in z, the indicator array, isz, being used to assign the variables in z to the x$x$ or y$y$ sets as appropriate.
2:     isz(m) – int64int32nag_int array
m, the dimension of the array, must satisfy the constraint mnx + ny${\mathbf{m}}\ge {\mathbf{nx}}+{\mathbf{ny}}$.
isz(j)${\mathbf{isz}}\left(j\right)$ indicates whether or not the j$j$th variable is included in the analysis and to which set of variables it belongs.
isz(j) > 0${\mathbf{isz}}\left(j\right)>0$
The variable contained in the j$j$th column of z is included as an x$x$ variable in the analysis.
isz(j) < 0${\mathbf{isz}}\left(j\right)<0$
The variable contained in the j$j$th column of z is included as a y$y$ variable in the analysis.
isz(j) = 0${\mathbf{isz}}\left(j\right)=0$
The variable contained in the j$j$th column of z is not included in the analysis.
Constraint: only nx elements of isz can be > 0$\text{}>0$ and only ny elements of isz can be < 0$\text{}<0$.
3:     nx – int64int32nag_int scalar
The number of x$x$ variables in the analysis, nx${n}_{x}$.
Constraint: nx1${\mathbf{nx}}\ge 1$.
4:     ny – int64int32nag_int scalar
The number of y$y$ variables in the analysis, ny${n}_{y}$.
Constraint: ny1${\mathbf{ny}}\ge 1$.
5:     mcv – int64int32nag_int scalar
An upper limit to the number of canonical variates.
Constraint: mcvmin (nx,ny)${\mathbf{mcv}}\ge \mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{nx}},{\mathbf{ny}}\right)$.
6:     tol – double scalar
The value of tol is used to decide if the variables are of full rank and, if not, what is the rank of the variables. The smaller the value of tol the stricter the criterion for selecting the singular value decomposition. If a non-negative value of tol less than machine precision is entered, the square root of machine precision is used instead.
Constraint: tol0.0${\mathbf{tol}}\ge 0.0$.

### Optional Input Parameters

1:     n – int64int32nag_int scalar
Default: The dimension of the array wt and the first dimension of the array z. (An error is raised if these dimensions are not equal.)
n$n$, the number of observations.
Constraint: n > nx + ny${\mathbf{n}}>{\mathbf{nx}}+{\mathbf{ny}}$.
2:     m – int64int32nag_int scalar
Default: The dimension of the array isz and the second dimension of the array z. (An error is raised if these dimensions are not equal.)
m$m$, the total number of variables.
Constraint: mnx + ny${\mathbf{m}}\ge {\mathbf{nx}}+{\mathbf{ny}}$.
3:     wt( : $:$) – double array
Note: the dimension of the array wt must be at least n${\mathbf{n}}$ if weight = 'W'$\mathit{weight}=\text{'W'}$, and at least 1$1$ otherwise.
If weight = 'W'$\mathit{weight}=\text{'W'}$, the first n$n$ elements of wt must contain the weights to be used in the analysis.
If wt(i) = 0.0${\mathbf{wt}}\left(i\right)=0.0$, the i$i$th observation is not included in the analysis. The effective number of observations is the sum of weights.
If weight = 'U'$\mathit{weight}=\text{'U'}$, wt is not referenced and the effective number of observations is n$n$.
Constraints:
• wt(i)0.0${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$, for i = 1,2,,n$\mathit{i}=1,2,\dots ,n$;
• the sum of weightsnx + ny + 1$\text{sum of weights}\ge {\mathbf{nx}}+{\mathbf{ny}}+1$.

### Input Parameters Omitted from the MATLAB Interface

weight ldz lde ldcvx ldcvy wk iwk

### Output Parameters

1:     e(lde,6$6$) – double array
ldemin (nx,ny)$\mathit{lde}\ge \mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{nx}},{\mathbf{ny}}\right)$.
The statistics of the canonical variate analysis.
e(i,1)${\mathbf{e}}\left(i,1\right)$
The canonical correlations, δi${\delta }_{\mathit{i}}$, for i = 1,2,,l$\mathit{i}=1,2,\dots ,l$.
e(i,2)${\mathbf{e}}\left(i,2\right)$
The eigenvalues of Σ$\Sigma$, λi2${\lambda }_{\mathit{i}}^{2}$, for i = 1,2,,l$\mathit{i}=1,2,\dots ,l$.
e(i,3)${\mathbf{e}}\left(i,3\right)$
The proportion of variation explained by the i$\mathit{i}$th canonical variate, for i = 1,2,,l$\mathit{i}=1,2,\dots ,l$.
e(i,4)${\mathbf{e}}\left(i,4\right)$
The χ2${\chi }^{2}$ statistic for the i$\mathit{i}$th canonical variate, for i = 1,2,,l$\mathit{i}=1,2,\dots ,l$.
e(i,5)${\mathbf{e}}\left(i,5\right)$
The degrees of freedom for χ2${\chi }^{2}$ statistic for the i$\mathit{i}$th canonical variate, for i = 1,2,,l$\mathit{i}=1,2,\dots ,l$.
e(i,6)${\mathbf{e}}\left(i,6\right)$
The significance level for the χ2${\chi }^{2}$ statistic for the i$\mathit{i}$th canonical variate, for i = 1,2,,l$\mathit{i}=1,2,\dots ,l$.
2:     ncv – int64int32nag_int scalar
The number of canonical correlations, l$l$. This will be the minimum of the rank of X$\mathrm{X}$ and the rank of Y$\mathrm{Y}$.
3:     cvx(ldcvx,mcv) – double array
ldcvxnx$\mathit{ldcvx}\ge {\mathbf{nx}}$.
The canonical variate loadings for the x$x$ variables. cvx(i,j)${\mathbf{cvx}}\left(i,j\right)$ contains the loading coefficient for the i$i$th x$x$ variable on the j$j$th canonical variate.
4:     cvy(ldcvy,mcv) – double array
ldcvyny$\mathit{ldcvy}\ge {\mathbf{ny}}$.
The canonical variate loadings for the y$y$ variables. cvy(i,j)${\mathbf{cvy}}\left(i,j\right)$ contains the loading coefficient for the i$i$th y$y$ variable on the j$j$th canonical variate.
5:     ifail – int64int32nag_int scalar
${\mathrm{ifail}}={\mathbf{0}}$ unless the function detects an error (see [Error Indicators and Warnings]).

## Error Indicators and Warnings

Errors or warnings detected by the function:

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

ifail = 1${\mathbf{ifail}}=1$
 On entry, nx < 1${\mathbf{nx}}<1$, or ny < 1${\mathbf{ny}}<1$, or m < nx + ny${\mathbf{m}}<{\mathbf{nx}}+{\mathbf{ny}}$, or n ≤ nx + ny${\mathbf{n}}\le {\mathbf{nx}}+{\mathbf{ny}}$, or mcv < min (nx,ny) ${\mathbf{mcv}}<\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{nx}},{\mathbf{ny}}\right)$, or ldz < n$\mathit{ldz}<{\mathbf{n}}$, or ldcvx < nx$\mathit{ldcvx}<{\mathbf{nx}}$, or ldcvy < ny$\mathit{ldcvy}<{\mathbf{ny}}$, or lde < min (nx,ny) $\mathit{lde}<\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{nx}},{\mathbf{ny}}\right)$, or nx ≥ ny${\mathbf{nx}}\ge {\mathbf{ny}}$ and iwk < n × nx + nx + ny + max ((5 × (nx − 1) + nx × nx),n × ny)$\mathit{iwk}<{\mathbf{n}}×{\mathbf{nx}}+{\mathbf{nx}}+{\mathbf{ny}}+\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left(\left(5×\left({\mathbf{nx}}-1\right)+{\mathbf{nx}}×{\mathbf{nx}}\right),{\mathbf{n}}×{\mathbf{ny}}\right)$, or nx < ny${\mathbf{nx}}<{\mathbf{ny}}$ and iwk < n × ny + nx + ny + max ((5 × (ny − 1) + ny × ny),n × nx)$\mathit{iwk}<{\mathbf{n}}×{\mathbf{ny}}+{\mathbf{nx}}+{\mathbf{ny}}+\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left(\left(5×\left({\mathbf{ny}}-1\right)+{\mathbf{ny}}×{\mathbf{ny}}\right),{\mathbf{n}}×{\mathbf{nx}}\right)$, or weight ≠ 'U'$\mathit{weight}\ne \text{'U'}$ or 'W'$\text{'W'}$, or tol < 0.0${\mathbf{tol}}<0.0$.
ifail = 2${\mathbf{ifail}}=2$
 On entry, a weight = 'W'$\mathit{weight}=\text{'W'}$ and value of wt < 0.0${\mathbf{wt}}<0.0$.
ifail = 3${\mathbf{ifail}}=3$
 On entry, the number of x$x$ variables to be included in the analysis as indicated by isz is not equal to nx. or the number of y$y$ variables to be included in the analysis as indicated by isz is not equal to ny.
ifail = 4${\mathbf{ifail}}=4$
 On entry, the effective number of observations is less than nx + ny + 1${\mathbf{nx}}+{\mathbf{ny}}+1$.
ifail = 5${\mathbf{ifail}}=5$
A singular value decomposition has failed to converge. See nag_eigen_real_triang_svd (f02wu). This is an unlikely error exit.
W ifail = 6${\mathbf{ifail}}=6$
A canonical correlation is equal to 1$1$. This will happen if the x$x$ and y$y$ variables are perfectly correlated.
W ifail = 7${\mathbf{ifail}}=7$
On entry, the rank of the X$X$ matrix or the rank of the Y$Y$ matrix is 0$0$. This will happen if all the x$x$ or y$y$ variables are constants.

## Accuracy

As the computation involves the use of orthogonal matrices and a singular value decomposition rather than the traditional computing of a sum of squares matrix and the use of an eigenvalue decomposition, nag_mv_canon_corr (g03ad) should be less affected by ill-conditioned problems.

None.

## Example

```function nag_mv_canon_corr_example
z = [80, 58.4, 14, 21;
75, 59.2, 15, 27;
78, 60.3, 15, 27;
75, 57.4, 13, 22;
79, 59.5, 14, 26;
78, 58.1, 14.5, 26;
75, 58, 12.5, 23;
64, 55.5, 11, 22;
80, 59.2, 12.5, 22];
isz = [int64(-1);1;1;-1];
nx = int64(2);
ny = int64(2);
mcv = int64(2);
tol = 1e-06;
[e, ncv, cvx, cvy, ifail] = nag_mv_canon_corr(z, isz, nx, ny, mcv, tol)
```
```

e =

0.9570    0.9159    0.8746   14.3914    4.0000    0.0061
0.3624    0.1313    0.1254    0.7744    1.0000    0.3789

ncv =

2

cvx =

-0.4261    1.0337
-0.3444   -1.1136

cvy =

-0.1415    0.1504
-0.2384   -0.3424

ifail =

0

```
```function g03ad_example
z = [80, 58.4, 14, 21;
75, 59.2, 15, 27;
78, 60.3, 15, 27;
75, 57.4, 13, 22;
79, 59.5, 14, 26;
78, 58.1, 14.5, 26;
75, 58, 12.5, 23;
64, 55.5, 11, 22;
80, 59.2, 12.5, 22];
isz = [int64(-1);1;1;-1];
nx = int64(2);
ny = int64(2);
mcv = int64(2);
tol = 1e-06;
[e, ncv, cvx, cvy, ifail] = g03ad(z, isz, nx, ny, mcv, tol)
```
```

e =

0.9570    0.9159    0.8746   14.3914    4.0000    0.0061
0.3624    0.1313    0.1254    0.7744    1.0000    0.3789

ncv =

2

cvx =

-0.4261    1.0337
-0.3444   -1.1136

cvy =

-0.1415    0.1504
-0.2384   -0.3424

ifail =

0

```