PDF version (NAG web site
, 64bit version, 64bit version)
NAG Toolbox Chapter Introduction
G03 — Multivariate Methods
Scope of the Chapter
This chapter is concerned with methods for studying multivariate data. A multivariate dataset consists of several variables recorded on a number of objects or individuals. Multivariate methods can be classified as those that seek to examine the relationships between the variables (e.g., principal components), known as variabledirected methods, and those that seek to examine the relationships between the objects (e.g., cluster analysis), known as individualdirected methods.
Multiple regression is not included in this chapter as it involves the relationship of a single variable, known as the response variable, to the other variables in the dataset, the explanatory variables. Routines for multiple regression are provided in
Chapter G02.
Background to the Problems
Variabledirected Methods
Let the n$n$ by p$p$ data matrix consist of p$p$ variables, x_{1},x_{2}, … ,x_{p}${x}_{1},{x}_{2},\dots ,{x}_{p}$, observed on n$n$ objects or individuals. Variabledirected methods seek to examine the linear relationships between the p$p$ variables with the aim of reducing the dimensionality of the problem. There are different methods depending on the structure of the problem. Principal component analysis and factor analysis examine the relationships between all the variables. If the individuals are classified into groups, then canonical variate analysis examines the between group structure. If the variables can be considered as coming from two sets, then canonical correlation analysis examines the relationships between the two sets of variables. All four methods are based on an eigenvalue decomposition or a singular value decomposition (SVD) of an appropriate matrix.
The above methods may reduce the dimensionality of the data from the original p$p$ variables to a smaller number, k$k$, of derived variables that adequately represent the data. In general, these k$k$ derived variables will be unique only up to an orthogonal rotation. Therefore, it may be useful to see if there exists suitable rotations of these variables that lead to a simple interpretation of the new variables in terms of the original variables.
Principal component analysis
Principal component analysis finds new variables which are linear combinations of the p$p$ observed variables so that they have maximum variation and are orthogonal (uncorrelated).
Let
S$S$ be the
p$p$ by
p$p$ variancecovariance matrix of the
n$n$ by
p$p$ data matrix. A vector
a_{1}${a}_{1}$ of length
p$p$ is found such that
The variable
z_{1} = ∑ _{i = 1}^{p}a_{1i}x_{i}${z}_{1}={\displaystyle \sum _{i=1}^{p}}{a}_{1i}{x}_{i}$ is known as the first principal component and gives the linear combination of the variables that gives the maximum variation. A second principal component,
z_{2} = ∑ _{i = 1}^{p}a_{2i}x_{i}${z}_{2}={\displaystyle \sum _{i=1}^{p}}{a}_{2i}{x}_{i}$, is found such that
This gives the linear combination of variables, orthogonal to the first principal component, that gives the maximum variation. Further principal components are derived in a similar way.
The vectors a_{i}${a}_{\mathit{i}}$, for i = 1,2, … ,p$\mathit{i}=1,2,\dots ,p$, are the eigenvectors of the matrix S$S$ and associated with each eigenvector is the eigenvalue, γ_{i}^{2}${\gamma}_{i}^{2}$. The value of γ_{i}^{2} / ∑ γ_{i}^{2}${\gamma}_{i}^{2}/\sum {\gamma}_{i}^{2}$ gives the proportion of variation explained by the i$i$th principal component. Alternatively, the a_{i}${a}_{i}$ can be considered as the right singular vectors in a SVD of a scaled meancentred data matrix. The singular values of the SVD are the γ_{i}${\gamma}_{i}$values.
Often fewer than p$p$ dimensions (principal components) are needed to represent most of the variation in the data. A test on the smaller eigenvalues can be used to investigate the number of dimensions needed.
The values of the principal component variables for the individuals are known as the principal component scores. These can be standardized so that the variance of these scores for each principal component is 1.0$1.0$ or equal to the corresponding eigenvalue. The principal component scores correspond to the lefthand singular vectors in the SVD.
Factor analysis
Let the
p$p$ variables have variancecovariance matrix
Σ$\Sigma $. The aim of factor analysis is to account for the covariances in these
p$p$ variables in terms of a smaller number,
k$k$, of hypothetical variables or factors,
f_{1},f_{2}, … ,f_{k}${f}_{1},{f}_{2},\dots ,{f}_{k}$. These are assumed to be independent and to have unit variance. The relationship between the observed variables and the factors is given by the model
where
λ_{ij}${\lambda}_{\mathit{i}\mathit{j}}$, for
i = 1,2, … ,p$\mathit{i}=1,2,\dots ,p$ and
j = 1,2, … ,k$\mathit{j}=1,2,\dots ,k$, are the factor loadings and
e_{i}${e}_{\mathit{i}}$, for
i = 1,2, … ,p$\mathit{i}=1,2,\dots ,p$, are independent random variables with variances
ψ_{i}${\psi}_{i}$. These represent the unique component of the variation of each observed variable. The proportion of variation for each variable accounted for by the factors is known as the communality.
The model for the variancecovariance matrix,
Σ$\Sigma $, can then be written as
where
Λ$\Lambda $ is the matrix of the factor loadings,
λ_{ij}${\lambda}_{ij}$, and
Ψ$\Psi $ is a diagonal matrix of the unique variances
ψ_{i}${\psi}_{i}$.
If it is assumed that both the
k$k$ factors and the
e_{i}${e}_{i}$ follow independent Normal distributions then the parameters of the model,
Λ$\Lambda $ and
Ψ$\Psi $, can be estimated by maximum likelihood, as described by
Lawley and Maxwell (1971). The computation of the maximum likelihood estimates is an iterative procedure which involves computing the eigenvalues and eigenvectors of the matrix
where
S$S$ is the sample variancecovariance matrix. Alternatively, the SVD of the matrix
RΨ^{ − 1 / 2}$R{\Psi}^{1/2}$ can be used, where
R^{T}R = S${R}^{\mathrm{T}}R=S$. When convergence has been achieved, the estimates
Λ̂$\hat{\Lambda}$, of
Λ$\Lambda $, are obtained by scaling the eigenvectors of
S^{*}${S}^{*}$. The use of maximum likelihood estimation means that likelihood ratio tests can be constructed to test for the number of factors required.
Having found the estimates of the parameters of the model, the estimates of the values of the factors for the individuals, the
factor
scores, can be computed. These involve the calculation of the
factor score coefficients. Two common methods of computing factor score coefficients are the regression method and Bartlett's method. Bartlett's method gives unbiased estimates of the factor scores while the estimates from the regression method are biased but have smaller variance; see
Lawley and Maxwell (1971).
Canonical variate analysis
If the individuals can be classified into one of g$g$ groups, then canonical variate analysis finds the linear combinations of the p$p$ variables that maximize the ratio of the betweengroup variation to the withingroup variation. These variables are known as canonical variates. As the canonical variates provide discrimination between the groups, the method is also known as canonical discrimination.
The canonical variates can be calculated from the eigenvectors of the withingroup sums of squares and crossproducts matrix or from the SVD of the matrix
where
Q_{g}${Q}_{g}$ is an orthogonal matrix that defines the groups and
Q_{x}${Q}_{x}$ is the first
p$p$ columns of the orthogonal matrix
Q$Q$ from the
QR$QR$ decomposition of the data matrix with the variable means subtracted. If the data matrix is not of full rank, the
Q_{x}${Q}_{x}$ matrix can be obtained from a SVD. If the SVD of
V$V$ is
then the nonzero elements (
δ_{i} > 0${\delta}_{i}>0$) of the diagonal matrix
Δ$\Delta $ are the canonical correlations. The largest
δ_{i}${\delta}_{i}$ is called the
first canonical correlation and associated with it is the first canonical variate.
The eigenvalues,
γ_{i}^{2}${\gamma}_{i}^{2}$, of the withingroup sums of squares matrix are given by
The value of
π_{i} = γ_{i}^{2} / ∑ γ_{i}^{2}${\pi}_{i}={\gamma}_{i}^{2}/\sum {\gamma}_{i}^{2}$ gives the proportion of variation explained by the
i$i$th canonical variate. The values of the
π_{i}${\pi}_{i}$ give an indication as to how many canonical variates are needed to adequately describe the data, i.e., the dimensionality of the problem. The number of dimensions can be investigated by means of a test on the smaller canonical correlations.
The canonical variate loadings and the relationship between the original variables and the canonical variates are calculated from the matrix U_{x}${U}_{x}$. This matrix is scaled so that the canonical variates have unit variance.
Canonical correlation analysis
If the
p$p$ variables can be considered as coming from two sets then canonical correlation analysis finds linear combinations of the variables in each set, known as canonical variates, such that the correlations between corresponding canonical variates for the two sets are maximized. Let the two sets of variables be denoted by
x$x$ and
y$y$, with
p_{x}${p}_{x}$ and
p_{y}${p}_{y}$ variables in each set respectively. Let the variancecovariance of the dataset be
and let
then the canonical correlations can be calculated from the eigenvalues of the matrix
Σ$\Sigma $. Alternatively, the canonical correlations can be calculated by means of a SVD of the matrix
where
Q_{x}${Q}_{x}$ is the first
p_{x}${p}_{x}$ columns of the orthogonal matrix
Q$Q$ from the
QR$QR$ decomposition of the
x$x$variables in the data matrix, and
Q_{y}${Q}_{y}$ is the first
p_{y}${p}_{y}$ columns of the
Q$Q$ matrix of the
QR$QR$ decomposition of the
y$y$variables in the data matrix. In both cases, the variable means are subtracted before the
QR$QR$ decomposition is computed. If either set of variables is not of full rank, an SVD can be used instead of the
QR$QR$ decomposition. If the SVD of
V$V$ is
then the nonzero elements (
δ_{i} > 0${\delta}_{i}>0$) of the diagonal matrix
Δ$\Delta $ are the canonical correlations. The largest
δ_{i}${\delta}_{i}$ is called the
first canonical correlation and associated with it is the first canonical variate. The eigenvalues,
γ_{i}^{2}${\gamma}_{i}^{2}$, of the matrix
Σ$\Sigma $ are given by
The value of
π_{i} = γ_{i}^{2} / ∑ γ_{i}^{2}${\pi}_{i}={\gamma}_{i}^{2}/\sum {\gamma}_{i}^{2}$ gives the proportion of variation explained by the
i$i$th canonical variate. The values of the
π_{i}${\pi}_{i}$ give an indication as to how many canonical variates are needed to adequately describe the data, i.e., the dimensionality of the problem; this can also be investigated by means of a test on the smaller values of the
γ_{i}^{2}${\gamma}_{i}^{2}$.
The relationship between the canonical variables and the original variables, the canonical variate loadings, can be computed from the U_{x}${U}_{x}$ and U_{y}${U}_{y}$ matrices.
Rotations
There are two principal reasons for using rotations: either
(a) 
simplifying the structure to aid interpretation of derived variables, or 
(b) 
comparing two or more datasets or sets of derived variables. 
The most common type of rotations used for
(a) are
orthogonal rotations. If
Λ$\Lambda $ is the
p$p$ by
k$k$ loading matrix from a variabledirected multivariate method, then the rotations are selected such that the elements,
λ_{ij}^{ * }${\lambda}_{ij}^{*}$, of the rotated loading matrix,
Λ^{*}${\Lambda}^{*}$, are either relatively large or small. The rotations may be found by minimizing the criterion
where the constant,
γ$\gamma $, gives a family of rotations, with
γ = 1$\gamma =1$ giving
varimax rotations and
γ = 0$\gamma =0$ giving
quartimax rotations.
Given an orthogonal rotation matrix X$X$, a solution may be further simplified by removing the orthogonality restriction with an oblique ProMax rotation. Let Y$Y$ denote the matrix defined by a power transformation of X$X$, designed to increase high values in X$X$ and decrease low values. Then the ProMax solution is based on a least squares fit of X$X$ to Y$Y$.
For
(b) Procrustes rotations are used. Let
A$A$ and
B$B$ be two
l$l$ by
m$m$ matrices, which can be considered as representing
l$l$ points in
m$m$ dimensions. One example is if
A$A$ is the loading matrix from a variabledirected multivariate method and
B$B$ is a hypothesised pattern matrix. In order to try to match the points in
A$A$ and
B$B$ there are three steps:
(i) 
translate so that centroids of both matrices are at the origin, 
(ii) 
find a rotation that minimizes the sum of squared distances between corresponding points of the matrices, 
(iii) 
scale the matrices. 
For a more detailed description, see
Krzanowski (1990).
Individualdirected Methods
While dealing with the same
n$n$ by
p$p$ data matrix as variabledirected methods, the emphasis is the
n$n$ objects or individuals rather than the
p$p$ variables. The methods are generally based on an
n$n$ by
n$n$ distance or dissimilarity matrix such that the (
k,j$k,j$)th element gives a measure of how ‘far apart’ the individuals
k$k$ and
j$j$ are. Alternatively, a similarity matrix can be used which measures how ‘close’ individuals are. The form of the measure of distance or similarity will depend upon the form of the
p$p$ variables. For continuous variables it is usually assumed that some form of Euclidean distance is suitable. That is, for
x_{ki}${x}_{ki}$ and
x_{ji}${x}_{ji}$ measured for individuals
k$k$ and
j$j$ on variable
i$i$ respectively, the contribution to distance between individuals
k$k$ and
j$j$ from variable
i$i$ is given by
Often there will be a need to scale the variables to produce satisfactory distances. For discrete variables, there are various measures of similarity or distance that can easily be computed. For example, for binary data a measure of similarity could be
 1$1$ – if the individuals take the same value,
 0$0$ – otherwise.
Given a measure of distance between individuals, there are three basic tasks that can be performed.
(i) 
Group the individuals; that is, collect the individuals into groups so that those within a group are closer to each other than they are to members of another group. 
(ii) 
Classify individuals; that is, if some individuals are known to come from certain groups, allocate individuals whose group membership is unknown, to the nearest group. 
(iii) 
Map the individuals; that is, produce a multidimensional diagram in which the distances on the diagram represent the distances between the individuals. 
In the above,
(i) leads to cluster analysis,
(ii) leads to discriminant analysis and
(iii) leads to scaling methods.
Hierarchical cluster analysis
Approaches for cluster analysis can be classified into two types: hierarchical and nonhierarchical. Hierarchical cluster analysis produces a series of overlapping groups or clusters ranging from separate individuals to one single cluster. For example, five individuals could be hierarchically clustered as follows.
Step 1 
(1)$\left(1\right)$ 
(2)$\left(2\right)$ 
(3)$\left(3\right)$ 
(4)$\left(4\right)$ 
(5)$\left(5\right)$ 
Step 2 
(1,2)$(1,2)$ 
(3)$\left(3\right)$ 
(4)$\left(4\right)$ 
(5)$\left(5\right)$ 
Step 3 
(1,2)$(1,2)$ 
(3,4)$(3,4)$ 
(5)$\left(5\right)$ 
Step 4 
(1,2)$(1,2)$ 
(3,4,5)$(3,4,5)$ 
Step 5 
(1,2,3,4,5)$(1,2,3,4,5)$ 
The clusters at a level are constructed from the clusters at a previous level. There are two basic approaches to hierarchical cluster analysis: agglomerative methods which build up clusters starting from individuals until there is only one cluster, or divisive methods which start with a single cluster and split clusters until the individual level is reached. This chapter contains the more common agglomerative methods.
The stages in a hierarchical cluster analysis are usually as follows.
(i) 
form a distance matrix; 
(ii) 
use selected criterion to form hierarchy; 
(iii) 
print cluster information in the form of a dendrogram or use information to form a set of clusters. 
These three stages will be considered in turn.
(i) 
Form a distance matrix
For the n$n$ by p$p$ data matrix X$X$, a general measure of the distance between object j$j$ and object k$k$, d_{jk}${d}_{jk}$, is
where x_{ji}${x}_{ji}$ and x_{ki}${x}_{ki}$ are the (j,i)$(j,i)$th and (k,i)$(k,i)$th elements of X$X$, s_{i}${s}_{i}$ is a standardization for the i$i$th variable and D(u,v)$D(u,v)$ is a suitable function. Three common distances for continuous variables are:
(a) 
Euclidean distance: D(u,v) = (u − v)^{2}$D(u,v)={(uv)}^{2}$ and α = (1/2)
$\alpha =\frac{1}{2}$. 
(b) 
Euclidean squared distance: D(u,v) = (u − v)^{2}$D(u,v)={(uv)}^{2}$ and α = 1$\alpha =1$. 
(c) 
Absolute distance (city block metric): D(u,v) = u − v$D(u,v)=uv$ and α = 1$\alpha =1$. 
The common standardizations are the standard deviation and the range. For dichotomous variables there are a number of different measures (see Krzanowski (1990) and Everitt (1974)); these are usually easy to compute. If the individuals in a cluster analysis are themselves variables, then a suitable distance measure will be based on the correlation coefficient for continuous variables and contingency table statistics for discrete data.

(ii) 
Form Hierarchy
Given a distance matrix for the n$n$ individuals, an agglomerative clustering method produces a hierarchical tree by starting with n$n$ clusters, each with a single individual and then at each of n − 1$n1$ stages, merging two clusters to form a larger cluster until all individuals are in a single cluster. At each stage, the two clusters that are nearest are merged to form a new cluster and a new distance matrix is computed for the reduced number of clusters.
Methods differ as to how the distances between the new cluster and other clusters are computed. For three clusters i$i$, j$j$ and k$k$, let n_{i}${n}_{i}$, n_{j}${n}_{j}$ and n_{k}${n}_{k}$ be the number of objects in each cluster, and let d_{ij}${d}_{ij}$, d_{ik}${d}_{ik}$ and d_{jk}${d}_{jk}$ be the distances between the clusters. If clusters j$j$ and k$k$, are to be merged to give cluster jk$jk$, then the distance from cluster i$i$ to cluster jk$jk$, d_{i . jk}${d}_{i.jk}$, can be computed in the following ways.
(a) 
Single link or nearest neighbour: d_{i . jk} = min (d_{ij},d_{ik})${d}_{i.jk}=\mathrm{min}\phantom{\rule{0.125em}{0ex}}({d}_{ij},{d}_{ik})$. 
(b) 
Complete link or furthest neighbour: d_{i . jk} = max (d_{ij},d_{ik})${d}_{i.jk}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}({d}_{ij},{d}_{ik})$. 
(c) 
Group average: d_{i . jk} = (n_{j})/(n_{j} + n_{k})d_{ij} + (n_{k})/(n_{j} + n_{k})d_{ik}${d}_{i.jk}=\frac{{n}_{j}}{{n}_{j}+{n}_{k}}{d}_{ij}+\frac{{n}_{k}}{{n}_{j}+{n}_{k}}{d}_{ik}$. 
(d) 
Centroid: d_{i . jk} = (n_{j})/(n_{j} + n_{k})d_{ij} + (n_{k})/(n_{j} + n_{k})d_{ik} − (n_{j}n_{k})/((n_{j} + n_{k})^{2})d_{jk}${d}_{i.jk}=\frac{{n}_{j}}{{n}_{j}+{n}_{k}}{d}_{ij}+\frac{{n}_{k}}{{n}_{j}+{n}_{k}}{d}_{ik}\frac{{n}_{j}{n}_{k}}{{({n}_{j}+{n}_{k})}^{2}}{d}_{jk}$. 
(e) 
Median: d_{i . jk} = (1/2)d_{ij} + (1/2)d_{ik} − (1/4)d_{jk}${d}_{i.jk}=\frac{1}{2}{d}_{ij}+\frac{1}{2}{d}_{ik}\frac{1}{4}{d}_{jk}$. 
(f) 
Minimum variance: d_{i . jk} = [(n_{i} + n_{j})d_{ij} + (n_{i} + n_{k})d_{ik} − n_{i}d_{jk}] / (n_{i} + n_{j} + n_{k})${d}_{i.jk}=[({n}_{i}+{n}_{j}){d}_{ij}+({n}_{i}+{n}_{k}){d}_{ik}{n}_{i}{d}_{jk}]/({n}_{i}+{n}_{j}+{n}_{k})$. 

(iii) 
Produce Dendrogram and Clusters
Hierarchical cluster analysis can be represented by a tree that shows at which distance the clusters merge. Such a tree is known as a dendrogram; see Everitt (1974) and Krzanowski (1990).
A simple example is
Figure 1
The end points of the dendrogram represent the individuals that have been clustered.
Alternatively, the information from the tree can be used to produce either a chosen number of clusters or the clusters that exist at a given distance. The latter is equivalent to taking the dendrogram and drawing a line across at a given distance to produce clusters.

Nonhierarchical clustering
Nonhierarchical cluster analysis usually forms a given number of clusters from the data. There is no requirement that if first k − 1$k1$ and then k$k$ clusters were requested then the k − 1$k1$ clusters would be formed from the k$k$ clusters.
Most nonhierarchical methods of cluster analysis seek to partition the set of individuals into a number of clusters so as to optimize a criterion. The number of clusters is usually specified prior to the analysis. One commonly used criterion is the withincluster sum of squares. Given
n$n$ individuals with
p$p$ variables measured on each individual,
x_{ij}${x}_{\mathit{i}\mathit{j}}$, for
i = 1,2, … ,n$\mathit{i}=1,2,\dots ,n$ and
j = 1,2, … ,p$\mathit{j}=1,2,\dots ,p$, the withincluster sum of squares for
K$K$ clusters is
where
S_{k}${S}_{k}$ is the set of objects in the
k$k$th cluster and
x_{kj}${\stackrel{}{x}}_{kj}$ is the mean for the variable
j$j$ over cluster
k$k$. Starting with an initial allocation of individuals to clusters, the method then seeks to minimize
SS_{c}${\mathrm{SS}}_{c}$ by a series of reallocations. This is often known as
K$K$means clustering.
In the
K$K$means case individuals belong to a single cluster and are excluded from all remaining clusters. Alternatively, probabilities of cluster membership can be estimated and each cluster can have its own distributional properties. For example, given an initial set of probabilities, the Normal (Gaussian) mixture model uses the E–M method of
Dempster et al. (1977) to maximize the sum of loglikelihoods over
K$K$ clusters for a given covariance model ranging from pooled variance to individual covariance matrices.
Discriminant analysis
Discriminant analysis is concerned with the
allocation of objects to
n_{g}${n}_{g}$ groups on the basis of observations on those objects using an allocation rule. This rule is computed from observations coming from a
training set in which group membership is known. The allocation rule is based on the distance between the object and an estimate of the location of the groups. If
p$p$ variables are observed and the vector of means for the
j$j$th group in the training set are
x_{j}${\stackrel{}{x}}_{j}$ then the usual measure of the distance of an observation,
x_{k}${x}_{k}$, from the
j$j$th group mean is given by Mahalanobis squared distance
where
S_{*}${S}_{*}$ is either the withingroup variancecovariance matrix,
S_{j}${S}_{j}$, for the
n_{j}${n}_{j}$ objects in the
j$j$th group, or a pooled variancecovariance matrix,
S$S$, computed from all
n$n$ objects from all groups where
If the withingroup variancecovariance matrices can be assumed to be equal then the pooled variancecovariance matrix can be used. This assumption can be tested using the test statistic
where
For large
n$n$,
G$G$ is approximately distributed as a
χ^{2}${\chi}^{2}$ variable with
(1/2)p(p + 1)(n_{g} − 1)$\frac{1}{2}p(p+1)({n}_{g}1)$ degrees of freedom; see
Morrison (1967).
In addition to the distances, a set of prior probabilities of group membership, π_{j}${\pi}_{\mathit{j}}$, for j = 1,2, … ,n_{g}$\mathit{j}=1,2,\dots ,{n}_{g}$, may be used. The prior probabilities reflect your view as to the likelihood of the objects coming from the different groups.
It is generally assumed that the
p$p$ variables follow a multivariate Normal distribution with, for the
j$j$th group, mean
μ_{j}${\mu}_{j}$ and variancecovariance matrix
Σ_{j}${\Sigma}_{j}$. If
p
(
x_{k} ∣
μ_{j}
,
Σ_{j}
)
$p({x}_{k}\mid {\mu}_{j},{\Sigma}_{j})$ is the probability of observing the observation
x_{k}${x}_{k}$ from group
j$j$, then the posterior probability of belonging to group
j$j$ is
An observation is allocated to the group with the highest posterior probability.
In the estimative approach to discrimination, the parameters μ_{j}${\mu}_{j}$ and Σ_{j}${\Sigma}_{j}$ in p(j ∣ x_{k},μ_{j},Σ_{j})$p(j\mid {x}_{k},{\mu}_{j},{\Sigma}_{j})$ are replaced by their estimates calculated from the training set. If it is assumed that the withingroup variancecovariance matrices are equal then the linear discriminant function is obtained; otherwise if it is assumed that the variancecovariance matrices are unequal then the quadratic discriminant function is obtained.
In the Bayesian
predictive approach, a noninformative prior distribution is used for the parameters giving the posterior distribution for the parameters from the training set,
X_{t}${X}_{t}$, of,
p
(μ_{j},
Σ_{j} ∣
X_{t}
)
$p({\mu}_{j},{\Sigma}_{j}\mid {X}_{t})$. A predictive distribution is then obtained by integrating
p
(j ∣ x_{k},μ_{j},Σ_{j})
p
(μ_{j},
Σ_{j} ∣
X)
$p(j\mid {x}_{k},{\mu}_{j},{\Sigma}_{j})p({\mu}_{j},{\Sigma}_{j}\mid X)$ over the parameter space. This predictive distribution,
p
(x_{k} ∣ X_{t})
$p({x}_{k}\mid {X}_{t})$, then replaces
p
(
x_{k} ∣
μ_{j}
,Σ_{j})
$p({x}_{k}\mid {\mu}_{j},{\Sigma}_{j})$ to give
In addition to allocating the objects to groups, an atypicality index for each object and for each group can be computed. This represents the probability of obtaining an observation more typical of the group than that observed. A high value of the atypicality index for all groups indicates that the observation may in fact come from a group not represented in the training set.
Alternative approaches to discrimination are the use of canonical variates and logistic discrimination. Canonical variate analysis is described above and as it seeks to find the directions that best discriminate between groups these directions can also be used to allocate further observations. This can be viewed as an extension of Fisher's linear discriminant function. This approach does not assume that the data is Normally distributed, but Fisher's linear discriminant function may not perform well on nonNormal data. In the case of two groups, logistic regression can be performed with the response variable indicating the group allocation and the variables in the discriminant analysis being the explanatory variables. Allocation can then be made on the basis of the fitted response value. This is known as logistic discrimination and can be shown to be valid for a wide range of distributional assumptions.
Scaling methods
Scaling methods seek to represent the observed dissimilarities or distances between objects as distances between points in Euclidean space. For example if the distances between objects A, B and C were 3$3$, 4$4$ and 5$5$, the distances could be represented exactly by three points in twodimensional space. Only their relative positions would be important, the whole configuration of points could be rotated or shifted without effecting the distances between the points. If a onedimensional representation was required, the ‘best’ representation might give distances of 2(1/3),3(1/3)
$2\frac{1}{3},3\frac{1}{3}$ and 5(2/3)
$5\frac{2}{3}$, which may be an adequate representation. If the distances were 3$3$, 4$4$ and 8$8$ then these distances could not be exactly represented in Euclidean space, even in two dimensions, the best representation being the three points in a straight line giving distances 3$3$, 4$4$ and 7$7$.
In practice, the use of scaling methods has to decide upon the number of dimensions in which the data is to be represented. The smaller the number the easier it will be to assimilate the information. The chosen number of dimensions needs to give an adequate representation of the data but will often not give an exact representation because either the number of chosen dimensions is too small or the data cannot be represented in Euclidean space.
Two basic methods are available depending on the nature of the dissimilarities or distances being analysed. If the distances can be assumed to satisfy the metric inequality
then the distances can be represented exactly by points in Euclidean space and the technique known as metric scaling, classical scaling or principal coordinate analysis can be used. This technique involves the computing of the eigenvalues of a matrix derived from the distance matrix. The eigenvectors corresponding to the
k$k$ largest positive eigenvalues gives the best
k$k$ dimensions in which to represent the objects. If there are negative eigenvalues then the distance matrix cannot be represented in Euclidean space.
Instead of the above approach of requiring the distances from the points to match the distances from the objects as closely as possible, sometimes only a rank order equivalence is required. That is, the
i$i$th largest distance between objects should, as far as possible, be represented by the
i$i$th largest distance between points. This would be appropriate when the dissimilarities are based on subjective rankings. For example, if the objects were foods then a number of judges rank the foods for different qualities such as taste and texture, the resulting distances would not necessarily obey the metric inequality, but the rank order would be significant. Alternatively, by relaxing the requirement from matching distances to rank order equivalence only, the number of dimensions required to represent the distance matrix may be decreased. The requirement of rank order equivalence leads to nonmetric or ordinal multidimensional scaling. The criterion used to measure the closeness of the fitted distance matrix to the observed distance matrix is known as STRESS, which is given by
where
d_{ij}^{2}^{^}$\hat{{d}_{ij}^{2}}$ is the Euclidean squared distance between the computed points
i$i$ and
j$j$, and
d_{ij}^{~}$\stackrel{~}{{d}_{ij}}$ is the fitted distance obtained when
d_{ij}^{^}$\hat{{d}_{ij}}$ is monotonically regressed on the observed distances
d_{ij}${d}_{ij}$; that is,
d_{ij}^{~}$\stackrel{~}{{d}_{ij}}$ is monotonic relative to
d_{ij}${d}_{ij}$ and is obtained from
d_{ij}^{^}$\hat{{d}_{ij}}$ with the smallest number of changes. Thus STRESS is a measure of by how much the set of points preserve the order of the distances in the original distance matrix, and nonmetric multidimensional scaling seeks to find the set of points that minimize the STRESS.
Recommendations on Choice and Use of Available Functions
See
Section [Functionality Index] for a list of functions available in this Chapter.
Note also that
nag_correg_glm_binomial (g02gb) will fit a logistic regression model and can be used for logistic discrimination.
Functionality Index
References
Chatfield C and Collins A J (1980) Introduction to Multivariate Analysis Chapman and Hall
Dempster A P, Laird N M and Rubin D B (1977) Maximum likelihood from incomplete data via the EM$EM$ algorithm (with discussion) J. Roy. Statist. Soc. Ser. B 39 1–38
Everitt B S (1974) Cluster Analysis Heinemann
Gnanadesikan R (1977) Methods for Statistical Data Analysis of Multivariate Observations Wiley
Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press
Lawley D N and Maxwell A E (1971) Factor Analysis as a Statistical Method (2nd Edition) Butterworths
Morrison D F (1967) Multivariate Statistical Methods McGraw–Hill
PDF version (NAG web site
, 64bit version, 64bit version)
© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2013