Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

NAG Toolbox: nag_mv_cluster_hier (g03ec)

Purpose

nag_mv_cluster_hier (g03ec) performs hierarchical cluster analysis.

Syntax

[d, ilc, iuc, cd, iord, dord, ifail] = g03ec(method, n, d)
[d, ilc, iuc, cd, iord, dord, ifail] = nag_mv_cluster_hier(method, n, d)

Description

Given a distance or dissimilarity matrix for n$n$ objects (see nag_mv_distance_mat (g03ea)), cluster analysis aims to group the n$n$ objects into a number of more or less homogeneous groups or clusters. With agglomerative clustering methods, a hierarchical tree is produced by starting with n$n$ clusters, each with a single object and then at each of n1$n-1$ stages, merging two clusters to form a larger cluster, until all objects are in a single cluster. This process may be represented by a dendrogram (see nag_mv_cluster_hier_dendrogram (g03eh)).
At each stage, the clusters that are nearest are merged, methods differ as to how the distances between the new cluster and other clusters are computed. For three clusters i$i$, j$j$ and k$k$ let ni${n}_{i}$, nj${n}_{j}$ and nk${n}_{k}$ be the number of objects in each cluster and let dij${d}_{ij}$, dik${d}_{ik}$ and djk${d}_{jk}$ be the distances between the clusters. Let clusters j$j$ and k$k$ be merged to give cluster jk$jk$, then the distance from cluster i$i$ to cluster jk$jk$, di . jk${d}_{i.jk}$ can be computed in the following ways.
1. Single link or nearest neighbour : di . jk = min (dij,dik) ${d}_{i.jk}=\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({d}_{ij},{d}_{ik}\right)$.
2. Complete link or furthest neighbour : di . jk = max (dij,dik)${d}_{i.jk}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left({d}_{ij},{d}_{ik}\right)$.
3. Group average : di . jk = (nj)/(nj + nk)dij + (nk)/(nj + nk)dik${d}_{i.jk}=\frac{{n}_{j}}{{n}_{j}+{n}_{k}}{d}_{ij}+\frac{{n}_{k}}{{n}_{j}+{n}_{k}}{d}_{ik}$.
4. Centroid : di . jk = (nj)/(nj + nk)dij + (nk)/(nj + nk)dik(njnk)/((nj + nk)2)djk${d}_{i.jk}=\frac{{n}_{j}}{{n}_{j}+{n}_{k}}{d}_{ij}+\frac{{n}_{k}}{{n}_{j}+{n}_{k}}{d}_{ik}-\frac{{n}_{j}{n}_{k}}{{\left({n}_{j}+{n}_{k}\right)}^{2}}{d}_{jk}$.
5. Median : di . jk = (1/2)dij + (1/2)dik(1/4)djk${d}_{i.jk}=\frac{1}{2}{d}_{ij}+\frac{1}{2}{d}_{ik}-\frac{1}{4}{d}_{jk}$.
6. Minimum variance : di . jk = {(ni + nj)dij + (ni + nk)diknidjk} / (ni + nj + nk)${d}_{i.jk}=\left\{\left({n}_{i}+{n}_{j}\right){d}_{ij}+\left({n}_{i}+{n}_{k}\right){d}_{ik}-{n}_{i}{d}_{jk}\right\}/\left({n}_{i}+{n}_{j}+{n}_{k}\right)$.
For further details see Everitt (1974) or Krzanowski (1990).
If the clusters are numbered 1,2,,n$1,2,\dots ,n$ then, for convenience, if clusters j$j$ and k$k$, j < k$j, merge then the new cluster will be referred to as cluster j$j$. Information on the clustering history is given by the values of j$j$, k$k$ and djk${d}_{jk}$ for each of the n1$n-1$ clustering steps. In order to produce a dendrogram, the ordering of the objects such that the clusters that merge are adjacent is required. This ordering is computed so that the first element is 1$1$. The associated distances with this ordering are also computed.

References

Everitt B S (1974) Cluster Analysis Heinemann
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

Parameters

Compulsory Input Parameters

1:     method – int64int32nag_int scalar
Indicates which clustering method is used.
method = 1${\mathbf{method}}=1$
method = 2${\mathbf{method}}=2$
method = 3${\mathbf{method}}=3$
Group average.
method = 4${\mathbf{method}}=4$
Centroid.
method = 5${\mathbf{method}}=5$
Median.
method = 6${\mathbf{method}}=6$
Minimum variance.
Constraint: method = 1${\mathbf{method}}=1$, 2$2$, 3$3$, 4$4$, 5$5$ or 6$6$.
2:     n – int64int32nag_int scalar
n$n$, the number of objects.
Constraint: n2${\mathbf{n}}\ge 2$.
3:     d(n × (n1) / 2${\mathbf{n}}×\left({\mathbf{n}}-1\right)/2$) – double array
The strictly lower triangle of the distance matrix. D$D$ must be stored packed by rows, i.e., d((i1)(i2) / 2 + j)${\mathbf{d}}\left(\left(i-1\right)\left(i-2\right)/2+j\right)$, i > j$i>j$ must contain dij${d}_{ij}$.
Constraint: d(i)0.0${\mathbf{d}}\left(\mathit{i}\right)\ge 0.0$, for i = 1,2,,n(n1) / 2$\mathit{i}=1,2,\dots ,n\left(n-1\right)/2$.

None.

iwk

Output Parameters

1:     d(n × (n1) / 2${\mathbf{n}}×\left({\mathbf{n}}-1\right)/2$) – double array
Is overwritten.
2:     ilc(n1${\mathbf{n}}-1$) – int64int32nag_int array
ilc(l)${\mathbf{ilc}}\left(\mathit{l}\right)$ contains the number, j$j$, of the cluster merged with cluster k$k$ (see iuc), j < k$j, at step l$\mathit{l}$, for l = 1,2,,n1$\mathit{l}=1,2,\dots ,n-1$.
3:     iuc(n1${\mathbf{n}}-1$) – int64int32nag_int array
iuc(l)${\mathbf{iuc}}\left(\mathit{l}\right)$ contains the number, k$k$, of the cluster merged with cluster j$j$, j < k$j, at step l$\mathit{l}$, for l = 1,2,,n1$\mathit{l}=1,2,\dots ,n-1$.
4:     cd(n1${\mathbf{n}}-1$) – double array
cd(l)${\mathbf{cd}}\left(\mathit{l}\right)$ contains the distance djk${d}_{jk}$, between clusters j$j$ and k$k$, j < k$j, merged at step l$\mathit{l}$, for l = 1,2,,n1$\mathit{l}=1,2,\dots ,n-1$.
5:     iord(n) – int64int32nag_int array
The objects in dendrogram order.
6:     dord(n) – double array
The clustering distances corresponding to the order in iord. dord(l)${\mathbf{dord}}\left(\mathit{l}\right)$ contains the distance at which cluster iord(l)${\mathbf{iord}}\left(\mathit{l}\right)$ and iord(l + 1)${\mathbf{iord}}\left(\mathit{l}+1\right)$ merge, for l = 1,2,,n1$\mathit{l}=1,2,\dots ,n-1$. dord(n)${\mathbf{dord}}\left(n\right)$ contains the maximum distance.
7:     ifail – int64int32nag_int scalar
${\mathrm{ifail}}={\mathbf{0}}$ unless the function detects an error (see [Error Indicators and Warnings]).

Error Indicators and Warnings

Errors or warnings detected by the function:
ifail = 1${\mathbf{ifail}}=1$
 On entry, method ≠ 1${\mathbf{method}}\ne 1$, 2$2$, 3$3$, 4$4$, 5$5$ or 6$6$, or n < 2${\mathbf{n}}<2$.
ifail = 2${\mathbf{ifail}}=2$
 On entry, d(i) < 0.0${\mathbf{d}}\left(i\right)<0.0$ for some i = 1,2, … ,n(n − 1) / 2$i=1,2,\dots ,n\left(n-1\right)/2$.
ifail = 3${\mathbf{ifail}}=3$
A true dendrogram cannot be formed because the distances at which clusters have merged are not increasing for all steps, i.e., cd(l) < cd(l1)${\mathbf{cd}}\left(l\right)<{\mathbf{cd}}\left(l-1\right)$ for some l = 2,3,,n1$l=2,3,\dots ,n-1$. This can occur for the median and centroid methods.

Accuracy

For method3${\mathbf{method}}\ge 3$ slight rounding errors may occur in the calculations of the updated distances. These would not normally significantly affect the results, however there may be an effect if distances are (almost) equal.
If at a stage, two distances dij${d}_{ij}$ and dkl${d}_{kl}$, (i < k$i) or (i = k$i=k$), and j < l$j, are equal then clusters k$k$ and l$l$ will be merged rather than clusters i$i$ and j$j$. For single link clustering this choice will only affect the order of the objects in the dendrogram. However, for other methods the choice of kl$kl$ rather than ij$ij$ may affect the shape of the dendrogram. If either of the distances dij${d}_{ij}$ and dkl${d}_{kl}$ is affected by rounding errors then their equality, and hence the dendrogram, may be affected.

The dendrogram may be formed using nag_mv_cluster_hier_dendrogram (g03eh). Groupings based on the clusters formed at a given distance can be computed using nag_mv_cluster_hier_indicator (g03ej).

Example

```function nag_mv_cluster_hier_example
method = int64(5);
n = int64(5);
d = [17;
2;
13;
16;
1;
10;
4;
17;
10;
20];
[dOut, ilc, iuc, cd, iord, dord, ifail] = nag_mv_cluster_hier(method, n, d)
```
```

dOut =

14.1250
2.0000
11.2500
16.0000
1.0000
10.0000
6.5000
18.2500
10.0000
20.0000

ilc =

2
1
1
1

iuc =

4
3
5
2

cd =

1.0000
2.0000
6.5000
14.1250

iord =

1
3
5
2
4

dord =

2.0000
6.5000
14.1250
1.0000
14.1250

ifail =

0

```
```function g03ec_example
method = int64(5);
n = int64(5);
d = [17;
2;
13;
16;
1;
10;
4;
17;
10;
20];
[dOut, ilc, iuc, cd, iord, dord, ifail] = g03ec(method, n, d)
```
```

dOut =

14.1250
2.0000
11.2500
16.0000
1.0000
10.0000
6.5000
18.2500
10.0000
20.0000

ilc =

2
1
1
1

iuc =

4
3
5
2

cd =

1.0000
2.0000
6.5000
14.1250

iord =

1
3
5
2
4

dord =

2.0000
6.5000
14.1250
1.0000
14.1250

ifail =

0

```