hide long namesshow long names
hide short namesshow short names
Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

NAG Toolbox: nag_mv_cluster_hier (g03ec)

Purpose

nag_mv_cluster_hier (g03ec) performs hierarchical cluster analysis.

Syntax

[d, ilc, iuc, cd, iord, dord, ifail] = g03ec(method, n, d)
[d, ilc, iuc, cd, iord, dord, ifail] = nag_mv_cluster_hier(method, n, d)

Description

Given a distance or dissimilarity matrix for nn objects (see nag_mv_distance_mat (g03ea)), cluster analysis aims to group the nn objects into a number of more or less homogeneous groups or clusters. With agglomerative clustering methods, a hierarchical tree is produced by starting with nn clusters, each with a single object and then at each of n1n-1 stages, merging two clusters to form a larger cluster, until all objects are in a single cluster. This process may be represented by a dendrogram (see nag_mv_cluster_hier_dendrogram (g03eh)).
At each stage, the clusters that are nearest are merged, methods differ as to how the distances between the new cluster and other clusters are computed. For three clusters ii, jj and kk let nini, njnj and nknk be the number of objects in each cluster and let dijdij, dikdik and djkdjk be the distances between the clusters. Let clusters jj and kk be merged to give cluster jkjk, then the distance from cluster ii to cluster jkjk, di . jkdi.jk can be computed in the following ways.
  1. Single link or nearest neighbour : di . jk = min (dij,dik) di.jk = min(dij,dik) .
  2. Complete link or furthest neighbour : di . jk = max (dij,dik)di.jk=max(dij,dik).
  3. Group average : di . jk = (nj)/(nj + nk)dij + (nk)/(nj + nk)dikdi.jk= njnj+nk dij+ nknj+nk dik.
  4. Centroid : di . jk = (nj)/(nj + nk)dij + (nk)/(nj + nk)dik(njnk)/((nj + nk)2)djkdi.jk= njnj+nk dij+ nknj+nk dik- njnk (nj+nk) 2djk.
  5. Median : di . jk = (1/2)dij + (1/2)dik(1/4)djkdi.jk=12dij+12dik-14djk.
  6. Minimum variance : di . jk = {(ni + nj)dij + (ni + nk)diknidjk} / (ni + nj + nk)di.jk={(ni+nj)dij+(ni+nk)dik-nidjk}/(ni+nj+nk).
For further details see Everitt (1974) or Krzanowski (1990).
If the clusters are numbered 1,2,,n1,2,,n then, for convenience, if clusters jj and kk, j < kj<k, merge then the new cluster will be referred to as cluster jj. Information on the clustering history is given by the values of jj, kk and djkdjk for each of the n1n-1 clustering steps. In order to produce a dendrogram, the ordering of the objects such that the clusters that merge are adjacent is required. This ordering is computed so that the first element is 11. The associated distances with this ordering are also computed.

References

Everitt B S (1974) Cluster Analysis Heinemann
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

Parameters

Compulsory Input Parameters

1:     method – int64int32nag_int scalar
Indicates which clustering method is used.
method = 1method=1
Single link.
method = 2method=2
Complete link.
method = 3method=3
Group average.
method = 4method=4
Centroid.
method = 5method=5
Median.
method = 6method=6
Minimum variance.
Constraint: method = 1method=1, 22, 33, 44, 55 or 66.
2:     n – int64int32nag_int scalar
nn, the number of objects.
Constraint: n2n2.
3:     d(n × (n1) / 2n×(n-1)/2) – double array
The strictly lower triangle of the distance matrix. DD must be stored packed by rows, i.e., d((i1)(i2) / 2 + j)d((i-1)(i-2)/2+j), i > ji>j must contain dijdij.
Constraint: d(i)0.0di0.0, for i = 1,2,,n(n1) / 2i=1,2,,n(n-1)/2.

Optional Input Parameters

None.

Input Parameters Omitted from the MATLAB Interface

iwk

Output Parameters

1:     d(n × (n1) / 2n×(n-1)/2) – double array
Is overwritten.
2:     ilc(n1n-1) – int64int32nag_int array
ilc(l)ilcl contains the number, jj, of the cluster merged with cluster kk (see iuc), j < kj<k, at step ll, for l = 1,2,,n1l=1,2,,n-1.
3:     iuc(n1n-1) – int64int32nag_int array
iuc(l)iucl contains the number, kk, of the cluster merged with cluster jj, j < kj<k, at step ll, for l = 1,2,,n1l=1,2,,n-1.
4:     cd(n1n-1) – double array
cd(l)cdl contains the distance djkdjk, between clusters jj and kk, j < kj<k, merged at step ll, for l = 1,2,,n1l=1,2,,n-1.
5:     iord(n) – int64int32nag_int array
The objects in dendrogram order.
6:     dord(n) – double array
The clustering distances corresponding to the order in iord. dord(l)dordl contains the distance at which cluster iord(l)iordl and iord(l + 1)iordl+1 merge, for l = 1,2,,n1l=1,2,,n-1. dord(n)dordn contains the maximum distance.
7:     ifail – int64int32nag_int scalar
ifail = 0ifail=0 unless the function detects an error (see [Error Indicators and Warnings]).

Error Indicators and Warnings

Errors or warnings detected by the function:
  ifail = 1ifail=1
On entry,method1method1, 22, 33, 44, 55 or 66,
orn < 2n<2.
  ifail = 2ifail=2
On entry,d(i) < 0.0di<0.0 for some i = 1,2,,n(n1) / 2i=1,2,,n(n-1)/2.
  ifail = 3ifail=3
A true dendrogram cannot be formed because the distances at which clusters have merged are not increasing for all steps, i.e., cd(l) < cd(l1)cdl<cdl-1 for some l = 2,3,,n1l=2,3,,n-1. This can occur for the median and centroid methods.

Accuracy

For method3method3 slight rounding errors may occur in the calculations of the updated distances. These would not normally significantly affect the results, however there may be an effect if distances are (almost) equal.
If at a stage, two distances dijdij and dkldkl, (i < ki<k) or (i = ki=k), and j < lj<l, are equal then clusters kk and ll will be merged rather than clusters ii and jj. For single link clustering this choice will only affect the order of the objects in the dendrogram. However, for other methods the choice of klkl rather than ijij may affect the shape of the dendrogram. If either of the distances dijdij and dkldkl is affected by rounding errors then their equality, and hence the dendrogram, may be affected.

Further Comments

The dendrogram may be formed using nag_mv_cluster_hier_dendrogram (g03eh). Groupings based on the clusters formed at a given distance can be computed using nag_mv_cluster_hier_indicator (g03ej).

Example

function nag_mv_cluster_hier_example
method = int64(5);
n = int64(5);
d = [17;
     2;
     13;
     16;
     1;
     10;
     4;
     17;
     10;
     20];
[dOut, ilc, iuc, cd, iord, dord, ifail] = nag_mv_cluster_hier(method, n, d)
 

dOut =

   14.1250
    2.0000
   11.2500
   16.0000
    1.0000
   10.0000
    6.5000
   18.2500
   10.0000
   20.0000


ilc =

                    2
                    1
                    1
                    1


iuc =

                    4
                    3
                    5
                    2


cd =

    1.0000
    2.0000
    6.5000
   14.1250


iord =

                    1
                    3
                    5
                    2
                    4


dord =

    2.0000
    6.5000
   14.1250
    1.0000
   14.1250


ifail =

                    0


function g03ec_example
method = int64(5);
n = int64(5);
d = [17;
     2;
     13;
     16;
     1;
     10;
     4;
     17;
     10;
     20];
[dOut, ilc, iuc, cd, iord, dord, ifail] = g03ec(method, n, d)
 

dOut =

   14.1250
    2.0000
   11.2500
   16.0000
    1.0000
   10.0000
    6.5000
   18.2500
   10.0000
   20.0000


ilc =

                    2
                    1
                    1
                    1


iuc =

                    4
                    3
                    5
                    2


cd =

    1.0000
    2.0000
    6.5000
   14.1250


iord =

                    1
                    3
                    5
                    2
                    4


dord =

    2.0000
    6.5000
   14.1250
    1.0000
   14.1250


ifail =

                    0



PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2013