NAG C Library Function Document

nag_mv_hierar_cluster_analysis (g03ecc)

1
Purpose

nag_mv_hierar_cluster_analysis (g03ecc) performs hierarchical cluster analysis.

2
Specification

#include <nag.h>
#include <nagg03.h>
void  nag_mv_hierar_cluster_analysis (Nag_ClusterMethod method, Integer n, double d[], Integer ilc[], Integer iuc[], double cd[], Integer iord[], double dord[], NagError *fail)

3
Description

Given a distance or dissimilarity matrix for n  objects (see nag_mv_distance_mat (g03eac)), cluster analysis aims to group the n  objects into a number of more or less homogeneous groups or clusters. With agglomerative clustering methods, a hierarchical tree is produced by starting with n  clusters, each with a single object and then at each of n-1  stages, merging two clusters to form a larger cluster, until all objects are in a single cluster. This process may be represented by a dendrogram (see nag_mv_dendrogram (g03ehc)).
At each stage, the clusters that are nearest are merged, methods differ as to how the distance between the new cluster and other clusters are computed. For three clusters i , j  and k  let n i , n j  and n k  be the number of objects in each cluster and let d ij , d ik  and d jk  be the distances between the clusters. Let clusters j  and k  be merged to give cluster jk , then the distance from cluster i  to cluster jk , d i . j k  can be computed in the following ways:
1. Single link or nearest neighbour: d i . j k = min d ij , d ik .
2. Complete link or furthest neighbour: d i . j k = max d ij , d ik .
3. Group average: d i . j k = n j n j + n k d ij + n k n j + n k d ik .
4. Centroid: d i . j k = n j n j + n k d ij + n k n j + n k d ik - n j n k n j + n k 2 d jk .
5. Median: d i . j k = 1 2 d ij + 1 2 d ik - 1 4 d jk .
6. Minimum variance: d i . j k = n i + n j d ij + n i + n k d ik - n i d jk / n i + n j + n k .
For further details see Everitt (1974) or Krzanowski (1990).
If the clusters are numbered 1 , 2 , , n  then, for convenience, if clusters j  and k , j<k , merge then the new cluster will be referred to as cluster j . Information on the clustering history is given by the values of j , k  and d jk  for each of the n-1  clustering steps. In order to produce a dendrogram, the ordering of the objects such that the clusters that merge are adjacent is required. This ordering is computed so that the first element is 1. The associated distances with this ordering are also computed.

4
References

Everitt B S (1974) Cluster Analysis Heinemann
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

5
Arguments

1:     method Nag_ClusterMethodInput
On entry: indicates which clustering.
method=Nag_SingleLink
Single link.
method=Nag_CompleteLink
Complete link.
method=Nag_GroupAverage
Group average.
method=Nag_Centroid
Centroid.
method=Nag_Median
Median.
method=Nag_MinVariance
Minimum variance.
Constraint: method=Nag_SingleLink, Nag_CompleteLink, Nag_GroupAverage, Nag_Centroid, Nag_Median or Nag_MinVariance.
2:     n IntegerInput
On entry: the number of objects, n .
Constraint: n2 .
3:     d[n×n-1/2] doubleInput/Output
On entry: the strictly lower triangle of the distance matrix. D  must be stored packed by rows, i.e., d[ i-1 i-2 / 2 + j - 1 ] , i>j  must contain d ij .
On exit: is overwritten.
Constraint: d[i-1] 0.0 , for i=1,2,, n n-1 / 2 .
4:     ilc[n-1] IntegerOutput
On exit: ilc[l-1]  contains the number, j , of the cluster merged with cluster k  (see iuc), j<k , at step l , for l=1,2,,n - 1.
5:     iuc[n-1] IntegerOutput
On exit: iuc[l-1]  contains the number, k , of the cluster merged with cluster j , j<k , at step l, for l=1,2,,n-1.
6:     cd[n-1] doubleOutput
On exit: cd[l-1]  contains the distance d jk , between clusters j  and k , j<k , merged at step l , for l=1,2,,n - 1.
7:     iord[n] IntegerOutput
On exit: the objects in dendrogram order.
8:     dord[n] doubleOutput
On exit: the clustering distances corresponding to the order in iord. dord[l-1]  contains the distance at which cluster iord[l-1]  and iord[l]  merge, for l=1,2,,n - 1. dord[n-1]  contains the maximum distance.
9:     fail NagError *Input/Output
The NAG error argument (see Section 3.7 in How to Use the NAG Library and its Documentation).

6
Error Indicators and Warnings

NE_ALLOC_FAIL
Dynamic memory allocation failed.
NE_BAD_PARAM
On entry, argument method had an illegal value.
NE_DENDROGRAM
A true dendrogram cannot be formed because the distances at which clusters have merged are not increasing for all steps, i.e., cd[i-1] < cd[i-2]  for some i = 2 , 3 , , n - 1 . This can occur for the method=Nag_Centroid and method=Nag_Median methods.
NE_INT_ARG_LT
On entry, n=value.
Constraint: n2.
NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
NE_REALARR
On entry, d[value] = value.
Constraint: d[i-1] 0.0 , for i=1,2,,n × n-1 / 2.

7
Accuracy

For methods other than method=Nag_SingleLink or Nag_CompleteLink, slight rounding errors may occur in the calculations of the updated distances. These would not normally significantly affect the results, however there may be an effect if distances are (almost) equal.
If at a stage, two distances d ij  and d kl , i<k  or i=k  and j<l , are equal then clusters k  and l  will be merged rather than clusters i  and j . For single link clustering this choice will only affect the order of the objects in the dendrogram. However, for other methods the choice of kl  rather than ij  may affect the shape of the dendrogram. If either of the distances d ij  or d kl  are affected by rounding errors then their equality, and hence the dendrogram, may be affected.

8
Parallelism and Performance

nag_mv_hierar_cluster_analysis (g03ecc) is not threaded in any implementation.

9
Further Comments

The dendrogram may be formed using nag_mv_dendrogram (g03ehc). Groupings based on the clusters formed at a given distance can be computed using nag_mv_cluster_indicator (g03ejc).

10
Example

Data consisting of three variables on five objects are read in. Euclidean squared distances based on two variables are computed using nag_mv_distance_mat (g03eac), the objects are clustered using nag_mv_hierar_cluster_analysis (g03ecc) and the dendrogram computed using nag_mv_dendrogram (g03ehc). The dendrogram is then printed.

10.1
Program Text

Program Text (g03ecce.c)

10.2
Program Data

Program Data (g03ecce.d)

10.3
Program Results

Program Results (g03ecce.r)