NAG Library Routine Document
G03ECF performs hierarchical cluster analysis.
||METHOD, N, ILC(N-1), IUC(N-1), IORD(N), IWK(2*N), IFAIL
||D(N*(N-1)/2), CD(N-1), DORD(N)
Given a distance or dissimilarity matrix for
objects (see G03EAF
), cluster analysis aims to group the
objects into a number of more or less homogeneous groups or clusters. With agglomerative clustering methods, a hierarchical tree is produced by starting with
clusters, each with a single object and then at each of
stages, merging two clusters to form a larger cluster, until all objects are in a single cluster. This process may be represented by a dendrogram (see G03EHF
At each stage, the clusters that are nearest are merged, methods differ as to how the distances between the new cluster and other clusters are computed. For three clusters
be the number of objects in each cluster and let
be the distances between the clusters. Let clusters
be merged to give cluster
, then the distance from cluster
can be computed in the following ways.
- Single link or nearest neighbour : .
- Complete link or furthest neighbour : .
- Group average : .
- Centroid : .
- Median : .
- Minimum variance : .
If the clusters are numbered then, for convenience, if clusters and , , merge then the new cluster will be referred to as cluster . Information on the clustering history is given by the values of , and for each of the clustering steps. In order to produce a dendrogram, the ordering of the objects such that the clusters that merge are adjacent is required. This ordering is computed so that the first element is . The associated distances with this ordering are also computed.
Everitt B S (1974) Cluster Analysis Heinemann
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press
- 1: METHOD – INTEGERInput
: indicates which clustering method is used.
- Single link.
- Complete link.
- Group average.
- Minimum variance.
, , , , or .
- 2: N – INTEGERInput
On entry: , the number of objects.
- 3: D() – REAL (KIND=nag_wp) arrayInput/Output
On entry: the strictly lower triangle of the distance matrix. must be stored packed by rows, i.e., , must contain .
On exit: is overwritten.
, for .
- 4: ILC() – INTEGER arrayOutput
contains the number,
, of the cluster merged with cluster
, at step
- 5: IUC() – INTEGER arrayOutput
On exit: contains the number, , of the cluster merged with cluster , , at step , for .
- 6: CD() – REAL (KIND=nag_wp) arrayOutput
On exit: contains the distance , between clusters and , , merged at step , for .
- 7: IORD(N) – INTEGER arrayOutput
On exit: the objects in dendrogram order.
- 8: DORD(N) – REAL (KIND=nag_wp) arrayOutput
: the clustering distances corresponding to the order in IORD
contains the distance at which cluster
contains the maximum distance.
- 9: IWK() – INTEGER arrayWorkspace
- 10: IFAIL – INTEGERInput/Output
must be set to
. If you are unfamiliar with this parameter you should refer to Section 3.3
in the Essential Introduction for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value
is recommended. If the output of error messages is undesirable, then the value
is recommended. Otherwise, if you are not familiar with this parameter, the recommended value is
. When the value is used it is essential to test the value of IFAIL on exit.
unless the routine detects an error or a warning has been flagged (see Section 6
6 Error Indicators and Warnings
If on entry
, explanatory error messages are output on the current error message unit (as defined by X04AAF
Errors or warnings detected by the routine:
|On entry,||, , , , or ,|
|On entry,|| for some .|
A true dendrogram cannot be formed because the distances at which clusters have merged are not increasing for all steps, i.e., for some . This can occur for the median and centroid methods.
For slight rounding errors may occur in the calculations of the updated distances. These would not normally significantly affect the results, however there may be an effect if distances are (almost) equal.
If at a stage, two distances and , () or (), and , are equal then clusters and will be merged rather than clusters and . For single link clustering this choice will only affect the order of the objects in the dendrogram. However, for other methods the choice of rather than may affect the shape of the dendrogram. If either of the distances and is affected by rounding errors then their equality, and hence the dendrogram, may be affected.
The dendrogram may be formed using G03EHF
. Groupings based on the clusters formed at a given distance can be computed using G03EJF
Data consisting of three variables on five objects are read in. Euclidean squared distances based on two variables are computed using G03EAF
, the objects are clustered using G03ECF and the dendrogram computed using G03EHF
. The dendrogram is then printed.
9.1 Program Text
Program Text (g03ecfe.f90)
9.2 Program Data
Program Data (g03ecfe.d)
9.3 Program Results
Program Results (g03ecfe.r)