Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

# NAG Toolbox: nag_mv_cluster_kmeans (g03ef)

## Purpose

nag_mv_cluster_kmeans (g03ef) performs $K$-means cluster analysis.

## Syntax

[cmeans, inc, nic, css, csw, ifail] = g03ef(x, isx, cmeans, 'n', n, 'm', m, 'nvar', nvar, 'k', k, 'wt', wt, 'maxit', maxit)
[cmeans, inc, nic, css, csw, ifail] = nag_mv_cluster_kmeans(x, isx, cmeans, 'n', n, 'm', m, 'nvar', nvar, 'k', k, 'wt', wt, 'maxit', maxit)
Note: the interface to this routine has changed since earlier releases of the toolbox:
 At Mark 24: weight was removed from the interface; wt was made optional At Mark 22: n and k were made optional

## Description

Given $n$ objects with $p$ variables measured on each object, ${x}_{\mathit{i}\mathit{j}}$, for $\mathit{i}=1,2,\dots ,n$ and $\mathit{j}=1,2,\dots ,p$, nag_mv_cluster_kmeans (g03ef) allocates each object to one of $K$ groups or clusters to minimize the within-cluster sum of squares:
 $∑k=1K∑i∈Sk∑j=1p xij-x-kj 2,$
where ${S}_{k}$ is the set of objects in the $k$th cluster and ${\stackrel{-}{x}}_{kj}$ is the mean for the variable $j$ over cluster $k$. This is often known as $K$-means clustering.
In addition to the data matrix, a $K$ by $p$ matrix giving the initial cluster centres for the $K$ clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the $K$-partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.
Optionally, weights for each object, ${w}_{i}$, can be used so that the clustering is based on within-cluster weighted sums of squares:
 $∑k=1K∑i∈Sk∑j=1pwi xij-x~kj 2,$
where ${\stackrel{~}{x}}_{kj}$ is the weighted mean for variable $j$ over cluster $k$.
The function is based on the algorithm of Hartigan and Wong (1979).

## References

Everitt B S (1974) Cluster Analysis Heinemann
Hartigan J A and Wong M A (1979) Algorithm AS 136: A K-means clustering algorithm Appl. Statist. 28 100–108
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

## Parameters

### Compulsory Input Parameters

1:     $\mathrm{x}\left(\mathit{ldx},{\mathbf{m}}\right)$ – double array
ldx, the first dimension of the array, must satisfy the constraint $\mathit{ldx}\ge {\mathbf{n}}$.
${\mathbf{x}}\left(\mathit{i},\mathit{j}\right)$ must contain the value of the $\mathit{j}$th variable for the $\mathit{i}$th object, for $\mathit{i}=1,2,\dots ,n$ and $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
2:     $\mathrm{isx}\left({\mathbf{m}}\right)$int64int32nag_int array
${\mathbf{isx}}\left(\mathit{j}\right)$ indicates whether or not the $\mathit{j}$th variable is to be included in the analysis. If ${\mathbf{isx}}\left(\mathit{j}\right)>0$, the variable contained in the $\mathit{j}$th column of x is included, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
Constraint: ${\mathbf{isx}}\left(j\right)>0$ for nvar values of $j$.
3:     $\mathrm{cmeans}\left(\mathit{ldc},{\mathbf{nvar}}\right)$ – double array
ldc, the first dimension of the array, must satisfy the constraint $\mathit{ldc}\ge {\mathbf{k}}$.
${\mathbf{cmeans}}\left(\mathit{i},\mathit{j}\right)$ must contain the value of the $\mathit{j}$th variable for the $\mathit{i}$th initial cluster centre, for $\mathit{i}=1,2,\dots ,K$ and $\mathit{j}=1,2,\dots ,p$.

### Optional Input Parameters

1:     $\mathrm{n}$int64int32nag_int scalar
Default: the first dimension of the array x.
$n$, the number of objects.
Constraint: ${\mathbf{n}}>1$.
2:     $\mathrm{m}$int64int32nag_int scalar
Default: the dimension of the array isx and the second dimension of the array x. (An error is raised if these dimensions are not equal.)
The total number of variables in array x.
Constraint: ${\mathbf{m}}\ge {\mathbf{nvar}}$.
3:     $\mathrm{nvar}$int64int32nag_int scalar
Default: the second dimension of the array cmeans.
$p$, the number of variables included in the sums of squares calculations.
Constraint: $1\le {\mathbf{nvar}}\le {\mathbf{m}}$.
4:     $\mathrm{k}$int64int32nag_int scalar
Default: the first dimension of the array cmeans.
$K$, the number of clusters.
Constraint: ${\mathbf{k}}\ge 2$.
5:     $\mathrm{wt}\left(:\right)$ – double array
The dimension of the array wt must be at least ${\mathbf{n}}$ if $\mathit{weight}=\text{'W'}$, and at least $1$ otherwise
If $\mathit{weight}=\text{'W'}$, the first $n$ elements of wt must contain the weights to be used.
If ${\mathbf{wt}}\left(i\right)=0.0$, the $i$th observation is not included in the analysis. The effective number of observation is the sum of the weights.
If $\mathit{weight}=\text{'U'}$, wt is not referenced and the effective number of observations is $n$.
Constraint: if $\mathit{weight}=\text{'W'}$, ${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$ and ${\mathbf{wt}}\left(\mathit{i}\right)>0.0$ for at least two values of $\mathit{i}$, for $\mathit{i}=1,2,\dots ,n$.
6:     $\mathrm{maxit}$int64int32nag_int scalar
Default: $10$
The maximum number of iterations allowed in the analysis.
Constraint: ${\mathbf{maxit}}>0$.

### Output Parameters

1:     $\mathrm{cmeans}\left(\mathit{ldc},{\mathbf{nvar}}\right)$ – double array
${\mathbf{cmeans}}\left(\mathit{i},\mathit{j}\right)$ contains the value of the $\mathit{j}$th variable for the $\mathit{i}$th computed cluster centre, for $\mathit{i}=1,2,\dots ,K$ and $\mathit{j}=1,2,\dots ,p$.
2:     $\mathrm{inc}\left({\mathbf{n}}\right)$int64int32nag_int array
${\mathbf{inc}}\left(\mathit{i}\right)$ contains the cluster to which the $\mathit{i}$th object has been allocated, for $\mathit{i}=1,2,\dots ,n$.
3:     $\mathrm{nic}\left({\mathbf{k}}\right)$int64int32nag_int array
${\mathbf{nic}}\left(\mathit{i}\right)$ contains the number of objects in the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$.
4:     $\mathrm{css}\left({\mathbf{k}}\right)$ – double array
${\mathbf{css}}\left(\mathit{i}\right)$ contains the within-cluster (weighted) sum of squares of the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$.
5:     $\mathrm{csw}\left({\mathbf{k}}\right)$ – double array
${\mathbf{csw}}\left(\mathit{i}\right)$ contains the within-cluster sum of weights of the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$. If $\mathit{weight}=\text{'U'}$, the sum of weights is the number of objects in the cluster.
6:     $\mathrm{ifail}$int64int32nag_int scalar
${\mathbf{ifail}}={\mathbf{0}}$ unless the function detects an error (see Error Indicators and Warnings).

## Error Indicators and Warnings

Errors or warnings detected by the function:
${\mathbf{ifail}}=1$
 On entry, $\mathit{weight}\ne \text{'W'}$ or $\text{'U'}$, or ${\mathbf{n}}<2$, or ${\mathbf{nvar}}<1$, or ${\mathbf{m}}<{\mathbf{nvar}}$, or ${\mathbf{k}}<2$, or $\mathit{ldx}<{\mathbf{n}}$, or $\mathit{ldc}<{\mathbf{k}}$, or ${\mathbf{maxit}}\le 0$.
${\mathbf{ifail}}=2$
 On entry, $\mathit{weight}=\text{'W'}$ and a value of ${\mathbf{wt}}\left(i\right)<0.0$ for some $i$, or $\mathit{weight}=\text{'W'}$ and ${\mathbf{wt}}\left(i\right)=0.0$ for all or all but one values of $i$.
${\mathbf{ifail}}=3$
 On entry, the number of positive values in isx does not equal nvar.
${\mathbf{ifail}}=4$
On entry, at least one cluster is empty after the initial assignment. Try a different set of initial cluster centres in cmeans and also consider decreasing the value of k. The empty clusters may be found by examining the values in nic.
${\mathbf{ifail}}=5$
Convergence has not been achieved within the maximum number of iterations given by maxit. Try increasing maxit and, if possible, use the returned values in cmeans as the initial cluster centres.
${\mathbf{ifail}}=-99$
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.

## Accuracy

nag_mv_cluster_kmeans (g03ef) produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.

The time per iteration is approximately proportional to $\mathit{np}K$.

## Example

The data consists of observations of five variables on twenty soils (see Hartigan and Wong (1979)). The data is read in, the $K$-means clustering performed and the results printed.
```function g03ef_example

fprintf('g03ef example results\n\n');

x = [77.3, 13.0,  9.7, 1.5, 6.4;
82.5, 10.0,  7.5, 1.5, 6.5;
66.9, 20.6, 12.5, 2.3, 7.0;
47.2, 33.8, 19.0, 2.8, 5.8;
65.3, 20.5, 14.2, 1.9, 6.9;
83.3, 10.0,  6.7, 2.2, 7.0;
81.6, 12.7,  5.7, 2.9, 6.7;
47.8, 36.5, 15.7, 2.3, 7.2;
48.6, 37.1, 14.3, 2.1, 7.2;
61.6, 25.5, 12.9, 1.9, 7.3;
58.6, 26.5, 14.9, 2.4, 6.7;
69.3, 22.3,  8.4, 4.0, 7.0;
61.8, 30.8,  7.4, 2.7, 6.4;
67.7, 25.3,  7.0, 4.8, 7.3;
57.2, 31.2, 11.6, 2.4, 6.5;
67.2, 22.7, 10.1, 3.3, 6.2;
59.2, 31.2,  9.6, 2.4, 6.0;
80.2, 13.2,  6.6, 2.0, 5.8;
82.2, 11.1,  6.7, 2.2, 7.2;
69.7, 20.7,  9.6, 3.1, 5.9];
[m,n] = size(x);
isx   = ones(n,1,'int64');

% Cluster centres
cmeans = [82.5, 10.0,  7.5, 1.5, 6.5;
47.8, 36.5, 15.7, 2.3, 7.2;
67.2, 22.7, 10.1, 3.3, 6.2];

% perform k means clustering
[cmeans, inc, nic, css, csw, ifail] = ...
g03ef(x, isx, cmeans);

disp(' The cluster each point belongs to');
fprintf('%6d%6d%6d%6d%6d%6d%6d%6d%6d%6d\n',inc);
disp(' The number of points in each cluster');
disp(nic');
disp(' The within-cluster sum of weights of each cluster');
disp(csw');
disp(' The within-cluster sum of squares of each cluster');
disp(css')
mtitle = 'The final cluster centres';
matrix = 'General';
diag   = ' ';
[ifail] = x04ca( ...
matrix, diag, cmeans, mtitle);

```
```g03ef example results

The cluster each point belongs to
1     1     3     2     3     1     1     2     2     3
3     3     3     3     3     3     3     1     1     3
The number of points in each cluster
6                    3                   11

The within-cluster sum of weights of each cluster
6     3    11

The within-cluster sum of squares of each cluster
46.5717   20.3800  468.8964

The final cluster centres
1          2          3          4          5
1     81.1833    11.6667     7.1500     2.0500     6.6000
2     47.8667    35.8000    16.3333     2.4000     6.7333
3     64.0455    25.2091    10.7455     2.8364     6.6545
```