# NAG Library Routine Document

## 1Purpose

g03eff performs $K$-means cluster analysis.

## 2Specification

Fortran Interface
 Subroutine g03eff ( n, m, x, ldx, isx, nvar, k, ldc, wt, inc, nic, css, csw, iwk, wk,
 Integer, Intent (In) :: n, m, ldx, isx(m), nvar, k, ldc, maxit Integer, Intent (Inout) :: ifail Integer, Intent (Out) :: inc(n), nic(k), iwk(n+3*k) Real (Kind=nag_wp), Intent (In) :: x(ldx,m), wt(*) Real (Kind=nag_wp), Intent (Inout) :: cmeans(ldc,nvar) Real (Kind=nag_wp), Intent (Out) :: css(k), csw(k), wk(n+2*k) Character (1), Intent (In) :: weight
#include nagmk26.h
 void g03eff_ (const char *weight, const Integer *n, const Integer *m, const double x[], const Integer *ldx, const Integer isx[], const Integer *nvar, const Integer *k, double cmeans[], const Integer *ldc, const double wt[], Integer inc[], Integer nic[], double css[], double csw[], const Integer *maxit, Integer iwk[], double wk[], Integer *ifail, const Charlen length_weight)

## 3Description

Given $n$ objects with $p$ variables measured on each object, ${x}_{\mathit{i}\mathit{j}}$, for $\mathit{i}=1,2,\dots ,n$ and $\mathit{j}=1,2,\dots ,p$, g03eff allocates each object to one of $K$ groups or clusters to minimize the within-cluster sum of squares:
 $∑k=1K∑i∈Sk∑j=1p xij-x-kj 2,$
where ${S}_{k}$ is the set of objects in the $k$th cluster and ${\stackrel{-}{x}}_{kj}$ is the mean for the variable $j$ over cluster $k$. This is often known as $K$-means clustering.
In addition to the data matrix, a $K$ by $p$ matrix giving the initial cluster centres for the $K$ clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the $K$-partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.
Optionally, weights for each object, ${w}_{i}$, can be used so that the clustering is based on within-cluster weighted sums of squares:
 $∑k=1K∑i∈Sk∑j=1pwi xij-x~kj 2,$
where ${\stackrel{~}{x}}_{kj}$ is the weighted mean for variable $j$ over cluster $k$.
The routine is based on the algorithm of Hartigan and Wong (1979).
Everitt B S (1974) Cluster Analysis Heinemann
Hartigan J A and Wong M A (1979) Algorithm AS 136: A K-means clustering algorithm Appl. Statist. 28 100–108
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

## 5Arguments

1:     $\mathbf{weight}$ – Character(1)Input
On entry: indicates if weights are to be used.
${\mathbf{weight}}=\text{'U'}$
No weights are used.
${\mathbf{weight}}=\text{'W'}$
Weights are used and must be supplied in wt.
Constraint: ${\mathbf{weight}}=\text{'U'}$ or $\text{'W'}$.
2:     $\mathbf{n}$ – IntegerInput
On entry: $n$, the number of objects.
Constraint: ${\mathbf{n}}>1$.
3:     $\mathbf{m}$ – IntegerInput
On entry: the total number of variables in array x.
Constraint: ${\mathbf{m}}\ge {\mathbf{nvar}}$.
4:     $\mathbf{x}\left({\mathbf{ldx}},{\mathbf{m}}\right)$ – Real (Kind=nag_wp) arrayInput
On entry: ${\mathbf{x}}\left(\mathit{i},\mathit{j}\right)$ must contain the value of the $\mathit{j}$th variable for the $\mathit{i}$th object, for $\mathit{i}=1,2,\dots ,n$ and $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
5:     $\mathbf{ldx}$ – IntegerInput
On entry: the first dimension of the array x as declared in the (sub)program from which g03eff is called.
Constraint: ${\mathbf{ldx}}\ge {\mathbf{n}}$.
6:     $\mathbf{isx}\left({\mathbf{m}}\right)$ – Integer arrayInput
On entry: ${\mathbf{isx}}\left(\mathit{j}\right)$ indicates whether or not the $\mathit{j}$th variable is to be included in the analysis. If ${\mathbf{isx}}\left(\mathit{j}\right)>0$, the variable contained in the $\mathit{j}$th column of x is included, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
Constraint: ${\mathbf{isx}}\left(j\right)>0$ for nvar values of $j$.
7:     $\mathbf{nvar}$ – IntegerInput
On entry: $p$, the number of variables included in the sums of squares calculations.
Constraint: $1\le {\mathbf{nvar}}\le {\mathbf{m}}$.
8:     $\mathbf{k}$ – IntegerInput
On entry: $K$, the number of clusters.
Constraint: ${\mathbf{k}}\ge 2$.
9:     $\mathbf{cmeans}\left({\mathbf{ldc}},{\mathbf{nvar}}\right)$ – Real (Kind=nag_wp) arrayInput/Output
On entry: ${\mathbf{cmeans}}\left(\mathit{i},\mathit{j}\right)$ must contain the value of the $\mathit{j}$th variable for the $\mathit{i}$th initial cluster centre, for $\mathit{i}=1,2,\dots ,K$ and $\mathit{j}=1,2,\dots ,p$.
On exit: ${\mathbf{cmeans}}\left(\mathit{i},\mathit{j}\right)$ contains the value of the $\mathit{j}$th variable for the $\mathit{i}$th computed cluster centre, for $\mathit{i}=1,2,\dots ,K$ and $\mathit{j}=1,2,\dots ,p$.
10:   $\mathbf{ldc}$ – IntegerInput
On entry: the first dimension of the array cmeans as declared in the (sub)program from which g03eff is called.
Constraint: ${\mathbf{ldc}}\ge {\mathbf{k}}$.
11:   $\mathbf{wt}\left(*\right)$ – Real (Kind=nag_wp) arrayInput
Note: the dimension of the array wt must be at least ${\mathbf{n}}$ if ${\mathbf{weight}}=\text{'W'}$, and at least $1$ otherwise.
On entry: if ${\mathbf{weight}}=\text{'W'}$, the first $n$ elements of wt must contain the weights to be used.
If ${\mathbf{wt}}\left(i\right)=0.0$, the $i$th observation is not included in the analysis. The effective number of observation is the sum of the weights.
If ${\mathbf{weight}}=\text{'U'}$, wt is not referenced and the effective number of observations is $n$.
Constraint: if ${\mathbf{weight}}=\text{'W'}$, ${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$ and ${\mathbf{wt}}\left(\mathit{i}\right)>0.0$ for at least two values of $\mathit{i}$, for $\mathit{i}=1,2,\dots ,n$.
12:   $\mathbf{inc}\left({\mathbf{n}}\right)$ – Integer arrayOutput
On exit: ${\mathbf{inc}}\left(\mathit{i}\right)$ contains the cluster to which the $\mathit{i}$th object has been allocated, for $\mathit{i}=1,2,\dots ,n$.
13:   $\mathbf{nic}\left({\mathbf{k}}\right)$ – Integer arrayOutput
On exit: ${\mathbf{nic}}\left(\mathit{i}\right)$ contains the number of objects in the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$.
14:   $\mathbf{css}\left({\mathbf{k}}\right)$ – Real (Kind=nag_wp) arrayOutput
On exit: ${\mathbf{css}}\left(\mathit{i}\right)$ contains the within-cluster (weighted) sum of squares of the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$.
15:   $\mathbf{csw}\left({\mathbf{k}}\right)$ – Real (Kind=nag_wp) arrayOutput
On exit: ${\mathbf{csw}}\left(\mathit{i}\right)$ contains the within-cluster sum of weights of the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$. If ${\mathbf{weight}}=\text{'U'}$, the sum of weights is the number of objects in the cluster.
16:   $\mathbf{maxit}$ – IntegerInput
On entry: the maximum number of iterations allowed in the analysis.
Suggested value: ${\mathbf{maxit}}=10$.
Constraint: ${\mathbf{maxit}}>0$.
17:   $\mathbf{iwk}\left({\mathbf{n}}+3×{\mathbf{k}}\right)$ – Integer arrayWorkspace
18:   $\mathbf{wk}\left({\mathbf{n}}+2×{\mathbf{k}}\right)$ – Real (Kind=nag_wp) arrayWorkspace
19:   $\mathbf{ifail}$ – IntegerInput/Output
On entry: ifail must be set to $0$, $-1\text{​ or ​}1$. If you are unfamiliar with this argument you should refer to Section 3.4 in How to Use the NAG Library and its Documentation for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value $-1\text{​ or ​}1$ is recommended. If the output of error messages is undesirable, then the value $1$ is recommended. Otherwise, if you are not familiar with this argument, the recommended value is $0$. When the value $-\mathbf{1}\text{​ or ​}\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit: ${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see Section 6).

## 6Error Indicators and Warnings

If on entry ${\mathbf{ifail}}=0$ or $-1$, explanatory error messages are output on the current error message unit (as defined by x04aaf).
Errors or warnings detected by the routine:
${\mathbf{ifail}}=1$
 On entry, ${\mathbf{weight}}\ne \text{'W'}$ or $\text{'U'}$, or ${\mathbf{n}}<2$, or ${\mathbf{nvar}}<1$, or ${\mathbf{m}}<{\mathbf{nvar}}$, or ${\mathbf{k}}<2$, or ${\mathbf{ldx}}<{\mathbf{n}}$, or ${\mathbf{ldc}}<{\mathbf{k}}$, or ${\mathbf{maxit}}\le 0$.
${\mathbf{ifail}}=2$
 On entry, ${\mathbf{weight}}=\text{'W'}$ and a value of ${\mathbf{wt}}\left(i\right)<0.0$ for some $i$, or ${\mathbf{weight}}=\text{'W'}$ and ${\mathbf{wt}}\left(i\right)=0.0$ for all or all but one values of $i$.
${\mathbf{ifail}}=3$
 On entry, the number of positive values in isx does not equal nvar.
${\mathbf{ifail}}=4$
On entry, at least one cluster is empty after the initial assignment. Try a different set of initial cluster centres in cmeans and also consider decreasing the value of k. The empty clusters may be found by examining the values in nic.
${\mathbf{ifail}}=5$
Convergence has not been achieved within the maximum number of iterations given by maxit. Try increasing maxit and, if possible, use the returned values in cmeans as the initial cluster centres.
${\mathbf{ifail}}=-99$
See Section 3.9 in How to Use the NAG Library and its Documentation for further information.
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
See Section 3.8 in How to Use the NAG Library and its Documentation for further information.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.
See Section 3.7 in How to Use the NAG Library and its Documentation for further information.

## 7Accuracy

g03eff produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.

## 8Parallelism and Performance

g03eff is not threaded in any implementation.

The time per iteration is approximately proportional to $\mathit{np}K$.

## 10Example

The data consists of observations of five variables on twenty soils (see Hartigan and Wong (1979)). The data is read in, the $K$-means clustering performed and the results printed.

### 10.1Program Text

Program Text (g03effe.f90)

### 10.2Program Data

Program Data (g03effe.d)

### 10.3Program Results

Program Results (g03effe.r)

© The Numerical Algorithms Group Ltd, Oxford, UK. 2017