# NAG FL Interfaceg03eff (cluster_​kmeans)

## ▸▿ Contents

Settings help

FL Name Style:

FL Specification Language:

## 1Purpose

g03eff performs $K$-means cluster analysis.

## 2Specification

Fortran Interface
 Subroutine g03eff ( n, m, x, ldx, isx, nvar, k, ldc, wt, inc, nic, css, csw, iwk, wk,
 Integer, Intent (In) :: n, m, ldx, isx(m), nvar, k, ldc, maxit Integer, Intent (Inout) :: ifail Integer, Intent (Out) :: inc(n), nic(k), iwk(n+3*k) Real (Kind=nag_wp), Intent (In) :: x(ldx,m), wt(*) Real (Kind=nag_wp), Intent (Inout) :: cmeans(ldc,nvar) Real (Kind=nag_wp), Intent (Out) :: css(k), csw(k), wk(n+2*k) Character (1), Intent (In) :: weight
#include <nag.h>
 void g03eff_ (const char *weight, const Integer *n, const Integer *m, const double x[], const Integer *ldx, const Integer isx[], const Integer *nvar, const Integer *k, double cmeans[], const Integer *ldc, const double wt[], Integer inc[], Integer nic[], double css[], double csw[], const Integer *maxit, Integer iwk[], double wk[], Integer *ifail, const Charlen length_weight)
The routine may be called by the names g03eff or nagf_mv_cluster_kmeans.

## 3Description

Given $n$ objects with $p$ variables measured on each object, ${x}_{\mathit{i}\mathit{j}}$, for $\mathit{i}=1,2,\dots ,n$ and $\mathit{j}=1,2,\dots ,p$, g03eff allocates each object to one of $K$ groups or clusters to minimize the within-cluster sum of squares:
 $∑k=1K∑i∈Sk∑j=1p (xij-x¯kj) 2,$
where ${S}_{k}$ is the set of objects in the $k$th cluster and ${\overline{x}}_{kj}$ is the mean for the variable $j$ over cluster $k$. This is often known as $K$-means clustering.
In addition to the data matrix, a $K×p$ matrix giving the initial cluster centres for the $K$ clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the $K$-partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.
Optionally, weights for each object, ${w}_{i}$, can be used so that the clustering is based on within-cluster weighted sums of squares:
 $∑k=1K∑i∈Sk∑j=1pwi (xij-x~kj) 2,$
where ${\stackrel{~}{x}}_{kj}$ is the weighted mean for variable $j$ over cluster $k$.
The routine is based on the algorithm of Hartigan and Wong (1979).

## 4References

Everitt B S (1974) Cluster Analysis Heinemann
Hartigan J A and Wong M A (1979) Algorithm AS 136: A K-means clustering algorithm Appl. Statist. 28 100–108
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

## 5Arguments

1: $\mathbf{weight}$Character(1) Input
On entry: indicates if weights are to be used.
${\mathbf{weight}}=\text{'U'}$
No weights are used.
${\mathbf{weight}}=\text{'W'}$
Weights are used and must be supplied in wt.
Constraint: ${\mathbf{weight}}=\text{'U'}$ or $\text{'W'}$.
2: $\mathbf{n}$Integer Input
On entry: $n$, the number of objects.
Constraint: ${\mathbf{n}}>1$.
3: $\mathbf{m}$Integer Input
On entry: the total number of variables in array x.
Constraint: ${\mathbf{m}}\ge {\mathbf{nvar}}$.
4: $\mathbf{x}\left({\mathbf{ldx}},{\mathbf{m}}\right)$Real (Kind=nag_wp) array Input
On entry: ${\mathbf{x}}\left(\mathit{i},\mathit{j}\right)$ must contain the value of the $\mathit{j}$th variable for the $\mathit{i}$th object, for $\mathit{i}=1,2,\dots ,n$ and $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
5: $\mathbf{ldx}$Integer Input
On entry: the first dimension of the array x as declared in the (sub)program from which g03eff is called.
Constraint: ${\mathbf{ldx}}\ge {\mathbf{n}}$.
6: $\mathbf{isx}\left({\mathbf{m}}\right)$Integer array Input
On entry: ${\mathbf{isx}}\left(\mathit{j}\right)$ indicates whether or not the $\mathit{j}$th variable is to be included in the analysis. If ${\mathbf{isx}}\left(\mathit{j}\right)>0$, the variable contained in the $\mathit{j}$th column of x is included, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
Constraint: ${\mathbf{isx}}\left(j\right)>0$ for nvar values of $j$.
7: $\mathbf{nvar}$Integer Input
On entry: $p$, the number of variables included in the sums of squares calculations.
Constraint: $1\le {\mathbf{nvar}}\le {\mathbf{m}}$.
8: $\mathbf{k}$Integer Input
On entry: $K$, the number of clusters.
Constraint: ${\mathbf{k}}\ge 2$.
9: $\mathbf{cmeans}\left({\mathbf{ldc}},{\mathbf{nvar}}\right)$Real (Kind=nag_wp) array Input/Output
On entry: ${\mathbf{cmeans}}\left(\mathit{i},\mathit{j}\right)$ must contain the value of the $\mathit{j}$th variable for the $\mathit{i}$th initial cluster centre, for $\mathit{i}=1,2,\dots ,K$ and $\mathit{j}=1,2,\dots ,p$.
On exit: ${\mathbf{cmeans}}\left(\mathit{i},\mathit{j}\right)$ contains the value of the $\mathit{j}$th variable for the $\mathit{i}$th computed cluster centre, for $\mathit{i}=1,2,\dots ,K$ and $\mathit{j}=1,2,\dots ,p$.
10: $\mathbf{ldc}$Integer Input
On entry: the first dimension of the array cmeans as declared in the (sub)program from which g03eff is called.
Constraint: ${\mathbf{ldc}}\ge {\mathbf{k}}$.
11: $\mathbf{wt}\left(*\right)$Real (Kind=nag_wp) array Input
Note: the dimension of the array wt must be at least ${\mathbf{n}}$ if ${\mathbf{weight}}=\text{'W'}$, and at least $1$ otherwise.
On entry: if ${\mathbf{weight}}=\text{'W'}$, the first $n$ elements of wt must contain the weights to be used.
If ${\mathbf{wt}}\left(i\right)=0.0$, the $i$th observation is not included in the analysis. The effective number of observation is the sum of the weights.
If ${\mathbf{weight}}=\text{'U'}$, wt is not referenced and the effective number of observations is $n$.
Constraint: if ${\mathbf{weight}}=\text{'W'}$, ${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$ and ${\mathbf{wt}}\left(\mathit{i}\right)>0.0$ for at least two values of $\mathit{i}$, for $\mathit{i}=1,2,\dots ,n$.
12: $\mathbf{inc}\left({\mathbf{n}}\right)$Integer array Output
On exit: ${\mathbf{inc}}\left(\mathit{i}\right)$ contains the cluster to which the $\mathit{i}$th object has been allocated, for $\mathit{i}=1,2,\dots ,n$.
13: $\mathbf{nic}\left({\mathbf{k}}\right)$Integer array Output
On exit: ${\mathbf{nic}}\left(\mathit{i}\right)$ contains the number of objects in the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$.
14: $\mathbf{css}\left({\mathbf{k}}\right)$Real (Kind=nag_wp) array Output
On exit: ${\mathbf{css}}\left(\mathit{i}\right)$ contains the within-cluster (weighted) sum of squares of the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$.
15: $\mathbf{csw}\left({\mathbf{k}}\right)$Real (Kind=nag_wp) array Output
On exit: ${\mathbf{csw}}\left(\mathit{i}\right)$ contains the within-cluster sum of weights of the $\mathit{i}$th cluster, for $\mathit{i}=1,2,\dots ,K$. If ${\mathbf{weight}}=\text{'U'}$, the sum of weights is the number of objects in the cluster.
16: $\mathbf{maxit}$Integer Input
On entry: the maximum number of iterations allowed in the analysis.
Suggested value: ${\mathbf{maxit}}=10$.
Constraint: ${\mathbf{maxit}}>0$.
17: $\mathbf{iwk}\left({\mathbf{n}}+3×{\mathbf{k}}\right)$Integer array Workspace
18: $\mathbf{wk}\left({\mathbf{n}}+2×{\mathbf{k}}\right)$Real (Kind=nag_wp) array Workspace
19: $\mathbf{ifail}$Integer Input/Output
On entry: ifail must be set to $0$, $-1$ or $1$ to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of $0$ causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of $-1$ means that an error message is printed while a value of $1$ means that it is not.
If halting is not appropriate, the value $-1$ or $1$ is recommended. If message printing is undesirable, then the value $1$ is recommended. Otherwise, the value $0$ is recommended. When the value $-\mathbf{1}$ or $\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit: ${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see Section 6).

## 6Error Indicators and Warnings

If on entry ${\mathbf{ifail}}=0$ or $-1$, explanatory error messages are output on the current error message unit (as defined by x04aaf).
Errors or warnings detected by the routine:
${\mathbf{ifail}}=1$
On entry, ${\mathbf{k}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{k}}\ge 2$.
On entry, ${\mathbf{ldc}}=⟨\mathit{\text{value}}⟩$ and ${\mathbf{k}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{ldc}}\ge {\mathbf{k}}$.
On entry, ${\mathbf{ldx}}=⟨\mathit{\text{value}}⟩$ and ${\mathbf{n}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{ldx}}\ge {\mathbf{n}}$.
On entry, ${\mathbf{m}}=⟨\mathit{\text{value}}⟩$ and ${\mathbf{nvar}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{m}}\ge {\mathbf{nvar}}$.
On entry, ${\mathbf{maxit}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{maxit}}>0$.
On entry, ${\mathbf{n}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{n}}>1$.
On entry, ${\mathbf{nvar}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{nvar}}\ge 1$.
On entry, ${\mathbf{weight}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{weight}}=\text{'U'}$ or $\text{'W'}$.
${\mathbf{ifail}}=2$
On entry, $i=⟨\mathit{\text{value}}⟩$ and ${\mathbf{wt}}\left(i\right)<0.0$.
Constraint: ${\mathbf{wt}}\left(i\right)\ge 0.0$.
On entry, wt has less than two positive values.
${\mathbf{ifail}}=3$
On entry, ${\mathbf{nvar}}=⟨\mathit{\text{value}}⟩$ and $⟨\mathit{\text{value}}⟩$ values of ${\mathbf{isx}}>0$.
Constraint: exactly nvar elements of ${\mathbf{isx}}>0$.
${\mathbf{ifail}}=4$
At least one cluster is empty after the initial assignment. Try a different set of initial cluster centres in cmeans and also consider decreasing the value of k. The empty clusters may be found by examining the values in nic.
${\mathbf{ifail}}=5$
Convergence has not been achieved within the maximum number of iterations ${\mathbf{maxit}}=⟨\mathit{\text{value}}⟩$. Try increasing maxit and, if possible, use the returned values in cmeans as the initial cluster centres.
${\mathbf{ifail}}=-99$
See Section 7 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

## 7Accuracy

g03eff produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.

## 8Parallelism and Performance

g03eff is not threaded in any implementation.

The time per iteration is approximately proportional to $\mathit{np}K$.

## 10Example

The data consists of observations of five variables on twenty soils (see Hartigan and Wong (1979)). The data is read in, the $K$-means clustering performed and the results printed.

### 10.1Program Text

Program Text (g03effe.f90)

### 10.2Program Data

Program Data (g03effe.d)

### 10.3Program Results

Program Results (g03effe.r)