hide long namesshow long names
hide short namesshow short names
Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

NAG Toolbox: nag_mv_cluster_kmeans (g03ef)

Purpose

nag_mv_cluster_kmeans (g03ef) performs KK-means cluster analysis.

Syntax

[cmeans, inc, nic, css, csw, ifail] = g03ef(x, isx, cmeans, 'n', n, 'm', m, 'nvar', nvar, 'k', k, 'wt', wt, 'maxit', maxit)
[cmeans, inc, nic, css, csw, ifail] = nag_mv_cluster_kmeans(x, isx, cmeans, 'n', n, 'm', m, 'nvar', nvar, 'k', k, 'wt', wt, 'maxit', maxit)
Note: the interface to this routine has changed since earlier releases of the toolbox:
Mark 22: n, k have been made optional
Mark 24: drop weight, wt optional
.

Description

Given nn objects with pp variables measured on each object, xijxij, for i = 1,2,,ni=1,2,,n and j = 1,2,,pj=1,2,,p, nag_mv_cluster_kmeans (g03ef) allocates each object to one of KK groups or clusters to minimize the within-cluster sum of squares:
K p
iSk(xijxkj)2,
k = 1 j = 1
k=1KiSkj=1p (xij-x-kj) 2,
where SkSk is the set of objects in the kkth cluster and xkjx-kj is the mean for the variable jj over cluster kk. This is often known as KK-means clustering.
In addition to the data matrix, a KK by pp matrix giving the initial cluster centres for the KK clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the KK-partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.
Optionally, weights for each object, wiwi, can be used so that the clustering is based on within-cluster weighted sums of squares:
K p
iSkwi(xijkj)2,
k = 1 j = 1
k=1KiSkj=1pwi (xij-x~kj) 2,
where kjx~kj is the weighted mean for variable jj over cluster kk.
The function is based on the algorithm of Hartigan and Wong (1979).

References

Everitt B S (1974) Cluster Analysis Heinemann
Hartigan J A and Wong M A (1979) Algorithm AS 136: A K-means clustering algorithm Appl. Statist. 28 100–108
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

Parameters

Compulsory Input Parameters

1:     x(ldx,m) – double array
ldx, the first dimension of the array, must satisfy the constraint ldxnldxn.
x(i,j)xij must contain the value of the jjth variable for the iith object, for i = 1,2,,ni=1,2,,n and j = 1,2,,mj=1,2,,m.
2:     isx(m) – int64int32nag_int array
m, the dimension of the array, must satisfy the constraint mnvarmnvar.
isx(j)isxj indicates whether or not the jjth variable is to be included in the analysis. If isx(j) > 0isxj>0, the variable contained in the jjth column of x is included, for j = 1,2,,mj=1,2,,m.
Constraint: isx(j) > 0isxj>0 for nvar values of jj.
3:     cmeans(ldc,nvar) – double array
ldc, the first dimension of the array, must satisfy the constraint ldckldck.
cmeans(i,j)cmeansij must contain the value of the jjth variable for the iith initial cluster centre, for i = 1,2,,Ki=1,2,,K and j = 1,2,,pj=1,2,,p.

Optional Input Parameters

1:     n – int64int32nag_int scalar
Default: The first dimension of the array x.
nn, the number of objects.
Constraint: n > 1n>1.
2:     m – int64int32nag_int scalar
Default: The dimension of the array isx and the second dimension of the array x. (An error is raised if these dimensions are not equal.)
The total number of variables in array x.
Constraint: mnvarmnvar.
3:     nvar – int64int32nag_int scalar
Default: The second dimension of the array cmeans.
pp, the number of variables included in the sums of squares calculations.
Constraint: 1nvarm1nvarm.
4:     k – int64int32nag_int scalar
Default: The first dimension of the array cmeans.
KK, the number of clusters.
Constraint: k2k2.
5:     wt( : :) – double array
Note: the dimension of the array wt must be at least nn if weight = 'W'weight='W', and at least 11 otherwise.
If weight = 'W'weight='W', the first nn elements of wt must contain the weights to be used.
If wt(i) = 0.0wti=0.0, the iith observation is not included in the analysis. The effective number of observation is the sum of the weights.
If weight = 'U'weight='U', wt is not referenced and the effective number of observations is nn.
Constraint: if weight = 'W'weight='W', wt(i)0.0wti0.0 and wt(i) > 0.0wti>0.0 for at least two values of ii, for i = 1,2,,ni=1,2,,n.
6:     maxit – int64int32nag_int scalar
The maximum number of iterations allowed in the analysis.
Default: 1010
Constraint: maxit > 0maxit>0.

Input Parameters Omitted from the MATLAB Interface

weight ldx ldc iwk wk

Output Parameters

1:     cmeans(ldc,nvar) – double array
ldckldck.
cmeans(i,j)cmeansij contains the value of the jjth variable for the iith computed cluster centre, for i = 1,2,,Ki=1,2,,K and j = 1,2,,pj=1,2,,p.
2:     inc(n) – int64int32nag_int array
inc(i)inci contains the cluster to which the iith object has been allocated, for i = 1,2,,ni=1,2,,n.
3:     nic(k) – int64int32nag_int array
nic(i)nici contains the number of objects in the iith cluster, for i = 1,2,,Ki=1,2,,K.
4:     css(k) – double array
css(i)cssi contains the within-cluster (weighted) sum of squares of the iith cluster, for i = 1,2,,Ki=1,2,,K.
5:     csw(k) – double array
csw(i)cswi contains the within-cluster sum of weights of the iith cluster, for i = 1,2,,Ki=1,2,,K. If weight = 'U'weight='U', the sum of weights is the number of objects in the cluster.
6:     ifail – int64int32nag_int scalar
ifail = 0ifail=0 unless the function detects an error (see [Error Indicators and Warnings]).

Error Indicators and Warnings

Errors or warnings detected by the function:
  ifail = 1ifail=1
On entry,weight'W'weight'W' or 'U''U',
orn < 2n<2,
ornvar < 1nvar<1,
orm < nvarm<nvar,
ork < 2k<2,
orldx < nldx<n,
orldc < kldc<k,
ormaxit0maxit0.
  ifail = 2ifail=2
On entry,weight = 'W'weight='W' and a value of wt(i) < 0.0wti<0.0 for some ii,
orweight = 'W'weight='W' and wt(i) = 0.0wti=0.0 for all or all but one values of ii.
  ifail = 3ifail=3
On entry,the number of positive values in isx does not equal nvar.
  ifail = 4ifail=4
On entry, at least one cluster is empty after the initial assignment. Try a different set of initial cluster centres in cmeans and also consider decreasing the value of k. The empty clusters may be found by examining the values in nic.
  ifail = 5ifail=5
Convergence has not been achieved within the maximum number of iterations given by maxit. Try increasing maxit and, if possible, use the returned values in cmeans as the initial cluster centres.

Accuracy

nag_mv_cluster_kmeans (g03ef) produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.

Further Comments

The time per iteration is approximately proportional to npKnpK.

Example

function nag_mv_cluster_kmeans_example
x = [77.3, 13, 9.7, 1.5, 6.4;
     82.5, 10, 7.5, 1.5, 6.5;
     66.9, 20.6, 12.5, 2.3, 7;
     47.2, 33.8, 19, 2.8, 5.8;
     65.3, 20.5, 14.2, 1.9, 6.9;
     83.3, 10, 6.7, 2.2, 7;
     81.6, 12.7, 5.7, 2.9, 6.7;
     47.8, 36.5, 15.7, 2.3, 7.2;
     48.6, 37.1, 14.3, 2.1, 7.2;
     61.6, 25.5, 12.9, 1.9, 7.3;
     58.6, 26.5, 14.9, 2.4, 6.7;
     69.3, 22.3, 8.4, 4, 7;
     61.8, 30.8, 7.4, 2.7, 6.4;
     67.7, 25.3, 7, 4.8, 7.3;
     57.2, 31.2, 11.6, 2.4, 6.5;
     67.2, 22.7, 10.1, 3.3, 6.2;
     59.2, 31.2, 9.6, 2.4, 6;
     80.2, 13.2, 6.6, 2, 5.8;
     82.2, 11.1, 6.7, 2.2, 7.2;
     69.7, 20.7, 9.6, 3.1, 5.9];
isx = [int64(1);1;1;1;1];
cmeans = [82.5, 10, 7.5, 1.5, 6.5;
     47.8, 36.5, 15.7, 2.3, 7.2;
     67.2, 22.7, 10.1, 3.3, 6.2];
[cmeansOut, inc, nic, css, csw, ifail] = nag_mv_cluster_kmeans(x, isx, cmeans)
 

cmeansOut =

   81.1833   11.6667    7.1500    2.0500    6.6000
   47.8667   35.8000   16.3333    2.4000    6.7333
   64.0455   25.2091   10.7455    2.8364    6.6545


inc =

                    1
                    1
                    3
                    2
                    3
                    1
                    1
                    2
                    2
                    3
                    3
                    3
                    3
                    3
                    3
                    3
                    3
                    1
                    1
                    3


nic =

                    6
                    3
                   11


css =

   46.5717
   20.3800
  468.8964


csw =

     6
     3
    11


ifail =

                    0


function g03ef_example
x = [77.3, 13, 9.7, 1.5, 6.4;
     82.5, 10, 7.5, 1.5, 6.5;
     66.9, 20.6, 12.5, 2.3, 7;
     47.2, 33.8, 19, 2.8, 5.8;
     65.3, 20.5, 14.2, 1.9, 6.9;
     83.3, 10, 6.7, 2.2, 7;
     81.6, 12.7, 5.7, 2.9, 6.7;
     47.8, 36.5, 15.7, 2.3, 7.2;
     48.6, 37.1, 14.3, 2.1, 7.2;
     61.6, 25.5, 12.9, 1.9, 7.3;
     58.6, 26.5, 14.9, 2.4, 6.7;
     69.3, 22.3, 8.4, 4, 7;
     61.8, 30.8, 7.4, 2.7, 6.4;
     67.7, 25.3, 7, 4.8, 7.3;
     57.2, 31.2, 11.6, 2.4, 6.5;
     67.2, 22.7, 10.1, 3.3, 6.2;
     59.2, 31.2, 9.6, 2.4, 6;
     80.2, 13.2, 6.6, 2, 5.8;
     82.2, 11.1, 6.7, 2.2, 7.2;
     69.7, 20.7, 9.6, 3.1, 5.9];
isx = [int64(1);1;1;1;1];
cmeans = [82.5, 10, 7.5, 1.5, 6.5;
     47.8, 36.5, 15.7, 2.3, 7.2;
     67.2, 22.7, 10.1, 3.3, 6.2];
[cmeansOut, inc, nic, css, csw, ifail] = g03ef(x, isx, cmeans)
 

cmeansOut =

   81.1833   11.6667    7.1500    2.0500    6.6000
   47.8667   35.8000   16.3333    2.4000    6.7333
   64.0455   25.2091   10.7455    2.8364    6.6545


inc =

                    1
                    1
                    3
                    2
                    3
                    1
                    1
                    2
                    2
                    3
                    3
                    3
                    3
                    3
                    3
                    3
                    3
                    1
                    1
                    3


nic =

                    6
                    3
                   11


css =

   46.5717
   20.3800
  468.8964


csw =

     6
     3
    11


ifail =

                    0



PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2013