NAG Library Routine Document

g03dcf (discrim_group)

 Contents

    1  Purpose
    7  Accuracy

1
Purpose

g03dcf allocates observations to groups according to selected rules. It is intended for use after g03daf.

2
Specification

Fortran Interface
Subroutine g03dcf ( typ, equal, priors, nvar, ng, nig, gmn, ldgmn, gc, det, nobs, m, isx, x, ldx, prior, p, ldp, iag, atiq, ati, wk, ifail)
Integer, Intent (In):: nvar, ng, nig(ng), ldgmn, nobs, m, isx(m), ldx, ldp
Integer, Intent (Inout):: ifail
Integer, Intent (Out):: iag(nobs)
Real (Kind=nag_wp), Intent (In):: gmn(ldgmn,nvar), gc((ng+1)*nvar*(nvar+1)/2), det(ng), x(ldx,m)
Real (Kind=nag_wp), Intent (Inout):: prior(ng), p(ldp,ng), ati(ldp,*)
Real (Kind=nag_wp), Intent (Out):: wk(2*nvar)
Logical, Intent (In):: atiq
Character (1), Intent (In):: typ, equal, priors
C Header Interface
#include nagmk26.h
void  g03dcf_ (const char *typ, const char *equal, const char *priors, const Integer *nvar, const Integer *ng, const Integer nig[], const double gmn[], const Integer *ldgmn, const double gc[], const double det[], const Integer *nobs, const Integer *m, const Integer isx[], const double x[], const Integer *ldx, double prior[], double p[], const Integer *ldp, Integer iag[], const logical *atiq, double ati[], double wk[], Integer *ifail, const Charlen length_typ, const Charlen length_equal, const Charlen length_priors)

3
Description

Discriminant analysis is concerned with the allocation of observations to groups using information from other observations whose group membership is known, Xt; these are called the training set. Consider p variables observed on ng populations or groups. Let x-j be the sample mean and Sj the within-group variance-covariance matrix for the jth group; these are calculated from a training set of n observations with nj observations in the jth group, and let xk be the kth observation from the set of observations to be allocated to the ng groups. The observation can be allocated to a group according to a selected rule. The allocation rule or discriminant function will be based on the distance of the observation from an estimate of the location of the groups, usually the group means. A measure of the distance of the observation from the jth group mean is given by the Mahalanobis distance, Dkj:
Dkj2=xk-x-jTSj-1xk-x-j. (1)
If the pooled estimate of the variance-covariance matrix S is used rather than the within-group variance-covariance matrices, then the distance is:
Dkj2=xk-x-jTS-1xk-x-j. (2)
Instead of using the variance-covariance matrices S and Sj, g03dcf uses the upper triangular matrices R and Rj supplied by g03daf such that S=RTR and Sj=RjTRj. Dkj2 can then be calculated as zTz where RTjz=xk-xj or RTz=xk-x as appropriate.
In addition to the distances, a set of prior probabilities of group membership, πj, for j=1,2,,ng, may be used, with πj=1. The prior probabilities reflect your view as to the likelihood of the observations coming from the different groups. Two common cases for prior probabilities are π1=π2==πng, that is, equal prior probabilities, and πj=nj/n, for j=1,2,,ng, that is, prior probabilities proportional to the number of observations in the groups in the training set.
g03dcf uses one of four allocation rules. In all four rules the p variables are assumed to follow a multivariate Normal distribution with mean μj and variance-covariance matrix Σj if the observation comes from the jth group. The different rules depend on whether or not the within-group variance-covariance matrices are assumed equal, i.e., Σ1=Σ2==Σng, and whether a predictive or estimative approach is used. If p xk μj ,Σj  is the probability of observing the observation xk from group j, then the posterior probability of belonging to group j is:
p jxk,μj,Σj p xk μj ,Σj πj. (3)
In the estimative approach, the arguments μj and Σj in (3) are replaced by their estimates calculated from Xt. In the predictive approach, a non-informative prior distribution is used for the arguments and a posterior distribution for the arguments, p μj, Σj Xt , is found. A predictive distribution is then obtained by integrating p jxk,μj,Σj p μj, Σj X  over the argument space. This predictive distribution then replaces p xk μj ,Σj  in (3). See Aitchison and Dunsmore (1975), Aitchison et al. (1977) and Moran and Murphy (1979) for further details.
The observation is allocated to the group with the highest posterior probability. Denoting the posterior probabilities, p jxk,μj,Σj , by qj, the four allocation rules are:
(i) Estimative with equal variance-covariance matrices – Linear Discrimination
logqj-12Dkj2+logπj  
(ii) Estimative with unequal variance-covariance matrices – Quadratic Discrimination
logqj-12Dkj2+logπj-12logSj  
(iii) Predictive with equal variance-covariance matrices
q j - 1 n j +1 / n j p / 2 1 + n j / n - n g n j +1 D k j 2 n +1 - n g / 2  
(iv) Predictive with unequal variance-covariance matrices
q j - 1 C n j 2 - 1 / n j S j p / 2 1 + n j / n j 2 - 1 D k j 2 n j / 2 ,  
where
C=Γ12nj-p Γ12nj .  
In the above the appropriate value of Dkj2 from (1) or (2) is used. The values of the qj are standardized so that,
j=1ngqj=1.  
Moran and Murphy (1979) show the similarity between the predictive methods and methods based upon likelihood ratio tests.
In addition to allocating the observation to a group, g03dcf computes an atypicality index, Ijxk. The predictive atypicality index is returned, irrespective of the value of the parameter typ. This represents the probability of obtaining an observation more typical of group j than the observed xk (see Aitchison and Dunsmore (1975) and Aitchison et al. (1977)). The atypicality index is computed for unequal within-group variance-covariance matrices as:
Ijxk=PBz:12p,12nj-p  
where PBβ:a,b is the lower tail probability from a beta distribution and
z=Dkj2/Dkj2+nj2-1/nj,  
and for equal within-group variance-covariance matrices as:
Ijxk=PBz : 12p,12n-ng-p+ 1,  
with
z=Dkj2/Dkj2+n-ngnj+1/nj.  
If Ijxk is close to 1 for all groups it indicates that the observation may come from a grouping not represented in the training set. Moran and Murphy (1979) provide a frequentist interpretation of Ijxk.

4
References

Aitchison J and Dunsmore I R (1975) Statistical Prediction Analysis Cambridge
Aitchison J, Habbema J D F and Kay J W (1977) A critical comparison of two methods of statistical discrimination Appl. Statist. 26 15–25
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press
Moran M A and Murphy B J (1979) A closer look at two alternative methods of statistical discrimination Appl. Statist. 28 223–232
Morrison D F (1967) Multivariate Statistical Methods McGraw–Hill

5
Arguments

1:     typ – Character(1)Input
On entry: whether the estimative or predictive approach is used.
typ='E'
The estimative approach is used.
typ='P'
The predictive approach is used.
Constraint: typ='E' or 'P'.
2:     equal – Character(1)Input
On entry: indicates whether or not the within-group variance-covariance matrices are assumed to be equal and the pooled variance-covariance matrix used.
equal='E'
The within-group variance-covariance matrices are assumed equal and the matrix R stored in the first pp+1/2 elements of gc is used.
equal='U'
The within-group variance-covariance matrices are assumed to be unequal and the matrices Ri, for i=1,2,,ng, stored in the remainder of gc are used.
Constraint: equal='E' or 'U'.
3:     priors – Character(1)Input
On entry: indicates the form of the prior probabilities to be used.
priors='E'
Equal prior probabilities are used.
priors='P'
Prior probabilities proportional to the group sizes in the training set, nj, are used.
priors='I'
The prior probabilities are input in prior.
Constraint: priors='E', 'I' or 'P'.
4:     nvar – IntegerInput
On entry: p, the number of variables in the variance-covariance matrices.
Constraint: nvar1.
5:     ng – IntegerInput
On entry: the number of groups, ng.
Constraint: ng2.
6:     nigng – Integer arrayInput
On entry: the number of observations in each group in the training set, nj.
Constraints:
  • if equal='E', nigj>0 and j=1ngnigj>ng+nvar, for j=1,2,,ng;
  • if equal='U', nigj>nvar, for j=1,2,,ng.
7:     gmnldgmnnvar – Real (Kind=nag_wp) arrayInput
On entry: the jth row of gmn contains the means of the p variables for the jth group, for j=1,2,,nj. These are returned by g03daf.
8:     ldgmn – IntegerInput
On entry: the first dimension of the array gmn as declared in the (sub)program from which g03dcf is called.
Constraint: ldgmnng.
9:     gcng+1×nvar×nvar+1/2 – Real (Kind=nag_wp) arrayInput
On entry: the first pp+1/2 elements of gc should contain the upper triangular matrix R and the next ng blocks of pp+1/2 elements should contain the upper triangular matrices Rj.
All matrices must be stored packed by column. These matrices are returned by g03daf. If equal='E' only the first pp+1/2 elements are referenced, if equal='U' only the elements pp+1/2+1 to ng+1pp+1/2 are referenced.
Constraints:
  • if equal='E', the diagonal elements of R must be 0.0;
  • if equal='U', the diagonal elements of the Rj must be 0.0, for j=1,2,,ng.
10:   detng – Real (Kind=nag_wp) arrayInput
On entry: if equal='U'. the logarithms of the determinants of the within-group variance-covariance matrices as returned by g03daf. Otherwise det is not referenced.
11:   nobs – IntegerInput
On entry: the number of observations in x which are to be allocated.
Constraint: nobs1.
12:   m – IntegerInput
On entry: the number of variables in the data array x.
Constraint: mnvar.
13:   isxm – Integer arrayInput
On entry: isxl indicates if the lth variable in x is to be included in the distance calculations.
If isxl>0, the lth variable is included, for l=1,2,,m; otherwise the lth variable is not referenced.
Constraint: isxl>0 for nvar values of l.
14:   xldxm – Real (Kind=nag_wp) arrayInput
On entry: xkl must contain the kth observation for the lth variable, for k=1,2,,nobs and l=1,2,,m.
15:   ldx – IntegerInput
On entry: the first dimension of the array x as declared in the (sub)program from which g03dcf is called.
Constraint: ldxnobs.
16:   priorng – Real (Kind=nag_wp) arrayInput/Output
On entry: if priors='I', the prior probabilities for the ng groups.
Constraint: if priors='I', priorj>0.0 and 1- j=1 ng priorj 10×machine precision , for j=1,2,,ng.
On exit: if priors='P', the computed prior probabilities in proportion to group sizes for the ng groups.
If priors='I', the input prior probabilities will be unchanged.
If priors='E', prior is not set.
17:   pldpng – Real (Kind=nag_wp) arrayOutput
On exit: pkj contains the posterior probability pkj for allocating the kth observation to the jth group, for k=1,2,,nobs and j=1,2,,ng.
18:   ldp – IntegerInput
On entry: the first dimension of the array ati and the first dimension of the array p as declared in the (sub)program from which g03dcf is called.
Constraint: ldpnobs.
19:   iagnobs – Integer arrayOutput
On exit: the groups to which the observations have been allocated.
20:   atiq – LogicalInput
On entry: atiq must be .TRUE. if atypicality indices are required. If atiq is .FALSE. the array ati is not set.
21:   atildp* – Real (Kind=nag_wp) arrayOutput
Note: the second dimension of the array ati must be at least ng if atiq=.TRUE., and at least 1 otherwise.
On exit: if atiq is .TRUE., atikj will contain the predictive atypicality index for the kth observation with respect to the jth group, for k=1,2,,nobs and j=1,2,,ng.
If atiq is .FALSE., ati is not set.
22:   wk2×nvar – Real (Kind=nag_wp) arrayWorkspace
23:   ifail – IntegerInput/Output
On entry: ifail must be set to 0, -1​ or ​1. If you are unfamiliar with this argument you should refer to Section 3.4 in How to Use the NAG Library and its Documentation for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value -1​ or ​1 is recommended. If the output of error messages is undesirable, then the value 1 is recommended. Otherwise, if you are not familiar with this argument, the recommended value is 0. When the value -1​ or ​1 is used it is essential to test the value of ifail on exit.
On exit: ifail=0 unless the routine detects an error or a warning has been flagged (see Section 6).

6
Error Indicators and Warnings

If on entry ifail=0 or -1, explanatory error messages are output on the current error message unit (as defined by x04aaf).
Errors or warnings detected by the routine:
ifail=1
On entry,nvar<1,
orng<2,
ornobs<1,
orm<nvar,
orldgmn<ng,
orldx<nobs,
orldp<nobs,
ortyp'E' or ‘p’,
orequal'E' or ‘U’,
orpriors'E', ‘I’ or ‘p’.
ifail=2
On entry,the number of variables indicated by isx is not equal to nvar,
orequal='E' and nigj0, for some j,
orequal='E' and j=1ngnigjng+nvar,
orequal='U' and nigjnvar for some j.
ifail=3
On entry,priors='I' and priorj0.0 for some j,
orpriors='I' and j=1ngpriorj is not within 10×machine precision of 1.
ifail=4
On entry,equal='E' and a diagonal element of R is zero,
orequal='U' and a diagonal element of Rj for some j is zero.
ifail=-99
An unexpected error has been triggered by this routine. Please contact NAG.
See Section 3.9 in How to Use the NAG Library and its Documentation for further information.
ifail=-399
Your licence key may have expired or may not have been installed correctly.
See Section 3.8 in How to Use the NAG Library and its Documentation for further information.
ifail=-999
Dynamic memory allocation failed.
See Section 3.7 in How to Use the NAG Library and its Documentation for further information.

7
Accuracy

The accuracy of the returned posterior probabilities will depend on the accuracy of the input R or Rj matrices. The atypicality index should be accurate to four significant places.

8
Parallelism and Performance

g03dcf makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.
Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.

9
Further Comments

The distances Dkj2 can be computed using g03dbf if other forms of discrimination are required.

10
Example

The data, taken from Aitchison and Dunsmore (1975), is concerned with the diagnosis of three ‘types’ of Cushing's syndrome. The variables are the logarithms of the urinary excretion rates (mg/24hr) of two steroid metabolites. Observations for a total of 21 patients are input and the group means and R matrices are computed by g03daf. A further six observations of unknown type are input and allocations made using the predictive approach and under the assumption that the within-group covariance matrices are not equal. The posterior probabilities of group membership, qj, and the atypicality index are printed along with the allocated group. The atypicality index shows that observations 5 and 6 do not seem to be typical of the three types present in the initial 21 observations.

10.1
Program Text

Program Text (g03dcfe.f90)

10.2
Program Data

Program Data (g03dcfe.d)

10.3
Program Results

Program Results (g03dcfe.r)

© The Numerical Algorithms Group Ltd, Oxford, UK. 2017