PDF version (NAG web site
, 64bit version, 64bit version)
NAG Toolbox Chapter Introduction
G11 — Contingency Table Analysis
Scope of the Chapter
The functions in this chapter are for the analysis of discrete multivariate data. One suite of functions computes tables while other functions are for the analysis of twoway contingency tables, conditional logistic models and onefactor analysis of binary data.
Functions in
Chapter G02 may be used to fit generalized linear models to discrete data including binary data and contingency tables.
Background to the Problems
Discrete Data
Discrete variables can be defined as variables which take a limited range of values. Discrete data can be usefully categorized into three types.
 Binary data. The variables can take one of two values: for example, yes or no. The data may be grouped: for example, the number of yes responses in ten questions.
 Categorical data. The variables can take one of two or more values or levels, but the values are not considered to have any ordering: for example, the values may be red, green, blue or brown.
 Ordered categorical data. This is similar to categorical data but an ordering can be placed on the levels: for example, poor, average or good.
Data containing discrete variables can be analysed by computing summaries and measures of association and by fitting models.
Tabulation
The basic summary for multivariate discrete data is the multidimensional table in which each dimension is specified by a discrete variable. If the cells of the table are the number of observations with the corresponding values of the discrete variables then it is a contingency table. The discrete variables that can be used to classify a table are known as factors. For example, the factor sex would have the levels male and female. These can be coded as
1$1$ and
2$2$ respectively. Given several factors a multiway table can be constructed such that each cell of the table represents one level from each factor. For example, a sample of
120$120$ observations with the two factors sex and habitat, habitat having three levels (innercity, suburban and rural), would give the
2 × 3$2\times 3$ contingency table
Sex 
Habitat 

Innercity 
Suburban 
Rural 
Male 
32 
27 
15 
Female 
21 
19 
6 
If the sample also contains continuous variables such as age, the average for the observations in each cell could be computed:
Sex 
Habitat 

Innercity 
Suburban 
Rural 
Male 
25.5 
30.3 
35.6 
Female 
23.2 
29.1 
30.4 
or other summary statistics.
Given a table, the totals or means for rows, columns etc. may be required. Thus the above contingency table with marginal totals is
Sex 
Habitat 


Innercity 
Suburban 
Rural 
Total 
Male 
32 
27 
15 
74 
Female 
21 
19 
6 
46 
Total 
53 
46 
21 
120 
Note that the marginal totals for columns is itself a 2 × 1$2\times 1$ table. Also, other summary statistics could be used to produce the marginal tables such as means or medians. Having computed the marginal tables, the cells of the original table may be expressed in terms of the margins, for example in the above table the cells could be expressed as percentages of the column totals.
Discrete Response Variables and Logistic Regression
A second important categorization in addition to that given in
Section [Discrete Data] is whether one of the discrete variables can be considered as a response variable or whether it is just the association between the discrete variables that is being considered.
If the response variable is binary, for example, success or failure, then a logistic or probit regression model can be used. The logistic regression model relates the logarithm of the oddsratio to a linear model. So if p_{i}${p}_{i}$ is the probability of success, the model relates log(p_{i} / (1 − p_{i}))$\mathrm{log}({p}_{i}/(1{p}_{i}))$ to the explanatory variables. If the responses are independent then these models are special cases of the generalized linear model with binomial errors. However, there are cases when the binomial model is not suitable. For example, in a casecontrol study a number of cases (successes) and number of controls (failures) is chosen for a number of sets of casecontrols. In this situation a conditional logistic analysis is required.
Handling a categorical or ordered categorical response variable is more complex, for a discussion on the appropriate models see
McCullagh and Nelder (1983). These models generally use a Poisson distribution.
Note that if the response variable is a continuous variable and it is only the explanatory variables that are discrete then the regression models described in
Chapter G02 should be used.
Contingency Tables
If there is no response variable then to investigate the association between discrete variables a contingency table can be computed and a suitable test performed on the table. The simplest case is the twoway table formed when considering two discrete variables. For a dataset of
n$n$ observations classified by the two variables with
r$r$ and
c$c$ levels respectively, a twoway table of frequencies or counts with
r$r$ rows and
c$c$ columns can be computed.
If
p_{ij}${p}_{ij}$ is the probability of an observation in cell
ij$ij$ then the model which assumes no association between the two variables is the model
where
p_{i . }${p}_{i.}$ is the marginal probability for the row variable and
p_{ . j}${p}_{.j}$ is the marginal probability for the column variable, the marginal probability being the probability of observing a particular value of the variable ignoring all other variables. The appropriateness of this model can be assessed by two commonly used statistics:
the Pearson
χ^{2}${\chi}^{2}$ statistic
and the likelihood ratio test statistic
The
f_{ij}${f}_{ij}$ are the fitted values from the model; these values are the expected cell frequencies and are given by
Under the hypothesis of no association between the two classification variables, both these statistics have, approximately, a
χ^{2}${\chi}^{2}$distribution with
(c − 1)(r − 1)$(c1)(r1)$ degrees of freedom. This distribution is arrived at under the assumption that the expected cell frequencies,
f_{ij}${f}_{ij}$, are not too small.
In the case of the 2 × 2$2\times 2$ table, i.e., c = 2$c=2$ and r = 2$r=2$, the χ^{2}${\chi}^{2}$ approximation can be improved by using Yates's continuity correction factor. This decreases the absolute value of (n_{ij} − f_{ij}${n}_{ij}{f}_{ij}$) by 1 / 2$1/2$. For 2 × 2$2\times 2$ tables with a small values of n$n$ the exact probabilities can be computed; this is known as Fisher's exact test.
An alternative approach, which can easily be generalized to more than two variables, is to use loglinear models. A loglinear model for two variables can be written as
A model like this can be fitted as a generalized linear model with Poisson error with the cell counts,
n_{ij}${n}_{ij}$, as the response variable.
Latent Variable Models
Latent variable models play an important role in the analysis of multivariate data. They have arisen in response to practical needs in many sciences, especially in psychology, educational testing and other social sciences.
Largescale statistical enquiries, such as social surveys, generate much more information than can be easily absorbed without condensation. Elementary statistical methods help to summarise the data by looking at individual variables or the relationship between a small number of variables. However, with many variables it may still be difficult to see any pattern of interrelationships. Our ability to visualize relationships is limited to two or three dimensions putting us under strong pressure to reduce the dimensionality of the data and yet preserve as much of the structure as possible. The question is thus one of how to replace the many variables with which we start by a much smaller number, with as little loss of information as possible.
One approach to the problem is to set up a model in which the dependence between the observed variables is accounted for by one or more latent variables. Such a model links the large number of observable variables with a much smaller number of latent variables.
Factor analysis, as described in
Chapter G03, is based on a linear model of this kind when the observed variables are continuous. Here we consider the case where the observed variables are binary (e.g., coded
0 / 1$0/1$ or true/false) and where there is one latent variable. In educational testing this is known as latent trait analysis, but, more generally, as factor analysis of binary data.
A variety of methods and models have been proposed for this problem. The models used here are derived from the general approach of
Bartholomew (1980) and
Bartholomew (1984). You are referred to
Bartholomew (1980) for further information on the models and to
Bartholomew (1987) for details of the method and application.
Recommendations on Choice and Use of Available Functions
Tabulation
The following functions can be used to perform the tabulation of discrete data:
Analysis of Contingency Tables
nag_contab_chisq (g11aa) computes the Pearson and likelihood ratio
χ^{2}${\chi}^{2}$ statistics for a twoway contingency table. For
2 × 2$2\times 2$ tables Yates's correction factor is used and for small samples,
n ≤ 40$n\le 40$, Fisher's exact test is used.
In addition,
nag_correg_glm_poisson (g02gc) can be used to fit a loglinear model to a contingency table.
Binary data
The following functions can be used to analyse binary data:
In addition,
nag_correg_glm_binomial (g02gb) fits generalized linear models to binary data.
Functionality Index
Multiway tables from set of classification factors,   
References
Bartholomew D J (1980) Factor analysis for categorical data (with Discussion) J. Roy. Statist. Soc. Ser. B 42 293–321
Bartholomew D J (1984) The foundations of factor analysis Biometrika 71 221–232
Bartholomew D J (1987) Latent Variable Models and Factor Analysis Griffin
Everitt B S (1977) The Analysis of Contingency Tables Chapman and Hall
Kendall M G and Stuart A (1969) The Advanced Theory of Statistics (Volume 1) (3rd Edition) Griffin
Kendall M G and Stuart A (1973) The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin
McCullagh P and Nelder J A (1983) Generalized Linear Models Chapman and Hall
PDF version (NAG web site
, 64bit version, 64bit version)
© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2013