NAG C Library Function Document
nag_full_step_regsn (g02efc)
1
Purpose
nag_full_step_regsn (g02efc) calculates a full stepwise selection from $p$ variables by using Clarke's sweep algorithm on the correlation matrix of a design and data matrix, $Z$. The (weighted) variancecovariance, (weighted) means and sum of weights of $Z$ must be supplied.
2
Specification
#include <nag.h> 
#include <nagg02.h> 
void 
nag_full_step_regsn (Integer m,
Integer n,
const double wmean[],
const double c[],
double sw,
Integer isx[],
double fin,
double fout,
double tau,
double b[],
double se[],
double *rsq,
double *rms,
Integer *df,
Integer monlev,
Nag_Comm *comm,
NagError *fail) 

3
Description
The general multiple linear regression model is defined by
where
 $y$ is a vector of $n$ observations on the dependent variable,
 ${\beta}_{0}$ is an intercept coefficient,
 $X$ is an $n$ by $p$ matrix of $p$ explanatory variables,
 $\beta $ is a vector of $p$ unknown coefficients, and
 $\epsilon $ is a vector of length $n$ of unknown, Normally distributed, random errors.
nag_full_step_regsn (g02efc) employs a full stepwise regression to select a subset of explanatory variables from the $p$ available variables (the intercept is included in the model) and computes regression coefficients and their standard errors, and various other statistical quantities, by minimizing the sum of squares of residuals. The method applies repeatedly a forward selection step followed by a backward elimination step and halts when neither step updates the current model.
The criterion used to update a current model is the variance ratio of residual sum of squares. Let
${s}_{1}$ and
${s}_{2}$ be the residual sum of squares of the current model and this model after undergoing a single update, with degrees of freedom
${q}_{1}$ and
${q}_{2}$, respectively. Then the condition:
must be satisfied if a variable
$k$ will be considered for entry to the current model, and the condition:
must be satisfied if a variable
$k$ will be considered for removal from the current model, where
${f}_{1}$ and
${f}_{2}$ are usersupplied values and
${f}_{2}\le {f}_{1}$.
In the entry step the entry statistic is computed for each variable not in the current model. If no variable is associated with a test value that exceeds ${f}_{1}$ then this step is terminated; otherwise the variable associated with the largest value for the entry statistic is entered into the model.
In the removal step the removal statistic is computed for each variable in the current model. If no variable is associated with a test value less than ${f}_{2}$ then this step is terminated; otherwise the variable associated with the smallest value for the removal statistic is removed from the model.
The data values $X$ and $y$ are not provided as input to the function. Instead, summary statistics of the design and data matrix $Z=\left(X\mid y\right)$ are required.
Explanatory variables are entered into and removed from the current model by using sweep operations on the correlation matrix
$R$ of
$Z$, given by:
where
${r}_{\mathit{i}\mathit{j}}$ is the correlation between the explanatory variables
$\mathit{i}$ and
$\mathit{j}$, for
$\mathit{i}=1,2,\dots ,p$ and
$\mathit{j}=1,2,\dots ,p$, and
${r}_{yi}$ (and
${r}_{iy}$) is the correlation between the response variable
$y$ and the
$\mathit{i}$th explanatory variable, for
$\mathit{i}=1,2,\dots ,p$.
A sweep operation on the
$k$th row and column (
$k\le p$) of
$R$ replaces:
The
$k$th explanatory variable is eligible for entry into the current model if it satisfies the collinearity tests:
${r}_{kk}>\tau $ and
for a usersupplied value (
$>0$) of
$\tau $ and where the index
$i$ runs over explanatory variables in the current model. The sweep operation is its own inverse, therefore pivoting on an explanatory variable
$k$ in the current model has the effect of removing it from the model.
Once the stepwise model selection procedure is finished, the function calculates:
(a) 
the least squares estimate for the $i$th explanatory variable included in the fitted model; 
(b) 
standard error estimates for each coefficient in the final model; 
(c) 
the square root of the mean square of residuals and its degrees of freedom; 
(d) 
the multiple correlation coefficient. 
The function makes use of the symmetry of the sweep operations and correlation matrix which reduces by almost one half the storage and computation required by the sweep algorithm, see
Clarke (1981) for details.
4
References
Clarke M R B (1981) Algorithm AS 178: the Gauss–Jordan sweep operator with detection of collinearity Appl. Statist. 31 166–169
Dempster A P (1969) Elements of Continuous Multivariate Analysis Addison–Wesley
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
5
Arguments
 1:
$\mathbf{m}$ – IntegerInput

On entry: the number of explanatory variables available in the design matrix, $Z$.
Constraint:
${\mathbf{m}}>1$.
 2:
$\mathbf{n}$ – IntegerInput

On entry: the number of observations used in the calculations.
Constraint:
${\mathbf{n}}>1$.
 3:
$\mathbf{wmean}\left[{\mathbf{m}}+1\right]$ – const doubleInput

On entry: the mean of the design matrix, $Z$.
 4:
$\mathbf{c}\left[\mathit{dim}\right]$ – const doubleInput

Note: the dimension,
dim, of the array
c
must be at least
$\left({\mathbf{m}}+1\right)\times \left({\mathbf{m}}+2\right)/2$.
On entry: the uppertriangular variancecovariance matrix packed by column for the design matrix,
$Z$. Because the function computes the correlation matrix
$R$ from
c, the variancecovariance matrix need only be supplied up to a scaling factor.
 5:
$\mathbf{sw}$ – doubleInput

On entry: if weights were used to calculate
c then
sw is the sum of positive weight values; otherwise
sw is the number of observations used to calculate
c.
Constraint:
${\mathbf{sw}}>1.0$.
 6:
$\mathbf{isx}\left[{\mathbf{m}}\right]$ – IntegerInput/Output

On entry: the value of
${\mathbf{isx}}\left[\mathit{j}1\right]$ determines the set of variables used to perform full stepwise model selection, for
$\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
 ${\mathbf{isx}}\left[\mathit{j}1\right]=1$
 To exclude the variable corresponding to the $j$th column of $X$ from the final model.
 ${\mathbf{isx}}\left[\mathit{j}1\right]=1$
 To consider the variable corresponding to the $j$th column of $X$ for selection in the final model.
 ${\mathbf{isx}}\left[\mathit{j}1\right]=2$
 To force the inclusion of the variable corresponding to the $j$th column of $X$ in the final model.
Constraint:
${\mathbf{isx}}\left[\mathit{j}1\right]=1,1\text{ or}2$, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
On exit: the value of
${\mathbf{isx}}\left[\mathit{j}1\right]$ indicates the status of the
$j$th explanatory variable in the model.
 ${\mathbf{isx}}\left[\mathit{j}1\right]=1$
 Forced exclusion.
 ${\mathbf{isx}}\left[\mathit{j}1\right]=0$
 Excluded.
 ${\mathbf{isx}}\left[\mathit{j}1\right]=1$
 Selected.
 ${\mathbf{isx}}\left[\mathit{j}1\right]=2$
 Forced selection.
 7:
$\mathbf{fin}$ – doubleInput

On entry: the value of the variance ratio which an explanatory variable must exceed to be included in a model.
Suggested value:
${\mathbf{fin}}=4.0$.
Constraint:
${\mathbf{fin}}>0.0$.
 8:
$\mathbf{fout}$ – doubleInput

On entry: the explanatory variable in a model with the lowest variance ratio value is removed from the model if its value is less than
fout.
fout is usually set equal to the value of
fin; a value less than
fin is occasionally preferred.
Suggested value:
${\mathbf{fout}}={\mathbf{fin}}$.
Constraint:
$0.0\le {\mathbf{fout}}\le {\mathbf{fin}}$.
 9:
$\mathbf{tau}$ – doubleInput

On entry: the tolerance, $\tau $, for detecting collinearities between variables when adding or removing an explanatory variable from a model. Explanatory variables deemed to be collinear are excluded from the final model.
Suggested value:
${\mathbf{tau}}=1.0\times {10}^{6}$.
Constraint:
${\mathbf{tau}}>0.0$.
 10:
$\mathbf{b}\left[{\mathbf{m}}+1\right]$ – doubleOutput

On exit: ${\mathbf{b}}\left[0\right]$ contains the estimate for the intercept term in the fitted model. If ${\mathbf{isx}}\left[j1\right]\ne 0$ then ${\mathbf{b}}\left[j\right]$ contains the estimate for the $j$th explanatory variable in the fitted model; otherwise ${\mathbf{b}}\left[j\right]=0$.
 11:
$\mathbf{se}\left[{\mathbf{m}}+1\right]$ – doubleOutput

On exit: ${\mathbf{se}}\left[\mathit{j}1\right]$ contains the standard error for the estimate of ${\mathbf{b}}\left[\mathit{j}1\right]$, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}+1$.
 12:
$\mathbf{rsq}$ – double *Output

On exit: the ${R}^{2}$statistic for the fitted regression model.
 13:
$\mathbf{rms}$ – double *Output

On exit: the mean square of residuals for the fitted regression model.
 14:
$\mathbf{df}$ – Integer *Output

On exit: the number of degrees of freedom for the sum of squares of residuals.
 15:
$\mathbf{monlev}$ – IntegerInput

On entry: if a subfunction is provided by you to monitor the model selection process, set
monlev to
$1$; otherwise set
monlev to
$0$.
Constraint:
${\mathbf{monlev}}=0$ or $1$.
 16:
$\mathbf{monfun}$ – function, supplied by the userExternal Function

You may define your own function or specify the NAG defined default function nag_full_step_regsn_monfun (g02efg).
If this facility is not required then the NAG defined null function macro NULLFN can be substituted.
The specification of
monfun is:
void 
monfun (Nag_FullStepwise flag,
Integer var,
double val,
Nag_Comm *comm)


 1:
$\mathbf{flag}$ – Nag_FullStepwiseInput

On entry: the value of
flag indicates the stage of the stepwise selection of explanatory variables.
 ${\mathbf{flag}}=\mathrm{Nag\_AddVar}$
 Variable var was added to the current model.
 ${\mathbf{flag}}=\mathrm{Nag\_BeginBackward}$
 Beginning the backward elimination step.
 ${\mathbf{flag}}=\mathrm{Nag\_ColinearVar}$
 Variable var failed the collinearity test and is excluded from the model.
 ${\mathbf{flag}}=\mathrm{Nag\_DropVar}$
 Variable var was dropped from the current model.
 ${\mathbf{flag}}=\mathrm{Nag\_BeginForward}$
 Beginning the forward selection step
 ${\mathbf{flag}}=\mathrm{Nag\_NoRemoveVar}$
 Backward elimination did not remove any variables from the current model.
 ${\mathbf{flag}}=\mathrm{Nag\_BeginStepwise}$
 Starting stepwise selection procedure.
 ${\mathbf{flag}}=\mathrm{Nag\_VarianceRatio}$
 The variance ratio for variable var takes the value val.
 ${\mathbf{flag}}=\mathrm{Nag\_FinishStepwise}$
 Finished stepwise selection procedure.
 2:
$\mathbf{var}$ – IntegerInput

On entry: the index of the explanatory variable in the design matrix
$Z$ to which
flag pertains.
 3:
$\mathbf{val}$ – doubleInput

On entry: if
${\mathbf{flag}}=\mathrm{Nag\_VarianceRatio}$,
val is the variance ratio value for the coefficient associated with explanatory variable index
var.
 4:
$\mathbf{comm}$ – Nag_Comm *
Pointer to structure of type Nag_Comm; the following members are relevant to
monfun.
 user – double *
 iuser – Integer *
 p – Pointer
The type Pointer will be
void *. Before calling
nag_full_step_regsn (g02efc) you may allocate memory and initialize these pointers with various quantities for use by
monfun when called from
nag_full_step_regsn (g02efc) (see
Section 3.3.1.1 in How to Use the NAG Library and its Documentation).
 17:
$\mathbf{comm}$ – Nag_Comm *

The NAG communication argument (see
Section 3.3.1.1 in How to Use the NAG Library and its Documentation).
 18:
$\mathbf{fail}$ – NagError *Input/Output

The NAG error argument (see
Section 3.7 in How to Use the NAG Library and its Documentation).
6
Error Indicators and Warnings
 NE_ALLOC_FAIL

Dynamic memory allocation failed.
See
Section 2.3.1.2 in How to Use the NAG Library and its Documentation for further information.
 NE_BAD_PARAM

On entry, argument $\u2329\mathit{\text{value}}\u232a$ had an illegal value.
 NE_FREE_VARS

On entry, ${\mathbf{isx}}\left[i1\right]\ne 1$, for all $i=1,2,\dots ,{\mathbf{m}}$.
Constraint: there must be at least one free variable.
 NE_INT

On entry, ${\mathbf{m}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{m}}>1$.
On entry, ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{n}}>1$.
 NE_INT_ARRAY_ELEM_CONS

On entry, ${\mathbf{isx}}\left[\u2329\mathit{\text{value}}\u232a\right]=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{isx}}\left[\mathit{j}1\right]=1$, $1$ or $2$, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
 NE_INTERNAL_ERROR

An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact
NAG for assistance.
See
Section 2.7.6 in How to Use the NAG Library and its Documentation for further information.
 NE_MODEL_INFEASIBLE

All variables are collinear, no model to select.
 NE_NO_LICENCE

Your licence key may have expired or may not have been installed correctly.
See
Section 2.7.5 in How to Use the NAG Library and its Documentation for further information.
 NE_NOT_POS_DEF

The design and data matrix $Z$ is not positive definite, results may be inaccurate. All output is returned as documented.
 NE_REAL

On entry, ${\mathbf{fin}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{fin}}>0.0$.
On entry, ${\mathbf{sw}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{sw}}>1.0$.
On entry, ${\mathbf{tau}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{tau}}>0.0$.
 NE_REAL_2

On entry, ${\mathbf{fout}}=\u2329\mathit{\text{value}}\u232a$; ${\mathbf{fin}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: $0.0\le {\mathbf{fout}}\le {\mathbf{fin}}$.
 NE_ZERO_DIAG

On entry at least one diagonal element of
${\mathbf{c}}\le 0.0$.
Constraint:
c must be positive definite.
7
Accuracy
nag_full_step_regsn (g02efc) returns a warning if the design and data matrix is not positive definite.
8
Parallelism and Performance
nag_full_step_regsn (g02efc) is not threaded in any implementation.
Although the condition for removing or adding a variable to the current model is based on a ratio of variances, these values should not be interpreted as
$F$statistics with the usual interpretation of significance unless the probability levels are adjusted to account for correlations between variables under consideration and the number of possible updates (see, e.g.,
Draper and Smith (1985)).
nag_full_step_regsn (g02efc) allocates internally $\mathcal{O}\left(4\times {\mathbf{m}}+\left({\mathbf{m}}+1\right)\times \left({\mathbf{m}}+2\right)/2+2\right)$ of double storage.
10
Example
This example calculates a full stepwise model selection for the Hald data described in
Dempster (1969). Means, the uppertriangular variancecovariance matrix and the sum of weights are calculated by
nag_sum_sqs (g02buc). The NAG defined default monitor function nag_full_step_regsn_monfun (g02efg) is used to print information at each step of the model selection process.
10.1
Program Text
Program Text (g02efce.c)
10.2
Program Data
Program Data (g02efce.d)
10.3
Program Results
Program Results (g02efce.r)