# NAG FL Interfaceg02eff (linregm_​fit_​stepwise)

## ▸▿ Contents

Settings help

FL Name Style:

FL Specification Language:

## 1Purpose

g02eff calculates a full stepwise selection from $p$ variables by using Clarke's sweep algorithm on the correlation matrix of a design and data matrix, $Z$. The (weighted) variance-covariance, (weighted) means and sum of weights of $Z$ must be supplied.

## 2Specification

Fortran Interface
 Subroutine g02eff ( m, n, c, sw, isx, fin, fout, tau, b, se, rsq, rms, df,
 Integer, Intent (In) :: m, n, monlev Integer, Intent (Inout) :: isx(m), iuser(*), ifail Integer, Intent (Out) :: df Real (Kind=nag_wp), Intent (In) :: wmean(m+1), c((m+1)*(m+2)/2), sw, fin, fout, tau Real (Kind=nag_wp), Intent (Inout) :: ruser(*) Real (Kind=nag_wp), Intent (Out) :: b(m+1), se(m+1), rsq, rms External :: monfun
#include <nag.h>
 void g02eff_ (const Integer *m, const Integer *n, const double wmean[], const double c[], const double *sw, Integer isx[], const double *fin, const double *fout, const double *tau, double b[], double se[], double *rsq, double *rms, Integer *df, const Integer *monlev, void (NAG_CALL *monfun)(const char *flag, const Integer *var, const double *val, Integer iuser[], double ruser[], const Charlen length_flag),Integer iuser[], double ruser[], Integer *ifail)
The routine may be called by the names g02eff or nagf_correg_linregm_fit_stepwise.

## 3Description

The general multiple linear regression model is defined by
 $y = β0 +Xβ+ε,$
where
• $y$ is a vector of $n$ observations on the dependent variable,
• ${\beta }_{0}$ is an intercept coefficient,
• $X$ is an $n×p$ matrix of $p$ explanatory variables,
• $\beta$ is a vector of $p$ unknown coefficients, and
• $\epsilon$ is a vector of length $n$ of unknown, Normally distributed, random errors.
g02eff employs a full stepwise regression to select a subset of explanatory variables from the $p$ available variables (the intercept is included in the model) and computes regression coefficients and their standard errors, and various other statistical quantities, by minimizing the sum of squares of residuals. The method applies repeatedly a forward selection step followed by a backward elimination step and halts when neither step updates the current model.
The criterion used to update a current model is the variance ratio of residual sum of squares. Let ${s}_{1}$ and ${s}_{2}$ be the residual sum of squares of the current model and this model after undergoing a single update, with degrees of freedom ${q}_{1}$ and ${q}_{2}$, respectively. Then the condition:
 $(s2-s1) / (q2-q1) s1 / q1 > f1 ,$
must be satisfied if a variable $k$ will be considered for entry to the current model, and the condition:
 $(s1-s2) / (q1-q2) s1 / q1 < f2 ,$
must be satisfied if a variable $k$ will be considered for removal from the current model, where ${f}_{1}$ and ${f}_{2}$ are user-supplied values and ${f}_{2}\le {f}_{1}$.
In the entry step the entry statistic is computed for each variable not in the current model. If no variable is associated with a test value that exceeds ${f}_{1}$ then this step is terminated; otherwise the variable associated with the largest value for the entry statistic is entered into the model.
In the removal step the removal statistic is computed for each variable in the current model. If no variable is associated with a test value less than ${f}_{2}$ then this step is terminated; otherwise the variable associated with the smallest value for the removal statistic is removed from the model.
The data values $X$ and $y$ are not provided as input to the routine. Instead, summary statistics of the design and data matrix $Z=\left(X\mid y\right)$ are required.
Explanatory variables are entered into and removed from the current model by using sweep operations on the correlation matrix $R$ of $Z$, given by:
 $R = ( 1 … r1p r1y ⋮ ⋱ ⋮ ⋮ rp1 … 1 rpy ry1 … ryp 1 ) ,$
where ${r}_{\mathit{i}\mathit{j}}$ is the correlation between the explanatory variables $\mathit{i}$ and $\mathit{j}$, for $\mathit{i}=1,2,\dots ,p$ and $\mathit{j}=1,2,\dots ,p$, and ${r}_{yi}$ (and ${r}_{iy}$) is the correlation between the response variable $y$ and the $\mathit{i}$th explanatory variable, for $\mathit{i}=1,2,\dots ,p$.
A sweep operation on the $k$th row and column ($k\le p$) of $R$ replaces:
 $rkk ​ by ​ −1 / rkk ; rik ​ by ​ rik / |rkk| , i=1,2,…,p+1 ​ ​ (i≠k) ; rkj ​ by ​ rkj / |rkk| , j=1,2,…,p+1 ​ ​ (j≠k) ; rij ​ by ​ rij - rik rkj / |rkk| , ​ i=1,2,…,p+1 ​ ​ (i≠k) ; ​ j=1,2,…,p+1 ​ ​ (j≠k) .$
The $k$th explanatory variable is eligible for entry into the current model if it satisfies the collinearity tests: ${r}_{kk}>\tau$ and
 $(rii- rik rki rkk ) τ≤1 ,$
for a user-supplied value ($>0$) of $\tau$ and where the index $i$ runs over explanatory variables in the current model. The sweep operation is its own inverse, therefore, pivoting on an explanatory variable $k$ in the current model has the effect of removing it from the model.
Once the stepwise model selection procedure is finished, the routine calculates:
1. (a)the least squares estimate for the $i$th explanatory variable included in the fitted model;
2. (b)standard error estimates for each coefficient in the final model;
3. (c)the square root of the mean square of residuals and its degrees of freedom;
4. (d)the multiple correlation coefficient.
The routine makes use of the symmetry of the sweep operations and correlation matrix which reduces by almost one half the storage and computation required by the sweep algorithm, see Clarke (1981) for details.
Clarke M R B (1981) Algorithm AS 178: the Gauss–Jordan sweep operator with detection of collinearity Appl. Statist. 31 166–169
Dempster A P (1969) Elements of Continuous Multivariate Analysis Addison–Wesley
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley

## 5Arguments

1: $\mathbf{m}$Integer Input
On entry: the number of explanatory variables available in the design matrix, $Z$.
Constraint: ${\mathbf{m}}>1$.
2: $\mathbf{n}$Integer Input
On entry: the number of observations used in the calculations.
Constraint: ${\mathbf{n}}>1$.
3: $\mathbf{wmean}\left({\mathbf{m}}+1\right)$Real (Kind=nag_wp) array Input
On entry: the mean of the design matrix, $Z$.
4: $\mathbf{c}\left(\left({\mathbf{m}}+1\right)×\left({\mathbf{m}}+2\right)/2\right)$Real (Kind=nag_wp) array Input
On entry: the upper-triangular variance-covariance matrix packed by column for the design matrix, $Z$. Because the routine computes the correlation matrix $R$ from c, the variance-covariance matrix need only be supplied up to a scaling factor.
5: $\mathbf{sw}$Real (Kind=nag_wp) Input
On entry: if weights were used to calculate c then sw is the sum of positive weight values; otherwise sw is the number of observations used to calculate c.
Constraint: ${\mathbf{sw}}>1.0$.
6: $\mathbf{isx}\left({\mathbf{m}}\right)$Integer array Input/Output
On entry: the value of ${\mathbf{isx}}\left(\mathit{j}\right)$ determines the set of variables used to perform full stepwise model selection, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
${\mathbf{isx}}\left(\mathit{j}\right)=-1$
To exclude the variable corresponding to the $j$th column of $X$ from the final model.
${\mathbf{isx}}\left(\mathit{j}\right)=1$
To consider the variable corresponding to the $j$th column of $X$ for selection in the final model.
${\mathbf{isx}}\left(\mathit{j}\right)=2$
To force the inclusion of the variable corresponding to the $j$th column of $X$ in the final model.
Constraint: ${\mathbf{isx}}\left(\mathit{j}\right)=-1,1\text{​ or ​}2$, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
On exit: the value of ${\mathbf{isx}}\left(\mathit{j}\right)$ indicates the status of the $j$th explanatory variable in the model.
${\mathbf{isx}}\left(\mathit{j}\right)=-1$
Forced exclusion.
${\mathbf{isx}}\left(\mathit{j}\right)=0$
Excluded.
${\mathbf{isx}}\left(\mathit{j}\right)=1$
Selected.
${\mathbf{isx}}\left(\mathit{j}\right)=2$
Forced selection.
7: $\mathbf{fin}$Real (Kind=nag_wp) Input
On entry: the value of the variance ratio which an explanatory variable must exceed to be included in a model.
Suggested value: ${\mathbf{fin}}=4.0$.
Constraint: ${\mathbf{fin}}>0.0$.
8: $\mathbf{fout}$Real (Kind=nag_wp) Input
On entry: the explanatory variable in a model with the lowest variance ratio value is removed from the model if its value is less than fout. fout is usually set equal to the value of fin; a value less than fin is occasionally preferred.
Suggested value: ${\mathbf{fout}}={\mathbf{fin}}$.
Constraint: $0.0\le {\mathbf{fout}}\le {\mathbf{fin}}$.
9: $\mathbf{tau}$Real (Kind=nag_wp) Input
On entry: the tolerance, $\tau$, for detecting collinearities between variables when adding or removing an explanatory variable from a model. Explanatory variables deemed to be collinear are excluded from the final model.
Suggested value: ${\mathbf{tau}}=1.0×{10}^{-6}$.
Constraint: ${\mathbf{tau}}>0.0$.
10: $\mathbf{b}\left({\mathbf{m}}+1\right)$Real (Kind=nag_wp) array Output
On exit: ${\mathbf{b}}\left(1\right)$ contains the estimate for the intercept term in the fitted model. If ${\mathbf{isx}}\left(j\right)\ne 0$, then ${\mathbf{b}}\left(j+1\right)$ contains the estimate for the $j$th explanatory variable in the fitted model; otherwise ${\mathbf{b}}\left(j+1\right)=0$.
11: $\mathbf{se}\left({\mathbf{m}}+1\right)$Real (Kind=nag_wp) array Output
On exit: ${\mathbf{se}}\left(\mathit{j}\right)$ contains the standard error for the estimate of ${\mathbf{b}}\left(\mathit{j}\right)$, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}+1$.
12: $\mathbf{rsq}$Real (Kind=nag_wp) Output
On exit: the ${R}^{2}$-statistic for the fitted regression model.
13: $\mathbf{rms}$Real (Kind=nag_wp) Output
On exit: the mean square of residuals for the fitted regression model.
14: $\mathbf{df}$Integer Output
On exit: the number of degrees of freedom for the sum of squares of residuals.
15: $\mathbf{monlev}$Integer Input
On entry: if a subroutine is provided by you to monitor the model selection process, set monlev to $1$; otherwise set monlev to $0$.
Constraint: ${\mathbf{monlev}}=0$ or $1$.
16: $\mathbf{monfun}$Subroutine, supplied by the NAG Library or the user. External Procedure
You may define your own function or specify the NAG defined default function g02efh.
If ${\mathbf{monlev}}=0$, monfun is not referenced; otherwise its specification is:
The specification of monfun is:
Fortran Interface
 Subroutine monfun ( flag, var, val,
 Integer, Intent (In) :: var Integer, Intent (Inout) :: iuser(*) Real (Kind=nag_wp), Intent (In) :: val Real (Kind=nag_wp), Intent (Inout) :: ruser(*) Character (1), Intent (In) :: flag
 void monfun (const char *flag, const Integer *var, const double *val, Integer iuser[], double ruser[], const Charlen length_flag)
1: $\mathbf{flag}$Character(1) Input
On entry: the value of flag indicates the stage of the stepwise selection of explanatory variables.
${\mathbf{flag}}=\text{'A'}$
Variable var was added to the current model.
${\mathbf{flag}}=\text{'B'}$
Beginning the backward elimination step.
${\mathbf{flag}}=\text{'C'}$
Variable var failed the collinearity test and is excluded from the model.
${\mathbf{flag}}=\text{'D'}$
Variable var was dropped from the current model.
${\mathbf{flag}}=\text{'F'}$
Beginning the forward selection step
${\mathbf{flag}}=\text{'K'}$
Backward elimination did not remove any variables from the current model.
${\mathbf{flag}}=\text{'S'}$
Starting stepwise selection procedure.
${\mathbf{flag}}=\text{'V'}$
The variance ratio for variable var takes the value val.
${\mathbf{flag}}=\text{'X'}$
Finished stepwise selection procedure.
2: $\mathbf{var}$Integer Input
On entry: the index of the explanatory variable in the design matrix $Z$ to which flag pertains.
3: $\mathbf{val}$Real (Kind=nag_wp) Input
On entry: if ${\mathbf{flag}}=\text{'V'}$, val is the variance ratio value for the coefficient associated with explanatory variable index var.
4: $\mathbf{iuser}\left(*\right)$Integer array User Workspace
5: $\mathbf{ruser}\left(*\right)$Real (Kind=nag_wp) array User Workspace
monfun is called with the arguments iuser and ruser as supplied to g02eff. You should use the arrays iuser and ruser to supply information to monfun.
monfun must either be a module subprogram USEd by, or declared as EXTERNAL in, the (sub)program from which g02eff is called. Arguments denoted as Input must not be changed by this procedure.
17: $\mathbf{iuser}\left(*\right)$Integer array User Workspace
18: $\mathbf{ruser}\left(*\right)$Real (Kind=nag_wp) array User Workspace
iuser and ruser are not used by g02eff, but are passed directly to monfun and may be used to pass information to this routine.
19: $\mathbf{ifail}$Integer Input/Output
On entry: ifail must be set to $0$, $-1$ or $1$ to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of $0$ causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of $-1$ means that an error message is printed while a value of $1$ means that it is not.
If halting is not appropriate, the value $-1$ or $1$ is recommended. If message printing is undesirable, then the value $1$ is recommended. Otherwise, the value $0$ is recommended. When the value $-\mathbf{1}$ or $\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit: ${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see Section 6).

## 6Error Indicators and Warnings

If on entry ${\mathbf{ifail}}=0$ or $-1$, explanatory error messages are output on the current error message unit (as defined by x04aaf).
Errors or warnings detected by the routine:
${\mathbf{ifail}}=1$
On entry, ${\mathbf{fin}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{fin}}>0.0$.
On entry, ${\mathbf{fout}}=⟨\mathit{\text{value}}⟩$; ${\mathbf{fin}}=⟨\mathit{\text{value}}⟩$.
Constraint: $0.0\le {\mathbf{fout}}\le {\mathbf{fin}}$.
On entry, ${\mathbf{m}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{m}}>1$.
On entry, ${\mathbf{monlev}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{monlev}}=0$ or $1$.
On entry, ${\mathbf{n}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{n}}>1$.
On entry, ${\mathbf{sw}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{sw}}>1.0$.
On entry, ${\mathbf{tau}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{tau}}>0.0$.
${\mathbf{ifail}}=2$
On entry, ${\mathbf{isx}}\left(⟨\mathit{\text{value}}⟩\right)=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{isx}}\left(\mathit{j}\right)=-1$, $1$ or $2$, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
On entry, ${\mathbf{isx}}\left(i\right)\ne 1$, for all $i=1,2,\dots ,{\mathbf{m}}$.
Constraint: there must be at least one free variable.
On entry at least one diagonal element of ${\mathbf{c}}\le 0.0$.
Constraint: c must be positive definite.
${\mathbf{ifail}}=3$
The design and data matrix $Z$ is not positive definite, results may be inaccurate. All output is returned as documented.
${\mathbf{ifail}}=4$
All variables are collinear, no model to select.
${\mathbf{ifail}}=-99$
See Section 7 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

## 7Accuracy

g02eff returns a warning if the design and data matrix is not positive definite.

## 8Parallelism and Performance

g02eff is not threaded in any implementation.

Although the condition for removing or adding a variable to the current model is based on a ratio of variances, these values should not be interpreted as $F$-statistics with the usual interpretation of significance unless the probability levels are adjusted to account for correlations between variables under consideration and the number of possible updates (see, e.g., Draper and Smith (1985)).
g02eff allocates internally $\mathcal{O}\left(4×{\mathbf{m}}+\left({\mathbf{m}}+1\right)×\left({\mathbf{m}}+2\right)/2+2\right)$ of real storage.

## 10Example

This example calculates a full stepwise model selection for the Hald data described in Dempster (1969). Means, the upper-triangular variance-covariance matrix and the sum of weights are calculated by g02buf. The NAG defined default monitor function g02efh is used to print information at each step of the model selection process.

### 10.1Program Text

Program Text (g02effe.f90)

### 10.2Program Data

Program Data (g02effe.d)

### 10.3Program Results

Program Results (g02effe.r)