NAG Library Routine Document
g02eef (linregm_fit_onestep)
1
Purpose
g02eef carries out one step of a forward selection procedure in order to enable the ‘best’ linear regression model to be found.
2
Specification
Fortran Interface
Subroutine g02eef ( 
istep, mean, weight, n, m, x, ldx, vname, isx, maxip, y, wt, fin, addvar, newvar, chrss, f, model, nterm, rss, idf, ifr, free, exss, q, ldq, p, wk, ifail) 
Integer, Intent (In)  ::  n, m, ldx, isx(m), maxip, ldq  Integer, Intent (Inout)  ::  istep, nterm, idf, ifr, ifail  Real (Kind=nag_wp), Intent (In)  ::  x(ldx,m), y(n), wt(*), fin  Real (Kind=nag_wp), Intent (Inout)  ::  rss, q(ldq,maxip+2), p(maxip+1)  Real (Kind=nag_wp), Intent (Out)  ::  chrss, f, exss(maxip), wk(2*maxip)  Logical, Intent (Out)  ::  addvar  Character (*), Intent (In)  ::  vname(m)  Character (*), Intent (Inout)  ::  model(maxip), free(maxip)  Character (*), Intent (Out)  ::  newvar  Character (1), Intent (In)  ::  mean, weight 

C Header Interface
#include nagmk26.h
void 
g02eef_ (Integer *istep, const char *mean, const char *weight, const Integer *n, const Integer *m, const double x[], const Integer *ldx, const char vname[], const Integer isx[], const Integer *maxip, const double y[], const double wt[], const double *fin, logical *addvar, char *newvar, double *chrss, double *f, char model[], Integer *nterm, double *rss, Integer *idf, Integer *ifr, char free[], double exss[], double q[], const Integer *ldq, double p[], double wk[], Integer *ifail, const Charlen length_mean, const Charlen length_weight, const Charlen length_vname, const Charlen length_newvar, const Charlen length_model, const Charlen length_free) 

3
Description
One method of selecting a linear regression model from a given set of independent variables is by forward selection. The following procedure is used:
(i) 
Select the best fitting independent variable, i.e., the independent variable which gives the smallest residual sum of squares. If the $F$test for this variable is greater than a chosen critical value, ${F}_{\mathrm{c}}$, then include the variable in the model, else stop. 
(ii) 
Find the independent variable that leads to the greatest reduction in the residual sum of squares when added to the current model. 
(iii) 
If the $F$test for this variable is greater than a chosen critical value, ${F}_{\mathrm{c}}$, then include the variable in the model and go to (ii), otherwise stop. 
At any step the variables not in the model are known as the free terms.
g02eef allows you to specify some independent variables that must be in the model, these are known as forced variables.
The computational procedure involves the use of $QR$ decompositions, the $R$ and the $Q$ matrices being updated as each new variable is added to the model. In addition the matrix ${Q}^{\mathrm{T}}{X}_{\mathrm{free}}$, where ${X}_{\mathrm{free}}$ is the matrix of variables not included in the model, is updated.
g02eef computes one step of the forward selection procedure at a call. The results produced at each step may be printed or used as inputs to
g02ddf, in order to compute the regression coefficients for the model fitted at that step. Repeated calls to
g02eef should be made until
$F<{F}_{\mathrm{c}}$ is indicated.
4
References
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Weisberg S (1985) Applied Linear Regression Wiley
5
Arguments
Note: after the initial call to
g02eef with
${\mathbf{istep}}=0$ all arguments except
fin must not be changed by you between calls.
 1: $\mathbf{istep}$ – IntegerInput/Output

On entry: indicates which step in the forward selection process is to be carried out.
 ${\mathbf{istep}}=0$
 The process is initialized.
Constraint:
${\mathbf{istep}}\ge 0$.
On exit: is incremented by $1$.
 2: $\mathbf{mean}$ – Character(1)Input

On entry: indicates if a mean term is to be included.
 ${\mathbf{mean}}=\text{'M'}$
 A mean term, intercept, will be included in the model.
 ${\mathbf{mean}}=\text{'Z'}$
 The model will pass through the origin, zeropoint.
Constraint:
${\mathbf{mean}}=\text{'M'}$ or $\text{'Z'}$.
 3: $\mathbf{weight}$ – Character(1)Input

On entry: indicates if weights are to be used.
 ${\mathbf{weight}}=\text{'U'}$
 Least squares estimation is used.
 ${\mathbf{weight}}=\text{'W'}$
 Weighted least squares is used and weights must be supplied in array wt.
Constraint:
${\mathbf{weight}}=\text{'U'}$ or $\text{'W'}$.
 4: $\mathbf{n}$ – IntegerInput

On entry: $n$, the number of observations.
Constraint:
${\mathbf{n}}\ge 2$.
 5: $\mathbf{m}$ – IntegerInput

On entry: $m$, the total number of independent variables in the dataset.
Constraint:
${\mathbf{m}}\ge 1$.
 6: $\mathbf{x}\left({\mathbf{ldx}},{\mathbf{m}}\right)$ – Real (Kind=nag_wp) arrayInput

On entry: ${\mathbf{x}}\left(\mathit{i},\mathit{j}\right)$ must contain the $\mathit{i}$th observation for the $\mathit{j}$th independent variable, for $\mathit{i}=1,2,\dots ,{\mathbf{n}}$ and $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
 7: $\mathbf{ldx}$ – IntegerInput

On entry: the first dimension of the array
x as declared in the (sub)program from which
g02eef is called.
Constraint:
${\mathbf{ldx}}\ge {\mathbf{n}}$.
 8: $\mathbf{vname}\left({\mathbf{m}}\right)$ – Character(*) arrayInput

On entry:
${\mathbf{vname}}\left(\mathit{j}\right)$ must contain the name of the independent variable in column
$\mathit{j}$ of
x, for
$\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
 9: $\mathbf{isx}\left({\mathbf{m}}\right)$ – Integer arrayInput

On entry: indicates which independent variables could be considered for inclusion in the regression.
 ${\mathbf{isx}}\left(j\right)\ge 2$
 The variable contained in the
$\mathit{j}$th column of x is automatically included in the regression model, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
 ${\mathbf{isx}}\left(j\right)=1$
 The variable contained in the
$\mathit{j}$th column of x is considered for inclusion in the regression model, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
 ${\mathbf{isx}}\left(j\right)=0$
 The variable in the
$\mathit{j}$th column is not considered for inclusion in the model, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
Constraint:
${\mathbf{isx}}\left(\mathit{j}\right)\ge 0$ and at least one value of ${\mathbf{isx}}\left(\mathit{j}\right)=1$, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
 10: $\mathbf{maxip}$ – IntegerInput

On entry: the maximum number of independent variables to be included in the model.
Constraints:
 if ${\mathbf{mean}}=\text{'M'}$, ${\mathbf{maxip}}\ge 1+\text{}$ number of values of ${\mathbf{isx}}>0$;
 if ${\mathbf{mean}}=\text{'Z'}$, ${\mathbf{maxip}}\ge \text{}$ number of values of ${\mathbf{isx}}>0$.
 11: $\mathbf{y}\left({\mathbf{n}}\right)$ – Real (Kind=nag_wp) arrayInput

On entry: the dependent variable.
 12: $\mathbf{wt}\left(*\right)$ – Real (Kind=nag_wp) arrayInput

Note: the dimension of the array
wt
must be at least
${\mathbf{n}}$ if
${\mathbf{weight}}=\text{'W'}$.
On entry: if
${\mathbf{weight}}=\text{'W'}$,
wt must contain the weights to be used in the weighted regression,
$W$.
If ${\mathbf{wt}}\left(i\right)=0.0$, the $i$th observation is not included in the model, in which case the effective number of observations is the number of observations with nonzero weights.
If
${\mathbf{weight}}=\text{'U'}$,
wt is not referenced and the effective number of observations is
n.
Constraint:
if ${\mathbf{weight}}=\text{'W'}$, ${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$, for $\mathit{i}=1,2,\dots ,{\mathbf{n}}$.
 13: $\mathbf{fin}$ – Real (Kind=nag_wp)Input

On entry: the critical value of the $F$ statistic for the term to be included in the model, ${F}_{\mathrm{c}}$.
Suggested value:
$2.0$ is a commonly used value in exploratory modelling.
Constraint:
${\mathbf{fin}}\ge 0.0$.
 14: $\mathbf{addvar}$ – LogicalOutput

On exit: indicates if a variable has been added to the model.
 ${\mathbf{addvar}}=\mathrm{.TRUE.}$
 A variable has been added to the model.
 ${\mathbf{addvar}}=\mathrm{.FALSE.}$
 No variable had an $F$ value greater than ${F}_{\mathrm{c}}$ and none were added to the model.
 15: $\mathbf{newvar}$ – Character(*)Output

On exit: if
${\mathbf{addvar}}=\mathrm{.TRUE.}$,
newvar contains the name of the variable added to the model.
Constraint:
the declared size of
newvar must be greater than or equal to the declared size of
vname.

On exit: if
${\mathbf{addvar}}=\mathrm{.TRUE.}$,
chrss contains the change in the residual sum of squares due to adding variable
newvar.
 17: $\mathbf{f}$ – Real (Kind=nag_wp)Output

On exit: if
${\mathbf{addvar}}=\mathrm{.TRUE.}$,
f contains the
$F$ statistic for the inclusion of the variable in
newvar.
 18: $\mathbf{model}\left({\mathbf{maxip}}\right)$ – Character(*) arrayInput/Output

On entry: if
${\mathbf{istep}}=0$,
model need not be set.
If
${\mathbf{istep}}\ne 0$,
model must contain the values returned by the previous call to
g02eef.
Constraint:
the declared size of
model must be greater than or equal to the declared size of
vname.
On exit: the names of the variables in the current model.
 19: $\mathbf{nterm}$ – IntegerInput/Output

On entry: if
${\mathbf{istep}}=0$,
nterm need not be set.
If
${\mathbf{istep}}\ne 0$,
nterm must contain the value returned by the previous call to
g02eef.
Constraint:
if ${\mathbf{istep}}\ne 0$, ${\mathbf{nterm}}>0$.
On exit: the number of independent variables in the current model, not including the mean, if any.

On entry: if
${\mathbf{istep}}=0$,
rss need not be set.
If
${\mathbf{istep}}\ne 0$,
rss must contain the value returned by the previous call to
g02eef.
Constraint:
if ${\mathbf{istep}}\ne 0$, ${\mathbf{rss}}>0.0$.
On exit: the residual sums of squares for the current model.
 21: $\mathbf{idf}$ – IntegerInput/Output

On entry: if
${\mathbf{istep}}=0$,
idf need not be set.
If
${\mathbf{istep}}\ne 0$,
idf must contain the value returned by the previous call to
g02eef.
On exit: the degrees of freedom for the residual sum of squares for the current model.
 22: $\mathbf{ifr}$ – IntegerInput/Output

On entry: if
${\mathbf{istep}}=0$,
ifr need not be set.
If
${\mathbf{istep}}\ne 0$,
ifr must contain the value returned by the previous call to
g02eef.
On exit: the number of free independent variables, i.e., the number of variables not in the model that are still being considered for selection.
 23: $\mathbf{free}\left({\mathbf{maxip}}\right)$ – Character(*) arrayInput/Output

On entry: if
${\mathbf{istep}}=0$,
free need not be set.
If
${\mathbf{istep}}\ne 0$,
free must contain the values returned by the previous call to
g02eef.
Constraint:
the declared size of
free must be greater than or equal to the declared size of
vname.
On exit: the first
ifr values of
free contain the names of the free variables.
 24: $\mathbf{exss}\left({\mathbf{maxip}}\right)$ – Real (Kind=nag_wp) arrayOutput

On exit: the first
ifr values of
exss contain what would be the change in regression sum of squares if the free variables had been added to the model, i.e., the extra sum of squares for the free variables.
${\mathbf{exss}}\left(i\right)$ contains what would be the change in regression sum of squares if the variable
${\mathbf{free}}\left(i\right)$ had been added to the model.
 25: $\mathbf{q}\left({\mathbf{ldq}},{\mathbf{maxip}}+2\right)$ – Real (Kind=nag_wp) arrayInput/Output

On entry: if
${\mathbf{istep}}=0$,
q need not be set.
If
${\mathbf{istep}}\ne 0$,
q must contain the values returned by the previous call to
g02eef.
On exit: the results of the
$QR$ decomposition for the current model:
 the first column of q contains $c={Q}^{\mathrm{T}}y$ (or ${Q}^{\mathrm{T}}{W}^{\frac{1}{2}}y$ where $W$ is the vector of weights if used);
 the upper triangular part of columns $2$ to $p+1$ contain the $R$ matrix;
 the strictly lower triangular part of columns $2$ to $p+1$ contain details of the $Q$ matrix;
 the remaining $p+1$ to $p+{\mathbf{ifr}}$ columns of contain ${Q}^{\mathrm{T}}{X}_{\mathit{free}}$ (or ${Q}^{\mathrm{T}}{W}^{\frac{1}{2}}{X}_{\mathit{free}}$),
where
$p={\mathbf{nterm}}$, or
$p={\mathbf{nterm}}+1$ if
${\mathbf{mean}}=\text{'M'}$.
 26: $\mathbf{ldq}$ – IntegerInput

On entry: the first dimension of the array
q as declared in the (sub)program from which
g02eef is called.
Constraint:
${\mathbf{ldq}}\ge {\mathbf{n}}$.
 27: $\mathbf{p}\left({\mathbf{maxip}}+1\right)$ – Real (Kind=nag_wp) arrayInput/Output

On entry: if
${\mathbf{istep}}=0$,
p need not be set.
If
${\mathbf{istep}}\ne 0$,
p must contain the values returned by the previous call to
g02eef.
On exit: the first
$p$ elements of
p contain details of the
$QR$ decomposition, where
$p={\mathbf{nterm}}$, or
$p={\mathbf{nterm}}+1$ if
${\mathbf{mean}}=\text{'M'}$.
 28: $\mathbf{wk}\left(2\times {\mathbf{maxip}}\right)$ – Real (Kind=nag_wp) arrayWorkspace

 29: $\mathbf{ifail}$ – IntegerInput/Output

On entry:
ifail must be set to
$0$,
$1\text{ or}1$. If you are unfamiliar with this argument you should refer to
Section 3.4 in How to Use the NAG Library and its Documentation for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value
$1\text{ or}1$ is recommended. If the output of error messages is undesirable, then the value
$1$ is recommended. Otherwise, if you are not familiar with this argument, the recommended value is
$0$.
When the value $\mathbf{1}\text{ or}\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit:
${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see
Section 6).
6
Error Indicators and Warnings
If on entry
${\mathbf{ifail}}=0$ or
$1$, explanatory error messages are output on the current error message unit (as defined by
x04aaf).
Errors or warnings detected by the routine:
 ${\mathbf{ifail}}=1$

On entry, ${\mathbf{fin}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{fin}}\ge 0.0$.
On entry, ${\mathbf{istep}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{istep}}\ge 0$.
On entry, ${\mathbf{istep}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{nterm}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: if ${\mathbf{istep}}\ne 0$, ${\mathbf{nterm}}>0$.
On entry, ${\mathbf{ldq}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{ldq}}\ge {\mathbf{n}}$.
On entry, ${\mathbf{ldx}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{ldx}}\ge {\mathbf{n}}$.
On entry, ${\mathbf{m}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{m}}\ge 1$.
On entry, ${\mathbf{mean}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{mean}}=\text{'M'}$ or $\text{'Z'}$.
On entry, ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{n}}\ge 2$.
On entry, ${\mathbf{rss}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{rss}}>0.0$.
On entry, ${\mathbf{weight}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{weight}}=\text{'W'}$ or $\text{'U'}$.
 ${\mathbf{ifail}}=2$

On entry, ${\mathbf{wt}}\left(\u2329\mathit{\text{value}}\u232a\right)<0.0$.
Constraint: ${\mathbf{wt}}\left(i\right)\ge 0.0$, for $i=1,2,\dots ,n$.
 ${\mathbf{ifail}}=3$

Degrees of freedom for error will equal $0$ if new variable is added, i.e., the number of variables in the model plus $1$ is equal to the effective number of observations.
On entry, number of forced variables $\text{}\ge {\mathbf{n}}$.
 ${\mathbf{ifail}}=4$

On entry, ${\mathbf{isx}}\left(\u2329\mathit{\text{value}}\u232a\right)<0$.
Constraint: ${\mathbf{isx}}\left(i\right)\ge 0$, for $i=1,2,\dots ,{\mathbf{m}}$.
On entry,
${\mathbf{isx}}\left(i\right)=0$, for all
$i=1,2,\dots ,{\mathbf{m}}$.
Constraint: at least one value of
isx must be nonzero.
On entry,
${\mathbf{maxip}}=\u2329\mathit{\text{value}}\u232a$.
Constraint:
maxip must be large enough to accommodate the number of terms given by
isx.
 ${\mathbf{ifail}}=5$

On entry, the variables forced into the model are not of full rank, i.e., some of these variables are linear combinations of others.
 ${\mathbf{ifail}}=6$

There are no free variables, i.e., no element of ${\mathbf{isx}}=0$.
 ${\mathbf{ifail}}=7$

The value of the change in the sum of squares is greater than the input value of
rss. This may occur due to rounding errors if the true residual sum of squares for the new model is small relative to the residual sum of squares for the previous model.
 ${\mathbf{ifail}}=99$
An unexpected error has been triggered by this routine. Please
contact
NAG.
See
Section 3.9 in How to Use the NAG Library and its Documentation for further information.
 ${\mathbf{ifail}}=399$
Your licence key may have expired or may not have been installed correctly.
See
Section 3.8 in How to Use the NAG Library and its Documentation for further information.
 ${\mathbf{ifail}}=999$
Dynamic memory allocation failed.
See
Section 3.7 in How to Use the NAG Library and its Documentation for further information.
7
Accuracy
As g02eef uses a $QR$ transformation the results will often be more accurate than traditional algorithms using methods based on the crossproducts of the dependent and independent variables.
8
Parallelism and Performance
g02eef is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
g02eef makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.
Please consult the
X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the
Users' Note for your implementation for any additional implementationspecific information.
None.
10
Example
The data, from an oxygen uptake experiment, is given by
Weisberg (1985). The names of the variables are as given in
Weisberg (1985). The independent and dependent variables are read and
g02eef is repeatedly called until
${\mathbf{addvar}}=\mathrm{.FALSE.}$. At each step the
$F$ statistic, the free variables and their extra sum of squares are printed; also, except for when
${\mathbf{addvar}}=\mathrm{.FALSE.}$, the new variable, the change in the residual sum of squares and the terms in the model are printed.
10.1
Program Text
Program Text (g02eefe.f90)
10.2
Program Data
Program Data (g02eefe.d)
10.3
Program Results
Program Results (g02eefe.r)