Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

NAG Toolbox: nag_correg_linregm_fit_onestep (g02ee)

Purpose

nag_correg_linregm_fit_onestep (g02ee) carries out one step of a forward selection procedure in order to enable the ‘best’ linear regression model to be found.

Syntax

[istep, addvar, newvar, chrss, f, model, nterm, rss, idf, ifr, free, exss, q, p, ifail] = g02ee(istep, mean_p, x, vname, isx, y, model, nterm, rss, idf, ifr, free, q, p, 'n', n, 'm', m, 'maxip', maxip, 'wt', wt, 'fin', fin)
[istep, addvar, newvar, chrss, f, model, nterm, rss, idf, ifr, free, exss, q, p, ifail] = nag_correg_linregm_fit_onestep(istep, mean_p, x, vname, isx, y, model, nterm, rss, idf, ifr, free, q, p, 'n', n, 'm', m, 'maxip', maxip, 'wt', wt, 'fin', fin)

Description

One method of selecting a linear regression model from a given set of independent variables is by forward selection. The following procedure is used:
 (i) Select the best fitting independent variable, i.e., the independent variable which gives the smallest residual sum of squares. If the $F$-test for this variable is greater than a chosen critical value, ${F}_{\mathrm{c}}$, then include the variable in the model, else stop. (ii) Find the independent variable that leads to the greatest reduction in the residual sum of squares when added to the current model. (iii) If the $F$-test for this variable is greater than a chosen critical value, ${F}_{\mathrm{c}}$, then include the variable in the model and go to (ii), otherwise stop.
At any step the variables not in the model are known as the free terms.
nag_correg_linregm_fit_onestep (g02ee) allows you to specify some independent variables that must be in the model, these are known as forced variables.
The computational procedure involves the use of $QR$ decompositions, the $R$ and the $Q$ matrices being updated as each new variable is added to the model. In addition the matrix ${Q}^{\mathrm{T}}{X}_{\mathrm{free}}$, where ${X}_{\mathrm{free}}$ is the matrix of variables not included in the model, is updated.
nag_correg_linregm_fit_onestep (g02ee) computes one step of the forward selection procedure at a call. The results produced at each step may be printed or used as inputs to nag_correg_linregm_update (g02dd), in order to compute the regression coefficients for the model fitted at that step. Repeated calls to nag_correg_linregm_fit_onestep (g02ee) should be made until $F<{F}_{\mathrm{c}}$ is indicated.

References

Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Weisberg S (1985) Applied Linear Regression Wiley

Parameters

Note:  after the initial call to nag_correg_linregm_fit_onestep (g02ee) with ${\mathbf{istep}}=0$ all arguments except fin must not be changed by you between calls.

Compulsory Input Parameters

1:     $\mathrm{istep}$int64int32nag_int scalar
Indicates which step in the forward selection process is to be carried out.
${\mathbf{istep}}=0$
The process is initialized.
Constraint: ${\mathbf{istep}}\ge 0$.
2:     $\mathrm{mean_p}$ – string (length ≥ 1)
Indicates if a mean term is to be included.
${\mathbf{mean_p}}=\text{'M'}$
A mean term, intercept, will be included in the model.
${\mathbf{mean_p}}=\text{'Z'}$
The model will pass through the origin, zero-point.
Constraint: ${\mathbf{mean_p}}=\text{'M'}$ or $\text{'Z'}$.
3:     $\mathrm{x}\left(\mathit{ldx},{\mathbf{m}}\right)$ – double array
ldx, the first dimension of the array, must satisfy the constraint $\mathit{ldx}\ge {\mathbf{n}}$.
${\mathbf{x}}\left(\mathit{i},\mathit{j}\right)$ must contain the $\mathit{i}$th observation for the $\mathit{j}$th independent variable, for $\mathit{i}=1,2,\dots ,{\mathbf{n}}$ and $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
4:     $\mathrm{vname}\left({\mathbf{m}}\right)$ – cell array of strings
${\mathbf{vname}}\left(\mathit{j}\right)$ must contain the name of the independent variable in column $\mathit{j}$ of x, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
5:     $\mathrm{isx}\left({\mathbf{m}}\right)$int64int32nag_int array
Indicates which independent variables could be considered for inclusion in the regression.
${\mathbf{isx}}\left(j\right)\ge 2$
The variable contained in the $\mathit{j}$th column of x is automatically included in the regression model, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
${\mathbf{isx}}\left(j\right)=1$
The variable contained in the $\mathit{j}$th column of x is considered for inclusion in the regression model, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
${\mathbf{isx}}\left(j\right)=0$
The variable in the $\mathit{j}$th column is not considered for inclusion in the model, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
Constraint: ${\mathbf{isx}}\left(\mathit{j}\right)\ge 0$ and at least one value of ${\mathbf{isx}}\left(\mathit{j}\right)=1$, for $\mathit{j}=1,2,\dots ,{\mathbf{m}}$.
6:     $\mathrm{y}\left({\mathbf{n}}\right)$ – double array
The dependent variable.
7:     $\mathrm{model}\left({\mathbf{maxip}}\right)$ – cell array of strings
If ${\mathbf{istep}}=0$, model need not be set.
If ${\mathbf{istep}}\ne 0$, model must contain the values returned by the previous call to nag_correg_linregm_fit_onestep (g02ee).
Constraint: the declared size of model must be greater than or equal to the declared size of vname.
8:     $\mathrm{nterm}$int64int32nag_int scalar
If ${\mathbf{istep}}=0$, nterm need not be set.
If ${\mathbf{istep}}\ne 0$, nterm must contain the value returned by the previous call to nag_correg_linregm_fit_onestep (g02ee).
Constraint: if ${\mathbf{istep}}\ne 0$, ${\mathbf{nterm}}>0$.
9:     $\mathrm{rss}$ – double scalar
If ${\mathbf{istep}}=0$, rss need not be set.
If ${\mathbf{istep}}\ne 0$, rss must contain the value returned by the previous call to nag_correg_linregm_fit_onestep (g02ee).
Constraint: if ${\mathbf{istep}}\ne 0$, ${\mathbf{rss}}>0.0$.
10:   $\mathrm{idf}$int64int32nag_int scalar
If ${\mathbf{istep}}=0$, idf need not be set.
If ${\mathbf{istep}}\ne 0$, idf must contain the value returned by the previous call to nag_correg_linregm_fit_onestep (g02ee).
11:   $\mathrm{ifr}$int64int32nag_int scalar
If ${\mathbf{istep}}=0$, ifr need not be set.
If ${\mathbf{istep}}\ne 0$, ifr must contain the value returned by the previous call to nag_correg_linregm_fit_onestep (g02ee).
12:   $\mathrm{free}\left({\mathbf{maxip}}\right)$ – cell array of strings
If ${\mathbf{istep}}=0$, free need not be set.
If ${\mathbf{istep}}\ne 0$, free must contain the values returned by the previous call to nag_correg_linregm_fit_onestep (g02ee).
Constraint: the declared size of free must be greater than or equal to the declared size of vname.
13:   $\mathrm{q}\left(\mathit{ldq},{\mathbf{maxip}}+2\right)$ – double array
ldq, the first dimension of the array, must satisfy the constraint $\mathit{ldq}\ge {\mathbf{n}}$.
If ${\mathbf{istep}}=0$, q need not be set.
If ${\mathbf{istep}}\ne 0$, q must contain the values returned by the previous call to nag_correg_linregm_fit_onestep (g02ee).
14:   $\mathrm{p}\left({\mathbf{maxip}}+1\right)$ – double array
If ${\mathbf{istep}}=0$, p need not be set.
If ${\mathbf{istep}}\ne 0$, p must contain the values returned by the previous call to nag_correg_linregm_fit_onestep (g02ee).

Optional Input Parameters

1:     $\mathrm{n}$int64int32nag_int scalar
Default: the dimension of the array y and the first dimension of the arrays x, q. (An error is raised if these dimensions are not equal.)
$n$, the number of observations.
Constraint: ${\mathbf{n}}\ge 2$.
2:     $\mathrm{m}$int64int32nag_int scalar
Default: the second dimension of the array x and the dimension of the arrays vname, isx. (An error is raised if these dimensions are not equal.)
$m$, the total number of independent variables in the dataset.
Constraint: ${\mathbf{m}}\ge 1$.
3:     $\mathrm{maxip}$int64int32nag_int scalar
Default: the dimension of the arrays model, free. (An error is raised if these dimensions are not equal.)
The maximum number of independent variables to be included in the model.
Constraints:
• if ${\mathbf{mean_p}}=\text{'M'}$, ${\mathbf{maxip}}\ge 1+\text{}$ number of values of ${\mathbf{isx}}>0$;
• if ${\mathbf{mean_p}}=\text{'Z'}$, ${\mathbf{maxip}}\ge \text{}$ number of values of ${\mathbf{isx}}>0$.
4:     $\mathrm{wt}\left(:\right)$ – double array
The dimension of the array wt must be at least ${\mathbf{n}}$ if $\mathit{weight}=\text{'W'}$
If $\mathit{weight}=\text{'W'}$, wt must contain the weights to be used in the weighted regression, $W$.
If ${\mathbf{wt}}\left(i\right)=0.0$, the $i$th observation is not included in the model, in which case the effective number of observations is the number of observations with nonzero weights.
If $\mathit{weight}=\text{'U'}$, wt is not referenced and the effective number of observations is n.
Constraint: if $\mathit{weight}=\text{'W'}$, ${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$, for $\mathit{i}=1,2,\dots ,{\mathbf{n}}$.
5:     $\mathrm{fin}$ – double scalar
Default: $2.0$ is a commonly used value in exploratory modelling.
The critical value of the $F$ statistic for the term to be included in the model, ${F}_{\mathrm{c}}$.
Constraint: ${\mathbf{fin}}\ge 0.0$.

Output Parameters

1:     $\mathrm{istep}$int64int32nag_int scalar
Is incremented by $1$.
2:     $\mathrm{addvar}$ – logical scalar
Indicates if a variable has been added to the model.
${\mathbf{addvar}}=\mathit{true}$
A variable has been added to the model.
${\mathbf{addvar}}=\mathit{false}$
No variable had an $F$ value greater than ${F}_{\mathrm{c}}$ and none were added to the model.
3:     $\mathrm{newvar}$ – string
If ${\mathbf{addvar}}=\mathit{true}$, newvar contains the name of the variable added to the model.
4:     $\mathrm{chrss}$ – double scalar
If ${\mathbf{addvar}}=\mathit{true}$, chrss contains the change in the residual sum of squares due to adding variable newvar.
5:     $\mathrm{f}$ – double scalar
If ${\mathbf{addvar}}=\mathit{true}$, f contains the $F$ statistic for the inclusion of the variable in newvar.
6:     $\mathrm{model}\left({\mathbf{maxip}}\right)$ – cell array of strings
The names of the variables in the current model.
7:     $\mathrm{nterm}$int64int32nag_int scalar
The number of independent variables in the current model, not including the mean, if any.
8:     $\mathrm{rss}$ – double scalar
The residual sums of squares for the current model.
9:     $\mathrm{idf}$int64int32nag_int scalar
The degrees of freedom for the residual sum of squares for the current model.
10:   $\mathrm{ifr}$int64int32nag_int scalar
The number of free independent variables, i.e., the number of variables not in the model that are still being considered for selection.
11:   $\mathrm{free}\left({\mathbf{maxip}}\right)$ – cell array of strings
The first ifr values of free contain the names of the free variables.
12:   $\mathrm{exss}\left({\mathbf{maxip}}\right)$ – double array
The first ifr values of exss contain what would be the change in regression sum of squares if the free variables had been added to the model, i.e., the extra sum of squares for the free variables. ${\mathbf{exss}}\left(i\right)$ contains what would be the change in regression sum of squares if the variable ${\mathbf{free}}\left(i\right)$ had been added to the model.
13:   $\mathrm{q}\left(\mathit{ldq},{\mathbf{maxip}}+2\right)$ – double array
The results of the $QR$ decomposition for the current model:
• the first column of q contains $c={Q}^{\mathrm{T}}y$ (or ${Q}^{\mathrm{T}}{W}^{\frac{1}{2}}y$ where $W$ is the vector of weights if used);
• the upper triangular part of columns $2$ to $p+1$ contain the $R$ matrix;
• the strictly lower triangular part of columns $2$ to $p+1$ contain details of the $Q$ matrix;
• the remaining $p+1$ to $p+{\mathbf{ifr}}$ columns of contain ${Q}^{\mathrm{T}}{X}_{\mathit{free}}$ (or ${Q}^{\mathrm{T}}{W}^{\frac{1}{2}}{X}_{\mathit{free}}$),
where $p={\mathbf{nterm}}$, or $p={\mathbf{nterm}}+1$ if ${\mathbf{mean_p}}=\text{'M'}$
14:   $\mathrm{p}\left({\mathbf{maxip}}+1\right)$ – double array
The first $p$ elements of p contain details of the $QR$ decomposition, where $p={\mathbf{nterm}}$, or $p={\mathbf{nterm}}+1$ if ${\mathbf{mean_p}}=\text{'M'}$.
15:   $\mathrm{ifail}$int64int32nag_int scalar
${\mathbf{ifail}}={\mathbf{0}}$ unless the function detects an error (see Error Indicators and Warnings).

Error Indicators and Warnings

Errors or warnings detected by the function:
${\mathbf{ifail}}=1$
 On entry, ${\mathbf{n}}<1$, or ${\mathbf{m}}<1$, or $\mathit{ldx}<{\mathbf{n}}$, or $\mathit{ldq}<{\mathbf{n}}$, or ${\mathbf{istep}}<0$, or ${\mathbf{istep}}\ne 0$ and ${\mathbf{nterm}}=0$, or ${\mathbf{istep}}\ne 0$ and ${\mathbf{rss}}\le 0.0$, or ${\mathbf{fin}}<0.0$, or ${\mathbf{mean_p}}\ne \text{'M'}$ or $\text{'Z'}$, or $\mathit{weight}\ne \text{'U'}$ or $\text{'W'}$.
${\mathbf{ifail}}=2$
 On entry, $\mathit{weight}=\text{'W'}$ and a value of ${\mathbf{wt}}<0.0$.
${\mathbf{ifail}}=3$
On entry, the degrees of freedom will be zero if a variable is selected, i.e., the number of variables in the model plus $1$ is equal to the effective number of observations.
${\mathbf{ifail}}=4$
 On entry, a value of ${\mathbf{isx}}<0$, or there are no forced or free variables, i.e., no element of ${\mathbf{isx}}>0$, or the value of maxip is too small for number of variables indicated by isx.
${\mathbf{ifail}}=5$
On entry, the variables forced into the model are not of full rank, i.e., some of these variables are linear combinations of others.
${\mathbf{ifail}}=6$
 On entry, there are no free variables, i.e., no element of ${\mathbf{isx}}=0$.
${\mathbf{ifail}}=7$
The value of the change in the sum of squares is greater than the input value of rss. This may occur due to rounding errors if the true residual sum of squares for the new model is small relative to the residual sum of squares for the previous model.
${\mathbf{ifail}}=-99$
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.

Accuracy

As nag_correg_linregm_fit_onestep (g02ee) uses a $QR$ transformation the results will often be more accurate than traditional algorithms using methods based on the cross-products of the dependent and independent variables.

None.

Example

The data, from an oxygen uptake experiment, is given by Weisberg (1985). The names of the variables are as given in Weisberg (1985). The independent and dependent variables are read and nag_correg_linregm_fit_onestep (g02ee) is repeatedly called until ${\mathbf{addvar}}=\mathit{false}$. At each step the $F$ statistic, the free variables and their extra sum of squares are printed; also, except for when ${\mathbf{addvar}}=\mathit{false}$, the new variable, the change in the residual sum of squares and the terms in the model are printed.
```function g02ee_example

fprintf('g02ee example results\n\n');

x = [  0, 1125, 232, 7160, 85.9, 8905;
7,  920, 268, 8804, 86.5, 7388;
15,  835, 271, 8108, 85.2, 5348;
22, 1000, 237, 6370, 83.8, 8056;
29, 1150, 192, 6441, 82.1, 6960;
37,  990, 202, 5154, 79.2, 5690;
44,  840, 184, 5896, 81.2, 6932;
58,  650, 200, 5336, 80.6, 5400;
65,  640, 180, 5041, 78.4, 3177;
72,  583, 165, 5012, 79.3, 4461;
80,  570, 151, 4825, 78.7, 3901;
86,  570, 171, 4391, 78.0, 5002;
93,  510, 243, 4320, 72.3, 4665;
100,  555, 147, 3709, 74.9, 4642;
107,  460, 286, 3969, 74.4, 4840;
122,  275, 198, 3558, 72.5, 4479;
129,  510, 196, 4361, 57.7, 4200;
151,  165, 210, 3301, 71.8, 3410;
171,  244, 327, 2964, 72.5, 3360;
220,   79, 334, 2777, 71.9, 2599];
y = [ 1.5563;  0.8976;  0.7482;  0.7160;  0.3010;
0.3617;  0.1139;  0.1139; -0.2218; -0.1549;
0.0000;  0.0000; -0.0969; -0.2218; -0.3979;
-0.1549; -0.2218; -0.3979; -0.5229; -0.0458];
[n,m] = size(x);

mean_p = 'M';
isx = ones(m,1,'int64');
isx(1) = 0;
isx(m) = 2;
vname = {'DAY'; 'BOD'; 'TKN'; 'TS '; 'TVS'; 'COD'};

nzero = int64(0);
model = {'   '; '   '; '   '; '   '; '   '; '   '};
nterm = nzero;
idf = nzero;
ifr = nzero;
free = model;
q = zeros(n,m+2);
p = zeros(m+1,1);

% Loop attempting to add each variable in turn
istep = nzero;
rss, idf, ifr, free, exss, q, p, ifail] = ...
g02ee( ...
istep, mean_p, x, vname, isx, y, model, nterm, ...
rss, idf, ifr, free, q, p);

% Display the results at each step
fprintf('Step %3d\n', istep);
fprintf('No further variables added max  F = %7.2f\n', f);
else
fprintf('Change in residual sum of squares = %13.4e\n', chrss);
fprintf('F Statistic                       = %7.2f\n\n', f);
fprintf('Variables in model              :');
fprintf('     %s', model{1:nterm,1});
fprintf('\n\nResidual sum of squares           = %13.4e\n', rss);
fprintf('Degrees of freedom                = %2d\n\n', idf);
end
if ifr==0
fprintf('No free variables remaining\n');
else
fprintf('Free variables                  :')
fprintf('     %s', free{1:ifr,1});
fprintf('%8.4f', exss(1:ifr));
fprintf('\n\n');
end
end

```
```g02ee example results

Step   1
Change in residual sum of squares =    4.7126e-01
F Statistic                       =    7.38

Variables in model              :     COD     TS

Residual sum of squares           =    1.0850e+00
Degrees of freedom                = 17

Free variables                  :     TKN     BOD     TVS
0.1175  0.0600  0.2276

Step   2
No further variables added max  F =    1.59
Free variables                  :     TKN     BOD     TVS