Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

Purpose

nag_correg_linregm_rssq_stat (g02ec) calculates ${R}^{2}$ and ${C}_{p}$-values from the residual sums of squares for a series of linear regression models.

Syntax

[rsq, cp, ifail] = g02ec(mean_p, n, sigsq, tss, nterms, rss, 'nmod', nmod)
[rsq, cp, ifail] = nag_correg_linregm_rssq_stat(mean_p, n, sigsq, tss, nterms, rss, 'nmod', nmod)

Description

When selecting a linear regression model for a set of $n$ observations a balance has to be found between the number of independent variables in the model and fit as measured by the residual sum of squares. The more variables included the smaller will be the residual sum of squares. Two statistics can help in selecting the best model.
(a) ${R}^{2}$ represents the proportion of variation in the dependent variable that is explained by the independent variables.
 $R2=Regression Sum of SquaresTotal Sum of Squares,$
 where $\text{Total Sum of Squares}={\mathbf{tss}}=\sum {\left(y-\stackrel{-}{y}\right)}^{2}$ (if mean is fitted, otherwise ${\mathbf{tss}}=\sum {y}^{2}$) and $\text{Regression Sum of Squares}=\text{RegSS}={\mathbf{tss}}-{\mathbf{rss}}$, where ${\mathbf{rss}}=\text{residual sum of squares}=\sum {\left(y-\stackrel{^}{y}\right)}^{2}$.
The ${R}^{2}$-values can be examined to find a model with a high ${R}^{2}$-value but with small number of independent variables.
(b) ${C}_{p}$ statistic.
 $Cp=rssσ^2 -n-2p,$
where $p$ is the number of arguments (including the mean) in the model and ${\stackrel{^}{\sigma }}^{2}$ is an estimate of the true variance of the errors. This can often be obtained from fitting the full model.
A well fitting model will have ${C}_{p}\simeq p$. ${C}_{p}$ is often plotted against $p$ to see which models are closest to the ${C}_{p}=p$ line.
nag_correg_linregm_rssq_stat (g02ec) may be called after nag_correg_linregm_rssq (g02ea) which calculates the residual sums of squares for all possible linear regression models.

References

Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Weisberg S (1985) Applied Linear Regression Wiley

Parameters

Compulsory Input Parameters

1:     $\mathrm{mean_p}$ – string (length ≥ 1)
Indicates if a mean term is to be included.
${\mathbf{mean_p}}=\text{'M'}$
A mean term, intercept, will be included in the model.
${\mathbf{mean_p}}=\text{'Z'}$
The model will pass through the origin, zero-point.
Constraint: ${\mathbf{mean_p}}=\text{'M'}$ or $\text{'Z'}$.
2:     $\mathrm{n}$int64int32nag_int scalar
$n$, the number of observations used in the regression model.
Constraint: ${\mathbf{n}}$ must be greater than $2×{p}_{\mathrm{max}}$, where ${p}_{\mathrm{max}}$ is the largest number of independent variables fitted (including the mean if fitted).
3:     $\mathrm{sigsq}$ – double scalar
The best estimate of true variance of the errors, ${\stackrel{^}{\sigma }}^{2}$.
Constraint: ${\mathbf{sigsq}}>0.0$.
4:     $\mathrm{tss}$ – double scalar
The total sum of squares for the regression model.
Constraint: ${\mathbf{tss}}>0.0$.
5:     $\mathrm{nterms}\left({\mathbf{nmod}}\right)$int64int32nag_int array
${\mathbf{nterms}}\left(\mathit{i}\right)$ must contain the number of independent variables (not counting the mean) fitted to the $\mathit{i}$th model, for $\mathit{i}=1,2,\dots ,{\mathbf{nmod}}$.
6:     $\mathrm{rss}\left({\mathbf{nmod}}\right)$ – double array
${\mathbf{rss}}\left(i\right)$ must contain the residual sum of squares for the $i$th model.
Constraint: ${\mathbf{rss}}\left(\mathit{i}\right)\le {\mathbf{tss}}$, for $\mathit{i}=1,2,\dots ,{\mathbf{nmod}}$.

Optional Input Parameters

1:     $\mathrm{nmod}$int64int32nag_int scalar
Default: the dimension of the arrays nterms, rss. (An error is raised if these dimensions are not equal.)
The number of regression models.
Constraint: ${\mathbf{nmod}}>0$.

Output Parameters

1:     $\mathrm{rsq}\left({\mathbf{nmod}}\right)$ – double array
${\mathbf{rsq}}\left(\mathit{i}\right)$ contains the ${R}^{2}$-value for the $\mathit{i}$th model, for $\mathit{i}=1,2,\dots ,{\mathbf{nmod}}$.
2:     $\mathrm{cp}\left({\mathbf{nmod}}\right)$ – double array
${\mathbf{cp}}\left(\mathit{i}\right)$ contains the ${C}_{p}$-value for the $\mathit{i}$th model, for $\mathit{i}=1,2,\dots ,{\mathbf{nmod}}$.
3:     $\mathrm{ifail}$int64int32nag_int scalar
${\mathbf{ifail}}={\mathbf{0}}$ unless the function detects an error (see Error Indicators and Warnings).

Error Indicators and Warnings

Errors or warnings detected by the function:
${\mathbf{ifail}}=1$
 On entry, ${\mathbf{nmod}}<1$, or ${\mathbf{sigsq}}\le 0.0$, or ${\mathbf{tss}}\le 0.0$. or ${\mathbf{mean_p}}\ne \text{'M'}$ or $\text{'Z'}$.
${\mathbf{ifail}}=2$
On entry, the number of arguments for a model is too large for the number of observations, i.e., $2×p\ge n$.
${\mathbf{ifail}}=3$
On entry, ${\mathbf{rss}}\left(i\right)>{\mathbf{tss}}$, for some $i=1,2,\dots ,{\mathbf{nmod}}$.
${\mathbf{ifail}}=4$
A value of ${C}_{p}$ is less than $0.0$. This may occur if sigsq is too large or if rss, n or IP are incorrect.
${\mathbf{ifail}}=-99$
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.

Accuracy

Accuracy is sufficient for all practical purposes.

None.

Example

The data, from an oxygen uptake experiment, is given by Weisberg (1985). The independent and dependent variables are read and the residual sums of squares for all possible models computed using nag_correg_linregm_rssq (g02ea). The values of ${R}^{2}$ and ${C}_{p}$ are then computed and printed along with the names of variables in the models.
```function g02ec_example

fprintf('g02ec example results\n\n');

x = [  0, 1125, 232, 7160, 85.9, 8905;
7,  920, 268, 8804, 86.5, 7388;
15,  835, 271, 8108, 85.2, 5348;
22, 1000, 237, 6370, 83.8, 8056;
29, 1150, 192, 6441, 82.1, 6960;
37,  990, 202, 5154, 79.2, 5690;
44,  840, 184, 5896, 81.2, 6932;
58,  650, 200, 5336, 80.6, 5400;
65,  640, 180, 5041, 78.4, 3177;
72,  583, 165, 5012, 79.3, 4461;
80,  570, 151, 4825, 78.7, 3901;
86,  570, 171, 4391, 78.0, 5002;
93,  510, 243, 4320, 72.3, 4665;
100,  555, 147, 3709, 74.9, 4642;
107,  460, 286, 3969, 74.4, 4840;
122,  275, 198, 3558, 72.5, 4479;
129,  510, 196, 4361, 57.7, 4200;
151,  165, 210, 3301, 71.8, 3410;
171,  244, 327, 2964, 72.5, 3360;
220,   79, 334, 2777, 71.9, 2599];
y = [ 1.5563;  0.8976;  0.7482;  0.7160;  0.3010;
0.3617;  0.1139;  0.1139; -0.2218; -0.1549;
0.0000;  0.0000; -0.0969; -0.2218; -0.3979;
-0.1549; -0.2218; -0.3979; -0.5229; -0.0458];
[n,m] = size(x);

mean_p = 'M';
isx = ones(m,1,'int64');
isx(1) = 0;
vname = {'DAY'; 'BOD'; 'TKN'; 'TS '; 'TVS'; 'COD'};

% Calculate residual sums of squares for all possible models
[nmod, model, rss, nterms, mrank, ifail] = ...
g02ea(mean_p, x, vname, isx, y);

% Calculate R^2 and Mallows Cp
[rsq, cp, ifail] = g02ec( ...
mean_p, int64(n), sigsq, tss, nterms, rss);

% Display results
fprintf(' Parameters     Cp    R^2      model\n');
for j = 1:nmod
fprintf('%8d%11.2f%8.4f  ', nterms(j), cp(j), rsq(j));
fprintf(' %s', model{j,:});
fprintf('\n');
end

```
```g02ec example results

Parameters     Cp    R^2      model
0      55.45  0.0000
1      56.84  0.0082   TKN
1      20.33  0.5054   TVS
1      13.50  0.5983   BOD
1       6.57  0.6926   COD
1       6.29  0.6965   TS
2      21.36  0.5185   TKN TVS
2      11.33  0.6551   BOD TVS
2       9.09  0.6856   BOD TKN
2       7.70  0.7045   BOD COD
2       7.33  0.7095   TKN TS
2       7.16  0.7119   TS  TVS
2       6.88  0.7157   BOD TS
2       6.87  0.7158   TKN COD
2       5.27  0.7376   TVS COD
2       1.74  0.7857   TS  COD
3       8.68  0.7184   BOD TKN TVS
3       8.16  0.7255   TKN TS  TVS
3       8.15  0.7256   BOD TS  TVS
3       7.15  0.7392   BOD TVS COD
3       6.51  0.7479   BOD TKN COD
3       6.25  0.7515   BOD TKN TS
3       5.67  0.7595   TKN TVS COD
3       3.44  0.7898   BOD TS  COD
3       3.42  0.7900   TS  TVS COD
3       2.32  0.8050   TKN TS  COD
4       7.70  0.7591   BOD TKN TS  TVS
4       6.78  0.7716   BOD TKN TVS COD
4       5.07  0.7948   BOD TS  TVS COD
4       4.32  0.8050   BOD TKN TS  COD
4       4.00  0.8094   TKN TS  TVS COD
5       6.00  0.8094   BOD TKN TS  TVS COD
```