## 1Purpose

g02ecf calculates ${R}^{2}$ and ${C}_{p}$-values from the residual sums of squares for a series of linear regression models.

## 2Specification

Fortran Interface
 Subroutine g02ecf ( mean, n, tss, nmod, rss, rsq, cp,
 Integer, Intent (In) :: n, nmod, nterms(nmod) Integer, Intent (Inout) :: ifail Real (Kind=nag_wp), Intent (In) :: sigsq, tss, rss(nmod) Real (Kind=nag_wp), Intent (Out) :: rsq(nmod), cp(nmod) Character (1), Intent (In) :: mean
#include <nag.h>
 void g02ecf_ (const char *mean, const Integer *n, const double *sigsq, const double *tss, const Integer *nmod, const Integer nterms[], const double rss[], double rsq[], double cp[], Integer *ifail, const Charlen length_mean)
The routine may be called by the names g02ecf or nagf_correg_linregm_rssq_stat.

## 3Description

When selecting a linear regression model for a set of $n$ observations a balance has to be found between the number of independent variables in the model and fit as measured by the residual sum of squares. The more variables included the smaller will be the residual sum of squares. Two statistics can help in selecting the best model.
1. (a)${R}^{2}$ represents the proportion of variation in the dependent variable that is explained by the independent variables.
 $R2=Regression Sum of SquaresTotal Sum of Squares,$
 where $\text{Total Sum of Squares}={\mathbf{tss}}=\sum {\left(y-\overline{y}\right)}^{2}$ (if mean is fitted, otherwise ${\mathbf{tss}}=\sum {y}^{2}$) and $\text{Regression Sum of Squares}=\text{RegSS}={\mathbf{tss}}-{\mathbf{rss}}$, where ${\mathbf{rss}}=\text{residual sum of squares}=\sum {\left(y-\stackrel{^}{y}\right)}^{2}$.
The ${R}^{2}$-values can be examined to find a model with a high ${R}^{2}$-value but with small number of independent variables.
2. (b)${C}_{p}$ statistic.
 $Cp=rssσ^2 -n-2p,$
where $p$ is the number of parameters (including the mean) in the model and ${\stackrel{^}{\sigma }}^{2}$ is an estimate of the true variance of the errors. This can often be obtained from fitting the full model.
A well fitting model will have ${C}_{p}\simeq p$. ${C}_{p}$ is often plotted against $p$ to see which models are closest to the ${C}_{p}=p$ line.
g02ecf may be called after g02eaf which calculates the residual sums of squares for all possible linear regression models.

## 4References

Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Weisberg S (1985) Applied Linear Regression Wiley

## 5Arguments

1: $\mathbf{mean}$Character(1) Input
On entry: indicates if a mean term is to be included.
${\mathbf{mean}}=\text{'M'}$
A mean term, intercept, will be included in the model.
${\mathbf{mean}}=\text{'Z'}$
The model will pass through the origin, zero-point.
Constraint: ${\mathbf{mean}}=\text{'M'}$ or $\text{'Z'}$.
2: $\mathbf{n}$Integer Input
On entry: $n$, the number of observations used in the regression model.
Constraint: ${\mathbf{n}}$ must be greater than $2×{p}_{\mathrm{max}}$, where ${p}_{\mathrm{max}}$ is the largest number of independent variables fitted (including the mean if fitted).
3: $\mathbf{sigsq}$Real (Kind=nag_wp) Input
On entry: the best estimate of true variance of the errors, ${\stackrel{^}{\sigma }}^{2}$.
Constraint: ${\mathbf{sigsq}}>0.0$.
4: $\mathbf{tss}$Real (Kind=nag_wp) Input
On entry: the total sum of squares for the regression model.
Constraint: ${\mathbf{tss}}>0.0$.
5: $\mathbf{nmod}$Integer Input
On entry: the number of regression models.
Constraint: ${\mathbf{nmod}}>0$.
6: $\mathbf{nterms}\left({\mathbf{nmod}}\right)$Integer array Input
On entry: ${\mathbf{nterms}}\left(\mathit{i}\right)$ must contain the number of independent variables (not counting the mean) fitted to the $\mathit{i}$th model, for $\mathit{i}=1,2,\dots ,{\mathbf{nmod}}$.
7: $\mathbf{rss}\left({\mathbf{nmod}}\right)$Real (Kind=nag_wp) array Input
On entry: ${\mathbf{rss}}\left(i\right)$ must contain the residual sum of squares for the $i$th model.
Constraint: ${\mathbf{rss}}\left(\mathit{i}\right)\le {\mathbf{tss}}$, for $\mathit{i}=1,2,\dots ,{\mathbf{nmod}}$.
8: $\mathbf{rsq}\left({\mathbf{nmod}}\right)$Real (Kind=nag_wp) array Output
On exit: ${\mathbf{rsq}}\left(\mathit{i}\right)$ contains the ${R}^{2}$-value for the $\mathit{i}$th model, for $\mathit{i}=1,2,\dots ,{\mathbf{nmod}}$.
9: $\mathbf{cp}\left({\mathbf{nmod}}\right)$Real (Kind=nag_wp) array Output
On exit: ${\mathbf{cp}}\left(\mathit{i}\right)$ contains the ${C}_{p}$-value for the $\mathit{i}$th model, for $\mathit{i}=1,2,\dots ,{\mathbf{nmod}}$.
10: $\mathbf{ifail}$Integer Input/Output
On entry: ifail must be set to $0$, $-1$ or $1$ to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of $0$ causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of $-1$ means that an error message is printed while a value of $1$ means that it is not.
If halting is not appropriate, the value $-1$ or $1$ is recommended. If message printing is undesirable, then the value $1$ is recommended. Otherwise, the value $0$ is recommended. When the value $-\mathbf{1}$ or $\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit: ${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see Section 6).

## 6Error Indicators and Warnings

If on entry ${\mathbf{ifail}}=0$ or $-1$, explanatory error messages are output on the current error message unit (as defined by x04aaf).
Errors or warnings detected by the routine:
${\mathbf{ifail}}=1$
On entry, ${\mathbf{mean}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{mean}}=\text{'M'}$ or $\text{'Z'}$.
On entry, ${\mathbf{nmod}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{nmod}}>0$.
On entry, ${\mathbf{sigsq}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{sigsq}}>0.0$.
On entry, ${\mathbf{tss}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{tss}}>0.0$.
${\mathbf{ifail}}=2$
On entry: the number of parameters, $p$, is $〈\mathit{\text{value}}〉$ and ${\mathbf{n}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{n}}\ge 2p$.
${\mathbf{ifail}}=3$
On entry, ${\mathbf{rss}}\left(〈\mathit{\text{value}}〉\right)=〈\mathit{\text{value}}〉$ and ${\mathbf{tss}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{rss}}\left(i\right)\le {\mathbf{tss}}$, for all $i$.
${\mathbf{ifail}}=4$
A value of ${C}_{p}$ is less than $0.0$. This may occur if sigsq is too large or if rss, n or IP are incorrect.
${\mathbf{ifail}}=-99$
See Section 7 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

## 7Accuracy

Accuracy is sufficient for all practical purposes.

## 8Parallelism and Performance

g02ecf is not threaded in any implementation.

None.

## 10Example

The data, from an oxygen uptake experiment, is given by Weisberg (1985). The independent and dependent variables are read and the residual sums of squares for all possible models computed using g02eaf. The values of ${R}^{2}$ and ${C}_{p}$ are then computed and printed along with the names of variables in the models.

### 10.1Program Text

Program Text (g02ecfe.f90)

### 10.2Program Data

Program Data (g02ecfe.d)

### 10.3Program Results

Program Results (g02ecfe.r)