PDF version (NAG web site
, 64bit version, 64bit version)
NAG Toolbox: nag_correg_linregm_coeffs_const (g02cg)
Purpose
nag_correg_linregm_coeffs_const (g02cg) performs a multiple linear regression on a set of variables whose means, sums of squares and crossproducts of deviations from means, and Pearson productmoment correlation coefficients are given.
Syntax
[
result,
coef,
con,
rinv,
c,
ifail] = g02cg(
n,
xbar,
ssp,
r, 'k1',
k1)
[
result,
coef,
con,
rinv,
c,
ifail] = nag_correg_linregm_coeffs_const(
n,
xbar,
ssp,
r, 'k1',
k1)
Note: the interface to this routine has changed since earlier releases of the toolbox:
At Mark 23: 
k was removed from the interface 
Description
nag_correg_linregm_coeffs_const (g02cg) fits a curve of the form
to the data points
such that
The function calculates the regression coefficients,
${b}_{1},{b}_{2},\dots ,{b}_{k}$, the regression constant,
$a$, and various other statistical quantities by minimizing
The actual data values
$\left({x}_{1i},{x}_{2i},\dots ,{x}_{ki},{y}_{i}\right)$ are not provided as input to the function. Instead, input consists of:
(i) 
The number of cases, $n$, on which the regression is based. 
(ii) 
The total number of variables, dependent and independent, in the regression, $\left(k+1\right)$. 
(iii) 
The number of independent variables in the regression, $k$. 
(iv) 
The means of all $k+1$ variables in the regression, both the independent variables $\left({x}_{1},{x}_{2},\dots ,{x}_{k}\right)$ and the dependent variable $\left(y\right)$, which is the $\left(k+1\right)$th variable: i.e., ${\stackrel{}{x}}_{1},{\stackrel{}{x}}_{2},\dots ,{\stackrel{}{x}}_{k},\stackrel{}{y}$. 
(v) 
The $\left(k+1\right)$ by $\left(k+1\right)$ matrix [${S}_{ij}$] of sums of squares and crossproducts of deviations from means of all the variables in the regression; the terms involving the dependent variable, $y$, appear in the $\left(k+1\right)$th row and column. 
(vi) 
The $\left(k+1\right)$ by $\left(k+1\right)$ matrix [${R}_{ij}$] of the Pearson productmoment correlation coefficients for all the variables in the regression; the correlations involving the dependent variable, $y$, appear in the $\left(k+1\right)$th row and column. 
The quantities calculated are:
(a) 
The inverse of the $k$ by $k$ partition of the matrix of correlation coefficients, [${R}_{ij}$], involving only the independent variables. The inverse is obtained using an accurate method which assumes that this submatrix is positive definite. 
(b) 
The modified inverse matrix, $C=\left[{c}_{ij}\right]$, where
where ${r}_{ij}$ is the $\left(i,j\right)$th element of the inverse matrix of [${R}_{ij}$] as described in (a) above. Each element of $C$ is thus the corresponding element of the matrix of correlation coefficients multiplied by the corresponding element of the inverse of this matrix, divided by the corresponding element of the matrix of sums of squares and crossproducts of deviations from means. 
(c) 
The regression coefficients:
where ${S}_{j\left(k+1\right)}$ is the sum of crossproducts of deviations from means for the independent variable ${x}_{j}$ and the dependent variable $y$. 
(d) 
The sum of squares attributable to the regression, $SSR$, the sum of squares of deviations about the regression, $SSD$, and the total sum of squares, $SST$:
 $SST={S}_{\left(k+1\right)\left(k+1\right)}$, the sum of squares of deviations from the mean for the dependent variable, $y$;
 $SSR={\displaystyle \sum _{j=1}^{k}}{b}_{j}{S}_{j\left(k+1\right)}\text{; \hspace{1em}}SSD=SSTSSR$

(e) 
The degrees of freedom attributable to the regression, $DFR$, the degrees of freedom of deviations about the regression, $DFD$, and the total degrees of freedom, $DFT$:

(f) 
The mean square attributable to the regression, $MSR$, and the mean square of deviations about the regression, $MSD$:

(g) 
The $F$ values for the analysis of variance:

(h) 
The standard error estimate:

(i) 
The coefficient of multiple correlation, $R$, the coefficient of multiple determination, ${R}^{2}$ and the coefficient of multiple determination corrected for the degrees of freedom, ${\stackrel{}{R}}^{2}$;

(j) 
The standard error of the regression coefficients:

(k) 
The $t$ values for the regression coefficients:

(l) 
The regression constant, $a$, its standard error, $se\left(a\right)$, and its $t$ value, $t\left(a\right)$:

References
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Parameters
Compulsory Input Parameters
 1:
$\mathrm{n}$ – int64int32nag_int scalar

The number of cases $n$, used in calculating the sums of squares and crossproducts and correlation coefficients.
 2:
$\mathrm{xbar}\left({\mathbf{k1}}\right)$ – double array

${\mathbf{xbar}}\left(\mathit{i}\right)$ must be set to ${\stackrel{}{x}}_{\mathit{i}}$, the mean value of the $\mathit{i}$th variable, for $\mathit{i}=1,2,\dots ,k+1$; the mean of the dependent variable must be contained in ${\mathbf{xbar}}\left(k+1\right)$.
 3:
$\mathrm{ssp}\left(\mathit{ldssp},{\mathbf{k1}}\right)$ – double array

ldssp, the first dimension of the array, must satisfy the constraint
$\mathit{ldssp}\ge {\mathbf{k1}}$.
${\mathbf{ssp}}\left(\mathit{i},\mathit{j}\right)$ must be set to ${S}_{\mathit{i}\mathit{j}}$, the sum of crossproducts of deviations from means for the $\mathit{i}$th and $\mathit{j}$th variables, for $\mathit{i}=1,2,\dots ,k+1$ and $\mathit{j}=1,2,\dots ,k+1$; terms involving the dependent variable appear in row $k+1$ and column $k+1$.
 4:
$\mathrm{r}\left(\mathit{ldr},{\mathbf{k1}}\right)$ – double array

ldr, the first dimension of the array, must satisfy the constraint
$\mathit{ldr}\ge {\mathbf{k1}}$.
${\mathbf{r}}\left(\mathit{i},\mathit{j}\right)$ must be set to ${R}_{\mathit{i}\mathit{j}}$, the Pearson productmoment correlation coefficient for the $\mathit{i}$th and $\mathit{j}$th variables, for $\mathit{i}=1,2,\dots ,k+1$ and $\mathit{j}=1,2,\dots ,k+1$; terms involving the dependent variable appear in row $k+1$ and column $k+1$.
Optional Input Parameters
 1:
$\mathrm{k1}$ – int64int32nag_int scalar

Default:
the dimension of the array
xbar and the first dimension of the arrays
ssp,
r and the second dimension of the arrays
ssp,
r. (An error is raised if these dimensions are not equal.)
The total number of variables, independent and dependent, $\left(k+1\right)$, in the regression.
Constraint:
$2\le {\mathbf{k1}}<{\mathbf{n}}$.
Output Parameters
 1:
$\mathrm{result}\left(13\right)$ – double array

The following information:
${\mathbf{result}}\left(1\right)$  $SSR$, the sum of squares attributable to the regression; 
${\mathbf{result}}\left(2\right)$  $DFR$, the degrees of freedom attributable to the regression; 
${\mathbf{result}}\left(3\right)$  $MSR$, the mean square attributable to the regression; 
${\mathbf{result}}\left(4\right)$  $F$, the $F$ value for the analysis of variance; 
${\mathbf{result}}\left(5\right)$  $SSD$, the sum of squares of deviations about the regression; 
${\mathbf{result}}\left(6\right)$  $DFD$, the degrees of freedom of deviations about the regression; 
${\mathbf{result}}\left(7\right)$  $MSD$, the mean square of deviations about the regression; 
${\mathbf{result}}\left(8\right)$  $SST$, the total sum of squares; 
${\mathbf{result}}\left(9\right)$  $DFT$, the total degrees of freedom; 
${\mathbf{result}}\left(10\right)$  $s$, the standard error estimate; 
${\mathbf{result}}\left(11\right)$  $R$, the coefficient of multiple correlation; 
${\mathbf{result}}\left(12\right)$  ${R}^{2}$, the coefficient of multiple determination; 
${\mathbf{result}}\left(13\right)$  ${\stackrel{}{R}}^{2}$, the coefficient of multiple determination corrected for the degrees of freedom. 
 2:
$\mathrm{coef}\left(\mathit{ldcoef},3\right)$ – double array

For
$i=1,2,\dots ,k$, the following information:
 ${\mathbf{coef}}\left(i,1\right)$
 ${b}_{i}$, the regression coefficient for the $i$th variable.
 ${\mathbf{coef}}\left(i,2\right)$
 $se\left({b}_{i}\right)$, the standard error of the regression coefficient for the $i$th variable.
 ${\mathbf{coef}}\left(i,3\right)$
 $t\left({b}_{i}\right)$, the $t$ value of the regression coefficient for the $i$th variable.
 3:
$\mathrm{con}\left(3\right)$ – double array

The following information:
${\mathbf{con}}\left(1\right)$  $a$, the regression constant; 
${\mathbf{con}}\left(2\right)$  $se\left(a\right)$, the standard error of the regression constant; 
${\mathbf{con}}\left(3\right)$  $t\left(a\right)$, the $t$ value for the regression constant. 
 4:
$\mathrm{rinv}\left(\mathit{ldrinv},\mathit{k}\right)$ – double array

$\mathit{k}={\mathbf{k1}}1$.
The inverse of the matrix of correlation coefficients for the independent variables; that is, the inverse of the matrix consisting of the first
$k$ rows and columns of
r.
 5:
$\mathrm{c}\left(\mathit{ldc},\mathit{k}\right)$ – double array

$\mathit{k}={\mathbf{k1}}1$.
The modified inverse matrix, where
 ${\mathbf{c}}\left(\mathit{i},\mathit{j}\right)={\mathbf{r}}\left(\mathit{i},\mathit{j}\right)\times {\mathbf{rinv}}\left(\mathit{i},\mathit{j}\right)/{\mathbf{ssp}}\left(\mathit{i},\mathit{j}\right)$, for $\mathit{i}=1,2,\dots ,k$ and $\mathit{j}=1,2,\dots ,k$.
 6:
$\mathrm{ifail}$ – int64int32nag_int scalar
${\mathbf{ifail}}={\mathbf{0}}$ unless the function detects an error (see
Error Indicators and Warnings).
Error Indicators and Warnings
Errors or warnings detected by the function:
 ${\mathbf{ifail}}=1$

On entry,  ${\mathbf{k1}}<2$. 
 ${\mathbf{ifail}}=2$

On entry,  ${\mathbf{k1}}\ne \left(\mathit{k}+1\right)$. 
 ${\mathbf{ifail}}=3$

On entry,  ${\mathbf{n}}\le {\mathbf{k1}}$. 
 ${\mathbf{ifail}}=4$

On entry,  $\mathit{ldssp}<{\mathbf{k1}}$, 
or  $\mathit{ldr}<{\mathbf{k1}}$, 
or  $\mathit{ldcoef}<\mathit{k}$, 
or  $\mathit{ldrinv}<\mathit{k}$, 
or  $\mathit{ldc}<\mathit{k}$, 
or  $\mathit{ldwkz}<\mathit{k}$. 
 ${\mathbf{ifail}}=5$

The $k$ by $k$ partition of the matrix $R$ which is to be inverted is not positive definite.
 ${\mathbf{ifail}}=6$

The refinement following the actual inversion fails, indicating that the
$k$ by
$k$ partition of the matrix
$R$, which is to be inverted, is illconditioned. The use of
nag_correg_linregm_fit (g02da), which employs a different numerical technique, may avoid this difficulty (an extra ‘variable’ representing the constant term must be introduced for
nag_correg_linregm_fit (g02da)).
 ${\mathbf{ifail}}=7$

 ${\mathbf{ifail}}=99$
An unexpected error has been triggered by this routine. Please
contact
NAG.
 ${\mathbf{ifail}}=399$
Your licence key may have expired or may not have been installed correctly.
 ${\mathbf{ifail}}=999$
Dynamic memory allocation failed.
Accuracy
The accuracy of any regression function is almost entirely dependent on the accuracy of the matrix inversion method used. In
nag_correg_linregm_coeffs_const (g02cg), it is the matrix of correlation coefficients rather than that of the sums of squares and crossproducts of deviations from means that is inverted; this means that all terms in the matrix for inversion are of a similar order, and reduces the scope for computational error. For details on absolute accuracy, the relevant section of the document describing the inversion function used,
nag_linsys_real_posdef_solve_ref (f04ab), should be consulted.
nag_correg_linregm_fit (g02da) uses a different method, based on
nag_linsys_real_gen_lsqsol (f04am), and that function may well prove more reliable numerically. It does not handle missing values, nor does it provide the same output as this function. (In particular it is necessary to include explicitly the constant in the regression equation as another ‘variable’.)
If, in calculating
$F$,
$t\left(a\right)$, or any of the
$t\left({b}_{i}\right)$
(see
Description), the numbers involved are such that the result would be outside the range of numbers which can be stored by the machine, then the answer is set to the largest quantity which can be stored as a double variable, by means of a call to
nag_machine_real_largest (x02al).
Further Comments
The time taken by nag_correg_linregm_coeffs_const (g02cg) depends on $k$.
This function assumes that the matrix of correlation coefficients for the independent variables in the regression is positive definite; it fails if this is not the case.
This correlation matrix will in fact be positive definite whenever the correlation matrix and the sums of squares and crossproducts (of deviations from means) matrix have been formed either without regard to missing values, or by eliminating
completely any cases involving missing values, for any variable. If, however, these matrices are formed by eliminating cases with missing values from only those calculations involving the variables for which the values are missing, no such statement can be made, and the correlation matrix may or may not be positive definite. You should be aware of the possible dangers of using correlation matrices formed in this way (see the
G02 Chapter Introduction), but if they nevertheless wish to carry out regression using such matrices, this function is capable of handling the inversion of such matrices provided they are positive definite.
It should be noted that in forming the sums of squares and crossproducts matrix and the correlation matrix a column of constants should not be added to the data as an additional ‘variable’ in order to obtain a constant term in the regression. This function automatically calculates the regression constant, $a$, and any attempt to insert such a ‘dummy variable’ is likely to cause the function to fail.
It should also be noted that the function requires the dependent variable to be the last of the
$k+1$ variables whose statistics are provided as input to the function. If this variable is not correctly positioned in the original data, the means, standard deviations, sums of squares and crossproducts of deviations from means, and correlation coefficients can be manipulated by using
nag_correg_linregm_service_select (g02ce) or
nag_correg_linregm_service_reorder (g02cf) to reorder the variables as necessary.
Example
This example reads in the means, sums of squares and crossproducts of deviations from means, and correlation coefficients for three variables. A multiple linear regression is then performed with the third and final variable as the dependent variable. Finally the results are printed.
Open in the MATLAB editor:
g02cg_example
function g02cg_example
fprintf('g02cg example results\n\n');
n = int64(5);
k = 2;
xbar = [ 5.4; 5.8; 2.8];
ssp = [ 99.2, 57.6, 6.4;
57.6, 102.8, 29.2;
6.4, 29.2, 14.8 ];
r = [ 1, 0.5704, 0.167;
0.5704, 1, 0.7486;
0.167, 0.7486, 1 ];
korder = [int64(1); 3; 2];
disp('Means:');
disp(xbar);
disp('Sums of squares and crossproducts about means:');
disp(ssp);
disp('Correlation coefficients:');
disp(r);
[result, coeff, con, rinv, c, ifail] = ...
g02cg(n, xbar, ssp, r);
fprintf(' Variable Coef Std err tvalue\n');
disp([[1:k]' coeff]);
fprintf('Analysis of regression table :\n\n');
fprintf(' Source Sum of squares DF Mean square Fvalue\n');
fprintf('Due to regression %11.3f%8d%13.3f%12.3f\n', result(1:4));
fprintf('About regression %11.3f%8d%13.3f\n', result(5:7));
fprintf('Total %11.3f%8d\n\n', result(8:9));
fprintf('Standard error of estimate = %8.4f\n', result(10));
fprintf('Multiple correlation (R) = %8.4f\n', result(11));
fprintf('Determination (R squared) = %8.4f\n', result(12));
fprintf('Corrected R squared = %8.4f\n\n', result(13));
disp('Inverse of correlation matrix of independent variables:');
disp(rinv);
disp('Modified inverse matrix:');
disp(c);
g02cg example results
Means:
5.4000
5.8000
2.8000
Sums of squares and crossproducts about means:
99.2000 57.6000 6.4000
57.6000 102.8000 29.2000
6.4000 29.2000 14.8000
Correlation coefficients:
1.0000 0.5704 0.1670
0.5704 1.0000 0.7486
0.1670 0.7486 1.0000
Variable Coef Std err tvalue
1.0000 0.1488 0.1937 0.7683
2.0000 0.3674 0.1903 1.9309
Analysis of regression table :
Source Sum of squares DF Mean square Fvalue
Due to regression 9.777 2 4.888 1.946
About regression 5.023 2 2.512
Total 14.800 4
Standard error of estimate = 1.5848
Multiple correlation (R) = 0.8128
Determination (R squared) = 0.6606
Corrected R squared = 0.3212
Inverse of correlation matrix of independent variables:
1.4823 0.8455
0.8455 1.4823
Modified inverse matrix:
0.0149 0.0084
0.0084 0.0144
PDF version (NAG web site
, 64bit version, 64bit version)
© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2015