g02cgf performs a multiple linear regression on a set of variables whose means, sums of squares and cross-products of deviations from means, and Pearson product-moment correlation coefficients are given.
The routine calculates the regression coefficients, ${b}_{1},{b}_{2},\dots ,{b}_{k}$, the regression constant, $a$, and various other statistical quantities by minimizing
$$\sum _{i=1}^{n}{e}_{i}^{2}\text{.}$$
The actual data values $({x}_{1i},{x}_{2i},\dots ,{x}_{ki},{y}_{i})$ are not provided as input to the routine. Instead, input consists of:
(i)The number of cases, $n$, on which the regression is based.
(ii)The total number of variables, dependent and independent, in the regression, $(k+1)$.
(iii)The number of independent variables in the regression, $k$.
(iv)The means of all $k+1$ variables in the regression, both the independent variables $({x}_{1},{x}_{2},\dots ,{x}_{k})$ and the dependent variable $\left(y\right)$, which is the $(k+1)$th variable: i.e., ${\overline{x}}_{1},{\overline{x}}_{2},\dots ,{\overline{x}}_{k},\overline{y}$.
(v)The $(k+1)\times (k+1)$ matrix [${S}_{ij}$] of sums of squares and cross-products of deviations from means of all the variables in the regression; the terms involving the dependent variable, $y$, appear in the $(k+1)$th row and column.
(vi)The $(k+1)\times (k+1)$ matrix [${R}_{ij}$] of the Pearson product-moment correlation coefficients for all the variables in the regression; the correlations involving the dependent variable, $y$, appear in the $(k+1)$th row and column.
The quantities calculated are:
(a)The inverse of the $k\times k$ partition of the matrix of correlation coefficients, [${R}_{ij}$], involving only the independent variables. The inverse is obtained using an accurate method which assumes that this sub-matrix is positive definite.
(b)The modified inverse matrix, $C=\left[{c}_{ij}\right]$, where
where ${r}_{ij}$ is the $(i,j)$th element of the inverse matrix of [${R}_{ij}$] as described in (a) above. Each element of $C$ is thus the corresponding element of the matrix of correlation coefficients multiplied by the corresponding element of the inverse of this matrix, divided by the corresponding element of the matrix of sums of squares and cross-products of deviations from means.
where ${S}_{j(k+1)}$ is the sum of cross-products of deviations from means for the independent variable ${x}_{j}$ and the dependent variable $y$.
(d)The sum of squares attributable to the regression, $SSR$, the sum of squares of deviations about the regression, $SSD$, and the total sum of squares, $SST$:
$SST={S}_{(k+1)(k+1)}$, the sum of squares of deviations from the mean for the dependent variable, $y$;
(e)The degrees of freedom attributable to the regression, $DFR$, the degrees of freedom of deviations about the regression, $DFD$, and the total degrees of freedom, $DFT$:
(i)The coefficient of multiple correlation, $R$, the coefficient of multiple determination, ${R}^{2}$ and the coefficient of multiple determination corrected for the degrees of freedom, ${\overline{R}}^{2}$;
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
5Arguments
1: $\mathbf{n}$ – IntegerInput
On entry: the number of cases $n$, used in calculating the sums of squares and cross-products and correlation coefficients.
2: $\mathbf{k1}$ – IntegerInput
On entry: is no longer required by g02cgf but is retained for backwards compatibility.
3: $\mathbf{k}$ – IntegerInput
On entry: $k$, the number of independent variables in the regression.
Constraint:
$1\le {\mathbf{k}}<{\mathbf{n}}-1$.
4: $\mathbf{xbar}\left({\mathbf{k}}+1\right)$ – Real (Kind=nag_wp) arrayInput
On entry: ${\mathbf{xbar}}\left(\mathit{i}\right)$ must be set to ${\overline{x}}_{\mathit{i}}$, the mean value of the $\mathit{i}$th variable, for $\mathit{i}=1,2,\dots ,k+1$; the mean of the dependent variable must be contained in ${\mathbf{xbar}}\left(k+1\right)$.
5: $\mathbf{ssp}({\mathbf{ldssp}},{\mathbf{k}}+1)$ – Real (Kind=nag_wp) arrayInput
On entry: ${\mathbf{ssp}}(\mathit{i},\mathit{j})$ must be set to ${S}_{\mathit{i}\mathit{j}}$, the sum of cross-products of deviations from means for the $\mathit{i}$th and $\mathit{j}$th variables, for $\mathit{i}=1,2,\dots ,k+1$ and $\mathit{j}=1,2,\dots ,k+1$; terms involving the dependent variable appear in row $k+1$ and column $k+1$.
6: $\mathbf{ldssp}$ – IntegerInput
On entry: the first dimension of the array ssp as declared in the (sub)program from which g02cgf is called.
Constraint:
${\mathbf{ldssp}}\ge {\mathbf{k}}+1$.
7: $\mathbf{r}({\mathbf{ldr}},{\mathbf{k}}+1)$ – Real (Kind=nag_wp) arrayInput
On entry: ${\mathbf{r}}(\mathit{i},\mathit{j})$ must be set to ${R}_{\mathit{i}\mathit{j}}$, the Pearson product-moment correlation coefficient for the $\mathit{i}$th and $\mathit{j}$th variables, for $\mathit{i}=1,2,\dots ,k+1$ and $\mathit{j}=1,2,\dots ,k+1$; terms involving the dependent variable appear in row $k+1$ and column $k+1$.
8: $\mathbf{ldr}$ – IntegerInput
On entry: the first dimension of the array r as declared in the (sub)program from which g02cgf is called.
Constraint:
${\mathbf{ldr}}\ge {\mathbf{k}}+1$.
9: $\mathbf{result}\left(13\right)$ – Real (Kind=nag_wp) arrayOutput
On exit: the following information:
${\mathbf{result}}\left(1\right)$
$SSR$, the sum of squares attributable to the regression;
${\mathbf{result}}\left(2\right)$
$DFR$, the degrees of freedom attributable to the regression;
${\mathbf{result}}\left(3\right)$
$MSR$, the mean square attributable to the regression;
${\mathbf{result}}\left(4\right)$
$F$, the $F$ value for the analysis of variance;
${\mathbf{result}}\left(5\right)$
$SSD$, the sum of squares of deviations about the regression;
${\mathbf{result}}\left(6\right)$
$DFD$, the degrees of freedom of deviations about the regression;
${\mathbf{result}}\left(7\right)$
$MSD$, the mean square of deviations about the regression;
${\mathbf{result}}\left(8\right)$
$SST$, the total sum of squares;
${\mathbf{result}}\left(9\right)$
$DFT$, the total degrees of freedom;
${\mathbf{result}}\left(10\right)$
$s$, the standard error estimate;
${\mathbf{result}}\left(11\right)$
$R$, the coefficient of multiple correlation;
${\mathbf{result}}\left(12\right)$
${R}^{2}$, the coefficient of multiple determination;
${\mathbf{result}}\left(13\right)$
${\overline{R}}^{2}$, the coefficient of multiple determination corrected for the degrees of freedom.
10: $\mathbf{coef}({\mathbf{ldcoef}},3)$ – Real (Kind=nag_wp) arrayOutput
On exit: for $i=1,2,\dots ,k$, the following information:
${\mathbf{coef}}(i,1)$
${b}_{i}$, the regression coefficient for the $i$th variable.
${\mathbf{coef}}(i,2)$
$se\left({b}_{i}\right)$, the standard error of the regression coefficient for the $i$th variable.
${\mathbf{coef}}(i,3)$
$t\left({b}_{i}\right)$, the $t$ value of the regression coefficient for the $i$th variable.
11: $\mathbf{ldcoef}$ – IntegerInput
On entry: the first dimension of the array coef as declared in the (sub)program from which g02cgf is called.
Constraint:
${\mathbf{ldcoef}}\ge {\mathbf{k}}$.
12: $\mathbf{con}\left(3\right)$ – Real (Kind=nag_wp) arrayOutput
On exit: the following information:
${\mathbf{con}}\left(1\right)$
$a$, the regression constant;
${\mathbf{con}}\left(2\right)$
$se\left(a\right)$, the standard error of the regression constant;
${\mathbf{con}}\left(3\right)$
$t\left(a\right)$, the $t$ value for the regression constant.
13: $\mathbf{rinv}({\mathbf{ldrinv}},{\mathbf{k}})$ – Real (Kind=nag_wp) arrayOutput
On exit: the inverse of the matrix of correlation coefficients for the independent variables; that is, the inverse of the matrix consisting of the first $k$ rows and columns of r.
14: $\mathbf{ldrinv}$ – IntegerInput
On entry: the first dimension of the array rinv as declared in the (sub)program from which g02cgf is called.
Constraint:
${\mathbf{ldrinv}}\ge {\mathbf{k}}$.
15: $\mathbf{c}({\mathbf{ldc}},{\mathbf{k}})$ – Real (Kind=nag_wp) arrayOutput
On exit: the modified inverse matrix, where
${\mathbf{c}}(\mathit{i},\mathit{j})={\mathbf{r}}(\mathit{i},\mathit{j})\times {\mathbf{rinv}}(\mathit{i},\mathit{j})/{\mathbf{ssp}}(\mathit{i},\mathit{j})$, for $\mathit{i}=1,2,\dots ,k$ and $\mathit{j}=1,2,\dots ,k$.
16: $\mathbf{ldc}$ – IntegerInput
On entry: the first dimension of the array c as declared in the (sub)program from which g02cgf is called.
Constraint:
${\mathbf{ldc}}\ge {\mathbf{k}}$.
17: $\mathbf{wkz}({\mathbf{ldwkz}},{\mathbf{k}})$ – Real (Kind=nag_wp) arrayWorkspace
18: $\mathbf{ldwkz}$ – IntegerInput
On entry: the first dimension of the array wkz as declared in the (sub)program from which g02cgf is called.
Constraint:
${\mathbf{ldwkz}}\ge {\mathbf{k}}$.
19: $\mathbf{ifail}$ – IntegerInput/Output
On entry: ifail must be set to $0$, $-1$ or $1$ to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of $0$ causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of $-1$ means that an error message is printed while a value of $1$ means that it is not.
If halting is not appropriate, the value $-1$ or $1$ is recommended. If message printing is undesirable, then the value $1$ is recommended. Otherwise, the value $0$ is recommended. When the value $-\mathbf{1}$ or $\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit: ${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see Section 6).
6Error Indicators and Warnings
If on entry ${\mathbf{ifail}}=0$ or $-1$, explanatory error messages are output on the current error message unit (as defined by x04aaf).
Errors or warnings detected by the routine:
${\mathbf{ifail}}=1$
On entry, ${\mathbf{k}}=\u27e8\mathit{\text{value}}\u27e9$.
Constraint: ${\mathbf{k}}\ge 1$.
${\mathbf{ifail}}=3$
On entry, ${\mathbf{n}}=\u27e8\mathit{\text{value}}\u27e9$ and ${\mathbf{k}}=\u27e8\mathit{\text{value}}\u27e9$.
Constraint: ${\mathbf{n}}>{\mathbf{k}}+1$.
${\mathbf{ifail}}=4$
On entry, ${\mathbf{ldc}}=\u27e8\mathit{\text{value}}\u27e9$ and ${\mathbf{k}}=\u27e8\mathit{\text{value}}\u27e9$.
Constraint: ${\mathbf{ldc}}\ge {\mathbf{k}}$.
On entry, ${\mathbf{ldcoef}}=\u27e8\mathit{\text{value}}\u27e9$ and ${\mathbf{k}}=\u27e8\mathit{\text{value}}\u27e9$.
Constraint: ${\mathbf{ldcoef}}\ge {\mathbf{k}}$.
On entry, ${\mathbf{ldr}}=\u27e8\mathit{\text{value}}\u27e9$ and ${\mathbf{k}}=\u27e8\mathit{\text{value}}\u27e9$.
Constraint: ${\mathbf{ldr}}\ge {\mathbf{k}}+1$.
On entry, ${\mathbf{ldrinv}}=\u27e8\mathit{\text{value}}\u27e9$ and ${\mathbf{k}}=\u27e8\mathit{\text{value}}\u27e9$.
Constraint: ${\mathbf{ldrinv}}\ge {\mathbf{k}}$.
On entry, ${\mathbf{ldssp}}=\u27e8\mathit{\text{value}}\u27e9$ and ${\mathbf{k}}=\u27e8\mathit{\text{value}}\u27e9$.
Constraint: ${\mathbf{ldssp}}\ge {\mathbf{k}}+1$.
On entry, ${\mathbf{ldwkz}}=\u27e8\mathit{\text{value}}\u27e9$ and ${\mathbf{k}}=\u27e8\mathit{\text{value}}\u27e9$.
Constraint: ${\mathbf{ldwkz}}\ge {\mathbf{k}}$.
${\mathbf{ifail}}=5$
The k by k partition of r which requires inversion is not positive definite.
${\mathbf{ifail}}=6$
The refinement following inversion has failed.
This indicates that the $k\times k$ partition of the matrix $R$, which is to be inverted, is ill-conditioned. The use of g02daf, which employs a different numerical technique, may avoid this difficulty (an extra ‘variable’ representing the constant term must be introduced for g02daf).
${\mathbf{ifail}}=-99$
An unexpected error has been triggered by this routine. Please
contact NAG.
See Section 7 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.
7Accuracy
The accuracy of g02cgf is almost entirely dependent on the accuracy of the matrix inversion method used. As g02cgf works with the matrix of correlation coefficients rather than that of the sums of squares and cross-products of deviations from means all terms in the matrix being inverted are of a similar order and, therefore, the scope for computational error is reduced. An alternative, and potentially more numerically reliable, routine is g02daf. g02daf works directly with the data matrix and, therefore, avoids explicitly performing a matrix inversion. However, g02daf does not handle missing values, nor does it provide the same output as this routine.
If, in calculating
$F$, $t\left(a\right)$, or any of the $t\left({b}_{i}\right)$
(see Section 3), the numbers involved are such that the result would be outside the range of numbers which can be stored by the machine, then the answer is set to the largest quantity which can be stored as a real variable, by means of a call to x02alf.
8Parallelism and Performance
g02cgf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
g02cgf makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.
Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.
9Further Comments
The time taken by g02cgf depends on $k$.
This routine assumes that the matrix of correlation coefficients for the independent variables in the regression is positive definite; it fails if this is not the case.
This correlation matrix will in fact be positive definite whenever the correlation matrix and the sums of squares and cross-products (of deviations from means) matrix have been formed either without regard to missing values, or by eliminating completely any cases involving missing values, for any variable. If, however, these matrices are formed by eliminating cases with missing values from only those calculations involving the variables for which the values are missing, no such statement can be made, and the correlation matrix may or may not be positive definite. You should be aware of the possible dangers of using correlation matrices formed in this way (see the G02 Chapter Introduction), but if they nevertheless wish to carry out regression using such matrices, this routine is capable of handling the inversion of such matrices provided they are positive definite.
If a matrix is positive definite, its subsequent re-organisation by either g02ceforg02cff will not affect this property, and the new matrix can safely be used in this routine. Thus correlation matrices produced by any of g02baf,g02bbf,g02bgforg02bhf, even if subsequently modified by either g02ceforg02cff, can be handled by this routine.
It should be noted that in forming the sums of squares and cross-products matrix and the correlation matrix a column of constants should not be added to the data as an additional ‘variable’ in order to obtain a constant term in the regression. This routine automatically calculates the regression constant, $a$, and any attempt to insert such a ‘dummy variable’ is likely to cause the routine to fail.
It should also be noted that the routine requires the dependent variable to be the last of the $k+1$ variables whose statistics are provided as input to the routine. If this variable is not correctly positioned in the original data, the means, standard deviations, sums of squares and cross-products of deviations from means, and correlation coefficients can be manipulated by using g02ceforg02cff to reorder the variables as necessary.
10Example
This example reads in the means, sums of squares and cross-products of deviations from means, and correlation coefficients for three variables. A multiple linear regression is then performed with the third and final variable as the dependent variable. Finally the results are printed.