# NAG Library Function Document

## 1Purpose

nag_regsn_mult_linear (g02dac) performs a general multiple linear regression when the independent variables may be linearly dependent. Parameter estimates, standard errors, residuals and influence statistics are computed. nag_regsn_mult_linear (g02dac) may be used to perform a weighted regression.

## 2Specification

 #include #include
 void nag_regsn_mult_linear (Nag_IncludeMean mean, Integer n, const double x[], Integer tdx, Integer m, const Integer sx[], Integer ip, const double y[], const double wt[], double *rss, double *df, double b[], double se[], double cov[], double res[], double h[], double q[], Integer tdq, Nag_Boolean *svd, Integer *rank, double p[], double tol, double com_ar[], NagError *fail)

## 3Description

The general linear regression model is defined by
 $y = X β + ε$
where
• $y$ is a vector of $n$ observations on the dependent variable,
• $X$ is an $n$ by $p$ matrix of the independent variables of column rank $k$,
• $\beta$ is a vector of length $p$ of unknown arguments, and
• $\epsilon$ is a vector of length $n$ of unknown random errors such that $\mathrm{var}\epsilon =V{\sigma }^{2}$, where $V$ is a known diagonal matrix.
Note: the $p$ independent variables may be selected from a set of $m$ potential independent variables.
If $V=I$, the identity matrix, then least squares estimation is used.
If $V\ne I$, then for a given weight matrix $W\propto {V}^{-1}$, weighted least squares estimation is used.
The least squares estimates $\stackrel{^}{\beta }$ of the arguments $\beta$ minimize ${\left(y-X\beta \right)}^{\mathrm{T}}\left(y-X\beta \right)$ while the weighted least squares estimates minimize ${\left(y-X\beta \right)}^{\mathrm{T}}W\left(y-X\beta \right)$.
nag_regsn_mult_linear (g02dac) finds a $QR$ decomposition of $X$ (or ${W}^{1/2}X$ in the weighted case), i.e.,
 $X = QR * or ​ W 1/2 X = QR *$
where ${R}^{*}=\left(\begin{array}{c}R\\ 0\end{array}\right)$ and $R$ is a $p$ by $p$ upper triangular matrix and $Q$ is an $n$ by $n$ orthogonal matrix.
If $R$ is of full rank, then $\stackrel{^}{\beta }$ is the solution to
 $R β ^ = c 1$
where $c={Q}^{\mathrm{T}}y$ (or ${Q}^{\mathrm{T}}{W}^{1/2}y$) and ${c}_{1}$ is the first $p$ elements of $c$.
If $R$ is not of full rank a solution is obtained by means of a singular value decomposition (SVD) of $R$,
 $R = Q * D 0 0 0 PT$
where $D$ is a $k$ by $k$ diagonal matrix with nonzero diagonal elements, $k$ being the rank of $R$ and ${Q}_{*}$ and $P$ are $p$ by $p$ orthogonal matrices. This gives the solution
 $β ^ = P 1 D -1 Q * 1 T c 1$
${P}_{1}$ being the first $k$ columns of $P$, i.e., $P=\left({P}_{1}{P}_{0}\right)$ and ${Q}_{{*}_{1}}$ being the first $k$ columns of ${Q}_{*}$.
Details of the SVD are made available, in the form of the matrix ${P}^{*}$:
 $P * = D -1 P1T P0T .$
This will be only one of the possible solutions. Other estimates may be obtained by applying constraints to the arguments. These solutions can be obtained by using nag_regsn_mult_linear_tran_model (g02dkc) after using nag_regsn_mult_linear (g02dac). Only certain linear combinations of the arguments will have unique estimates; these are known as estimable functions.
The fit of the model can be examined by considering the residuals, ${r}_{i}={y}_{i}-\stackrel{^}{y}$, where $\stackrel{^}{y}=X\stackrel{^}{\beta }$ are the fitted values. The fitted values can be written as $Hy$ for an $n$ by $n$ matrix $H$. The $i$th diagonal element of $H$, ${h}_{i}$, gives a measure of the influence of the $i$th value of the independent variables on the fitted regression model. The values ${h}_{i}$ are sometimes known as leverages. Both ${r}_{i}$ and ${h}_{i}$ are provided by nag_regsn_mult_linear (g02dac).
The output of nag_regsn_mult_linear (g02dac) also includes $\stackrel{^}{\beta }$, the residual sum of squares and associated degrees of freedom, $\left(n-k\right)$, the standard errors of the parameter estimates and the variance-covariance matrix of the parameter estimates.
In many linear regression models the first term is taken as a mean term or an intercept, i.e., ${X}_{\mathit{i},1}=1$, for $\mathit{i}=1,2,\dots ,n$. This is provided as an option. Also note that not all the potential independent variables need to be included in a model; a facility to select variables to be included in the model is provided.
Details of the $QR$ decomposition and, if used, the SVD, are made available. These allow the regression to be updated by adding or deleting an observation using nag_regsn_mult_linear_addrem_obs (g02dcc), adding or deleting a variable using nag_regsn_mult_linear_add_var (g02dec) and nag_regsn_mult_linear_delete_var (g02dfc) or estimating and testing an estimable function using nag_regsn_mult_linear_est_func (g02dnc).

## 4References

Cook R D and Weisberg S (1982) Residuals and Influence in Regression Chapman and Hall
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Golub G H and Van Loan C F (1996) Matrix Computations (3rd Edition) Johns Hopkins University Press, Baltimore
Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25
McCullagh P and Nelder J A (1983) Generalized Linear Models Chapman and Hall
Searle S R (1971) Linear Models Wiley

## 5Arguments

1:    $\mathbf{mean}$Nag_IncludeMeanInput
On entry: indicates if a mean term is to be included.
${\mathbf{mean}}=\mathrm{Nag_MeanInclude}$
A mean term, (intercept), will be included in the model.
${\mathbf{mean}}=\mathrm{Nag_MeanZero}$
The model will pass through the origin, zero point.
Constraint: ${\mathbf{mean}}=\mathrm{Nag_MeanInclude}$ or $\mathrm{Nag_MeanZero}$.
2:    $\mathbf{n}$IntegerInput
On entry: the number of observations, $n$.
Constraint: ${\mathbf{n}}\ge 2$.
3:    $\mathbf{x}\left[{\mathbf{n}}×{\mathbf{tdx}}\right]$const doubleInput
On entry: ${\mathbf{x}}\left[\left(\mathit{i}\right)×{\mathbf{tdx}}+\mathit{j}\right]$ must contain the $\mathit{i}$th observation for the $\mathit{j}$th potential independent variable, for $\mathit{i}=0,1,\dots ,n-1$ and $\mathit{j}=0,1,\dots ,m-1$.
4:    $\mathbf{tdx}$IntegerInput
On entry: the stride separating matrix column elements in the array x.
Constraint: ${\mathbf{tdx}}\ge {\mathbf{m}}$.
5:    $\mathbf{m}$IntegerInput
On entry: the total number of independent variables in the dataset, $m$.
Constraint: ${\mathbf{m}}\ge 1$.
6:    $\mathbf{sx}\left[{\mathbf{m}}\right]$const IntegerInput
On entry: indicates which of the potential independent variables are to be included in the model. If ${\mathbf{sx}}\left[j\right]>0$, then the variable contained in the corresponding column of x is included in the regression model.
Constraints:
• ${\mathbf{sx}}\left[\mathit{j}\right]\ge 0$, for $\mathit{j}=0,1,\dots ,m-1$;
• if ${\mathbf{mean}}=\mathrm{Nag_MeanInclude}$, then exactly ${\mathbf{ip}}-1$ values of sx must be $>0$;
• if ${\mathbf{mean}}=\mathrm{Nag_MeanZero}$, then exactly ip values of sx must be $>0$.
7:    $\mathbf{ip}$IntegerInput
On entry: the number $p$ of independent variables in the model, including the mean or intercept if present.
Constraints:
• if ${\mathbf{mean}}=\mathrm{Nag_MeanInclude}$, $1\le {\mathbf{ip}}\le {\mathbf{m}}+1$;
• if ${\mathbf{mean}}=\mathrm{Nag_MeanZero}$, $1\le {\mathbf{ip}}\le {\mathbf{m}}$.
8:    $\mathbf{y}\left[{\mathbf{n}}\right]$const doubleInput
On entry: observations on the dependent variable, $y$.
9:    $\mathbf{wt}\left[{\mathbf{n}}\right]$const doubleInput
On entry: optionally, the weights to be used in the weighted regression.
If ${\mathbf{wt}}\left[i-1\right]=0.0$, then the $i$th observation is not included in the model, in which case the effective number of observations is the number of observations with nonzero weights. The values of res and h will be set to zero for observations with zero weights.
If weights are not provided then wt must be set to NULL and the effective number of observations is n.
Constraint: if ${\mathbf{wt}}\phantom{\rule{0.25em}{0ex}}\text{is not}\phantom{\rule{0.25em}{0ex}}\mathbf{NULL}$, ${\mathbf{wt}}\left[\mathit{i}-1\right]=0.0$, for $\mathit{i}=1,2,\dots ,n$.
10:  $\mathbf{rss}$double *Output
On exit: the residual sum of squares for the regression.
11:  $\mathbf{df}$double *Output
On exit: the degrees of freedom associated with the residual sum of squares.
12:  $\mathbf{b}\left[{\mathbf{ip}}\right]$doubleOutput
On exit: ${\mathbf{b}}\left[\mathit{i}\right]$, for $\mathit{i}=0,1,\dots ,{\mathbf{ip}}-1$, contain the least squares estimates of the arguments of the regression model, $\stackrel{^}{\beta }$.
If ${\mathbf{mean}}=\mathrm{Nag_MeanInclude}$, then ${\mathbf{b}}\left[0\right]$ will contain the estimate of the mean argument and ${\mathbf{b}}\left[i\right]$ will contain the coefficient of the variable contained in column $j$ of x, where ${\mathbf{sx}}\left[j\right]$ is the $i$th positive value in the array sx.
If ${\mathbf{mean}}=\mathrm{Nag_MeanZero}$, then ${\mathbf{b}}\left[i-1\right]$ will contain the coefficient of the variable contained in column $j$ of x, where ${\mathbf{sx}}\left[j\right]$ is the $i$th positive value in the array sx.
13:  $\mathbf{se}\left[{\mathbf{ip}}\right]$doubleOutput
On exit: ${\mathbf{se}}\left[\mathit{i}\right]$, for $\mathit{i}=0,1,\dots ,{\mathbf{ip}}-1$, contains the standard errors of the ip parameter estimates given in b.
14:  $\mathbf{cov}\left[{\mathbf{ip}}×\left({\mathbf{ip}}+1\right)/2\right]$doubleOutput
On exit: the first ${\mathbf{ip}}×\left({\mathbf{ip}}+1\right)/2$ elements of cov contain the upper triangular part of the variance-covariance matrix of the ip parameter estimates given in b. They are stored packed by column, i.e., the covariance between the parameter estimate given in ${\mathbf{b}}\left[i\right]$ and the parameter estimate given in ${\mathbf{b}}\left[\mathit{j}\right]$, $\mathit{j}\ge \mathit{i}$, is stored in ${\mathbf{cov}}\left[\mathit{j}\left(\mathit{j}+1\right)/2+\mathit{i}\right]$, for $\mathit{i}=0,1,\dots ,{\mathbf{ip}}-1$ and $\mathit{j}=i,\dots ,{\mathbf{ip}}-1$.
15:  $\mathbf{res}\left[{\mathbf{n}}\right]$doubleOutput
On exit: the (weighted) residuals, ${r}_{i}$.
16:  $\mathbf{h}\left[{\mathbf{n}}\right]$doubleOutput
On exit: the diagonal elements of $H$, ${h}_{i}$, the leverages.
17:  $\mathbf{q}\left[{\mathbf{n}}×{\mathbf{tdq}}\right]$doubleOutput
Note: the $\left(i,j\right)$th element of the matrix $Q$ is stored in ${\mathbf{q}}\left[\left(i-1\right)×{\mathbf{tdq}}+j-1\right]$.
On exit: the results of the $QR$ decomposition: the first column of q contains $c$, the upper triangular part of columns 2 to ${\mathbf{ip}}+1$ contain the $R$ matrix, the strictly lower triangular part of columns 2 to ${\mathbf{ip}}+1$ contain details of the $Q$ matrix.
18:  $\mathbf{tdq}$IntegerInput
On entry: the stride separating matrix column elements in the array q.
Constraint: ${\mathbf{tdq}}\ge {\mathbf{ip}}+1$.
19:  $\mathbf{svd}$Nag_Boolean *Output
On exit: if a singular value decomposition has been performed then svd will be Nag_TRUE, otherwise svd will be Nag_FALSE.
20:  $\mathbf{rank}$Integer *Output
On exit: the rank of the independent variables.
If ${\mathbf{svd}}=\mathrm{Nag_FALSE}$, ${\mathbf{rank}}={\mathbf{ip}}$.
If ${\mathbf{svd}}=\mathrm{Nag_TRUE}$, rank is an estimate of the rank of the independent variables. rank is calculated as the number of singular values greater than tol (largest singular value). It is possible for the SVD to be carried out but rank to be returned as ip.
21:  $\mathbf{p}\left[2×{\mathbf{ip}}+{\mathbf{ip}}×{\mathbf{ip}}\right]$doubleOutput
On exit: details of the $QR$ decomposition and SVD if used.
If ${\mathbf{svd}}=\mathrm{Nag_FALSE}$, only the first ip elements of p are used, these will contain details of the Householder vector in the $QR$ decomposition (see Sections 2.2.1 and 3.3.6 in the f08 Chapter Introduction).
If ${\mathbf{svd}}=\mathrm{Nag_TRUE}$, the first ip elements of p will contain details of the Householder vector in the $QR$ decomposition and the next ip elements of p contain singular values. The following ip by ip elements contain the matrix ${P}^{*}$ stored by rows.
22:  $\mathbf{tol}$doubleInput
On entry: the value of tol is used to decide what is the rank of the independent variables. The smaller the value of tol the stricter the criterion for selecting the singular value decomposition. If ${\mathbf{tol}}=0.0$, then the singular value decomposition will never be used, this may cause run time errors or inaccurate results if the independent variables are not of full rank.
Suggested value: ${\mathbf{tol}}=0.000001$.
Constraint: ${\mathbf{tol}}\ge 0.0$.
23:  $\mathbf{com_ar}\left[{\mathbf{ip}}+{\mathbf{ip}}×{\mathbf{ip}}\right]$doubleOutput
On exit: if ${\mathbf{svd}}=\mathrm{Nag_TRUE}$, com_ar contains information which is needed by nag_regsn_mult_linear_newyvar (g02dgc).
24:  $\mathbf{fail}$NagError *Input/Output
The NAG error argument (see Section 3.7 in How to Use the NAG Library and its Documentation).

## 6Error Indicators and Warnings

NE_2_INT_ARG_LT
On entry, ${\mathbf{n}}=〈\mathit{\text{value}}〉$ while ${\mathbf{ip}}=〈\mathit{\text{value}}〉$. These arguments must satisfy ${\mathbf{n}}\ge {\mathbf{ip}}$.
On entry, ${\mathbf{tdq}}=〈\mathit{\text{value}}〉$ while ${\mathbf{ip}}+1=〈\mathit{\text{value}}〉$. These arguments must satisfy ${\mathbf{tdq}}\ge {\mathbf{ip}}+1$.
On entry, ${\mathbf{tdx}}=〈\mathit{\text{value}}〉$ while ${\mathbf{m}}=〈\mathit{\text{value}}〉$. These arguments must satisfy ${\mathbf{tdx}}\ge {\mathbf{m}}$.
NE_ALLOC_FAIL
Dynamic memory allocation failed.
On entry, argument mean had an illegal value.
Either a value of sx is $<0$, or ip is incompatible with mean and sx, or ${\mathbf{ip}}>$ the effective number of observations.
NE_INT_ARG_LT
On entry, ${\mathbf{ip}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{ip}}\ge 1$.
On entry, ${\mathbf{m}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{m}}\ge 1$.
On entry, ${\mathbf{n}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{n}}\ge 2$.
On entry, ${\mathbf{sx}}\left[〈\mathit{\text{value}}〉\right]$ must not be less than 0: ${\mathbf{sx}}\left[〈\mathit{\text{value}}〉\right]=〈\mathit{\text{value}}〉$.
NE_REAL_ARG_LT
On entry, tol must not be less than 0.0: ${\mathbf{tol}}=〈\mathit{\text{value}}〉$.
On entry, ${\mathbf{wt}}\left[〈\mathit{\text{value}}〉\right]$ must not be less than 0.0: ${\mathbf{wt}}\left[〈\mathit{\text{value}}〉\right]=〈\mathit{\text{value}}〉$.
NE_SVD_NOT_CONV
The singular value decomposition has failed to converge.
NE_ZERO_DOF_RESID
The degrees of freedom for the residuals are zero, i.e., the designated number of arguments $\text{}=\text{}$ the effective number of observations. In this case the parameter estimates will be returned along with the diagonal elements of $H$, but neither standard errors nor the variance-covariance matrix will be calculated.

## 7Accuracy

The accuracy of this function is closely related to the accuracy of the $QR$ decomposition.

## 8Parallelism and Performance

nag_regsn_mult_linear (g02dac) is not threaded in any implementation.

Function nag_regsn_std_resid_influence (g02fac) can be used to compute standardized residuals and further measures of influence. nag_regsn_mult_linear (g02dac) requires, in particular, the results stored in res and h.

### 9.1Internal Changes

Internal changes have been made to this function as follows:
• At Mark 26.1: The documented minimum length of the array argument com_ar was too large. The documented minimum length was given as ${\mathbf{ip}}×{\mathbf{ip}}×5×\left({\mathbf{ip}}-1\right)$ but the actual minimum length is ${\mathbf{ip}}×{\mathbf{ip}}+{\mathbf{ip}}$ which is much less for non-trivial cases, ${\mathbf{ip}}>1$.
In addition, provided example programs that called nag_regsn_mult_linear (g02dac) allocated lengths of ${\mathbf{ip}}×{\mathbf{ip}}+5×\left({\mathbf{ip}}-1\right)$ for the array argument, which was also larger than necessary for non-trivial problems.
The nag_regsn_mult_linear (g02dac) routine document has been updated to document the actual minimum length requirement for com_ar, and those example programs that call nag_regsn_mult_linear (g02dac) have been updated to allocate the actual minimum length required for com_ar.
For details of all known issues which have been reported for the NAG Library please refer to the Known Issues list.

## 10Example

For this function two examples are presented. There is a single example program for nag_regsn_mult_linear (g02dac), with a main program and the code to solve the two example problems is given in the functions ex1 and ex2.
Example 1 (ex1)
Data from an experiment with four treatments and three observations per treatment are read in. The treatments are represented by dummy $\left(0-1\right)$ variables. An unweighted model is fitted with a mean included in the model.
Example 2 (ex2)
This example program uses nag_regsn_mult_linear (g02dac) to find the coefficient of the $n$ degree polynomial
 $p x = a n x n + a n-1 x n-1 + … a 1 x + a o$
that fits the data, $p\left(x\left(i\right)\right)$ to $y\left(i\right)$, in a least squares sense.
In this example nag_regsn_mult_linear (g02dac) is called with both ${\mathbf{mean}}=\mathrm{Nag_MeanInclude}$ and ${\mathbf{mean}}=\mathrm{Nag_MeanZero}$. The polynomial degree, the number of data points and the tolerance can be modified using the example data file.

### 10.1Program Text

Program Text (g02dace.c)

### 10.2Program Data

Program Data (g02dace.d)

### 10.3Program Results

Program Results (g02dace.r)

© The Numerical Algorithms Group Ltd, Oxford, UK. 2017