NAG Library Function Document
nag_regsn_ridge_opt (g02kac)
1 Purpose
nag_regsn_ridge_opt (g02kac) calculates a ridge regression, optimizing the ridge parameter according to one of four prediction error criteria.
2 Specification
| #include <nag.h> |
| #include <nagg02.h> |
| void |
nag_regsn_ridge_opt (Nag_OrderType order,
Integer n,
Integer m,
const double x[],
Integer pdx,
const Integer isx[],
Integer ip,
double tau,
const double y[],
double *h,
Nag_PredictError opt,
Integer *niter,
double tol,
double *nep,
Nag_EstimatesOption orig,
double b[],
double vif[],
double res[],
double *rss,
Integer *df,
Nag_OptionLOO optloo,
double perr[],
NagError *fail) |
|
3 Description
A linear model has the form:
where
- is an by matrix of values of a dependent variable;
- is a scalar intercept term;
- is an by matrix of values of independent variables;
- is an by matrix of unknown values of parameters;
- is an by matrix of unknown random errors such that variance of .
Let
be the mean-centred
and
the mean-centred
. Furthermore,
is scaled such that the diagonal elements of the cross product matrix
are one. The linear model now takes the form:
Ridge regression estimates the parameters
in a penalised least squares sense by finding the
that minimizes
where
denotes the
-norm and
is a scalar regularisation or ridge parameter. For a given value of
, the parameter estimates
are found by evaluating
Note that if
the ridge regression solution is equivalent to the ordinary least squares solution.
Rather than calculate the inverse of (
) directly, nag_regsn_ridge_opt (g02kac) uses the singular value decomposition (SVD) of
. After decomposing
into
where
and
are orthogonal matrices and
is a diagonal matrix, the parameter estimates become
A consequence of introducing the ridge parameter is that the effective number of parameters,
, in the model is given by the sum of diagonal elements of
see
Moody (1992) for details.
Any multi-collinearity in the design matrix
may be highlighted by calculating the variance inflation factors for the fitted model. The
th variance inflation factor,
, is a scaled version of the multiple correlation coefficient between independent variable
and the other independent variables,
, and is given by
The
variance inflation factors are calculated as the diagonal elements of the matrix:
which, using the SVD of
, is equivalent to the diagonal elements of the matrix:
Although parameter estimates are calculated by using , it is usual to report the parameter estimates associated with . These are calculated from , and the means and scalings of . Optionally, either or may be calculated.
The method can adopt one of four criteria to minimize while calculating a suitable value for
:
| (a) |
Generalised cross-validation (GCV):
|
| (b) |
Unbiased estimate of variance (UEV):
|
| (c) |
Future prediction error (FPE):
|
| (d) |
Bayesian information criterion (BIC):
|
where
is the sum of squares of residuals. However, the function returns all four of the above prediction errors regardless of the one selected to minimize the ridge parameter,
. Furthermore, the function will optionally return the leave-one-out cross-validation (LOOCV) prediction error.
4 References
Hastie T, Tibshirani R and Friedman J (2003) The Elements of Statistical Learning: Data Mining, Inference and Prediction Springer Series in Statistics
Moody J.E. (1992) The effective number of parameters: An analysis of generalisation and regularisation in nonlinear learning systems In: Neural Information Processing Systems (eds J E Moody, S J Hanson, and R P Lippmann) 4 847–854 Morgan Kaufmann San Mateo CA
5 Arguments
- 1:
order – Nag_OrderTypeInput
-
On entry: the
order argument specifies the two-dimensional storage scheme being used, i.e., row-major ordering or column-major ordering. C language defined storage is specified by
. See
Section 3.2.1.3 in the Essential Introduction for a more detailed explanation of the use of this argument.
Constraint:
or Nag_ColMajor.
- 2:
n – IntegerInput
-
On entry: , the number of observations.
Constraint:
.
- 3:
m – IntegerInput
-
On entry: the number of independent variables available in the data matrix .
Constraint:
.
- 4:
x[] – const doubleInput
-
Note: the dimension,
dim, of the array
x
must be at least
- when ;
- when .
The
th element of the matrix
is stored in
- when ;
- when .
On entry: the values of independent variables in the data matrix .
- 5:
pdx – IntegerInput
-
On entry: the stride separating row or column elements (depending on the value of
order) in the array
x.
Constraints:
- if ,
;
- if , .
- 6:
isx[m] – const IntegerInput
-
On entry: indicates which
independent variables are included in the model.
- The th variable in x will be included in the model.
- Variable is excluded.
Constraint:
, for .
- 7:
ip – IntegerInput
-
On entry: , the number of independent variables in the model.
Constraints:
- ;
- Exactly ip elements of isx must be equal to .
- 8:
tau – doubleInput
-
On entry: singular values less than
tau of the SVD of the data matrix
will be set equal to zero.
Suggested value:
Constraint:
.
- 9:
y[n] – const doubleInput
-
On entry: the values of the dependent variable .
- 10:
h – double *Input/Output
-
On entry: an initial value for the ridge regression parameter ; used as a starting point for the optimization.
Constraint:
.
On exit:
h is the optimized value of the ridge regression parameter
.
- 11:
opt – Nag_PredictErrorInput
-
On entry: the measure of prediction error used to optimize the ridge regression parameter
. The value of
opt must be set equal to one of:
- Generalised cross-validation (GCV);
- Unbiased estimate of variance (UEV)
- Future prediction error (FPE)
- Bayesian information criteron (BIC).
Constraint:
, , or .
- 12:
niter – Integer *Input/Output
-
On entry: the maximum number of iterations allowed to optimize the ridge regression parameter .
Constraint:
.
On exit: the number of iterations used to optimize the ridge regression parameter
within
tol.
- 13:
tol – doubleInput
-
On entry: iterations of the ridge regression parameter
will halt when consecutive values of
lie within
tol.
Constraint:
.
- 14:
nep – double *Output
-
On exit: the number of effective parameters, , in the model.
- 15:
orig – Nag_EstimatesOptionInput
-
On entry: if , the parameter estimates are calculated for the original data; otherwise and the parameter estimates are calculated for the standardized data.
Constraint:
or .
- 16:
b[] – doubleOutput
-
On exit: contains the intercept and parameter estimates for the fitted ridge regression model in the order indicated by
isx. The first element of
b contains the estimate for the intercept;
contains the parameter estimate for the
th independent variable in the model, for
.
- 17:
vif[ip] – doubleOutput
-
On exit: the variance inflation factors in the order indicated by
isx. For the
th independent variable in the model,
is the value of
, for
.
- 18:
res[n] – doubleOutput
-
On exit: is the value of the th residual for the fitted ridge regression model, for .
-
On exit: the sum of squares of residual values.
- 20:
df – Integer *Output
-
On exit: the degrees of freedom for the residual sum of squares
rss.
- 21:
optloo – Nag_OptionLOOInput
-
On entry: if , the leave-one-out cross-validation estimate of prediction error is calculated; otherwise no such estimate is calculated and .
Constraint:
or .
- 22:
perr[] – doubleOutput
-
On exit: the first four elements contain, in this order, the measures of prediction error: GCV, UEV, FPE and BIC.
If , is the LOOCV estimate of prediction error; otherwise is not referenced.
- 23:
fail – NagError *Input/Output
-
The NAG error argument (see
Section 3.6 in the Essential Introduction).
6 Error Indicators and Warnings
- NE_2_INT_ARG_CONS
On entry, ; .
Constraint: .
- NE_ALLOC_FAIL
Dynamic memory allocation failed.
- NE_BAD_PARAM
On entry, argument had an illegal value.
- NE_INT
On entry, .
Constraint: .
On entry, .
Constraint: .
On entry, .
Constraint: .
- NE_INT_2
On entry, and .
Constraint: .
On entry, ; .
Constraint: .
On entry, and .
Constraint: .
- NE_INT_ARG_CONS
On entry, .
Constraint: .
- NE_INT_ARRAY_VAL_1_OR_2
On entry, .
Constraint: or .
- NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact
NAG for assistance.
- NE_REAL
On entry, .
Constraint: .
On entry, .
Constraint: .
On entry, .
Constraint: .
- NE_SVD_FAIL
SVD failed to converge.
- NW_TOO_MANY_ITER
Maximum number of iterations used.
7 Accuracy
Not applicable.
nag_regsn_ridge_opt (g02kac) allocates internally elements of double precision storage.
9 Example
This example reads in data from an experiment to model body fat, and a ridge regression is calculated that optimizes GCV prediction error.
9.1 Program Text
Program Text (g02kace.c)
9.2 Program Data
Program Data (g02kace.d)
9.3 Program Results
Program Results (g02kace.r)