nag_lars_xtx (g02mbc) implements the LARS algorithm of Efron et al. (2004) as well as the modifications needed to perform forward stagewise linear regression and fit LASSO and positive LASSO models.
Given a vector of observed values,
and an design matrix , where the th column of , denoted , is a vector of length representing the th independent variable , standardized such that
and a set of model parameters to be estimated from the observed values, the LARS algorithm can be summarised as:
Set and all coefficients to zero, that is .
Find the variable most correlated with , say . Add to the ‘most correlated’ set . If go to 8.
Take the largest possible step in the direction of (i.e., increase the magnitude of ) until some other variable, say , has the same correlation with the current residual, .
Proceed in the ‘least angle direction’, that is, the direction which is equiangular between all variables in , altering the magnitude of the parameter estimates of those variables in , until the th variable, , has the same correlation with the current residual.
As well as being a model selection process in its own right, with a small number of modifications the LARS algorithm can be used to fit the LASSO model of Tibshirani (1996), a positive LASSO model, where the independent variables enter the model in their defined direction, forward stagewise linear regression (Hastie et al. (2001)) and forward selection (Weisberg (1985)). Details of the required modifications in each of these cases are given in Efron et al. (2004).
On exit: , the actual number of steps carried out in the model fitting process.
Note: the dimension, dim, of the array b
must be at least
On exit: the parameter estimates, with , the parameter estimate for the th variable, at the th step of the model fitting process, .
By default, when the parameter estimates are rescaled prior to being returned. If the parameter estimates are required on the normalized scale, then this can be overridden via ropt.
The values held in the remaining part of b depend on the type of preprocessing performed.
On entry: the stride separating row elements in the two-dimensional data stored in the array b.
, where is the number of parameter estimates as described in ip.
On exit: summaries of the model fitting process. When
, the sum of the absolute values of the parameter estimates for the th step of the modelling fitting process. If , the scaled parameter estimates are used in the summation.
, the residual sums of squares for the th step, where .
, approximate degrees of freedom for the th step.
, a -type statistic for the th step, where .
, correlation between the residual at step and the most correlated variable not yet in the active set , where the residual at step is .
, the step size used at step .
, the residual sums of squares for the null model, where .
, the degrees of freedom for the null model, where if and otherwise.
, a -type statistic for the null model, where .
, where and .
Although the statistics described above are returned when NW_LIMIT_REACHED they may not be meaningful due to the estimate not being based on the saturated model.
– const doubleInput
On entry: optional arguments to control various aspects of the LARS algorithm.
The default value will be used for if , therefore setting will use the default values for all optional arguments and ropt need not be set and may be NULL. The default value will also be used if an invalid value is supplied for a particular argument, for example, setting will use the default value for argument .
On entry, all values of isx are zero. Constraint: at least one value of isx must be nonzero.
On entry, .
Constraint: or for all .
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
An unexpected error has been triggered by this function. Please contact NAG.
See Section 3.6.6 in the Essential Introduction for further information.
On entry, . Constraint: .
On entry, . Constraint: diagonal elements of must be positive.
On entry, and . Constraint: diagonal elements of must be positive.
A negative value for the residual sums of squares was obtained. Check the values of dtd, dty and yty.
Your licence key may have expired or may not have been installed correctly.
See Section 3.6.5 in the Essential Introduction for further information.
On entry, . Constraint: .
The cross-product matrix supplied in dtd is not symmetric.
Fitting process did not finished in mnstep steps. Try increasing the size of mnstep and supplying larger output arrays. All output is returned as documented, up to step mnstep, however, and the statistics may not be meaningful.
, therefore sigma has been set to a large value. Output is returned as documented.
is approximately zero and hence the -type criterion cannot be calculated. All other output is returned as documented.
Degenerate model, no variables added and . Output is returned as documented.
8 Further Comments
The solution path to the LARS, LASSO and stagewise regression analysis is a continuous, piecewise linear. nag_lars_xtx (g02mbc) returns the parameter estimates at various points along this path. nag_lars_param (g02mcc) can be used to obtain estimates at different points along the path.
If you have the raw data values, that is and , then nag_lars (g02mac) can be used instead of nag_lars_xtx (g02mbc).
This example performs a LARS on a simulated dataset with observations and independent variables.
The example uses nag_sum_sqs (g02buc) to get the cross-products of the augmented matrix . The first elements of the (column packed) cross-products matrix returned therefore contain the elements of , the next elements contain and the last element .