naginterfaces.library.correg.linregm_fit_stepwise¶

naginterfaces.library.correg.linregm_fit_stepwise(n, wmean, c, sw, isx, fin=4.0, fout=None, tau=1e-06, monlev=0, monfun=None, data=None, io_manager=None)[source]¶

linregm_fit_stepwise calculates a full stepwise selection from $p$ variables by using Clarke’s sweep algorithm on the correlation matrix of a design and data matrix, $Z$ . The (weighted) variance-covariance, (weighted) means and sum of weights of $Z$ must be supplied.

For full information please refer to the NAG Library document for g02ef

https://www.nag.com/numeric/nl/nagdoc_29.3/flhtml/g02/g02eff.html

Parameters

nint

The number of observations used in the calculations.

wmeanfloat, array-like, shape $(m + 1)$

The mean of the design matrix, $Z$ .

cfloat, array-like, shape $((m + 1) \times (m + 2) / 2)$

The upper-triangular variance-covariance matrix packed by column for the design matrix, $Z$ . Because the function computes the correlation matrix $R$ from $c$ , the variance-covariance matrix need only be supplied up to a scaling factor.

swfloat

If weights were used to calculate $c$ then $s w$ is the sum of positive weight values; otherwise $s w$ is the number of observations used to calculate $c$ .

isxint, array-like, shape $(m)$

The value of $i s x [j - 1]$ determines the set of variables used to perform full stepwise model selection, for $j = 1, 2, \dots, m$ .

$i s x [j - 1] = - 1$

To exclude the variable corresponding to the $j$ th column of $X$ from the final model.

$i s x [j - 1] = 1$

To consider the variable corresponding to the $j$ th column of $X$ for selection in the final model.

$i s x [j - 1] = 2$

To force the inclusion of the variable corresponding to the $j$ th column of $X$ in the final model.

finfloat, optional

The value of the variance ratio which an explanatory variable must exceed to be included in a model.

foutNone or float, optional

Note: if this argument is None then a default value will be used, determined as follows: $f i n$ .

The explanatory variable in a model with the lowest variance ratio value is removed from the model if its value is less than $f o u t$ . $f o u t$ is usually set equal to the value of $f i n$ ; a value less than $f i n$ is occasionally preferred.

taufloat, optional

The tolerance, $τ$ , for detecting collinearities between variables when adding or removing an explanatory variable from a model. Explanatory variables deemed to be collinear are excluded from the final model.

monlevint, optional

A value of $1$ for $m o n l e v$ enables monitoring of the model selection process; a value of $0$ disables it.

monfunNone or callable monfun(flag, var, val, data=None), optional

Note: if this argument is None then a NAG-supplied facility will be used.

The function for monitoring the model selection process.

Parameters

flagstr, length 1

The value of $f l a g$ indicates the stage of the stepwise selection of explanatory variables.

$f l a g ='A'$

Variable $v a r$ was added to the current model.

$f l a g ='B'$

Beginning the backward elimination step.

$f l a g ='C'$

Variable $v a r$ failed the collinearity test and is excluded from the model.

$f l a g ='D'$

Variable $v a r$ was dropped from the current model.

$f l a g ='F'$

Beginning the forward selection step

$f l a g ='K'$

Backward elimination did not remove any variables from the current model.

$f l a g ='S'$

Starting stepwise selection procedure.

$f l a g ='V'$

The variance ratio for variable $v a r$ takes the value $v a l$ .

$f l a g ='X'$

Finished stepwise selection procedure.

varint

The index of the explanatory variable in the design matrix $Z$ to which $f l a g$ pertains.

valfloat

If $f l a g ='V'$ , $v a l$ is the variance ratio value for the coefficient associated with explanatory variable index $v a r$ .

dataarbitrary, optional, modifiable in place

User-communication data for callback functions.

dataarbitrary, optional

User-communication data for callback functions.

io_managerFileObjManager, optional

Manager for I/O in this routine.

Returns

isxint, ndarray, shape $(m)$

The value of $i s x [j - 1]$ indicates the status of the $j$ th explanatory variable in the model.

$i s x [j - 1] = - 1$

Forced exclusion.

$i s x [j - 1] = 0$

Excluded.

$i s x [j - 1] = 1$

Selected.

$i s x [j - 1] = 2$

Forced selection.

bfloat, ndarray, shape $(m + 1)$

$b [0]$ contains the estimate for the intercept term in the fitted model. If $i s x [j - 1] \neq 0$ , then $b [j + 1 - 1]$ contains the estimate for the $j$ th explanatory variable in the fitted model; otherwise $b [j + 1 - 1] = 0$ .

sefloat, ndarray, shape $(m + 1)$

$s e [j - 1]$ contains the standard error for the estimate of $b [j - 1]$ , for $j = 1, 2, \dots, m + 1$ .

rsqfloat

The $R^{2}$ -statistic for the fitted regression model.

rmsfloat

The mean square of residuals for the fitted regression model.

dfint

The number of degrees of freedom for the sum of squares of residuals.

Raises

NagValueError

(errno $1$ )

On entry, $m = ⟨ v a l u e ⟩$ .

Constraint: $m > 1$ .

(errno $1$ )

On entry, $n = ⟨ v a l u e ⟩$ .

Constraint: $n > 1$ .

(errno $1$ )

On entry, $s w = ⟨ v a l u e ⟩$ .

Constraint: $s w > 1.0$ .

(errno $1$ )

On entry, $f i n = ⟨ v a l u e ⟩$ .

Constraint: $f i n > 0.0$ .

(errno $1$ )

On entry, $f o u t = ⟨ v a l u e ⟩$ ; $f i n = ⟨ v a l u e ⟩$ .

Constraint: $0.0 \leq f o u t \leq f i n$ .

(errno $1$ )

On entry, $t a u = ⟨ v a l u e ⟩$ .

Constraint: $t a u > 0.0$ .

(errno $1$ )

On entry, $m o n l e v = ⟨ v a l u e ⟩$ .

Constraint: $m o n l e v = 0$ or $1$ .

(errno $2$ )

On entry at least one diagonal element of $c \leq 0.0$ .

Constraint: $c$ must be positive definite.

(errno $2$ )

On entry, $i s x [⟨ v a l u e ⟩] = ⟨ v a l u e ⟩$ .

Constraint: $i s x [j - 1] = - 1$ , $1$ or $2$ , for $j = 1, 2, \dots, m$ .

(errno $2$ )

On entry, $i s x [i - 1] \neq 1$ , for all $i = 1, 2, \dots, m$ .

Constraint: there must be at least one free variable.

(errno $4$ )

All variables are collinear, no model to select.

Warns

NagAlgorithmicWarning

(errno $3$ ): Matrix not positive definite, results may be inaccurate.

Notes

The general multiple linear regression model is defined by

y = β_{0} + X β + ϵ,

where

$y$ is a vector of $n$ observations on the dependent variable,

$β_{0}$ is an intercept coefficient,

$X$ is an $n \times p$ matrix of $p$ explanatory variables,

$β$ is a vector of $p$ unknown coefficients, and

$ϵ$ is a vector of length $n$ of unknown, Normally distributed, random errors.

linregm_fit_stepwise employs a full stepwise regression to select a subset of explanatory variables from the $p$ available variables (the intercept is included in the model) and computes regression coefficients and their standard errors, and various other statistical quantities, by minimizing the sum of squares of residuals. The method applies repeatedly a forward selection step followed by a backward elimination step and halts when neither step updates the current model.

The criterion used to update a current model is the variance ratio of residual sum of squares. Let $s_{1}$ and $s_{2}$ be the residual sum of squares of the current model and this model after undergoing a single update, with degrees of freedom $q_{1}$ and $q_{2}$ , respectively. Then the condition:

\frac{(s_{2} - s_{1}) / (q_{2} - q_{1})}{s_{1} / q_{1}} > f_{1},

must be satisfied if a variable $k$ will be considered for entry to the current model, and the condition:

\frac{(s_{1} - s_{2}) / (q_{1} - q_{2})}{s_{1} / q_{1}} < f_{2},

must be satisfied if a variable $k$ will be considered for removal from the current model, where $f_{1}$ and $f_{2}$ are user-supplied values and $f_{2} \leq f_{1}$ .

In the entry step the entry statistic is computed for each variable not in the current model. If no variable is associated with a test value that exceeds $f_{1}$ then this step is terminated; otherwise the variable associated with the largest value for the entry statistic is entered into the model.

In the removal step the removal statistic is computed for each variable in the current model. If no variable is associated with a test value less than $f_{2}$ then this step is terminated; otherwise the variable associated with the smallest value for the removal statistic is removed from the model.

The data values $X$ and $y$ are not provided as input to the function. Instead, summary statistics of the design and data matrix $Z = (X | y)$ are required.

Explanatory variables are entered into and removed from the current model by using sweep operations on the correlation matrix $R$ of $Z$ , given by:

\begin{matrix} R = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 1 & \dots & r_{1 p} & r_{1 y} ⋮ & ⋱ & ⋮ & ⋮ r_{p 1} & \dots & 1 & r_{p y} r_{y 1} & \dots & r_{y p} & 1 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠, \end{matrix}

where $r_{i j}$ is the correlation between the explanatory variables $i$ and $j$ , for $j = 1, 2, \dots, p$ , for $i = 1, 2, \dots, p$ , and $r_{y i}$ (and $r_{i y}$ ) is the correlation between the response variable $y$ and the $i$ th explanatory variable, for $i = 1, 2, \dots, p$ .

A sweep operation on the $k$ th row and column ( $k \leq p$ ) of $R$ replaces:

\begin{matrix} \begin{matrix} r_{k k} by - 1 / r_{k k}; r_{i k} by r_{i k} / | r_{k k} |, i = 1, 2, \dots, p + 1 (i \neq k); r_{k j} by r_{k j} / | r_{k k} |, j = 1, 2, \dots, p + 1 (j \neq k); r_{i j} by r_{i j} - r_{i k} r_{k j} / | r_{k k} |, i = 1, 2, \dots, p + 1 (i \neq k); j = 1, 2, \dots, p + 1 (j \neq k) . \end{matrix} \end{matrix}

The $k$ th explanatory variable is eligible for entry into the current model if it satisfies the collinearity tests: $r_{k k} > τ$ and

(r_{i i} - \frac{r_{i k} r_{k i}}{r_{k k}}) τ \leq 1,

for a user-supplied value ( $> 0$ ) of $τ$ and where the index $i$ runs over explanatory variables in the current model. The sweep operation is its own inverse, therefore, pivoting on an explanatory variable $k$ in the current model has the effect of removing it from the model.

Once the stepwise model selection procedure is finished, the function calculates:

the least squares estimate for the $i$ th explanatory variable included in the fitted model;
standard error estimates for each coefficient in the final model;
the square root of the mean square of residuals and its degrees of freedom;
the multiple correlation coefficient.

The function makes use of the symmetry of the sweep operations and correlation matrix which reduces by almost one half the storage and computation required by the sweep algorithm, see Clarke (1981) for details.

References

Clarke, M R B, 1981, Algorithm AS 178: the Gauss–Jordan sweep operator with detection of collinearity, Appl. Statist. (31), 166–169

Dempster, A P, 1969, Elements of Continuous Multivariate Analysis, Addison–Wesley

Draper, N R and Smith, H, 1985, Applied Regression Analysis, (2nd Edition), Wiley

NAG and Python

Return to Front

naginterfaces.library.correg.linregm_fit_stepwise¶

naginterfaces.library.correg.linregm_​fit_​stepwise¶

naginterfaces.library.correg.linregm_fit_stepwise¶