NAG Library Routine Document
g22yaf (lm_formula)
1
Purpose
g22yaf parses a text string containing a formula specifying a linear model and outputs a G22 handle to an internal data structure. This G22 handle can then be passed to various routines in
Chapter G22. In particular, the G22 handle can be passed to
g22ycf to produce a design matrix or
g22ydf to produce a vector of column inclusion flags suitable for use with routines in
Chapter G02.
2
Specification
Fortran Interface
Integer, Intent (Inout)  ::  ifail  Character (*), Intent (In)  ::  formula  Type (c_ptr), Intent (Inout)  ::  hform 

3
Description
3.1
Background
Let $D$ denote a data matrix with $n$ observations on ${m}_{d}$ independent variables, denoted ${V}_{1},{V}_{2},\dots ,{V}_{{m}_{d}}$. Let $y$ denote a vector of $n$ observations on a dependent variable.
A linear model, $\mathcal{M}$, as the term is used in this routine, expresses a relationship between the independent variables, ${V}_{j}$, and the dependent variable. This relationship can be expressed as a series of additive terms ${T}_{1}+{T}_{2}+\dots $, with each term, ${T}_{t}$, representing either a single independent variable ${V}_{j}$, called the main effect of ${V}_{j}$, or the interaction between two or more independent variables. An interaction term, denoted here using the $.$ operator, allows the effect of an independent variable on the dependent variable to depend on the value of one or more other independent variables. As an example, the threeway interaction between ${V}_{1},{V}_{2}$ and ${V}_{3}$ is denoted ${V}_{1}.{V}_{2}.{V}_{3}$ and describes a situation where the effect of one of these three variables is influenced by the value of the other two.
This routine takes a description of
$\mathcal{M}$, supplied as a text string containing a formula, and outputs a G22 handle to an internal data structure. This G22 handle can then be passed to
g22ycf to produce a design matrix for use in analysis routines from other chapters, for example the regression routines of
Chapter G02.
A more detailed description of what is meant by a G22 handle can be found in
Section 2.1 in the G22 Chapter Introduction.
3.2
Syntax
In its most verbose form $\mathcal{M}$ can be described by one or more variable names, ${V}_{j}$, and the two operators, $+$ and $.$. In order to allow a wide variety of models to be specified compactly this syntax is extended to six operators ($+$, $.$, $*$, $$, $:$, $^$) and parentheses.
A formula describing the model is supplied to g22yaf via a character string which must obey the following rules:
1. 
Variables can be denoted by arbitrary names, as long as
(i) 
The names used are a subset of those supplied to g22ybf when describing $D$. 
(ii) 
The names do not contain any of the characters in $+.*:^\left(\right)@$. 

2. 
The $.$ operator denotes an interaction between two or more variables or terms, with ${V}_{1}.{V}_{2}.{V}_{3}$ denoting the threeway interaction between ${V}_{1}$, ${V}_{2}$ and ${V}_{3}$. 
3. 
A term in $\mathcal{M}$ can contain one or more variable names, separated using the $.$ operator, i.e., a term can be either a main effect or an interaction term between two or more variables.
(i) 
If a variable appears in an interaction term more than once, all subsequent appearances, after the first, are ignored, therefore ${V}_{1}.{V}_{2}.{V}_{1}$ is the same as ${V}_{1}.{V}_{2}$. 
(ii) 
The ordering of the variables in an interaction term is ignored when comparing terms, therefore ${V}_{1}.{V}_{2}$ is the same as ${V}_{2}.{V}_{1}$. This ordering may have an effect when the resulting G22 handle is passed to another routine, for example g22ycf. 
(iii) 
Applying the $.$ operator to two terms appends one to the other, for example, if ${T}_{1}={V}_{1}.{V}_{2}$ and ${T}_{2}={V}_{3}.{V}_{4}$, ${T}_{1}.{T}_{2}={V}_{1}.{V}_{2}.{V}_{3}.{V}_{4}$. 

4. 
The $+$ operator allows additional terms to be included in $\mathcal{M}$, therefore ${T}_{1}+{T}_{2}$ is a model that includes terms ${T}_{1}$ and ${T}_{2}$.
(i) 
If a term is added to $\mathcal{M}$ more than once, all subsequent appearances, after the first, are ignored, therefore ${T}_{1}+{T}_{2}+{T}_{1}$ is the same as ${T}_{1}+{T}_{2}$. 
(ii) 
The ordering of the terms is ignored whilst parsing the formula, therefore ${T}_{1}+{T}_{2}$ is the same as ${T}_{2}+{T}_{1}$. This ordering may have an effect when the resulting G22 handle is passed to another routine, for example g22ycf. 
(iii) 
Internally, the terms are reordered so that all main effects come first, followed by twoway interactions, then threeway interactions, etc. The ordering within each of these categories is preserved. 

5. 
The $*$ operator can be used as a shorthand notation denoting the main effects and all interactions between the variables involved. Therefore, ${T}_{1}*{T}_{2}$ is equivalent to ${T}_{1}+{T}_{2}+{T}_{1}.{T}_{2}$ and ${T}_{1}*{T}_{2}*{T}_{3}$ is equivalent to ${T}_{1}+{T}_{2}+{T}_{3}+{T}_{1}.{T}_{2}+{T}_{1}.{T}_{3}+{T}_{2}.{T}_{3}+{T}_{1}.{T}_{2}.{T}_{3}$. 
6. 
The $$ operator removes a term from $\mathcal{M}$, therefore ${T}_{1}*{T}_{2}*{T}_{3}{T}_{1}.{T}_{2}.{T}_{3}$ is equivalent to ${T}_{1}+{T}_{2}+{T}_{3}+{T}_{1}.{T}_{2}+{T}_{1}.{T}_{3}+{T}_{2}.{T}_{3}$ as the threeway interaction, ${T}_{1}.{T}_{2}.{T}_{3}$, usually present due to ${T}_{1}*{T}_{2}*{T}_{3}$ has been removed. 
7. 
The $:$ operator is a shorthand way of specifying a series of variables, with ${V}_{1}:{V}_{j}$ being equivalent to ${V}_{1}+{V}_{2}+\dots +{V}_{j}$.
(i) 
This operator can only be used if the variable names end in a numeric, therefore $\text{VAR2}:\text{VAR4}$ would be valid, but $\text{FVAR}:\text{LVAR}$ would not. 
(ii) 
The root part of both variable names (i.e., the part before the trailing numeric, so $\text{VAR}$ in the valid example above) must be the same. 
(iii) 
The trailing numeric parts of the two variable names must be in ascending order. 

8. 
The $^$ operator is a shorthand notation for a series of $*$ operators. $\left({T}_{1}+{T}_{2}+{T}_{3}\right)^2$ is equivalent to $\left({T}_{1}+{T}_{2}+{T}_{3}\right)*\left({T}_{1}+{T}_{2}+{T}_{3}\right)$ which in turn is equivalent to ${T}_{1}+{T}_{2}+{T}_{3}+{T}_{1}.{T}_{2}+{T}_{1}.{T}_{3}+{T}_{2}.{T}_{3}$.
(i) 
This notation is present primarily for use with the $:$ operator in examples of the form, $\left({V}_{1}:{V}_{5}\right)^3$ which specifies a model containing the main effects for variables ${V}_{1}$ to ${V}_{5}$ as well as all two and threeway interactions. 
(ii) 
Using the $^$ operator on a single term has no effect, therefore ${T}_{2}^2$ is the same as ${T}_{2}$. 

3.2.1
Precedence
Each operator has an associated default precedence, but this can be overridden through the use of parentheses. The default precedence is:
1. 
The $:$ operator, with the resulting expression is treated as if it was surrounded by parentheses. Therefore, ${V}_{1}+{V}_{3}:{V}_{6}*{V}_{7}$ is equivalent to ${V}_{1}+\left({V}_{3}+{V}_{4}+{V}_{5}+{V}_{6}\right)*{V}_{7}$. 
2. 
The $^$ operator, with the resulting expression is treated as if it was surrounded by parentheses. Therefore, $\left({T}_{1}+{T}_{2}+{T}_{3}\right)^2.{T}_{4}$ is equivalent to $\left(\left({T}_{1}+{T}_{2}+{T}_{3}\right)^2\right).{T}_{4}$, which is the equivalent to ${T}_{1}.{T}_{4}+{T}_{2}.{T}_{4}+{T}_{3}.{T}_{4}+{T}_{1}.{T}_{2}.{T}_{4}+{T}_{1}.{T}_{3}.{T}_{4}+{T}_{2}.{T}_{3}.{T}_{4}$. 
3. 
The $.$ operator, so ${T}_{1}*{T}_{2}.{T}_{3}$ is equivalent to ${T}_{1}*\left({T}_{2}.{T}_{3}\right)$. 
4. 
The $*$ operator.
(i) 
When using parentheses with the $*$ or $.$ operators the usual rules of multiplication apply, therefore $\left({T}_{1}+{T}_{3}.{T}_{4}\right).\left({T}_{5}+{T}_{7}\right)$ is equivalent to ${T}_{1}.{T}_{5}+{T}_{1}.{T}_{7}+{T}_{3}.{T}_{4}.{T}_{5}+{T}_{3}.{T}_{4}.{T}_{7}$ and $\left({T}_{1}+{T}_{3}.{T}_{4}\right)*\left({T}_{5}+{T}_{7}\right)$ is equivalent to ${T}_{1}+{T}_{5}+{T}_{7}+{T}_{3}.{T}_{4}+{T}_{1}.{T}_{5}+{T}_{1}.{T}_{7}+{T}_{3}.{T}_{4}.{T}_{5}+{T}_{3}.{T}_{4}.{T}_{7}$. 
(ii) 
Syntax of the following form is invalid: ${T}_{1}o\left({T}_{2}\right)o{T}_{3}$, where $o$ indicates an operator, unless one or more of those operators are $+$ and/or $$. Therefore, ${T}_{1}.\left({T}_{2}+{T}_{3}\right)*{T}_{4}$ is invalid, whilst ${T}_{1}.\left({T}_{2}+{T}_{3}\right)+{T}_{4}$ is valid. 

5. 
The $+$ and $$ operators have equal precedence.
(i) 
If the terms associated with a $$ operator do not occur in the current expression they are ignored, therefore ${T}_{1}+\left({T}_{2}{T}_{1}\right)$ is the equivalent to ${T}_{1}+{T}_{2}$; the $\left({T}_{2}{T}_{1}\right)$ part of the expression is calculated first and results in ${T}_{2}$ as the ${T}_{1}$ term does not exist in this particular subexpression so cannot be removed. 

3.2.2
Mean Effect / Intercept Term
A mean effect (or intercept term) can be explicitly added to a formula by specifying $1$ and can be explicitly excluded from the formula by specifying $1$. For example, $1+{V}_{1}+{V}_{2}$ indicates a model with the main effects of two variables and a mean effect, whereas ${V}_{1}+{V}_{2}1$ denotes the same model, but without the mean effect. The mean indicator can appear anywhere in the formula string as long as it is not contained within parentheses.
If the mean effect is not explicitly mentioned in the model formula, the model is assumed to include a mean effect.
4
References
None.
5
Arguments
 1: $\mathbf{hform}$ – Type (c_ptr)Input/Output

On entry: must be set to
c_null_ptr.
As an alternative, an existing G22 handle may be supplied in which case this routine will destroy the supplied G22 handle as if
g22zaf had been called.
On exit: holds a G22 handle to the internal data structure containing a description of the model
$\mathcal{M}$ as specified in
formula. You
must not change the G22 handle other than through routines in
Chapter G22.
 2: $\mathbf{formula}$ – Character(*)Input

On entry: a string containing the formula specifying
$\mathcal{M}$. See
Section 3 for details on the allowed model syntax.
 3: $\mathbf{ifail}$ – IntegerInput/Output

On entry:
ifail must be set to
$0$,
$1\text{or}1$. If you are unfamiliar with this argument you should refer to
Section 3.4 in How to Use the NAG Library and its Documentation for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value
$1\text{or}1$ is recommended. If the output of error messages is undesirable, then the value
$1$ is recommended. Otherwise, if you are not familiar with this argument, the recommended value is
$0$.
When the value $\mathbf{1}\text{or}\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit:
${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see
Section 6).
6
Error Indicators and Warnings
If on entry
${\mathbf{ifail}}=0$ or
$1$, explanatory error messages are output on the current error message unit (as defined by
x04aaf).
Errors or warnings detected by the routine:
 ${\mathbf{ifail}}=11$

On entry,
hform is not
c_null_ptr or a recognised G22 handle.
 ${\mathbf{ifail}}=21$

The formula contained a mismatched parenthesis.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=22$

An operator was missing.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=23$

Invalid use of an operator.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=24$

Invalid specification for the power operator.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=25$

Invalid specification for the colon operator.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=26$

Invalid specification for the mean.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=27$

Invalid variable name.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=28$

Missing variable name.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=29$

After processing, the model contains no terms.
 ${\mathbf{ifail}}=30$

An invalid contrast specifier has been supplied.
The position in the formula string of the error is $\u2329\mathit{\text{value}}\u232a$.
 ${\mathbf{ifail}}=31$

A term contained a repeated variable with a different contrast specifier.
 ${\mathbf{ifail}}=99$
An unexpected error has been triggered by this routine. Please
contact
NAG.
See
Section 3.9 in How to Use the NAG Library and its Documentation for further information.
 ${\mathbf{ifail}}=399$
Your licence key may have expired or may not have been installed correctly.
See
Section 3.8 in How to Use the NAG Library and its Documentation for further information.
 ${\mathbf{ifail}}=999$
Dynamic memory allocation failed.
See
Section 3.7 in How to Use the NAG Library and its Documentation for further information.
7
Accuracy
Not applicable.
8
Parallelism and Performance
g22yaf is not threaded in any implementation.
None.
10
Example
This example reads in and parses a formula specifying a model,
$\mathcal{M}$, and displays the processed formula. A data matrix,
$D$, is then read in and a design matrix constructed from
$D$ and
$\mathcal{M}$ using
g22ycf.
The design matrix includes an explicit term for the mean effect.
See also the examples for
g22ybf,
g22ycf and
g22ydf.
10.1
Program Text
Program Text (g22yafe.f90)
10.2
Program Data
Program Data (g22yafe.d)
10.3
Program Results
Program Results (g22yafe.r)
11
Optional Parameters
As well as the optional parameters common to all G22 handles described in
g22zmf and
g22znf, a number of additional optional parameters can be specified for a G22 handle holding the description of a model, as returned by
g22yaf in
hform.
Each writeable optional parameter has an associated default value; to set any of them to a nondefault value, use
g02zkf. The value of any optional parameter can be queried using
g22znf.
The remainder of this section can be skipped if you wish to use the default values for all optional parameters.
The following is a list of the optional parameters available. A full description of each optional parameter is provided in
Section 11.1.
All routines that make use of the G22 handle returned by g22yaf combine it with a description of a data matrix, $D$, to construct a design matrix, $X$.
11.1
Description of the Optional Parameters
For each option, we give a summary line, a description of the optional parameter and details of constraints.
The summary line contains:
 a parameter value,
where the letters $a$, $i$ and $r$ denote options that take character, integer and real values respectively;
 the default value.
Keywords and character values are case and white space insensitive.
Contrast  $a$  Default $\text{}=\mathrm{FIRST}$ 
This parameter controls the default contrasts used for the categorical independent variables appearing in the model. Six types of contrasts and dummy variables are available:
 $\mathrm{FIRST}$
 Treatment contrasts relative to the first level of the variable will be used.
 $\mathrm{LAST}$
 Treatment contrasts relative to the last level of the variable will be used.
 $\mathrm{SUM\; FIRST}$
 Sum contrasts relative to the first level of the variable will be used.
 $\mathrm{SUM\; LAST}$
 Sum contrasts relative to the last level of the variable will be used.
 $\mathrm{HELMERT}$
 Helmert contrasts will be used.
 $\mathrm{POLYNOMIAL}$
 Polynomial contrasts will be used.
 $\mathrm{DUMMY}$
 Dummy variables will be used rather than a contrast.
See
g22ycf for more information on contrasts, their effect on the design matrix and how they are constructed.
This parameter may have an
instance identifier associated with it (see
g22zmf and
g22znf). The
instance identifier must be the name of one of the variables appearing in the model supplied in
formula when the G22 handle was created. For example,
CONTRAST : VAR1 = HELMERT would set Helmert contrasts for the variable named
VAR1.
If no instance identifier is specified, the default contrast for all categorical variables in the model is changed, otherwise only the default contrast for the named variable is changed.
In some situations it might be necessary for a variable to use a different contrast, depending on where it appears in the model formula. In order to allow contrasts to be specified on a term by term basis the $@$ operator can be used in the model formula. The syntax for this operator is ${V}_{j}@c$, where $c$ is one of: F, L, SF, SL, H, P or D, corresponding to treatment contrasts relative to the first and last levels, sum contrasts relative to the first and last levels, Helmert contrasts, polynomial contrasts or dummy variables respectively.
If the contrast has not been explicitly specified via the
$@$ operator, the value obtained from the optional parameter
Contrast is used.
For example, setting
formula to
VAR1 + VAR1@H.VAR2@P + VAR2@H.VAR3, specifies that the variable named
VAR1 should use the default contrasts in the first term and Helmert contrasts in the second term. The variable named
VAR2 should use polynomial contrasts in the second term and Helmert contrasts in the third term. The variable named
VAR3 should use the default contrasts in the third term.
Constraint:
${\mathbf{Contrast}}=\mathrm{FIRST}$, $\mathrm{LAST}$, $\mathrm{SUM\; FIRST}$, $\mathrm{SUM\; LAST}$, $\mathrm{HELMERT}$, $\mathrm{POLYNOMIAL}$ or $\mathrm{DUMMY}$.
Explicit Mean  $a$  Default $\text{}=\mathrm{NO}$ 
If ${\mathbf{Explicit\; Mean}}=\mathrm{YES}$, any mean effect included in the model will be explicitly added to the design matrix, $X$, as a column of $1$s.
If
${\mathbf{Explicit\; Mean}}=\mathrm{NO}$, it is assumed that the routine to which
$X$ will be passed treats the mean effect as a special case, see
mean in
g02daf for example.
Constraint:
${\mathbf{Explicit\; Mean}}=\mathrm{YES}$ or $\mathrm{NO}$.
This parameter returns a verbose version of the model formula specified in
formula, expanded and simplified to only contain variable names, the operators
$+$ and
$.$ and any contrast identifiers present.
Storage Order  $a$  Default $\text{}=\mathrm{OBSVAR}$ 
This optional parameter controls how the design matrix,
$X$, should be stored in its output array and only has an effect if the design matrix is being constructed using
g22ycf.
If ${\mathbf{Storage\; Order}}=\mathrm{OBSVAR}$, ${X}_{ij}$, the value for the $j$th variable of the $i$th observation of the design matrix is stored in ${\mathbf{x}}\left(i,j\right)$.
If ${\mathbf{Storage\; Order}}=\mathrm{VAROBS}$, ${X}_{ij}$, the value for the $j$th variable of the $i$th observation of the design matrix is stored in ${\mathbf{x}}\left(j,i\right)$.
Where
x is the output parameter of the same name in
g22ycf.
Constraint:
${\mathbf{Storage\; Order}}=\mathrm{OBSVAR}$ or $\mathrm{VAROBS}$.