# NAG FL Interfaceg07gaf (outlier_​peirce_​1var)

## 1Purpose

g07gaf identifies outlying values using Peirce's criterion.

## 2Specification

Fortran Interface
 Subroutine g07gaf ( n, p, y, mean, var, iout, diff,
 Integer, Intent (In) :: n, p, ldiff Integer, Intent (Inout) :: ifail Integer, Intent (Out) :: iout(n), niout Real (Kind=nag_wp), Intent (In) :: y(n), mean, var Real (Kind=nag_wp), Intent (Out) :: diff(ldiff), llamb(ldiff)
#include <nag.h>
 void g07gaf_ (const Integer *n, const Integer *p, const double y[], const double *mean, const double *var, Integer iout[], Integer *niout, const Integer *ldiff, double diff[], double llamb[], Integer *ifail)
The routine may be called by the names g07gaf or nagf_univar_outlier_peirce_1var.

## 3Description

g07gaf flags outlying values in data using Peirce's criterion. Let
• $y$ denote a vector of $n$ observations (for example the residuals) obtained from a model with $p$ parameters,
• $m$ denote the number of potential outlying values,
• $\mu$ and ${\sigma }^{2}$ denote the mean and variance of $y$ respectively,
• $\stackrel{~}{y}$ denote a vector of length $n-m$ constructed by dropping the $m$ values from $y$ with the largest value of $\left|{y}_{i}-\mu \right|$,
• ${\stackrel{~}{\sigma }}^{2}$ denote the (unknown) variance of $\stackrel{~}{y}$,
• $\lambda$ denote the ratio of $\stackrel{~}{\sigma }$ and $\sigma$ with $\lambda =\frac{\stackrel{~}{\sigma }}{\sigma }$.
Peirce's method flags ${y}_{i}$ as a potential outlier if $\left|{y}_{i}-\mu \right|\ge x$, where $x={\sigma }^{2}z$ and $z$ is obtained from the solution of
 $Rm = λ m-n mm n-m n-m nn$ (1)
where
 $R = 2 exp z2 - 1 2 1- Φz$ (2)
and $\Phi$ is the cumulative distribution function for the standard Normal distribution.
As ${\stackrel{~}{\sigma }}^{2}$ is unknown an assumption is made that the relationship between ${\stackrel{~}{\sigma }}^{2}$ and ${\sigma }^{2}$, hence $\lambda$, depends only on the sum of squares of the rejected observations and the ratio estimated as
 $λ2 = n-p-m z2 n-p-m$
which gives
 $z2 = 1+ n-p-m m 1-λ2$ (3)
A value for the cutoff $x$ is calculated iteratively. An initial value of $R=0.2$ is used and a value of $\lambda$ is estimated using equation (1). Equation (3) is then used to obtain an estimate of $z$ and then equation (2) is used to get a new estimate for $R$. This process is then repeated until the relative change in $z$ between consecutive iterations is $\text{}\le \sqrt{\epsilon }$, where $\epsilon$ is machine precision.
By construction, the cutoff for testing for $m+1$ potential outliers is less than the cutoff for testing for $m$ potential outliers. Therefore Peirce's criterion is used in sequence with the existence of a single potential outlier being investigated first. If one is found, the existence of two potential outliers is investigated etc.
If one of a duplicate series of observations is flagged as an outlier, then all of them are flagged as outliers.

## 4References

Gould B A (1855) On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application The Astronomical Journal 45
Peirce B (1852) Criterion for the rejection of doubtful observations The Astronomical Journal 45

## 5Arguments

1: $\mathbf{n}$Integer Input
On entry: $n$, the number of observations.
Constraint: ${\mathbf{n}}\ge 3$.
2: $\mathbf{p}$Integer Input
On entry: $p$, the number of parameters in the model used in obtaining the $y$. If $y$ is an observed set of values, as opposed to the residuals from fitting a model with $p$ parameters, then $p$ should be set to $1$, i.e., as if a model just containing the mean had been used.
Constraint: $1\le {\mathbf{p}}\le {\mathbf{n}}-2$.
3: $\mathbf{y}\left({\mathbf{n}}\right)$Real (Kind=nag_wp) array Input
On entry: $y$, the data being tested.
4: $\mathbf{mean}$Real (Kind=nag_wp) Input
On entry: if ${\mathbf{var}}>0.0$, mean must contain $\mu$, the mean of $y$, otherwise mean is not referenced and the mean is calculated from the data supplied in y.
5: $\mathbf{var}$Real (Kind=nag_wp) Input
On entry: if ${\mathbf{var}}>0.0$, var must contain ${\sigma }^{2}$, the variance of $y$, otherwise the variance is calculated from the data supplied in y.
6: $\mathbf{iout}\left({\mathbf{n}}\right)$Integer array Output
On exit: the indices of the values in y sorted in descending order of the absolute difference from the mean, therefore $\left|{\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}-1\right)\right)-\mu \right|\ge \left|{\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)-\mu \right|$, for $\mathit{i}=2,3,\dots ,{\mathbf{n}}$.
7: $\mathbf{niout}$Integer Output
On exit: the number of potential outliers. The indices for these potential outliers are held in the first niout elements of iout. By construction there can be at most ${\mathbf{n}}-{\mathbf{p}}-1$ values flagged as outliers.
8: $\mathbf{ldiff}$Integer Input
On entry: the maximum number of values to be returned in arrays diff and llamb.
If ${\mathbf{ldiff}}\le 0$, arrays diff and llamb are not referenced.
9: $\mathbf{diff}\left({\mathbf{ldiff}}\right)$Real (Kind=nag_wp) array Output
On exit: ${\mathbf{diff}}\left(\mathit{i}\right)$ holds $\left|y-\mu \right|-{\sigma }^{2}z$ for observation ${\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
10: $\mathbf{llamb}\left({\mathbf{ldiff}}\right)$Real (Kind=nag_wp) array Output
On exit: ${\mathbf{llamb}}\left(\mathit{i}\right)$ holds $\mathrm{log}\left({\lambda }^{2}\right)$ for observation ${\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
11: $\mathbf{ifail}$Integer Input/Output
On entry: ifail must be set to $0$, $-1$ or $1$ to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of $0$ causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of $-1$ means that an error message is printed while a value of $1$ means that it is not.
If halting is not appropriate, the value $-1$ or $1$ is recommended. If message printing is undesirable, then the value $1$ is recommended. Otherwise, the value $0$ is recommended. When the value $-\mathbf{1}$ or $\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit: ${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see Section 6).

## 6Error Indicators and Warnings

If on entry ${\mathbf{ifail}}=0$ or $-1$, explanatory error messages are output on the current error message unit (as defined by x04aaf).
Errors or warnings detected by the routine:
${\mathbf{ifail}}=1$
On entry, ${\mathbf{n}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{n}}\ge 3$.
${\mathbf{ifail}}=2$
On entry, ${\mathbf{p}}=〈\mathit{\text{value}}〉$ and ${\mathbf{n}}=〈\mathit{\text{value}}〉$.
Constraint: $1\le {\mathbf{p}}\le {\mathbf{n}}-2$.
${\mathbf{ifail}}=-99$
See Section 7 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

Not applicable.

## 8Parallelism and Performance

g07gaf is not threaded in any implementation.

One problem with Peirce's algorithm as implemented in g07gaf is the assumed relationship between ${\sigma }^{2}$, the variance using the full dataset, and ${\stackrel{~}{\sigma }}^{2}$, the variance with the potential outliers removed. In some cases, for example if the data $y$ were the residuals from a linear regression, this assumption may not hold as the regression line may change significantly when outlying values have been dropped resulting in a radically different set of residuals. In such cases g07gbf should be used instead.

## 10Example

This example reads in a series of data and flags any potential outliers.
The dataset used is from Peirce's original paper and consists of fifteen observations on the vertical semidiameter of Venus.

### 10.1Program Text

Program Text (g07gafe.f90)

### 10.2Program Data

Program Data (g07gafe.d)

### 10.3Program Results

Program Results (g07gafe.r)