g07 Chapter Contents
g07 Chapter Introduction
NAG Library Manual

# NAG Library Function Documentnag_outlier_peirce (g07gac)

## 1  Purpose

nag_outlier_peirce (g07gac) identifies outlying values using Peirce's criterion.

## 2  Specification

 #include #include
 void nag_outlier_peirce (Integer n, Integer p, const double y[], double mean, double var, Integer iout[], Integer *niout, Integer ldiff, double diff[], double llamb[], NagError *fail)

## 3  Description

nag_outlier_peirce (g07gac) flags outlying values in data using Peirce's criterion. Let
• $y$ denote a vector of $n$ observations (for example the residuals) obtained from a model with $p$ parameters,
• $m$ denote the number of potential outlying values,
• $\mu$ and ${\sigma }^{2}$ denote the mean and variance of $y$ respectively,
• $\stackrel{~}{y}$ denote a vector of length $n-m$ constructed by dropping the $m$ values from $y$ with the largest value of $\left|{y}_{i}-\mu \right|$,
• ${\stackrel{~}{\sigma }}^{2}$ denote the (unknown) variance of $\stackrel{~}{y}$,
• $\lambda$ denote the ratio of $\stackrel{~}{\sigma }$ and $\sigma$ with $\lambda =\frac{\stackrel{~}{\sigma }}{\sigma }$.
Peirce's method flags ${y}_{i}$ as a potential outlier if $\left|{y}_{i}-\mu \right|\ge x$, where $x={\sigma }^{2}z$ and $z$ is obtained from the solution of
 $Rm = λ m-n mm n-m n-m nn$ (1)
where
 $R = 2 exp z2 - 1 2 1- Φz$ (2)
and $\Phi$ is the cumulative distribution function for the standard Normal distribution.
As ${\stackrel{~}{\sigma }}^{2}$ is unknown an assumption is made that the relationship between ${\stackrel{~}{\sigma }}^{2}$ and ${\sigma }^{2}$, hence $\lambda$, depends only on the sum of squares of the rejected observations and the ratio estimated as
 $λ2 = n-p-m z2 n-p-m$
which gives
 $z2 = 1+ n-p-m m 1-λ2$ (3)
A value for the cutoff $x$ is calculated iteratively. An initial value of $R=0.2$ is used and a value of $\lambda$ is estimated using equation (1). Equation (3) is then used to obtain an estimate of $z$ and then equation (2) is used to get a new estimate for $R$. This process is then repeated until the relative change in $z$ between consecutive iterations is $\text{}\le \sqrt{\epsilon }$, where $\epsilon$ is machine precision.
By construction, the cutoff for testing for $m+1$ potential outliers is less than the cutoff for testing for $m$ potential outliers. Therefore Peirce's criterion is used in sequence with the existence of a single potential outlier being investigated first. If one is found, the existence of two potential outliers is investigated etc.
If one of a duplicate series of observations is flagged as an outlier, then all of them are flagged as outliers.

## 4  References

Gould B A (1855) On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application The Astronomical Journal 45
Peirce B (1852) Criterion for the rejection of doubtful observations The Astronomical Journal 45

## 5  Arguments

1:     nIntegerInput
On entry: $n$, the number of observations.
Constraint: ${\mathbf{n}}\ge 3$.
2:     pIntegerInput
On entry: $p$, the number of parameters in the model used in obtaining the $y$. If $y$ is an observed set of values, as opposed to the residuals from fitting a model with $p$ parameters, then $p$ should be set to $1$, i.e., as if a model just containing the mean had been used.
Constraint: $1\le {\mathbf{p}}\le {\mathbf{n}}-2$.
3:     y[n]const doubleInput
On entry: $y$, the data being tested.
4:     meandoubleInput
On entry: if ${\mathbf{var}}>0.0$, mean must contain $\mu$, the mean of $y$, otherwise mean is not referenced and the mean is calculated from the data supplied in y.
5:     vardoubleInput
On entry: if ${\mathbf{var}}>0.0$, var must contain ${\sigma }^{2}$, the variance of $y$, otherwise the variance is calculated from the data supplied in y.
6:     iout[n]IntegerOutput
On exit: the indices of the values in y sorted in descending order of the absolute difference from the mean, therefore $\left|{\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-2\right]-1\right]-\mu \right|\ge \left|{\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]-\mu \right|$, for $\mathit{i}=2,3,\dots ,{\mathbf{n}}$.
7:     nioutInteger *Output
On exit: the number of potential outliers. The indices for these potential outliers are held in the first niout elements of iout. By construction there can be at most ${\mathbf{n}}-{\mathbf{p}}-1$ values flagged as outliers.
8:     ldiffIntegerInput
On entry: the maximum number of values to be returned in arrays diff and llamb.
If ${\mathbf{ldiff}}\le 0$, arrays diff and llamb are not referenced and both diff and llamb may be NULL.
9:     diff[ldiff]doubleOutput
On exit: if diff is not NULL then ${\mathbf{diff}}\left[\mathit{i}-1\right]$ holds $\left|y-\mu \right|-{\sigma }^{2}z$ for observation ${\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
10:   llamb[ldiff]doubleOutput
On exit: if llamb is not NULL then ${\mathbf{llamb}}\left[\mathit{i}-1\right]$ holds $\mathrm{log}\left({\lambda }^{2}\right)$ for observation ${\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
11:   failNagError *Input/Output
The NAG error argument (see Section 3.6 in the Essential Introduction).

## 6  Error Indicators and Warnings

On entry, argument $⟨\mathit{\text{value}}⟩$ had an illegal value.
NE_INT
On entry, ${\mathbf{n}}=⟨\mathit{\text{value}}⟩$.
Constraint: ${\mathbf{n}}\ge 3$.
NE_INT_2
On entry, ${\mathbf{p}}=⟨\mathit{\text{value}}⟩$ and ${\mathbf{n}}=⟨\mathit{\text{value}}⟩$.
Constraint: $1\le {\mathbf{p}}\le {\mathbf{n}}-2$.
NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.

Not applicable.

## 8  Parallelism and Performance

Not applicable.

One problem with Peirce's algorithm as implemented in nag_outlier_peirce (g07gac) is the assumed relationship between ${\sigma }^{2}$, the variance using the full dataset, and ${\stackrel{~}{\sigma }}^{2}$, the variance with the potential outliers removed. In some cases, for example if the data $y$ were the residuals from a linear regression, this assumption may not hold as the regression line may change significantly when outlying values have been dropped resulting in a radically different set of residuals. In such cases nag_outlier_peirce_two_var (g07gbc) should be used instead.

## 10  Example

This example reads in a series of data and flags any potential outliers.
The dataset used is from Peirce's original paper and consists of fifteen observations on the vertical semidiameter of Venus.

### 10.1  Program Text

Program Text (g07gace.c)

### 10.2  Program Data

Program Data (g07gace.d)

### 10.3  Program Results

Program Results (g07gace.r)