Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

# NAG Toolbox: nag_univar_outlier_peirce_1var (g07ga)

## Purpose

nag_univar_outlier_peirce_1var (g07ga) identifies outlying values using Peirce's criterion.

## Syntax

[iout, niout, dif, llamb, ifail] = g07ga(p, y, ldiff, 'n', n, 'mean_p', mean_p, 'var', var)
[iout, niout, dif, llamb, ifail] = nag_univar_outlier_peirce_1var(p, y, ldiff, 'n', n, 'mean_p', mean_p, 'var', var)

## Description

nag_univar_outlier_peirce_1var (g07ga) flags outlying values in data using Peirce's criterion. Let
• $y$ denote a vector of $n$ observations (for example the residuals) obtained from a model with $p$ parameters,
• $m$ denote the number of potential outlying values,
• $\mu$ and ${\sigma }^{2}$ denote the mean and variance of $y$ respectively,
• $\stackrel{~}{y}$ denote a vector of length $n-m$ constructed by dropping the $m$ values from $y$ with the largest value of $\left|{y}_{i}-\mu \right|$,
• ${\stackrel{~}{\sigma }}^{2}$ denote the (unknown) variance of $\stackrel{~}{y}$,
• $\lambda$ denote the ratio of $\stackrel{~}{\sigma }$ and $\sigma$ with $\lambda =\frac{\stackrel{~}{\sigma }}{\sigma }$.
Peirce's method flags ${y}_{i}$ as a potential outlier if $\left|{y}_{i}-\mu \right|\ge x$, where $x={\sigma }^{2}z$ and $z$ is obtained from the solution of
 $Rm = λ m-n mm n-m n-m nn$ (1)
where
 $R = 2 exp z2 - 1 2 1- Φz$ (2)
and $\Phi$ is the cumulative distribution function for the standard Normal distribution.
As ${\stackrel{~}{\sigma }}^{2}$ is unknown an assumption is made that the relationship between ${\stackrel{~}{\sigma }}^{2}$ and ${\sigma }^{2}$, hence $\lambda$, depends only on the sum of squares of the rejected observations and the ratio estimated as
 $λ2 = n-p-m z2 n-p-m$
which gives
 $z2 = 1+ n-p-m m 1-λ2$ (3)
A value for the cutoff $x$ is calculated iteratively. An initial value of $R=0.2$ is used and a value of $\lambda$ is estimated using equation (1). Equation (3) is then used to obtain an estimate of $z$ and then equation (2) is used to get a new estimate for $R$. This process is then repeated until the relative change in $z$ between consecutive iterations is $\text{}\le \sqrt{\epsilon }$, where $\epsilon$ is machine precision.
By construction, the cutoff for testing for $m+1$ potential outliers is less than the cutoff for testing for $m$ potential outliers. Therefore Peirce's criterion is used in sequence with the existence of a single potential outlier being investigated first. If one is found, the existence of two potential outliers is investigated etc.
If one of a duplicate series of observations is flagged as an outlier, then all of them are flagged as outliers.

## References

Gould B A (1855) On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application The Astronomical Journal 45
Peirce B (1852) Criterion for the rejection of doubtful observations The Astronomical Journal 45

## Parameters

### Compulsory Input Parameters

1:     $\mathrm{p}$int64int32nag_int scalar
$p$, the number of parameters in the model used in obtaining the $y$. If $y$ is an observed set of values, as opposed to the residuals from fitting a model with $p$ parameters, then $p$ should be set to $1$, i.e., as if a model just containing the mean had been used.
Constraint: $1\le {\mathbf{p}}\le {\mathbf{n}}-2$.
2:     $\mathrm{y}\left({\mathbf{n}}\right)$ – double array
$y$, the data being tested.
3:     $\mathrm{ldiff}$int64int32nag_int scalar
The maximum number of values to be returned in arrays dif and llamb.
If ${\mathbf{ldiff}}\le 0$, arrays dif and llamb are not referenced.

### Optional Input Parameters

1:     $\mathrm{n}$int64int32nag_int scalar
Default: the dimension of the array y.
$n$, the number of observations.
Constraint: ${\mathbf{n}}\ge 3$.
2:     $\mathrm{mean_p}$ – double scalar
Default: $0.0$
If ${\mathbf{var}}>0.0$, mean_p must contain $\mu$, the mean of $y$, otherwise mean_p is not referenced and the mean is calculated from the data supplied in y.
3:     $\mathrm{var}$ – double scalar
Default: $0.0$
If ${\mathbf{var}}>0.0$, var must contain ${\sigma }^{2}$, the variance of $y$, otherwise the variance is calculated from the data supplied in y.

### Output Parameters

1:     $\mathrm{iout}\left({\mathbf{n}}\right)$int64int32nag_int array
The indices of the values in y sorted in descending order of the absolute difference from the mean, therefore $\left|{\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}-1\right)\right)-\mu \right|\ge \left|{\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)-\mu \right|$, for $\mathit{i}=2,3,\dots ,{\mathbf{n}}$.
2:     $\mathrm{niout}$int64int32nag_int scalar
The number of potential outliers. The indices for these potential outliers are held in the first niout elements of iout. By construction there can be at most ${\mathbf{n}}-{\mathbf{p}}-1$ values flagged as outliers.
3:     $\mathrm{dif}\left({\mathbf{ldiff}}\right)$ – double array
${\mathbf{dif}}\left(\mathit{i}\right)$ holds $\left|y-\mu \right|-{\sigma }^{2}z$ for observation ${\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
4:     $\mathrm{llamb}\left({\mathbf{ldiff}}\right)$ – double array
${\mathbf{llamb}}\left(\mathit{i}\right)$ holds $\mathrm{log}\left({\lambda }^{2}\right)$ for observation ${\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
5:     $\mathrm{ifail}$int64int32nag_int scalar
${\mathbf{ifail}}={\mathbf{0}}$ unless the function detects an error (see Error Indicators and Warnings).

## Error Indicators and Warnings

Errors or warnings detected by the function:
${\mathbf{ifail}}=1$
Constraint: ${\mathbf{n}}\ge 3$.
${\mathbf{ifail}}=2$
Constraint: $1\le {\mathbf{p}}\le {\mathbf{n}}-2$.
${\mathbf{ifail}}=-99$
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.

## Accuracy

Not applicable.

One problem with Peirce's algorithm as implemented in nag_univar_outlier_peirce_1var (g07ga) is the assumed relationship between ${\sigma }^{2}$, the variance using the full dataset, and ${\stackrel{~}{\sigma }}^{2}$, the variance with the potential outliers removed. In some cases, for example if the data $y$ were the residuals from a linear regression, this assumption may not hold as the regression line may change significantly when outlying values have been dropped resulting in a radically different set of residuals. In such cases nag_univar_outlier_peirce_2var (g07gb) should be used instead.

## Example

This example reads in a series of data and flags any potential outliers.
The dataset used is from Peirce's original paper and consists of fifteen observations on the vertical semidiameter of Venus.
```function g07ga_example

fprintf('g07ga example results\n\n');

y = [-0.30;  0.48;  0.63; -0.22; 0.18;
-0.44; -0.24; -0.13; -0.05; 0.39;
1.01;  0.06; -1.40;  0.20; 0.10];

p     = int64(2);
ldiff = int64(1);

% Get a list of potential outliers
[iout, niout, dif, llamb, ifail] = ...
g07ga(p, y, ldiff);

% Display results
fprintf('Number of potential outliers: %2d\n',niout);
fprintf('  No.  Index    Value');
if ldiff > 0
fprintf('       Diff    ln(lambda^2)');
end
fprintf('\n');

for i=1:niout
fprintf(' %4d %4d %10.2f', i, iout(i), y(iout(i)));
if i <= ldiff
fprintf(' %10.2f %10.2f', dif(i), llamb(i));
end
fprintf('\n');
end

```
```g07ga example results

Number of potential outliers:  2
No.  Index    Value       Diff    ln(lambda^2)
1   13      -1.40       0.31      -0.30
2   11       1.01
```