Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

# NAG Toolbox: nag_univar_outlier_peirce_1var (g07ga)

## Purpose

nag_univar_outlier_peirce_1var (g07ga) identifies outlying values using Peirce's criterion.

## Syntax

[iout, niout, diff, llamb, ifail] = g07ga(p, y, ldiff, 'n', n, 'mean', mean, 'var', var)
[iout, niout, diff, llamb, ifail] = nag_univar_outlier_peirce_1var(p, y, ldiff, 'n', n, 'mean', mean, 'var', var)

## Description

nag_univar_outlier_peirce_1var (g07ga) flags outlying values in data using Peirce's criterion. Let
• y$y$ denote a vector of n$n$ observations (for example the residuals) obtained from a model with p$p$ parameters,
• m$m$ denote the number of potential outlying values,
• μ$\mu$ and σ2${\sigma }^{2}$ denote the mean and variance of y$y$ respectively,
• $\stackrel{~}{y}$ denote a vector of length nm$n-m$ constructed by dropping the m$m$ values from y$y$ with the largest value of |yiμ|$|{y}_{i}-\mu |$,
• σ̃2${\stackrel{~}{\sigma }}^{2}$ denote the (unknown) variance of $\stackrel{~}{y}$,
• λ$\lambda$ denote the ratio of σ̃$\stackrel{~}{\sigma }$ and σ$\sigma$ with λ = (σ̃)/σ $\lambda =\frac{\stackrel{~}{\sigma }}{\sigma }$.
Peirce's method flags yi${y}_{i}$ as a potential outlier if |yiμ|x$|{y}_{i}-\mu |\ge x$, where x = σ2z$x={\sigma }^{2}z$ and z$z$ is obtained from the solution of
 Rm = λm − n ( mm (n − m)n − m )/(nn) $Rm = λ m-n mm (n-m) n-m nn$ (1)
where
 R = 2 exp((( z2 − 1 )/2)(1 − Φ(z))) $R = 2 exp( ( z2 - 1 2 ) ( 1- Φ(z) ) )$ (2)
and Φ$\Phi$ is the cumulative distribution function for the standard Normal distribution.
As σ̃2${\stackrel{~}{\sigma }}^{2}$ is unknown an assumption is made that the relationship between σ̃2${\stackrel{~}{\sigma }}^{2}$ and σ2${\sigma }^{2}$, hence λ$\lambda$, depends only on the sum of squares of the rejected observations and the ratio estimated as
 λ2 = ( n − p − m z2 )/(n − p − m) $λ2 = n-p-m z2 n-p-m$
which gives
 z2 = 1 + (n − p − m)/m (1 − λ2) $z2 = 1+ n-p-m m ( 1-λ2 )$ (3)
A value for the cutoff x$x$ is calculated iteratively. An initial value of R = 0.2$R=0.2$ is used and a value of λ$\lambda$ is estimated using equation (1). Equation (3) is then used to obtain an estimate of z$z$ and then equation (2) is used to get a new estimate for R$R$. This process is then repeated until the relative change in z$z$ between consecutive iterations is sqrt(ε)$\text{}\le \sqrt{\epsilon }$, where ε$\epsilon$ is machine precision.
By construction, the cutoff for testing for m + 1$m+1$ potential outliers is less than the cutoff for testing for m$m$ potential outliers. Therefore Peirce's criterion is used in sequence with the existence of a single potential outlier being investigated first. If one is found, the existence of two potential outliers is investigated etc.
If one of a duplicate series of observations is flagged as an outlier, then all of them are flagged as outliers.

## References

Gould B A (1855) On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application The Astronomical Journal 45
Peirce B (1852) Criterion for the rejection of doubtful observations The Astronomical Journal 45

## Parameters

### Compulsory Input Parameters

1:     p – int64int32nag_int scalar
p$p$, the number of parameters in the model used in obtaining the y$y$. If y$y$ is an observed set of values, as opposed to the residuals from fitting a model with p$p$ parameters, then p$p$ should be set to 1$1$, i.e., as if a model just containing the mean had been used.
Constraint: 1pn2$1\le {\mathbf{p}}\le {\mathbf{n}}-2$.
2:     y(n) – double array
n, the dimension of the array, must satisfy the constraint n3${\mathbf{n}}\ge 3$.
y$y$, the data being tested.
3:     ldiff – int64int32nag_int scalar
The maximum number of values to be returned in arrays diff and llamb.
If ldiff0${\mathbf{ldiff}}\le 0$, arrays diff and llamb are not referenced.

### Optional Input Parameters

1:     n – int64int32nag_int scalar
Default: The dimension of the array y.
n$n$, the number of observations.
Constraint: n3${\mathbf{n}}\ge 3$.
2:     mean – double scalar
If var > 0.0${\mathbf{var}}>0.0$, mean must contain μ$\mu$, the mean of y$y$, otherwise mean is not referenced and the mean is calculated from the data supplied in y.
Default: 0.0$0.0$
3:     var – double scalar
If var > 0.0${\mathbf{var}}>0.0$, var must contain σ2${\sigma }^{2}$, the variance of y$y$, otherwise the variance is calculated from the data supplied in y.
Default: 0.0$0.0$

None.

### Output Parameters

1:     iout(n) – int64int32nag_int array
The indices of the values in y sorted in descending order of the absolute difference from the mean, therefore |y(iout(i1))μ| |y(iout(i))μ| $|{\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}-1\right)\right)-\mu |\ge |{\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)-\mu |$, for i = 2,3,,n$\mathit{i}=2,3,\dots ,{\mathbf{n}}$.
2:     niout – int64int32nag_int scalar
The number of potential outliers. The indices for these potential outliers are held in the first niout elements of iout. By construction there can be at most np1${\mathbf{n}}-{\mathbf{p}}-1$ values flagged as outliers.
3:     diff(ldiff ) – double array
diff(i)${\mathbf{diff}}\left(\mathit{i}\right)$ holds |yμ|σ2z$|y-\mu |-{\sigma }^{2}z$ for observation y(iout(i))${\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)$, for i = 1,2,,min (ldiff,niout + 1,np1)$\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
4:     llamb(ldiff) – double array
llamb(i)${\mathbf{llamb}}\left(\mathit{i}\right)$ holds log(λ2)$\mathrm{log}\left({\lambda }^{2}\right)$ for observation y(iout(i))${\mathbf{y}}\left({\mathbf{iout}}\left(\mathit{i}\right)\right)$, for i = 1,2,,min (ldiff,niout + 1,np1)$\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
5:     ifail – int64int32nag_int scalar
${\mathrm{ifail}}={\mathbf{0}}$ unless the function detects an error (see [Error Indicators and Warnings]).

## Error Indicators and Warnings

Errors or warnings detected by the function:
ifail = 1${\mathbf{ifail}}=1$
Constraint: n3${\mathbf{n}}\ge 3$.
ifail = 2${\mathbf{ifail}}=2$
Constraint: 1pn2$1\le {\mathbf{p}}\le {\mathbf{n}}-2$.

## Accuracy

Not applicable.

One problem with Peirce's algorithm as implemented in nag_univar_outlier_peirce_1var (g07ga) is the assumed relationship between σ2${\sigma }^{2}$, the variance using the full dataset, and σ̃2${\stackrel{~}{\sigma }}^{2}$, the variance with the potential outliers removed. In some cases, for example if the data y$y$ were the residuals from a linear regression, this assumption may not hold as the regression line may change significantly when outlying values have been dropped resulting in a radically different set of residuals. In such cases nag_univar_outlier_peirce_2var (g07gb) should be used instead.

## Example

function nag_univar_outlier_peirce_1var_example
y = [-0.30; 0.48; 0.63; -0.22; 0.18; -0.44; -0.24; -0.13; -0.05; 0.39; ...
1.01; 0.06; -1.40; 0.20; 0.10];
p = int64(2);
ldiff = int64(1);

% Get a list of potential outliers
[iout, niout, dif, llamb, ifail] = nag_univar_outlier_peirce_1var(p, y, ldiff);

% Display results
if ifail == 0
if ldiff > 0
fprintf('\n  No.  Index    Value       Diff    ln(lambda^2)\n');
else
fprintf('\n  No.  Index    Value\n');
end
for i=1:double(niout)
if i > ldiff
fprintf(' %4d %4d %10.2f\n', i, iout(i), y(iout(i)));
else
fprintf(' %4d %4d %10.2f %10.2f %10.2f\n', ...
i, iout(i), y(iout(i)), dif(i), llamb(i));
end
end
end

No.  Index    Value       Diff    ln(lambda^2)
1   13      -1.40       0.31      -0.30
2   11       1.01

function g07ga_example
y = [-0.30; 0.48; 0.63; -0.22; 0.18; -0.44; -0.24; -0.13; -0.05; 0.39; ...
1.01; 0.06; -1.40; 0.20; 0.10];
p = int64(2);
ldiff = int64(1);

% Get a list of potential outliers
[iout, niout, dif, llamb, ifail] = g07ga(p, y, ldiff);

% Display results
if ifail == 0
if ldiff > 0
fprintf('\n  No.  Index    Value       Diff    ln(lambda^2)\n');
else
fprintf('\n  No.  Index    Value\n');
end
for i=1:double(niout)
if i > ldiff
fprintf(' %4d %4d %10.2f\n', i, iout(i), y(iout(i)));
else
fprintf(' %4d %4d %10.2f %10.2f %10.2f\n', ...
i, iout(i), y(iout(i)), dif(i), llamb(i));
end
end
end

No.  Index    Value       Diff    ln(lambda^2)
1   13      -1.40       0.31      -0.30
2   11       1.01