Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

# NAG Toolbox: nag_nonpar_test_ks_1sample (g08cb)

## Purpose

nag_nonpar_test_ks_1sample (g08cb) performs the one sample Kolmogorov–Smirnov test, using one of the standard distributions provided.

## Syntax

[par, d, z, p, sx, ifail] = g08cb(x, dist, par, estima, ntype, 'n', n)
[par, d, z, p, sx, ifail] = nag_nonpar_test_ks_1sample(x, dist, par, estima, ntype, 'n', n)

## Description

The data consist of a single sample of n $n$ observations denoted by x1 , x2 , , xn ${x}_{1},{x}_{2},\dots ,{x}_{n}$. Let Sn (x(i)) ${S}_{n}\left({x}_{\left(i\right)}\right)$ and F0 (x(i)) ${F}_{0}\left({x}_{\left(i\right)}\right)$ represent the sample cumulative distribution function and the theoretical (null) cumulative distribution function respectively at the point x(i) ${x}_{\left(i\right)}$ where x(i) ${x}_{\left(i\right)}$ is the i $i$th smallest sample observation.
The Kolmogorov–Smirnov test provides a test of the null hypothesis H0 ${H}_{0}$: the data are a random sample of observations from a theoretical distribution specified by you against one of the following alternative hypotheses:
 (i) H1 ${H}_{1}$: the data cannot be considered to be a random sample from the specified null distribution. (ii) H2 ${H}_{2}$: the data arise from a distribution which dominates the specified null distribution. In practical terms, this would be demonstrated if the values of the sample cumulative distribution function Sn (x) ${S}_{n}\left(x\right)$ tended to exceed the corresponding values of the theoretical cumulative distribution function F0 (x) ${F}_{0}\left(x\right)$. (iii) H3 ${H}_{3}$: the data arise from a distribution which is dominated by the specified null distribution. In practical terms, this would be demonstrated if the values of the theoretical cumulative distribution function F0 (x) ${F}_{0}\left(x\right)$ tended to exceed the corresponding values of the sample cumulative distribution function Sn (x) ${S}_{n}\left(x\right)$.
One of the following test statistics is computed depending on the particular alternative null hypothesis specified (see the description of the parameter ntype in Section [Parameters]).
For the alternative hypothesis H1 ${H}_{1}$.
• Dn ${D}_{n}$ – the largest absolute deviation between the sample cumulative distribution function and the theoretical cumulative distribution function. Formally Dn = max {Dn + ,Dn} ${D}_{n}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left\{{D}_{n}^{+},{D}_{n}^{-}\right\}$.
For the alternative hypothesis H2 ${H}_{2}$.
• Dn + ${D}_{n}^{+}$ – the largest positive deviation between the sample cumulative distribution function and the theoretical cumulative distribution function. Formally Dn + = max { Sn (x(i)) F0 (x(i)) ,0} ${D}_{n}^{+}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left\{{S}_{n}\left({x}_{\left(i\right)}\right)-{F}_{0}\left({x}_{\left(i\right)}\right),0\right\}$ for both discrete and continuous null distributions.
For the alternative hypothesis H3 ${H}_{3}$.
• Dn ${D}_{n}^{-}$ – the largest positive deviation between the theoretical cumulative distribution function and the sample cumulative distribution function. Formally if the null distribution is discrete then Dn = max { F0 (x(i)) Sn (x(i)) ,0} ${D}_{n}^{-}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left\{{F}_{0}\left({x}_{\left(i\right)}\right)-{S}_{n}\left({x}_{\left(i\right)}\right),0\right\}$ and if the null distribution is continuous then Dn = max { F0 (x(i)) Sn (x(i1)) ,0} ${D}_{n}^{-}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left\{{F}_{0}\left({x}_{\left(i\right)}\right)-{S}_{n}\left({x}_{\left(i-1\right)}\right),0\right\}$.
The standardized statistic Z = D × sqrt( n ) $Z=D×\sqrt{n}$ is also computed where D $D$ may be Dn , Dn + ${D}_{n},{D}_{n}^{+}$ or Dn ${D}_{n}^{-}$ depending on the choice of the alternative hypothesis. This is the standardized value of D $D$ with no correction for continuity applied and the distribution of Z $Z$ converges asymptotically to a limiting distribution, first derived by Kolmogorov (1933), and then tabulated by Smirnov (1948). The asymptotic distributions for the one-sided statistics were obtained by Smirnov (1933).
The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed. If n100 $n\le 100$ an exact method given by Conover (1980), is used. Note that the method used is only exact for continuous theoretical distributions and does not include Conover's modification for discrete distributions. This method computes the one-sided probabilities. The two-sided probabilities are estimated by doubling the one-sided probability. This is a good estimate for small p $p$, that is p0.10 $p\le 0.10$, but it becomes very poor for larger p $p$. If n > 100 $n>100$ then p $p$ is computed using the Kolmogorov–Smirnov limiting distributions, see Feller (1948), Kendall and Stuart (1973), Kolmogorov (1933), Smirnov (1933) and Smirnov (1948).

## References

Conover W J (1980) Practical Nonparametric Statistics Wiley
Feller W (1948) On the Kolmogorov–Smirnov limit theorems for empirical distributions Ann. Math. Statist. 19 179–181
Kendall M G and Stuart A (1973) The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin
Kolmogorov A N (1933) Sulla determinazione empirica di una legge di distribuzione Giornale dell' Istituto Italiano degli Attuari 4 83–91
Siegel S (1956) Non-parametric Statistics for the Behavioral Sciences McGraw–Hill
Smirnov N (1933) Estimate of deviation between empirical distribution functions in two independent samples Bull. Moscow Univ. 2(2) 3–16
Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions Ann. Math. Statist. 19 279–281

## Parameters

### Compulsory Input Parameters

1:     x(n) – double array
n, the dimension of the array, must satisfy the constraint n3 ${\mathbf{n}}\ge 3$.
The sample observations x1 , x2 , , xn ${x}_{1},{x}_{2},\dots ,{x}_{n}$.
Constraint: the sample observations supplied must be consistent, in the usual manner, with the null distribution chosen, as specified by the parameters dist and par. For further details see Section [Further Comments].
2:     dist – string
The theoretical (null) distribution from which it is suspected the data may arise.
The uniform distribution over (a,b) U (a,b) $\left(a,b\right)-U\left(a,b\right)$.
The Normal distribution with mean μ $\mu$ and variance σ2 N (μ,σ2) ${\sigma }^{2}-N\left(\mu ,{\sigma }^{2}\right)$.
The gamma distribution with shape parameter α $\alpha$ and scale parameter β $\beta$, where the mean = α β $\text{}=\alpha \beta$.
The beta distribution with shape parameters α $\alpha$ and β $\beta$, where the mean = α / (α + β) $\text{}=\alpha /\left(\alpha +\beta \right)$.
The binomial distribution with the number of trials, m $m$, and the probability of a success, p $p$.
The exponential distribution with parameter λ $\lambda$, where the mean = 1 / λ $\text{}=1/\lambda$.
The Poisson distribution with parameter μ $\mu$, where the mean = μ $\text{}=\mu$.
Any number of characters may be supplied as the actual parameter, however only the characters, maximum 2, required to uniquely identify the distribution are referenced.
Constraint: .
3:     par(2$2$) – double array
If estima = 'S'${\mathbf{estima}}=\text{'S'}$, par must contain the known values of the parameter(s) of the null distribution as follows.
If a uniform distribution is used, then par(1) ${\mathbf{par}}\left(1\right)$ and par(2) ${\mathbf{par}}\left(2\right)$ must contain the boundaries a $a$ and b $b$ respectively.
If a Normal distribution is used, then par(1) ${\mathbf{par}}\left(1\right)$ and par(2) ${\mathbf{par}}\left(2\right)$ must contain the mean, μ $\mu$, and the variance, σ2 ${\sigma }^{2}$, respectively.
If a gamma distribution is used, then par(1) ${\mathbf{par}}\left(1\right)$ and par(2) ${\mathbf{par}}\left(2\right)$ must contain the parameters α $\alpha$ and β $\beta$ respectively.
If a beta distribution is used, then par(1) ${\mathbf{par}}\left(1\right)$ and par(2) ${\mathbf{par}}\left(2\right)$ must contain the parameters α $\alpha$ and β $\beta$ respectively.
If a binomial distribution is used, then par(1) ${\mathbf{par}}\left(1\right)$ and par(2) ${\mathbf{par}}\left(2\right)$ must contain the parameters m $m$ and p $p$ respectively.
If an exponential distribution is used, then par(1) ${\mathbf{par}}\left(1\right)$ must contain the parameter λ $\lambda$.
If a Poisson distribution is used, then par(1) ${\mathbf{par}}\left(1\right)$ must contain the parameter μ $\mu$.
If estima = ''${\mathbf{estima}}=\text{''}$, par need not be set except when the null distribution requested is the binomial distribution in which case par(1) ${\mathbf{par}}\left(1\right)$ must contain the parameter m $m$.
Constraints:
• if , par(1) < par(2) ${\mathbf{par}}\left(1\right)<{\mathbf{par}}\left(2\right)$;
• if , par(2) > 0.0 ${\mathbf{par}}\left(2\right)>0.0$;
• if , par(1) > 0.0 ${\mathbf{par}}\left(1\right)>0.0$ and par(2) > 0.0 ${\mathbf{par}}\left(2\right)>0.0$;
• if , par(1) > 0.0 ${\mathbf{par}}\left(1\right)>0.0$ and par(2) > 0.0 ${\mathbf{par}}\left(2\right)>0.0$ and par(1) 106 ${\mathbf{par}}\left(1\right)\le {10}^{6}$ and par(2) 106 ${\mathbf{par}}\left(2\right)\le {10}^{6}$;
• if , par(1) 1.0 ${\mathbf{par}}\left(1\right)\ge 1.0$ and 0.0 < par(2) < 1.0 $0.0<{\mathbf{par}}\left(2\right)<1.0$ and par(1) × par(2) × (1.0par(2)) 106 ${\mathbf{par}}\left(1\right)×{\mathbf{par}}\left(2\right)×\left(1.0-{\mathbf{par}}\left(2\right)\right)\le {10}^{6}$ and par(1) < 1 / eps ${\mathbf{par}}\left(1\right)<1/\mathrm{eps}$, where eps = machine precision , see nag_machine_precision (x02aj);
• if , par(1) > 0.0 ${\mathbf{par}}\left(1\right)>0.0$;
• if , par(1) > 0.0 ${\mathbf{par}}\left(1\right)>0.0$ and par(1) 106 ${\mathbf{par}}\left(1\right)\le {10}^{6}$.
4:     estima – string (length ≥ 1)
estima must specify whether values of the parameters of the null distribution are known or are to be estimated from the data.
estima = 'S'${\mathbf{estima}}=\text{'S'}$
Values of the parameters will be supplied in the array par described above.
estima = 'E'${\mathbf{estima}}=\text{'E'}$
Parameters are to be estimated from the data except when the null distribution requested is the binomial distribution in which case the first parameter, m $m$, must be supplied in par(1) ${\mathbf{par}}\left(1\right)$ and only the second parameter, p $p$ is estimated from the data.
Constraint: estima = 'S'${\mathbf{estima}}=\text{'S'}$ or 'E'$\text{'E'}$.
5:     ntype – int64int32nag_int scalar
The test statistic to be calculated, i.e., the choice of alternative hypothesis.
ntype = 1${\mathbf{ntype}}=1$
Computes Dn ${D}_{n}$, to test H0 ${H}_{0}$ against H1 ${H}_{1}$,
ntype = 2${\mathbf{ntype}}=2$
Computes Dn + ${D}_{n}^{+}$, to test H0 ${H}_{0}$ against H2 ${H}_{2}$,
ntype = 3${\mathbf{ntype}}=3$
Computes Dn ${D}_{n}^{-}$, to test H0 ${H}_{0}$ against H3 ${H}_{3}$.
Constraint: ntype = 1${\mathbf{ntype}}=1$, 2$2$ or 3$3$.

### Optional Input Parameters

1:     n – int64int32nag_int scalar
Default: The dimension of the array x.
n$n$, the number of observations in the sample.
Constraint: n3 ${\mathbf{n}}\ge 3$.

None.

### Output Parameters

1:     par(2$2$) – double array
If estima = 'S'${\mathbf{estima}}=\text{'S'}$, par is unchanged. If estima = ''${\mathbf{estima}}=\text{''}$, then par(1) ${\mathbf{par}}\left(1\right)$ and par(2) ${\mathbf{par}}\left(2\right)$ are set to values as estimated from the data.
2:     d – double scalar
The Kolmogorov–Smirnov test statistic ( Dn ${D}_{n}$, Dn + ${D}_{n}^{+}$ or Dn ${D}_{n}^{-}$ according to the value of ntype).
3:     z – double scalar
A standardized value, Z $Z$, of the test statistic, D $D$, without any correction for continuity.
4:     p – double scalar
The probability, p $p$, associated with the observed value of D $D$ where D $D$ may be Dn , Dn + ${D}_{n},{D}_{n}^{+}$ or Dn ${D}_{n}^{-}$ depending on the value of ntype (see Section [Description]).
5:     sx(n) – double array
The sample observations, x1,x2,,xn${x}_{1},{x}_{2},\dots ,{x}_{n}$, sorted in ascending order.
6:     ifail – int64int32nag_int scalar
${\mathrm{ifail}}={\mathbf{0}}$ unless the function detects an error (see [Error Indicators and Warnings]).

## Error Indicators and Warnings

Errors or warnings detected by the function:

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

ifail = 1${\mathbf{ifail}}=1$
On entry, n < 3${\mathbf{n}}<3$.
ifail = 2${\mathbf{ifail}}=2$
On entry, an invalid code for dist has been specified.
ifail = 3${\mathbf{ifail}}=3$
On entry, ntype1${\mathbf{ntype}}\ne 1$, 2$2$ or 3$3$.
ifail = 4${\mathbf{ifail}}=4$
On entry, estima = 'S'${\mathbf{estima}}=\text{'S'}$ or 'E'$\text{'E'}$.
ifail = 5${\mathbf{ifail}}=5$
On entry, the parameters supplied for the specified null distribution are out of range (see Section [Parameters]). Apart from a check on the first parameter for the binomial distribution () this error will only occur if estima = 'S'${\mathbf{estima}}=\text{'S'}$.
ifail = 6${\mathbf{ifail}}=6$
The data supplied in x could not arise from the chosen null distribution, as specified by the parameters dist and par. For further details see Section [Further Comments].
W ifail = 7${\mathbf{ifail}}=7$
The whole sample is constant, i.e., the variance is zero. This error may only occur if () and estima = 'E'${\mathbf{estima}}=\text{'E'}$.
ifail = 8${\mathbf{ifail}}=8$
The variance of the binomial distribution () is too large. That is, mp(1p) > 1000000$\mathit{mp}\left(1-p\right)>1000000$.
ifail = 9${\mathbf{ifail}}=9$
When , in the computation of the incomplete gamma function by nag_specfun_gamma_incomplete (s14ba) the convergence of the Taylor series or Legendre continued fraction fails within 600$600$ iterations. This is an unlikely error exit.

## Accuracy

The approximation for p $p$, given when n > 100 $n>100$, has a relative error of at most 2.5% for most cases. The two-sided probability is approximated by doubling the one-sided probability. This is only good for small p $p$, i.e., p < 0.10 $p<0.10$ but very poor for large p $p$. The error is always on the conservative side, that is the tail probability, p $p$, is over estimated.

The time taken by nag_nonpar_test_ks_1sample (g08cb) increases with n $n$ until n > 100 $n>100$ at which point it drops and then increases slowly with n $n$. The time may also depend on the choice of null distribution and on whether or not the parameters are to be estimated.
The data supplied in the parameter x must be consistent with the chosen null distribution as follows:
• when , then par(1) xi par(2) ${\mathbf{par}}\left(1\right)\le {x}_{i}\le {\mathbf{par}}\left(2\right)$, for i = 1 , 2 , , n $i=1,2,\dots ,n$;
• when , then there are no constraints on the xi ${x}_{i}$'s;
• when , then xi 0.0 ${x}_{i}\ge 0.0$, for i = 1 , 2 , , n $i=1,2,\dots ,n$;
• when , then 0.0 xi 1.0 $0.0\le {x}_{i}\le 1.0$, for i = 1 , 2 , , n $i=1,2,\dots ,n$;
• when , then 0.0 xi par(1) $0.0\le {x}_{i}\le {\mathbf{par}}\left(1\right)$, for i = 1 , 2 , , n $i=1,2,\dots ,n$;
• when , then xi 0.0 ${x}_{i}\ge 0.0$, for i = 1 , 2 , , n $i=1,2,\dots ,n$;
• when , then xi 0.0 ${x}_{i}\ge 0.0$, for i = 1 , 2 , , n $i=1,2,\dots ,n$.

## Example

```function nag_nonpar_test_ks_1sample_example
x = [0.01;
0.3;
0.2;
0.9;
1.2;
0.09;
1.3;
0.18;
0.9;
0.48;
1.98;
0.03;
0.5;
0.07;
0.7;
0.6;
0.95;
1;
0.31;
1.45;
1.04;
1.25;
0.15;
0.75;
0.85;
0.22;
1.56;
0.81;
0.57;
0.55];
dist = 'Uniform';
par = [0;
2];
estima = 'Supplied';
ntype = int64(1);
[parOut, d, z, p, sx, ifail] = nag_nonpar_test_ks_1sample(x, dist, par, estima, ntype)
```
```

parOut =

0
2

d =

0.2800

z =

1.5336

p =

0.0143

sx =

0.0100
0.0300
0.0700
0.0900
0.1500
0.1800
0.2000
0.2200
0.3000
0.3100
0.4800
0.5000
0.5500
0.5700
0.6000
0.7000
0.7500
0.8100
0.8500
0.9000
0.9000
0.9500
1.0000
1.0400
1.2000
1.2500
1.3000
1.4500
1.5600
1.9800

ifail =

0

```
```function g08cb_example
x = [0.01;
0.3;
0.2;
0.9;
1.2;
0.09;
1.3;
0.18;
0.9;
0.48;
1.98;
0.03;
0.5;
0.07;
0.7;
0.6;
0.95;
1;
0.31;
1.45;
1.04;
1.25;
0.15;
0.75;
0.85;
0.22;
1.56;
0.81;
0.57;
0.55];
dist = 'Uniform';
par = [0;
2];
estima = 'Supplied';
ntype = int64(1);
[parOut, d, z, p, sx, ifail] = g08cb(x, dist, par, estima, ntype)
```
```

parOut =

0
2

d =

0.2800

z =

1.5336

p =

0.0143

sx =

0.0100
0.0300
0.0700
0.0900
0.1500
0.1800
0.2000
0.2200
0.3000
0.3100
0.4800
0.5000
0.5500
0.5700
0.6000
0.7000
0.7500
0.8100
0.8500
0.9000
0.9000
0.9500
1.0000
1.0400
1.2000
1.2500
1.3000
1.4500
1.5600
1.9800

ifail =

0

```