NAG Library Routine Document

Integer, Intent (In)	::	n, ntype
Integer, Intent (Inout)	::	ifail
Real (Kind=nag_wp), Intent (In)	::	x(n)
Real (Kind=nag_wp), Intent (Inout)	::	par(2)
Real (Kind=nag_wp), Intent (Out)	::	d, z, p, sx(n)
Character (*), Intent (In)	::	dist
Character (1), Intent (In)	::	estima

C Header Interface

#include <nagmk26.h>

void	g08cbf_ (const Integer n, const double x[], const char dist, double par[], const char estima, const Integer ntype, double d, double z, double p, double sx[], Integer ifail, const Charlen length_dist, const Charlen length_estima)

3

Description

The data consist of a single sample of

n

observations denoted by

x_{1}, x_{2}, \dots, x_{n}

. Let

S_{n} (x_{(i)})

and

F_{0} (x_{(i)})

represent the sample cumulative distribution function and the theoretical (null) cumulative distribution function respectively at the point

x_{(i)}

where

x_{(i)}

is the

i

th smallest sample observation.

The Kolmogorov–Smirnov test provides a test of the null hypothesis

H_{0}

: the data are a random sample of observations from a theoretical distribution specified by you against one of the following alternative hypotheses:

(i)	$H_{1}$ : the data cannot be considered to be a random sample from the specified null distribution.
(ii)	$H_{2}$ : the data arise from a distribution which dominates the specified null distribution. In practical terms, this would be demonstrated if the values of the sample cumulative distribution function $S_{n} (x)$ tended to exceed the corresponding values of the theoretical cumulative distribution function $F_{0} (x)$ .
(iii)	$H_{3}$ : the data arise from a distribution which is dominated by the specified null distribution. In practical terms, this would be demonstrated if the values of the theoretical cumulative distribution function $F_{0} (x)$ tended to exceed the corresponding values of the sample cumulative distribution function $S_{n} (x)$ .

One of the following test statistics is computed depending on the particular alternative null hypothesis specified (see the description of the argument ntype in Section 5).

For the alternative hypothesis

H_{1}

$D_{n}$ – the largest absolute deviation between the sample cumulative distribution function and the theoretical cumulative distribution function. Formally $D_{n} = \max \{D_{n}^{+}, D_{n}^{-}\}$ .

For the alternative hypothesis

H_{2}

$D_{n}^{+}$ – the largest positive deviation between the sample cumulative distribution function and the theoretical cumulative distribution function. Formally $D_{n}^{+} = \max \{S_{n} (x_{(i)}) - F_{0} (x_{(i)}), 0\}$ for both discrete and continuous null distributions.

For the alternative hypothesis

H_{3}

$D_{n}^{-}$ – the largest positive deviation between the theoretical cumulative distribution function and the sample cumulative distribution function. Formally if the null distribution is discrete then $D_{n}^{-} = \max \{F_{0} (x_{(i)}) - S_{n} (x_{(i)}), 0\}$ and if the null distribution is continuous then $D_{n}^{-} = \max \{F_{0} (x_{(i)}) - S_{n} (x_{(i - 1)}), 0\}$ .

The standardized statistic

Z = D \times \sqrt{n}

is also computed where

D

may be

D_{n}, D_{n}^{+}

D_{n}^{-}

depending on the choice of the alternative hypothesis. This is the standardized value of

D

with no correction for continuity applied and the distribution of

Z

converges asymptotically to a limiting distribution, first derived by Kolmogorov (1933), and then tabulated by Smirnov (1948). The asymptotic distributions for the one-sided statistics were obtained by Smirnov (1933).

The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed. If

n \leq 100

an exact method given by Conover (1980), is used. Note that the method used is only exact for continuous theoretical distributions and does not include Conover's modification for discrete distributions. This method computes the one-sided probabilities. The two-sided probabilities are estimated by doubling the one-sided probability. This is a good estimate for small

p

, that is

p \leq 0.10

, but it becomes very poor for larger

p

. If

n > 100

then

p

is computed using the Kolmogorov–Smirnov limiting distributions, see Feller (1948), Kendall and Stuart (1973), Kolmogorov (1933), Smirnov (1933) and Smirnov (1948).

4

References

Conover W J (1980) Practical Nonparametric Statistics Wiley

Feller W (1948) On the Kolmogorov–Smirnov limit theorems for empirical distributions Ann. Math. Statist. 19 179–181

Kendall M G and Stuart A (1973) The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin

Kolmogorov A N (1933) Sulla determinazione empirica di una legge di distribuzione Giornale dell' Istituto Italiano degli Attuari 4 83–91

Siegel S (1956) Non-parametric Statistics for the Behavioral Sciences McGraw–Hill

Smirnov N (1933) Estimate of deviation between empirical distribution functions in two independent samples Bull. Moscow Univ. 2(2) 3–16

Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions Ann. Math. Statist. 19 279–281

5

Arguments

1: $n$ – IntegerInput

On entry:

n

, the number of observations in the sample.

Constraint:

n \geq 3

2: $x (n)$ – Real (Kind=nag_wp) arrayInput

On entry: the sample observations

x_{1}, x_{2}, \dots, x_{n}

Constraint: the sample observations supplied must be consistent, in the usual manner, with the null distribution chosen, as specified by the arguments dist and par. For further details see Section 9.

3: $dist$ – Character(*)Input

On entry: the theoretical (null) distribution from which it is suspected the data may arise.

$dist ='U'$: The uniform distribution over $(a, b)$ .
$dist ='N'$: The Normal distribution with mean $μ$ and variance $σ^{2}$ .
$dist ='G'$: The gamma distribution with shape parameter $α$ and scale parameter $β$ , where the mean $= α β$ .
$dist ='BE'$: The beta distribution with shape parameters $α$ and $β$ , where the mean $= α / (α + β)$ .
$dist ='BI'$: The binomial distribution with the number of trials, $m$ , and the probability of a success, $p$ .
$dist ='E'$: The exponential distribution with parameter $λ$ , where the mean $= 1 / λ$ .
$dist ='P'$: The Poisson distribution with parameter $μ$ , where the mean $= μ$ .
$dist ='NB'$: The negative binomial distribution with the number of trials, $m$ , and the probability of success, $p$ .
$dist ='GP'$: The generalized Pareto distribution with shape parameter $ξ$ and scale $β$ .

Any number of characters may be supplied as the actual parameter, however only the characters, maximum

2

, required to uniquely identify the distribution are referenced.

Constraint:

dist ='U'

'N'

'G'

'BE'

'BI'

'E'

'P'

'NB'

'GP'

4: $par (2)$ – Real (Kind=nag_wp) arrayInput/Output

On entry: if

estima ='S'

, par must contain the known values of the parameter(s) of the null distribution as follows.

If a uniform distribution is used,

par (1)

and

par (2)

must contain the boundaries

a

and

b

respectively.

If a Normal distribution is used,

par (1)

and

par (2)

must contain the mean,

μ

, and the variance,

σ^{2}

, respectively.

If a gamma distribution is used,

par (1)

and

par (2)

must contain the parameters

α

and

β

respectively.

If a beta distribution is used,

par (1)

and

par (2)

must contain the parameters

α

and

β

respectively.

If a binomial distribution is used,

par (1)

and

par (2)

must contain the parameters

m

and

p

respectively.

If an exponential distribution is used,

par (1)

must contain the parameter

λ

If a Poisson distribution is used,

par (1)

must contain the parameter

μ

If a negative binomial distribution is used,

par (1)

and

par (2)

must contain the parameters

m

and

p

respectively.

If a generalized Pareto distribution is used,

par (1)

and

par (2)

must contain the parameters

ξ

and

β

respectively.

estima ='E'

, par need not be set except when the null distribution requested is either the binomial or the negative binomial distribution in which case

par (1)

must contain the parameter

m

On exit: if

estima ='S'

, par is unchanged; if

estima ='E'

, and

dist ='BI'

dist ='NB'

then

par (2)

is estimated from the data; otherwise

par (1)

and

par (2)

are estimated from the data.

Constraints:

if $dist ='U'$ , $par (1) < par (2)$ ;
if $dist ='N'$ , $par (2) > 0.0$ ;
if $dist ='G'$ , $par (1) > 0.0$ and $par (2) > 0.0$ ;
if $dist ='BE'$ , $par (1) > 0.0$ and $par (2) > 0.0$ and $par (1) \leq 10^{6}$ and $par (2) \leq 10^{6}$ ;
if $dist ='BI'$ , $par (1) \geq 1.0$ and $0.0 < par (2) < 1.0$ and $par (1) \times par (2) \times (1.0 - par (2)) \leq 10^{6}$ and $par (1) < 1 / eps$ , where $eps = machine precision$ , see x02ajf;
if $dist ='E'$ , $par (1) > 0.0$ ;
if $dist ='P'$ , $par (1) > 0.0$ and $par (1) \leq 10^{6}$ ;
if $dist ='NB'$ , $par (1) \geq 1.0$ and $0.0 < par (2) < 1.0$ and $par (1) \times (1.0 - par (2)) / (par (2) \times par (2)) \leq 10^{6}$ and $par (1) < 1 / eps$ , where $eps = machine precision$ , see x02ajf;
if $dist ='GP'$ , $par (2) > 0$ .

5: $estima$ – Character(1)Input

On entry: estima must specify whether values of the parameters of the null distribution are known or are to be estimated from the data.

$estima ='S'$: Values of the parameters will be supplied in the array par described above.
$estima ='E'$: Parameters are to be estimated from the data except when the null distribution requested is the binomial distribution or the negative binomial distribution in which case the first parameter, $m$ , must be supplied in $par (1)$ and only the second parameter, $p$ , is estimated from the data.

Constraint:

estima ='S'

'E'

6: $ntype$ – IntegerInput

On entry: the test statistic to be calculated, i.e., the choice of alternative hypothesis.

$ntype = 1$: Computes $D_{n}$ , to test $H_{0}$ against $H_{1}$ ,
$ntype = 2$: Computes $D_{n}^{+}$ , to test $H_{0}$ against $H_{2}$ ,
$ntype = 3$: Computes $D_{n}^{-}$ , to test $H_{0}$ against $H_{3}$ .

Constraint:

ntype = 1

2

3

7: $d$ – Real (Kind=nag_wp)Output

On exit: the Kolmogorov–Smirnov test statistic (

D_{n}

D_{n}^{+}

D_{n}^{-}

according to the value of ntype).

8: $z$ – Real (Kind=nag_wp)Output

On exit: a standardized value,

Z

, of the test statistic,

D

, without any correction for continuity.

9: $p$ – Real (Kind=nag_wp)Output

On exit: the probability,

p

, associated with the observed value of

D

where

D

may be

D_{n}, D_{n}^{+}

D_{n}^{-}

depending on the value of ntype (see Section 3).

10: $sx (n)$ – Real (Kind=nag_wp) arrayOutput

On exit: the sample observations,

x_{1}, x_{2}, \dots, x_{n}

, sorted in ascending order.

11: $ifail$ – IntegerInput/Output

On entry: ifail must be set to

0

- 1 or 1

. If you are unfamiliar with this argument you should refer to Section 3.4 in How to Use the NAG Library and its Documentation for details.

For environments where it might be inappropriate to halt program execution when an error is detected, the value

- 1 or 1

is recommended. If the output of error messages is undesirable, then the value

1

is recommended. Otherwise, if you are not familiar with this argument, the recommended value is

0

. When the value $- 1 or 1$ is used it is essential to test the value of ifail on exit.

On exit:

ifail = 0

unless the routine detects an error or a warning has been flagged (see Section 6).

6

Error Indicators and Warnings

If on entry

ifail = 0

- 1

, explanatory error messages are output on the current error message unit (as defined by x04aaf).

Errors or warnings detected by the routine:

$ifail = 1$: On entry, $n = 〈value〉$ .
Constraint: $n \geq 3$ .

$ifail = 2$: On entry, $dist = 〈value〉$ was an illegal value.

$ifail = 3$: On entry, $ntype = 〈value〉$ .
Constraint: $ntype = 1$ , $2$ or $3$ .

$ifail = 4$: On entry, $estima = 〈value〉$ was an illegal value.

$ifail = 5$: On entry, $dist ='BI'$ and $m = par (1) = 〈value〉$ .
Note that $m$ must always be supplied.
Constraint: for the binomial distribution, $1 \leq par (1) < 1 / eps$ , where $eps = machine precision$ , see x02ajf.

On entry, $dist ='NB'$ and $m = par (1) = 〈value〉$ .
Note that $m$ must always be supplied.
Constraint: for the negative binomial distribution, $1 \leq par (1) < 1 / eps$ , where $eps = machine precision$ , see x02ajf.

On entry, $estima ='S'$ and $par (1) = 〈value〉$ ; $par (2) = 〈value〉$ .
Constraint: for the beta distribution, $0 < par (1)$ and $par (2) \leq 1000000$ .

On entry, $estima ='S'$ and $par (1) = 〈value〉$ ; $par (2) = 〈value〉$ .
Constraint: for the gamma distribution, $par (1)$ and $par (2) > 0$ .

On entry, $estima ='S'$ and $par (1) = 〈value〉$ ; $par (2) = 〈value〉$ .
Constraint: for the generalized Pareto distribution with $par (1) < 0$ , $0 \leq x (i) \leq - par (2) / par (1)$ , for $i = 1, 2, \dots, n$ .

On entry, $estima ='S'$ and $par (1) = 〈value〉$ ; $par (2) = 〈value〉$ .
Constraint: for the uniform distribution, $par (1) < par (2)$ .

On entry, $estima ='S'$ and $par (1) = 〈value〉$ .
Constraint: for the exponential distribution, $par (1) > 0$ .

On entry, $estima ='S'$ and $par (1) = 〈value〉$ .
Constraint: for the Poisson distribution, $0 < par (1) < 1000000$ .

On entry, $estima ='S'$ and $par (2) = 〈value〉$ .
Constraint: for the binomial distribution, $0 < par (2) < 1$ .

On entry, $estima ='S'$ and $par (2) = 〈value〉$ .
Constraint: for the generalized Pareto distribution, $par (2) > 0$ .

On entry, $estima ='S'$ and $par (2) = 〈value〉$ .
Constraint: for the negative binomial distribution, $0 < par (2) < 1$ .

On entry, $estima ='S'$ and $par (2) = 〈value〉$ .
Constraint: for the Normal distribution, $par (2) > 0$ .

$ifail = 6$: On entry, $dist ='U'$ and at least one observation is illegal.
Constraint: $par (1) \leq x (i) \leq par (2)$ , for $i = 1, 2, \dots, n$ .

On entry, $dist ='G'$ , $'E'$ , $'P'$ , $'NB'$ or $'GP'$ and at least one observation is negative.
Constraint: $x (i) \geq 0$ , for $i = 1, 2, \dots, n$ .

On entry, $dist ='BE'$ and at least one observation is illegal.
Constraint: $0 \leq x (i) \leq 1$ , for $i = 1, 2, \dots, n$ .

On entry, $dist ='BI'$ and all observations are zero or $m$ .
Constraint: at least one $0.0 < x (i) < par (1)$ , for $i = 1, 2, \dots, n$ .

On entry, $dist ='BI'$ and at least one observation is illegal.
Constraint: $0 \leq x (i) \leq par (1)$ , for $i = 1, 2, \dots, n$ .

On entry, $dist ='E'$ or $'P'$ and all observations are zero.
Constraint: at least one $x (i) > 0$ , for $i = 1, 2, \dots, n$ .

On entry, $dist ='GP'$ and $estima ='E'$ .
The parameter estimates are invalid; the data may not be from the generalized Pareto distribution.

$ifail = 7$: On entry, $dist ='U'$ , $'N'$ , $'G'$ , $'BE'$ or $'GP'$ , $estima ='E'$ and the whole sample is constant. Thus the variance is zero.

$ifail = 8$: On entry, $dist ='BI'$ , $par (1) = 〈value〉$ , $par (2) = 〈value〉$ .
The variance $par (1) \times par (2) \times (1 - par (2))$ exceeds 1000000.

On entry, $dist ='NB'$ , $par (1) = 〈value〉$ , $par (2) = 〈value〉$ .
The variance $par (1) \times (1 - par (2)) / (par (2) \times par (2))$ exceeds 1000000.

$ifail = 9$: On entry, $dist ='G'$ and in the computation of the incomplete gamma function by s14baf the convergence of the Taylor series or Legendre continued fraction fails within $600$ iterations.

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.
See Section 3.9 in How to Use the NAG Library and its Documentation for further information.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.
See Section 3.8 in How to Use the NAG Library and its Documentation for further information.

$ifail = - 999$: Dynamic memory allocation failed.
See Section 3.7 in How to Use the NAG Library and its Documentation for further information.

7

Accuracy

The approximation for

p

, given when

n > 100

, has a relative error of at most

2.5

% for most cases. The two-sided probability is approximated by doubling the one-sided probability. This is only good for small

p

, i.e.,

p < 0.10

but very poor for large

p

. The error is always on the conservative side, that is the tail probability,

p

, is over estimated.

8

Parallelism and Performance

g08cbf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.

Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.

9

Further Comments

The time taken by g08cbf increases with

n

until

n > 100

at which point it drops and then increases slowly with

n

. The time may also depend on the choice of null distribution and on whether or not the parameters are to be estimated.

The data supplied in the argument x must be consistent with the chosen null distribution as follows:

when $dist ='U'$ , then $par (1) \leq x_{i} \leq par (2)$ , for $i = 1, 2, \dots, n$ ;
when $dist ='N'$ , then there are no constraints on the $x_{i}$ 's;
when $dist ='G'$ , then $x_{i} \geq 0.0$ , for $i = 1, 2, \dots, n$ ;
when $dist ='BE'$ , then $0.0 \leq x_{i} \leq 1.0$ , for $i = 1, 2, \dots, n$ ;
when $dist ='BI'$ , then $0.0 \leq x_{i} \leq par (1)$ , for $i = 1, 2, \dots, n$ ;
when $dist ='E'$ , then $x_{i} \geq 0.0$ , for $i = 1, 2, \dots, n$ ;
when $dist ='P'$ , then $x_{i} \geq 0.0$ , for $i = 1, 2, \dots, n$ ;
when $dist ='NB'$ , then $x_{i} \geq 0.0$ , for $i = 1, 2, \dots, n$ ;
when $dist ='GP'$ and $par (1) \geq 0.0$ , then $x_{i} \geq 0.0$ , for $i = 1, 2, \dots, n$ ;
when $dist ='GP'$ and $par (1) < 0.0$ , then $0.0 \leq x_{i} \leq - par (2) / par (1)$ , for $i = 1, 2, \dots, n$ .

10

Example

The following example program reads in a set of data consisting of 30 observations. The Kolmogorov–Smirnov test is then applied twice, firstly to test whether the sample is taken from a uniform distribution,

U (0, 2)

, and secondly to test whether the sample is taken from a Normal distribution where the mean and variance are estimated from the data. In both cases we are testing against

H_{1}

; that is, we are doing a two tailed test. The values of d, z and p are printed for each case.

NAG Library Routine Document

g08cbf (test_ks_1sample)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

1

Purpose

2

Specification

3

Description

4

References

5

Arguments

6

Error Indicators and Warnings

7

Accuracy

8

Parallelism and Performance

9

Further Comments

10

Example

10.1

Program Text

10.2

Program Data

10.3

Program Results