NAG Library Routine Document

Integer, Intent (In)	::	n, ntype
Integer, Intent (Inout)	::	ifail
Real (Kind=nag_wp), External	::	cdf
Real (Kind=nag_wp), Intent (In)	::	x(n)
Real (Kind=nag_wp), Intent (Out)	::	d, z, p, sx(n)

C Header Interface

#include <nagmk26.h>

void	g08ccf_ (const Integer n, const double x[], double (NAG_CALL cdf)(const double x), const Integer ntype, double d, double z, double p, double sx[], Integer ifail)

3

Description

The data consists of a single sample of

n

observations, denoted by

x_{1}, x_{2}, \dots, x_{n}

. Let

S_{n} (x_{(i)})

and

F_{0} (x_{(i)})

represent the sample cumulative distribution function and the theoretical (null) cumulative distribution function respectively at the point

x_{(i)}

, where

x_{(i)}

is the

i

th smallest sample observation.

The Kolmogorov–Smirnov test provides a test of the null hypothesis

H_{0}

: the data are a random sample of observations from a theoretical distribution specified by you (in cdf) against one of the following alternative hypotheses.

(i)	$H_{1}$ : the data cannot be considered to be a random sample from the specified null distribution.
(ii)	$H_{2}$ : the data arise from a distribution which dominates the specified null distribution. In practical terms, this would be demonstrated if the values of the sample cumulative distribution function $S_{n} (x)$ tended to exceed the corresponding values of the theoretical cumulative distribution function $F_{0 (x)}$ .
(iii)	$H_{3}$ : the data arise from a distribution which is dominated by the specified null distribution. In practical terms, this would be demonstrated if the values of the theoretical cumulative distribution function $F_{0} (x)$ tended to exceed the corresponding values of the sample cumulative distribution function $S_{n} (x)$ .

One of the following test statistics is computed depending on the particular alternative hypothesis specified (see the description of the argument ntype in Section 5).

For the alternative hypothesis

H_{1}

$D_{n}$ – the largest absolute deviation between the sample cumulative distribution function and the theoretical cumulative distribution function. Formally $D_{n} = \max \{D_{n}^{+}, D_{n}^{-}\}$ .

For the alternative hypothesis

H_{2}

$D_{n}^{+}$ – the largest positive deviation between the sample cumulative distribution function and the theoretical cumulative distribution function. Formally $D_{n}^{+} = \max \{S_{n} (x_{(i)}) - F_{0} (x_{(i)}), 0\}$ .

For the alternative hypothesis

H_{3}

$D_{n}^{-}$ – the largest positive deviation between the theoretical cumulative distribution function and the sample cumulative distribution function. Formally $D_{n}^{-} = \max \{F_{0} (x_{(i)}) - S_{n} (x_{(i - 1)}), 0\}$ . This is only true for continuous distributions. See Section 9 for comments on discrete distributions.

The standardized statistic,

Z = D \times \sqrt{n}

, is also computed, where

D

may be

D_{n}, D_{n}^{+}

D_{n}^{-}

depending on the choice of the alternative hypothesis. This is the standardized value of

D

with no continuity correction applied and the distribution of

Z

converges asymptotically to a limiting distribution, first derived by Kolmogorov (1933), and then tabulated by Smirnov (1948). The asymptotic distributions for the one-sided statistics were obtained by Smirnov (1933).

The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed. If

n \leq 100

, an exact method given by Conover (1980) is used. Note that the method used is only exact for continuous theoretical distributions and does not include Conover's modification for discrete distributions. This method computes the one-sided probabilities. The two-sided probabilities are estimated by doubling the one-sided probability. This is a good estimate for small

p

, that is

p \leq 0.10

, but it becomes very poor for larger

p

. If

n > 100

then

p

is computed using the Kolmogorov–Smirnov limiting distributions; see Feller (1948), Kendall and Stuart (1973), Kolmogorov (1933), Smirnov (1933) and Smirnov (1948).

4

References

Conover W J (1980) Practical Nonparametric Statistics Wiley

Feller W (1948) On the Kolmogorov–Smirnov limit theorems for empirical distributions Ann. Math. Statist. 19 179–181

Kendall M G and Stuart A (1973) The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin

Kolmogorov A N (1933) Sulla determinazione empirica di una legge di distribuzione Giornale dell' Istituto Italiano degli Attuari 4 83–91

Siegel S (1956) Non-parametric Statistics for the Behavioral Sciences McGraw–Hill

Smirnov N (1933) Estimate of deviation between empirical distribution functions in two independent samples Bull. Moscow Univ. 2(2) 3–16

Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions Ann. Math. Statist. 19 279–281

5

Arguments

1: $n$ – IntegerInput

On entry:

n

, the number of observations in the sample.

Constraint:

n \geq 1

2: $x (n)$ – Real (Kind=nag_wp) arrayInput

On entry: the sample observations,

x_{1}, x_{2}, \dots, x_{n}

3: $cdf$ – real (Kind=nag_wp) Function, supplied by the user.External Procedure

cdf must return the value of the theoretical (null) cumulative distribution function for a given value of its argument.

The specification of cdf is:

Fortran Interface

Function cdf (

Real (Kind=nag_wp)	::	cdf
Real (Kind=nag_wp), Intent (In)	::	x

C Header Interface

#include <nagmk26.h>

double

cdf (const double *x)

1: $x$ – Real (Kind=nag_wp)Input: On entry: the argument for which cdf must be evaluated.

cdf must either be a module subprogram USEd by, or declared as EXTERNAL in, the (sub)program from which g08ccf is called. Arguments denoted as Input must not be changed by this procedure.

Note: cdf should not return floating-point NaN (Not a Number) or infinity values, since these are not handled by g08ccf. If your code inadvertently does return any NaNs or infinities, g08ccf is likely to produce unexpected results.

Constraint:

cdf

must always return a value in the range

[0.0, 1.0]

and cdf must always satify the condition that

cdf (x_{1}) \leq cdf (x_{2})

for any

x_{1} \leq x_{2}

4: $ntype$ – IntegerInput

On entry: the statistic to be calculated, i.e., the choice of alternative hypothesis.

$ntype = 1$: Computes $D_{n}$ , to test $H_{0}$ against $H_{1}$ .
$ntype = 2$: Computes $D_{n}^{+}$ , to test $H_{0}$ against $H_{2}$ .
$ntype = 3$: Computes $D_{n}^{-}$ , to test $H_{0}$ against $H_{3}$ .

Constraint:

ntype = 1

2

3

5: $d$ – Real (Kind=nag_wp)Output

On exit: the Kolmogorov–Smirnov test statistic (

D_{n}

D_{n}^{+}

D_{n}^{-}

according to the value of ntype).

6: $z$ – Real (Kind=nag_wp)Output

On exit: a standardized value,

Z

, of the test statistic,

D

, without the continuity correction applied.

7: $p$ – Real (Kind=nag_wp)Output

On exit: the probability,

p

, associated with the observed value of

D

, where

D

may

D_{n}

D_{n}^{+}

D_{n}^{-}

depending on the value of ntype (see Section 3).

8: $sx (n)$ – Real (Kind=nag_wp) arrayOutput

On exit: the sample observations,

x_{1}, x_{2}, \dots, x_{n}

, sorted in ascending order.

9: $ifail$ – IntegerInput/Output

On entry: ifail must be set to

0

- 1 or 1

. If you are unfamiliar with this argument you should refer to Section 3.4 in How to Use the NAG Library and its Documentation for details.

For environments where it might be inappropriate to halt program execution when an error is detected, the value

- 1 or 1

is recommended. If the output of error messages is undesirable, then the value

1

is recommended. Otherwise, if you are not familiar with this argument, the recommended value is

0

. When the value $- 1 or 1$ is used it is essential to test the value of ifail on exit.

On exit:

ifail = 0

unless the routine detects an error or a warning has been flagged (see Section 6).

6

Error Indicators and Warnings

If on entry

ifail = 0

- 1

, explanatory error messages are output on the current error message unit (as defined by x04aaf).

Errors or warnings detected by the routine:

$ifail = 1$: On entry, $n = 〈value〉$ .
Constraint: $n \geq 1$ .

$ifail = 2$: On entry, $ntype = 〈value〉$ .
Constraint: $ntype = 1$ , $2$ or $3$ .

$ifail = 3$: On entry, at $x = 〈value〉$ , $F_{0} (x) = 〈value〉$ .
Constraint: $0.0 \leq F_{0} (x) \leq 1$ , where $F_{0}$ is supplied in cdf.

$ifail = 4$: On entry, at $x = 〈value〉$ , $F_{0} (x) = 〈value〉$ and at $y = 〈value〉$ , $F_{0} (y) = 〈value〉$
Constraint: when $x < y$ , $F_{0} (x) \leq F_{0} (y)$ , where $F_{0}$ is supplied in cdf.

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.
See Section 3.9 in How to Use the NAG Library and its Documentation for further information.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.
See Section 3.8 in How to Use the NAG Library and its Documentation for further information.

$ifail = - 999$: Dynamic memory allocation failed.
See Section 3.7 in How to Use the NAG Library and its Documentation for further information.

7

Accuracy

For most cases the approximation for

p

given when

n > 100

has a relative error of less than

0.01

. The two-sided probability is approximated by doubling the one-sided probability. This is only good for small

p

, that is

p < 0.10

, but very poor for large

p

. The error is always on the conservative side.

8

Parallelism and Performance

g08ccf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.

Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.

9

Further Comments

The time taken by g08ccf increases with

n

until

n > 100

at which point it drops and then increases slowly.

For a discrete theoretical cumulative distribution function

F_{0} (x)

D_{n}^{-} = \max \{F_{0} (x_{(i)}) - S_{n} (x_{(i)}), 0\}

. Thus if you wish to provide a discrete distribution function the following adjustment needs to be made,

for $D_{n}^{+}$ , return $F (x)$ as $x$ as usual;
for $D_{n}^{-}$ , return $F (x - d)$ at $x$ where $d$ is the discrete jump in the distribution. For example $d = 1$ for the Poisson or binomial distributions.

10

Example

The following example performs the one sample Kolmogorov–Smirnov test to test whether a sample of

30

observations arise firstly from a uniform distribution

U (0, 1)

or secondly from a Normal distribution with mean

0.75

and standard deviation

0.5

. The two-sided test statistic,

D_{n}

, the standardized test statistic,

Z

, and the upper tail probability,

p

, are computed and then printed for each test.

NAG Library Routine Document

g08ccf (test_ks_1sample_user)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

1

Purpose

2

Specification

3

Description

4

References

5

Arguments

6

Error Indicators and Warnings

7

Accuracy

8

Parallelism and Performance

9

Further Comments

10

Example

10.1

Program Text

10.2

Program Data

10.3

Program Results