NAG Library Routine Document
g08cdf (test_ks_2sample)
1
Purpose
g08cdf performs the two sample Kolmogorov–Smirnov distribution test.
2
Specification
Fortran Interface
Subroutine g08cdf ( 
n1, x, n2, y, ntype, d, z, p, sx, sy, ifail) 
Integer, Intent (In)  ::  n1, n2, ntype  Integer, Intent (Inout)  ::  ifail  Real (Kind=nag_wp), Intent (In)  ::  x(n1), y(n2)  Real (Kind=nag_wp), Intent (Out)  ::  d, z, p, sx(n1), sy(n2) 

C Header Interface
#include nagmk26.h
void 
g08cdf_ (const Integer *n1, const double x[], const Integer *n2, const double y[], const Integer *ntype, double *d, double *z, double *p, double sx[], double sy[], Integer *ifail) 

3
Description
The data consists of two independent samples, one of size ${n}_{1}$, denoted by ${x}_{1},{x}_{2},\dots ,{x}_{{n}_{1}}$, and the other of size ${n}_{2}$ denoted by ${y}_{1},{y}_{2},\dots ,{y}_{{n}_{2}}$. Let $F\left(x\right)$ and $G\left(x\right)$ represent their respective, unknown, distribution functions. Also let ${S}_{1}\left(x\right)$ and ${S}_{2}\left(x\right)$ denote the values of the sample cumulative distribution functions at the point $x$ for the two samples respectively.
The Kolmogorov–Smirnov test provides a test of the null hypothesis
${H}_{0}$:
$F\left(x\right)=G\left(x\right)$ against one of the following alternative hypotheses:
(i) 
${H}_{1}$: $F\left(x\right)\ne G\left(x\right)$. 
(ii) 
${H}_{2}$: $F\left(x\right)>G\left(x\right)$. This alternative hypothesis is sometimes stated as, ‘The $x$'s tend to be smaller than the $y$'s’, i.e., it would be demonstrated in practical terms if the values of ${S}_{1}\left(x\right)$ tended to exceed the corresponding values of ${S}_{2}\left(x\right)$. 
(iii) 
${H}_{3}$: $F\left(x\right)<G\left(x\right)$. This alternative hypothesis is sometimes stated as, ‘The $x$'s tend to be larger than the $y$'s’, i.e., it would be demonstrated in practical terms if the values of ${S}_{2}\left(x\right)$ tended to exceed the corresponding values of ${S}_{1}\left(x\right)$. 
One of the following test statistics is computed depending on the particular alternative null hypothesis specified (see the description of the argument
ntype in
Section 5).
For the alternative hypothesis
${H}_{1}$.
 ${D}_{{n}_{1},{n}_{2}}$ – the largest absolute deviation between the two sample cumulative distribution functions.
For the alternative hypothesis
${H}_{2}$.
 ${D}_{{n}_{1},{n}_{2}}^{+}$ – the largest positive deviation between the sample cumulative distribution function of the first sample, ${S}_{1}\left(x\right)$, and the sample cumulative distribution function of the second sample, ${S}_{2}\left(x\right)$. Formally ${D}_{{n}_{1},{n}_{2}}^{+}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left\{{S}_{1}\left(x\right){S}_{2}\left(x\right),0\right\}$.
For the alternative hypothesis
${H}_{3}$.
 ${D}_{{n}_{1},{n}_{2}}^{}$ – the largest positive deviation between the sample cumulative distribution function of the second sample, ${S}_{2}\left(x\right)$, and the sample cumulative distribution function of the first sample, ${S}_{1}\left(x\right)$. Formally ${D}_{{n}_{1},{n}_{2}}^{}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left\{{S}_{2}\left(x\right){S}_{1}\left(x\right),0\right\}$.
g08cdf also returns the standardized statistic
$Z=\sqrt{\frac{{n}_{1}+{n}_{2}}{{n}_{1}{n}_{2}}}\times D$, where
$D$ may be
${D}_{{n}_{1},{n}_{2}}$,
${D}_{{n}_{1},{n}_{2}}^{+}$ or
${D}_{{n}_{1},{n}_{2}}^{}$ depending on the choice of the alternative hypothesis. The distribution of this statistic converges asymptotically to a distribution given by Smirnov as
${n}_{1}$ and
${n}_{2}$ increase; see
Feller (1948),
Kendall and Stuart (1973),
Kim and Jenrich (1973),
Smirnov (1933) or
Smirnov (1948).
The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed. If
$\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left({n}_{1},{n}_{2}\right)\le 2500$ and
${n}_{1}{n}_{2}\le 10000$ then an exact method given by Kim and Jenrich (see
Kim and Jenrich (1973)) is used. Otherwise
$p$ is computed using the approximations suggested by
Kim and Jenrich (1973). Note that the method used is only exact for continuous theoretical distributions. This method computes the twosided probability. The onesided probabilities are estimated by halving the twosided probability. This is a good estimate for small
$p$, that is
$p\le 0.10$, but it becomes very poor for larger
$p$.
4
References
Conover W J (1980) Practical Nonparametric Statistics Wiley
Feller W (1948) On the Kolmogorov–Smirnov limit theorems for empirical distributions Ann. Math. Statist. 19 179–181
Kendall M G and Stuart A (1973) The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin
Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two sample Kolmogorov–Smirnov criterion ${D}_{mn}\left(m<n\right)$ Selected Tables in Mathematical Statistics 1 80–129 American Mathematical Society
Siegel S (1956) Nonparametric Statistics for the Behavioral Sciences McGraw–Hill
Smirnov N (1933) Estimate of deviation between empirical distribution functions in two independent samples Bull. Moscow Univ. 2(2) 3–16
Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions Ann. Math. Statist. 19 279–281
5
Arguments
 1: $\mathbf{n1}$ – IntegerInput

On entry: the number of observations in the first sample, ${n}_{1}$.
Constraint:
${\mathbf{n1}}\ge 1$.
 2: $\mathbf{x}\left({\mathbf{n1}}\right)$ – Real (Kind=nag_wp) arrayInput

On entry: the observations from the first sample, ${x}_{1},{x}_{2},\dots ,{x}_{{n}_{1}}$.
 3: $\mathbf{n2}$ – IntegerInput

On entry: the number of observations in the second sample, ${n}_{2}$.
Constraint:
${\mathbf{n2}}\ge 1$.
 4: $\mathbf{y}\left({\mathbf{n2}}\right)$ – Real (Kind=nag_wp) arrayInput

On entry: the observations from the second sample, ${y}_{1},{y}_{2},\dots ,{y}_{{n}_{2}}$.
 5: $\mathbf{ntype}$ – IntegerInput

On entry: the statistic to be computed, i.e., the choice of alternative hypothesis.
 ${\mathbf{ntype}}=1$
 Computes ${D}_{{n}_{1}{n}_{2}}$, to test against ${H}_{1}$.
 ${\mathbf{ntype}}=2$
 Computes ${D}_{{n}_{1}{n}_{2}}^{+}$, to test against ${H}_{2}$.
 ${\mathbf{ntype}}=3$
 Computes ${D}_{{n}_{1}{n}_{2}}^{}$, to test against ${H}_{3}$.
Constraint:
${\mathbf{ntype}}=1$, $2$ or $3$.
 6: $\mathbf{d}$ – Real (Kind=nag_wp)Output

On exit: the Kolmogorov–Smirnov test statistic (
${D}_{{n}_{1}{n}_{2}}$,
${D}_{{n}_{1}{n}_{2}}^{+}$ or
${D}_{{n}_{1}{n}_{2}}^{}$ according to the value of
ntype).
 7: $\mathbf{z}$ – Real (Kind=nag_wp)Output

On exit: a standardized value, $Z$, of the test statistic, $D$, without any correction for continuity.
 8: $\mathbf{p}$ – Real (Kind=nag_wp)Output

On exit: the tail probability associated with the observed value of
$D$, where
$D$ may be
${D}_{{n}_{1},{n}_{2}},{D}_{{n}_{1},{n}_{2}}^{+}$ or
${D}_{{n}_{1},{n}_{2}}^{}$ depending on the value of
ntype (see
Section 3).
 9: $\mathbf{sx}\left({\mathbf{n1}}\right)$ – Real (Kind=nag_wp) arrayOutput

On exit: the observations from the first sample sorted in ascending order.
 10: $\mathbf{sy}\left({\mathbf{n2}}\right)$ – Real (Kind=nag_wp) arrayOutput

On exit: the observations from the second sample sorted in ascending order.
 11: $\mathbf{ifail}$ – IntegerInput/Output

On entry:
ifail must be set to
$0$,
$1\text{ or}1$. If you are unfamiliar with this argument you should refer to
Section 3.4 in How to Use the NAG Library and its Documentation for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value
$1\text{ or}1$ is recommended. If the output of error messages is undesirable, then the value
$1$ is recommended. Otherwise, if you are not familiar with this argument, the recommended value is
$0$.
When the value $\mathbf{1}\text{ or}\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit:
${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see
Section 6).
6
Error Indicators and Warnings
If on entry
${\mathbf{ifail}}=0$ or
$1$, explanatory error messages are output on the current error message unit (as defined by
x04aaf).
Errors or warnings detected by the routine:
 ${\mathbf{ifail}}=1$

On entry,  ${\mathbf{n1}}<1$, 
or  ${\mathbf{n2}}<1$. 
 ${\mathbf{ifail}}=2$

On entry,  ${\mathbf{ntype}}\ne 1$, $2$ or $3$. 
 ${\mathbf{ifail}}=3$

The iterative procedure used in the approximation of the probability for large ${n}_{1}$ and ${n}_{2}$ did not converge. For the twosided test, $p=1$ is returned. For the onesided test, $p=0.5$ is returned.
 ${\mathbf{ifail}}=99$
An unexpected error has been triggered by this routine. Please
contact
NAG.
See
Section 3.9 in How to Use the NAG Library and its Documentation for further information.
 ${\mathbf{ifail}}=399$
Your licence key may have expired or may not have been installed correctly.
See
Section 3.8 in How to Use the NAG Library and its Documentation for further information.
 ${\mathbf{ifail}}=999$
Dynamic memory allocation failed.
See
Section 3.7 in How to Use the NAG Library and its Documentation for further information.
7
Accuracy
The large sample distributions used as approximations to the exact distribution should have a relative error of less than 5% for most cases.
8
Parallelism and Performance
g08cdf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
Please consult the
X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the
Users' Note for your implementation for any additional implementationspecific information.
The time taken by g08cdf increases with ${n}_{1}$ and ${n}_{2}$, until ${n}_{1}{n}_{2}>10000$ or $\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left({n}_{1},{n}_{2}\right)\ge 2500$. At this point one of the approximations is used and the time decreases significantly. The time then increases again modestly with ${n}_{1}$ and ${n}_{2}$.
10
Example
This example computes the twosided Kolmogorov–Smirnov test statistic for two independent samples of size $100$ and $50$ respectively. The first sample is from a uniform distribution $U\left(0,2\right)$. The second sample is from a uniform distribution $U\left(0.25,2.25\right)$. The test statistic, ${D}_{{n}_{1},{n}_{2}}$, the standardized test statistic, $Z$, and the tail probability, $p$, are computed and printed.
10.1
Program Text
Program Text (g08cdfe.f90)
10.2
Program Data
Program Data (g08cdfe.d)
10.3
Program Results
Program Results (g08cdfe.r)