NAG Library Routine Document
G08CCF
1 Purpose
G08CCF performs the one sample Kolmogorov–Smirnov distribution test, using a user-specified distribution.
2 Specification
INTEGER |
N, NTYPE, IFAIL |
REAL (KIND=nag_wp) |
X(N), CDF, D, Z, P, SX(N) |
EXTERNAL |
CDF |
|
3 Description
The data consists of a single sample of n observations, denoted by x1,x2,…,xn. Let Snxi and F0xi represent the sample cumulative distribution function and the theoretical (null) cumulative distribution function respectively at the point xi, where xi is the ith smallest sample observation.
The Kolmogorov–Smirnov test provides a test of the null hypothesis
H0: the data are a random sample of observations from a theoretical distribution specified by you (in
CDF) against one of the following alternative hypotheses.
(i) |
H1: the data cannot be considered to be a random sample from the specified null distribution. |
(ii) |
H2: the data arise from a distribution which dominates the specified null distribution. In practical terms, this would be demonstrated if the values of the sample cumulative distribution function Snx tended to exceed the corresponding values of the theoretical cumulative distribution function F0x. |
(iii) |
H3: the data arise from a distribution which is dominated by the specified null distribution. In practical terms, this would be demonstrated if the values of the theoretical cumulative distribution function F0x tended to exceed the corresponding values of the sample cumulative distribution function Snx. |
One of the following test statistics is computed depending on the particular alternative hypothesis specified (see the description of the parameter
NTYPE in
Section 5).
For the alternative hypothesis
H1:
- Dn – the largest absolute deviation between the sample cumulative distribution function and the theoretical cumulative distribution function. Formally Dn=maxDn+,Dn-.
For the alternative hypothesis
H2:
- Dn+ – the largest positive deviation between the sample cumulative distribution function and the theoretical cumulative distribution function. Formally Dn+=maxSnxi-F0xi,0.
For the alternative hypothesis
H3:
- Dn- – the largest positive deviation between the theoretical cumulative distribution function and the sample cumulative distribution function. Formally Dn-=maxF0xi-Snxi-1,0. This is only true for continuous distributions. See Section 8 for comments on discrete distributions.
The standardized statistic,
Z=D×n, is also computed, where
D may be
Dn,Dn+ or
Dn- depending on the choice of the alternative hypothesis. This is the standardized value of
D with no continuity correction applied and the distribution of
Z converges asymptotically to a limiting distribution, first derived by
Kolmogorov (1933), and then tabulated by
Smirnov (1948). The asymptotic distributions for the one-sided statistics were obtained by
Smirnov (1933).
The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed. If
n≤100, an exact method given by
Conover (1980) is used. Note that the method used is only exact for continuous theoretical distributions and does not include Conover's modification for discrete distributions. This method computes the one-sided probabilities. The two-sided probabilities are estimated by doubling the one-sided probability. This is a good estimate for small
p, that is
p≤0.10, but it becomes very poor for larger
p. If
n>100 then
p is computed using the Kolmogorov–Smirnov limiting distributions; see
Feller (1948),
Kendall and Stuart (1973),
Kolmogorov (1933),
Smirnov (1933) and
Smirnov (1948).
4 References
Conover W J (1980)
Practical Nonparametric Statistics Wiley
Feller W (1948) On the Kolmogorov–Smirnov limit theorems for empirical distributions
Ann. Math. Statist. 19 179–181
Kendall M G and Stuart A (1973)
The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin
Kolmogorov A N (1933) Sulla determinazione empirica di una legge di distribuzione
Giornale dell' Istituto Italiano degli Attuari 4 83–91
Siegel S (1956)
Non-parametric Statistics for the Behavioral Sciences McGraw–Hill
Smirnov N (1933) Estimate of deviation between empirical distribution functions in two independent samples
Bull. Moscow Univ. 2(2) 3–16
Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions
Ann. Math. Statist. 19 279–281
5 Parameters
- 1: N – INTEGERInput
On entry: n, the number of observations in the sample.
Constraint:
N≥1.
- 2: X(N) – REAL (KIND=nag_wp) arrayInput
On entry: the sample observations, x1,x2,…,xn.
- 3: CDF – REAL (KIND=nag_wp) FUNCTION, supplied by the user.External Procedure
CDF must return the value of the theoretical (null) cumulative distribution function for a given value of its argument.
The specification of
CDF is:
- 1: X – REAL (KIND=nag_wp)Input
On entry: the argument for which
CDF must be evaluated.
CDF must either be a module subprogram USEd by, or declared as EXTERNAL in, the (sub)program from which G08CCF is called. Parameters denoted as
Input must
not be changed by this procedure.
Constraint:
CDF must always return a value in the range
0.0,1.0 and
CDF must always satify the condition that
CDFx1≤CDFx2 for any
x1≤x2.
- 4: NTYPE – INTEGERInput
On entry: the statistic to be calculated, i.e., the choice of alternative hypothesis.
- NTYPE=1
- Computes Dn, to test H0 against H1.
- NTYPE=2
- Computes Dn+, to test H0 against H2.
- NTYPE=3
- Computes Dn-, to test H0 against H3.
Constraint:
NTYPE=1, 2 or 3.
- 5: D – REAL (KIND=nag_wp)Output
On exit: the Kolmogorov–Smirnov test statistic (
D n ,
D n + or
D n - according to the value of
NTYPE).
- 6: Z – REAL (KIND=nag_wp)Output
On exit: a standardized value, Z, of the test statistic, D, without the continuity correction applied.
- 7: P – REAL (KIND=nag_wp)Output
On exit: the probability,
p, associated with the observed value of
D, where
D may
Dn,
Dn+ or
Dn- depending on the value of
NTYPE (see
Section 3).
- 8: SX(N) – REAL (KIND=nag_wp) arrayOutput
On exit: the sample observations, x1,x2,…,xn, sorted in ascending order.
- 9: IFAIL – INTEGERInput/Output
-
On entry:
IFAIL must be set to
0,
-1 or 1. If you are unfamiliar with this parameter you should refer to
Section 3.3 in the Essential Introduction for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value
-1 or 1 is recommended. If the output of error messages is undesirable, then the value
1 is recommended. Otherwise, if you are not familiar with this parameter, the recommended value is
0.
When the value -1 or 1 is used it is essential to test the value of IFAIL on exit.
On exit:
IFAIL=0 unless the routine detects an error or a warning has been flagged (see
Section 6).
6 Error Indicators and Warnings
If on entry
IFAIL=0 or
-1, explanatory error messages are output on the current error message unit (as defined by
X04AAF).
Errors or warnings detected by the routine:
- IFAIL=1
- IFAIL=2
-
On entry, | NTYPE≠1, 2 or 3. |
- IFAIL=3
The supplied theoretical cumulative distribution function returns a value less than 0.0 or greater than 1.0, thereby violating the definition of the cumulative distribution function.
- IFAIL=4
The supplied theoretical cumulative distribution function is not a nondecreasing function thereby violating the definition of a cumulative distribution function, that is F0x>F0y for some x<y.
7 Accuracy
For most cases the approximation for p given when n>100 has a relative error of less than 0.01. The two-sided probability is approximated by doubling the one-sided probability. This is only good for small p, that is p<0.10, but very poor for large p. The error is always on the conservative side.
8 Further Comments
The time taken by G08CCF increases with n until n>100 at which point it drops and then increases slowly.
For a discrete theoretical cumulative distribution function
F0x,
Dn-=maxF0xi-Snxi,0. Thus if you wish to provide a discrete distribution function the following adjustment needs to be made,
- for Dn+, return Fx as x as usual;
- for Dn-, return Fx-d at x where d is the discrete jump in the distribution. For example d=1 for the Poisson or binomial distributions.
9 Example
The following example performs the one sample Kolmogorov–Smirnov test to test whether a sample of 30 observations arise firstly from a uniform distribution U0,1 or secondly from a Normal distribution with mean 0.75 and standard deviation 0.5. The two-sided test statistic, Dn, the standardized test statistic, Z, and the upper tail probability, p, are computed and then printed for each test.
9.1 Program Text
Program Text (g08ccfe.f90)
9.2 Program Data
Program Data (g08ccfe.d)
9.3 Program Results
Program Results (g08ccfe.r)