hide long namesshow long names
hide short namesshow short names
Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

NAG Toolbox: nag_stat_quantiles_stream_arbitrary (g01ap)

Purpose

nag_stat_quantiles_stream_arbitrary (g01ap) finds approximate quantiles from a large arbitrary-sized data stream using an out-of-core algorithm.

Syntax

[ind, np, qv, rcomm, icomm, ifail] = g01ap(ind, rv, eps, q, rcomm, icomm, 'nb', nb, 'nq', nq, 'lrcomm', lrcomm, 'licomm', licomm)
[ind, np, qv, rcomm, icomm, ifail] = nag_stat_quantiles_stream_arbitrary(ind, rv, eps, q, rcomm, icomm, 'nb', nb, 'nq', nq, 'lrcomm', lrcomm, 'licomm', licomm)

Description

A quantile is a value which divides a frequency distribution such that there is a given proportion of data values below the quantile. For example, the median of a dataset is the 0.50.5 quantile because half the values are less than or equal to it.
nag_stat_quantiles_stream_arbitrary (g01ap) uses a slightly modified version of an algorithm described in a paper by Zhang and Wang (2007) to determine εε-approximate quantiles of a large arbitrary-sized data stream of real values, where εε is a user-defined approximation factor. Let mm denote the number of data elements processed so far then, given any quantile q[0.0,1.0]q[0.0,1.0], an εε-approximate quantile is defined as an element in the data stream whose rank falls within [(qε)m,(q + ε)m] [(q-ε)m,(q+ε)m] . In case of more than one εε-approximate quantile being available, the one closest to qmqm is used.

References

Zhang Q and Wang W (2007) A fast algorithm for approximate quantiles in high speed data streams Proceedings of the 19th International Conference on Scientific and Statistical Database Management IEEE Computer Society 29

Parameters

Compulsory Input Parameters

1:     ind – int64int32nag_int scalar
On initial entry: must be set to 00.
Indicates the action required in the current call to nag_stat_quantiles_stream_arbitrary (g01ap).
ind = 0ind=0
Initialize the communication arrays and attempt to process the first nb values from the data stream. eps, rv and nb must be set and licomm must be at least 1010.
ind = 1ind=1
Attempt to process the next block of nb values from the data stream. The calling program must update rv and (if required) nb, and re-enter nag_stat_quantiles_stream_arbitrary (g01ap) with all other parameters unchanged.
ind = 2ind=2
Continue calculation following the reallocation of either or both of the communication arrays rcomm and icomm.
ind = 3ind=3
Calculate the nq εε-approximate quantiles specified in q. The calling program must set q and nq and re-enter nag_stat_quantiles_stream_arbitrary (g01ap) with all other parameters unchanged. This option can be chosen only when npexp(1.0) / epsnpexp(1.0)/eps.
Constraint: ind = 0ind=0, 11, 22 or 33.
2:     rv( : :) – double array
Note: the dimension of the array rv must be at least nbnb if ind = 0ind=0, 11 or 22.
If ind = 0ind=0, 11 or 22, the vector containing the current block of data, otherwise rv is not referenced.
3:     eps – double scalar
Approximation factor εε.
Constraint: eps > 0.0 ​ and ​eps1.0eps>0.0 ​ and ​eps1.0.
4:     q( : :) – double array
Note: the dimension of the array q must be at least nqnq if ind = 3ind=3.
If ind = 3ind=3, the quantiles to be calculated, otherwise q is not referenced. Note that q(i) = 0.0qi=0.0, corresponds to the minimum value and q(i) = 1.0qi=1.0 to the maximum value.
Constraint: if ind = 3ind=3, 0.0q(i)1.00.0qi1.0, for i = 1,2,,nqi=1,2,,nq.
5:     rcomm(lrcomm) – double array
lrcomm, the dimension of the array, must satisfy the constraint
If ind = 1ind=1 or 22 then the first ll elements of rcomm as supplied to nag_stat_quantiles_stream_arbitrary (g01ap) must be identical to the first ll elements of rcomm returned from the last call to nag_stat_quantiles_stream_arbitrary (g01ap), where ll is the value of lrcomm used in the last call. In other words, the contents of rcomm must not be altered between calls to this function. If rcomm needs to be reallocated then its contents must be preserved. If ind = 0ind=0 then rcomm need not be set.
6:     icomm(licomm) – int64int32nag_int array
licomm, the dimension of the array, must satisfy the constraint
If ind = 1ind=1 or 22 then the first ll elements of icomm as supplied to nag_stat_quantiles_stream_arbitrary (g01ap) must be identical to the first ll elements of icomm returned from the last call to nag_stat_quantiles_stream_arbitrary (g01ap), where ll is the value of licomm used in the last call. In other words, the contents of icomm must not be altered between calls to this function. If icomm needs to be reallocated then its contents must be preserved. If ind = 0ind=0 then icomm need not be set.

Optional Input Parameters

1:     nb – int64int32nag_int scalar
Default: The dimension of the array rv.
If ind = 0ind=0, 11 or 22, the size of the current block of data. The size of blocks of data in array rv can vary; therefore nb can change between calls to nag_stat_quantiles_stream_arbitrary (g01ap).
Constraint: if ind = 0ind=0, 11 or 22, nb > 0nb>0.
2:     nq – int64int32nag_int scalar
Default: The dimension of the array q.
If ind = 3ind=3, the number of quantiles requested, otherwise nq is not referenced.
Constraint: if ind = 3ind=3, nq > 0nq>0.
3:     lrcomm – int64int32nag_int scalar
Default: The dimension of the array rcomm.
The dimension of the array rcomm as declared in the (sub)program from which nag_stat_quantiles_stream_arbitrary (g01ap) is called.
Constraints:
4:     licomm – int64int32nag_int scalar
Default: The dimension of the array icomm.
The dimension of the array icomm as declared in the (sub)program from which nag_stat_quantiles_stream_arbitrary (g01ap) is called.
Constraints:

Input Parameters Omitted from the MATLAB Interface

None.

Output Parameters

1:     ind – int64int32nag_int scalar
Indicates output from the call.
ind = 1ind=1
nag_stat_quantiles_stream_arbitrary (g01ap) has processed np data points and expects to be called again with additional data.
ind = 2ind=2
Either one or more of the communication arrays rcomm and icomm is too small. The new minimum lengths of rcomm and icomm have been returned in icomm(1)icomm1 and icomm(2)icomm2 respectively. If the new minimum length is greater than the current length then the corresponding communication array needs to be reallocated, its contents preserved and nag_stat_quantiles_stream_arbitrary (g01ap) called again with all other parameters unchanged.
If there is more data to be processed, it is recommended that lrcomm and licomm are made significantly bigger than the minimum to limit the number of reallocations.
ind = 3ind=3
nag_stat_quantiles_stream_arbitrary (g01ap) has returned the requested εε-approximate quantiles in qv. These quantiles are based on np data points.
2:     np – int64int32nag_int scalar
mm, the number of elements processed so far.
3:     qv( : :) – double array
Note: the dimension of the array qv must be at least nqnq if ind = 3ind=3.
If ind = 3ind=3, qv(i)qvi contains the εε-approximate quantiles specified by the value provided in q(i)qi.
4:     rcomm(lrcomm) – double array
rcomm holds information required by subsequent calls to nag_stat_quantiles_stream_arbitrary (g01ap)
5:     icomm(licomm) – int64int32nag_int array
icomm(1)icomm1 holds the minimum required length for rcomm and icomm(2)icomm2 holds the minimum required length for icomm. The remaining elements of icomm are used for communication between subsequent calls to nag_stat_quantiles_stream_arbitrary (g01ap).
6:     ifail – int64int32nag_int scalar
ifail = 0ifail=0 unless the function detects an error (see [Error Indicators and Warnings]).
As an out-of-core function nag_stat_quantiles_stream_arbitrary (g01ap) will only perform certain parameter checks when a data checkpoint (including completion of data input) is signaled. As such it will usually be inappropriate to halt program execution when an error is detected since any errors may be subsequently resolved without losing any processing already carried out. Therefore setting ifail to a value of 1​ or ​1-1​ or ​1 is recommended. If the output of error messages is undesirable, then the value 11 is recommended. When the value 1​ or ​1-1​ or ​1 is used it is essential to test the value of ifail on exit.

Error Indicators and Warnings

Errors or warnings detected by the function:
  ifail = 1ifail=1
Constraint: ind = 0ind=0, 11, 22 or 33.
  ifail = 2ifail=2
Constraint: 0.0 < eps1.00.0<eps1.0.
  ifail = 3ifail=3
Constraint: if ind = 0ind=0, 11 or 22 then nb > 0nb>0.
  ifail = 4ifail=4
Constraint: licomm10licomm10.
  ifail = 5ifail=5
Constraint: lrcomm1lrcomm1.
  ifail = 6ifail=6
The contents of icomm have been altered between calls to this function.
  ifail = 7ifail=7
The contents of rcomm have been altered between calls to this function.
  ifail = 8ifail=8
Number of data elements streamed, __ is not sufficient for a quantile query when .
Supply more data or reprocess the data with a higher eps value.
  ifail = 9ifail=9
Constraint: if ind = 3ind=3 then nq > 0nq>0.
  ifail = 10ifail=10
Constraint: if ind = 3ind=3 then 0.0q(i)1.00.0qi1.0 for all ii.
  ifail = 999ifail=-999
Dynamic memory allocation failed.

Accuracy

Not applicable.

Further Comments

The average time taken by nag_stat_quantiles_stream_arbitrary (g01ap) scales as np log(1 / εlog(εnp)) np log( 1/ε log(εnp) ) .
It is not possible to determine in advance the final size of the communication arrays rcomm and icomm without knowing the size of the dataset. However, if a rough size (nn) is known, the speed of the computation can be increased if the sizes of the communication arrays are not smaller than
lrcomm = (log2(n × eps + 1.0)2) × 1.0 / eps + 1 + x + 2 × min (x, x / 2.0 + 1 ) × y + 1
licomm = (log2(n × eps + 1.0)2) × (2 × (1.0 / eps + 1) + 1) +
2 × (x + 2 × min (x, x / 2.0 + 1 ) × y) + y + 11
lrcomm = ( log2( n×eps+1.0 ) - 2 ) × 1.0/eps +1+x+ 2× min(x, x/2.0 +1 ) × y +1 licomm = ( log2( n×eps+1.0 ) - 2 ) × ( 2 × ( 1.0/eps +1 ) + 1 ) + 2 × ( x+2× min(x, x/2.0 +1 ) × y ) + y + 11
where
x = max (1,log (eps × n) / eps )
y = log2(n / x + 1.0) + 1 .
x= max(1, log (eps×n) / eps ) y = log2(n/x+1.0) +1 .

Example

function nag_stat_quantiles_stream_arbitrary_example
n = 0;
eps = 0.2;
q = [0.25; 0.5; 1.0];
nb = 20;
% For this example we are using a string as the source of data.
datastream = strcat('34.01 57.95 44.88 22.04 28.84 4.43 0.32 20.82 20.53 ', ...
' 13.08  7.99 54.03 23.21 26.73 39.72  0.97 39.05 38.78 19.38 51.34 24.08 ', ...
' 12.41 58.11 35.90 40.38 27.41 19.80  6.02 45.33 36.34 43.14 53.84 39.49 ', ...
'  9.04 36.74 58.72 59.95 15.41 33.05 39.54 33.24 58.67 54.12 39.48 43.73 ', ...
' 24.15 55.72  8.87 40.47 46.18 20.36  6.95 36.86 49.24 56.83 43.87 29.86 ', ...
' 22.49 25.29 33.17');
rcomm = zeros(100, 1);
icomm = zeros(400, 1, 'int64');
ind = int64(0);
repeat = true;
pos    = 0;
while (repeat)
  if (ind==0 || ind==1)
    [rv, new_pos] = textscan(datastream(pos+1:end), '%f', nb);
    pos = pos+new_pos;

    nd = numel(rv{1});
    if nd == 0
      break;
    elseif nd < nb
      nb = nd;
      repeat = false;
    elseif pos == numel(datastream)
      repeat = false;
    end
    n = n+nb;
  end

  [ind, np, qv, rcomm, icomm, ifail] = ...
    nag_stat_quantiles_stream_arbitrary(ind, rv{1}, eps, q, rcomm, icomm);

  if ifail ~= 0
    break;
  end

  % If ind=2, the communication arrays are too small, so extend them
  % and call the routine again with the same rv
  if ind == 2
    if numel(rcomm) < icomm(1)
      rcomm(icomm(1)) = 0;
    end
    if numel(icomm) < icomm(2)
      icomm(icomm(2)) = 0;
    end
  end
end

if ifail == 0
  % Call NAG routine again to calculate quantiles specified in vector q
  ind = int64(3);
  [ind, np, qv, rcomm, icomm, ifail] = ...
    nag_stat_quantiles_stream_arbitrary(ind, 0, eps, q, rcomm, icomm);
else
  fprintf('\nAn error occurred processing the observations\n');
  fprintf('ifail = %d\n', ifail);
end

% Print the results
if ifail == 0
  fprintf('\nInput Data:\n %d observations\n eps = %5.2f\n\n', n, eps);
  fprintf('Quantile     Result\n');
  for i = 1:numel(q)
    fprintf('%7.2f     %7.2f\n', q(i), qv(i));
  end
else
 fprintf('\nAn error occurred calculating the quantiles, ifail = %d\n', ifail);
end
 

Input Data:
 60 observations
 eps =  0.20

Quantile     Result
   0.25       22.49
   0.50       39.54
   1.00       59.95

function g01ap_example
n = 0;
eps = 0.2;
q = [0.25; 0.5; 1.0];
nb = 20;
% For this example we are using a string as the source of data.
datastream = strcat('34.01 57.95 44.88 22.04 28.84 4.43 0.32 20.82 20.53 ', ...
' 13.08  7.99 54.03 23.21 26.73 39.72  0.97 39.05 38.78 19.38 51.34 24.08 ', ...
' 12.41 58.11 35.90 40.38 27.41 19.80  6.02 45.33 36.34 43.14 53.84 39.49 ', ...
'  9.04 36.74 58.72 59.95 15.41 33.05 39.54 33.24 58.67 54.12 39.48 43.73 ', ...
' 24.15 55.72  8.87 40.47 46.18 20.36  6.95 36.86 49.24 56.83 43.87 29.86 ', ...
' 22.49 25.29 33.17');
rcomm = zeros(100, 1);
icomm = zeros(400, 1, 'int64');
ind = int64(0);
repeat = true;
pos    = 0;
while (repeat)
  if (ind==0 || ind==1)
    [rv, new_pos] = textscan(datastream(pos+1:end), '%f', nb);
    pos = pos+new_pos;

    nd = numel(rv{1});
    if nd == 0
      break;
    elseif nd < nb
      nb = nd;
      repeat = false;
    elseif pos == numel(datastream)
      repeat = false;
    end
    n = n+nb;
  end

  [ind, np, qv, rcomm, icomm, ifail] = g01ap(ind, rv{1}, eps, q, rcomm, icomm);

  if ifail ~= 0
    break;
  end

  % If ind=2, the communication arrays are too small, so extend them
  % and call the routine again with the same rv
  if ind == 2
    if numel(rcomm) < icomm(1)
      rcomm(icomm(1)) = 0;
    end
    if numel(icomm) < icomm(2)
      icomm(icomm(2)) = 0;
    end
  end
end

if ifail == 0
  % Call NAG routine again to calculate quantiles specified in vector q
  ind = int64(3);
  [ind, np, qv, rcomm, icomm, ifail] = g01ap(ind, 0, eps, q, rcomm, icomm);
else
  fprintf('\nAn error occurred processing the observations\n');
  fprintf('ifail = %d\n', ifail);
end

% Print the results
if ifail == 0
  fprintf('\nInput Data:\n %d observations\n eps = %5.2f\n\n', n, eps);
  fprintf('Quantile     Result\n');
  for i = 1:numel(q)
    fprintf('%7.2f     %7.2f\n', q(i), qv(i));
  end
else
 fprintf('\nAn error occurred calculating the quantiles, ifail = %d\n', ifail);
end
 

Input Data:
 60 observations
 eps =  0.20

Quantile     Result
   0.25       22.49
   0.50       39.54
   1.00       59.95


PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2013