Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

Chapter Contents
Chapter Introduction
NAG Toolbox

# NAG Toolbox: nag_rand_subsamp_xyw (g05pw)

## Purpose

nag_rand_subsamp_xyw (g05pw) generates a dataset suitable for use with repeated random sub-sampling validation.

## Syntax

[state, sx, sy, sw, errbuf, ifail] = g05pw(nt, x, state, 'n', n, 'm', m, 'sordx', sordx, 'y', y, 'w', w, 'sordsx', sordsx)
[state, sx, sy, sw, errbuf, ifail] = nag_rand_subsamp_xyw(nt, x, state, 'n', n, 'm', m, 'sordx', sordx, 'y', y, 'w', w, 'sordsx', sordsx)

## Description

Let ${X}_{o}$ denote a matrix of $n$ observations on $m$ variables and ${y}_{o}$ and ${w}_{o}$ each denote a vector of length $n$. For example, ${X}_{o}$ might represent a matrix of independent variables, ${y}_{o}$ the dependent variable and ${w}_{o}$ the associated weights in a weighted regression.
nag_rand_subsamp_xyw (g05pw) generates a series of training datasets, denoted by the matrix, vector, vector triplet $\left({X}_{t},{y}_{t},{w}_{t}\right)$ of ${n}_{t}$ observations, and validation datasets, denoted $\left({X}_{v},{y}_{v},{w}_{v}\right)$ with ${n}_{v}$ observations. These training and validation datasets are generated by randomly assigning each observation to either the training dataset or the validation dataset.
The resulting datasets are suitable for use with repeated random sub-sampling validation.
One of the initialization functions nag_rand_init_repeat (g05kf) (for a repeatable sequence if computed sequentially) or nag_rand_init_nonrepeat (g05kg) (for a non-repeatable sequence) must be called prior to the first call to nag_rand_subsamp_xyw (g05pw).

None.

## Parameters

### Compulsory Input Parameters

1:     $\mathrm{nt}$int64int32nag_int scalar
${n}_{t}$, the number of observations in the training dataset.
Constraint: $1\le {\mathbf{nt}}\le {\mathbf{n}}$.
2:     $\mathrm{x}\left(\mathit{ldx},:\right)$ – double array
The first dimension, $\mathit{ldx}$, of the array x must satisfy
• if ${\mathbf{sordx}}=2$, $\mathit{ldx}\ge {\mathbf{m}}$;
• otherwise $\mathit{ldx}\ge {\mathbf{n}}$.
The second dimension of the array x must be at least ${\mathbf{m}}$ if ${\mathbf{sordx}}=1$ and at least ${\mathbf{n}}$ if ${\mathbf{sordx}}=2$.
The way the data is stored in x is defined by sordx.
If ${\mathbf{sordx}}=1$, ${\mathbf{x}}\left(\mathit{i},\mathit{j}\right)$ contains the $\mathit{i}$th observation for the $\mathit{j}$th variable, for $i=1,2,\dots ,{\mathbf{n}}$ and $j=1,2,\dots ,{\mathbf{m}}$.
If ${\mathbf{sordx}}=2$, ${\mathbf{x}}\left(\mathit{j},\mathit{i}\right)$ contains the $\mathit{i}$th observation for the $\mathit{j}$th variable, for $i=1,2,\dots ,{\mathbf{n}}$ and $j=1,2,\dots ,{\mathbf{m}}$.
${X}_{o}$, the values of $X$ for the original dataset. This may be the array returned in sx by a previous call to nag_rand_subsamp_xyw (g05pw).
3:     $\mathrm{state}\left(:\right)$int64int32nag_int array
Note: the actual argument supplied must be the array state supplied to the initialization routines nag_rand_init_repeat (g05kf) or nag_rand_init_nonrepeat (g05kg).
Contains information on the selected base generator and its current state.

### Optional Input Parameters

1:     $\mathrm{n}$int64int32nag_int scalar
Default:
• if ${\mathbf{sordx}}=2$, ;
• otherwise .
$n$, the number of observations.
Constraint: ${\mathbf{n}}\ge 1$.
2:     $\mathrm{m}$int64int32nag_int scalar
Default:
• if ${\mathbf{sordx}}=2$, ;
• otherwise .
$m$, the number of variables.
Constraint: ${\mathbf{m}}\ge 1$.
3:     $\mathrm{sordx}$int64int32nag_int scalar
Default: $1$
Determines how variables are stored in x.
Constraint: ${\mathbf{sordx}}=1$ or $2$.
4:     $\mathrm{y}\left(\mathit{ly}\right)$ – double array
Optionally, ${y}_{o}$, the values of $y$ for the original dataset. This may be the vector returned in sy by a previous call to nag_rand_subsamp_xyw (g05pw).
5:     $\mathrm{w}\left(\mathit{lw}\right)$ – double array
Optionally, ${w}_{o}$, the values of $w$ for the original dataset. This may be the vector returned in sw by a previous call to nag_rand_subsamp_xyw (g05pw).
6:     $\mathrm{sordsx}$int64int32nag_int scalar
Default: ${\mathbf{sordx}}$
Determines how variables are stored in sx.
Constraint: ${\mathbf{sordsx}}=1$ or $2$.

### Output Parameters

1:     $\mathrm{state}\left(:\right)$int64int32nag_int array
Contains updated information on the state of the generator.
2:     $\mathrm{sx}\left(\mathit{ldsx},:\right)$ – double array
The first dimension, $\mathit{ldsx}$, of the array sx will be
• if ${\mathbf{sordsx}}=1$, $\mathit{ldsx}={\mathbf{n}}$;
• if ${\mathbf{sordsx}}=2$, $\mathit{ldsx}={\mathbf{m}}$.
The second dimension of the array sx will be ${\mathbf{m}}$ if ${\mathbf{sordsx}}=1$ and ${\mathbf{n}}$ otherwise.
The way the data is stored in sx is defined by sordsx.
If ${\mathbf{sordsx}}=1$, ${\mathbf{sx}}\left(\mathit{i},\mathit{j}\right)$ contains the $\mathit{i}$th observation for the $\mathit{j}$th variable, for $i=1,2,\dots ,{\mathbf{n}}$ and $j=1,2,\dots ,{\mathbf{m}}$.
If ${\mathbf{sordsx}}=2$, ${\mathbf{sx}}\left(\mathit{j},\mathit{i}\right)$ contains the $\mathit{i}$th observation for the $\mathit{j}$th variable, for $i=1,2,\dots ,{\mathbf{n}}$ and $j=1,2,\dots ,{\mathbf{m}}$.
sx holds the values of $X$ for the training and validation datasets, with ${X}_{t}$ held in observations $1$ to ${\mathbf{nt}}$ and ${X}_{v}$ in observations ${\mathbf{nt}}+1$ to ${\mathbf{n}}$.
3:     $\mathrm{sy}\left(\mathit{lsy}\right)$ – double array
If y is supplied then sy holds the values of $y$ for the training and validation datasets, with ${y}_{t}$ held in elements $1$ to ${\mathbf{nt}}$ and ${y}_{v}$ in elements ${\mathbf{nt}}+1$ to ${\mathbf{n}}$.
4:     $\mathrm{sw}\left(\mathit{lsw}\right)$ – double array
If w is supplied then sw holds the values of $w$ for the training and validation datasets, with ${w}_{t}$ held in elements $1$ to ${\mathbf{nt}}$ and ${w}_{v}$ in elements ${\mathbf{nt}}+1$ to ${\mathbf{n}}$.
5:     $\mathrm{errbuf}$ – string (length at least 200) (length ≥ 200)
6:     $\mathrm{ifail}$int64int32nag_int scalar
${\mathbf{ifail}}={\mathbf{0}}$ unless the function detects an error (see Error Indicators and Warnings).

## Error Indicators and Warnings

Errors or warnings detected by the function:
${\mathbf{ifail}}=11$
Constraint: $1\le {\mathbf{nt}}\le {\mathbf{n}}$.
${\mathbf{ifail}}=21$
Constraint: ${\mathbf{n}}\ge 1$.
${\mathbf{ifail}}=31$
Constraint: ${\mathbf{m}}\ge 1$.
${\mathbf{ifail}}=41$
Constraint: ${\mathbf{sordx}}=1$ or $2$.
${\mathbf{ifail}}=61$
Constraint: if ${\mathbf{sordx}}=1$, $\mathit{ldx}\ge {\mathbf{n}}$.
${\mathbf{ifail}}=62$
Constraint: if ${\mathbf{sordx}}=2$, $\mathit{ldx}\ge {\mathbf{m}}$.
${\mathbf{ifail}}=111$
On entry, state vector has been corrupted or not initialized.
${\mathbf{ifail}}=141$
Constraint: ${\mathbf{sordsx}}=1$ or $2$.
${\mathbf{ifail}}=-99$
${\mathbf{ifail}}=-399$
Your licence key may have expired or may not have been installed correctly.
${\mathbf{ifail}}=-999$
Dynamic memory allocation failed.

## Accuracy

Not applicable.

nag_rand_subsamp_xyw (g05pw) will be computationality more efficient if each observation in x are contiguous, that is ${\mathbf{sordx}}=2$ and ${\mathbf{sordsx}}=2$.

## Example

This example uses nag_rand_subsamp_xyw (g05pw) to facilitate repeated random sub-sampling cross-validation.
A set of simulated data is randomly split into a training and validation datasets. nag_correg_glm_binomial (g02gb) is used to fit a logistic regression model to each training dataset and then nag_correg_glm_predict (g02gp) is used to predict the response for the observations in the validation dataset. This process is repeated $10$ times.
The counts of true and false positives and negatives along with the sensitivity and specificity is then reported.
```function g05pw_example

fprintf('g05pw example results\n\n');

% Fit a logistic regression model using g02gb and predict values using g02gp
% (binomial error, logistic link, with an intercept)
mean = 'M';
errfn = 'B';

% Not usind the predicted standard errors
vfobs = false;

% Independent variables
x = [ 0.0 -0.1  0.0  1.0;   0.4 -1.1  1.0  1.0;  -0.5  0.2  1.0  0.0;
0.6  1.1  1.0  0.0;  -0.3 -1.0  1.0  1.0;   2.8 -1.8  0.0  1.0;
0.4 -0.7  0.0  1.0;  -0.4 -0.3  1.0  0.0;   0.5 -2.6  0.0  0.0;
-1.6 -0.3  1.0  1.0;   0.4  0.6  1.0  0.0;  -1.6  0.0  1.0  1.0;
0.0  0.4  1.0  1.0;  -0.1  0.7  1.0  1.0;  -0.2  1.8  1.0  1.0;
-0.9  0.7  1.0  1.0;  -1.1 -0.5  1.0  1.0;  -0.1 -2.2  1.0  1.0;
-1.8 -0.5  1.0  1.0;  -0.8 -0.9  0.0  1.0;   1.9 -0.1  1.0  1.0;
0.3  1.4  1.0  1.0;   0.4 -1.2  1.0  0.0;   2.2  1.8  1.0  0.0;
1.4 -0.4  0.0  1.0;   0.4  2.4  1.0  1.0;  -0.6  1.1  1.0  1.0;
1.4 -0.6  1.0  1.0;  -0.1 -0.1  0.0  0.0;  -0.6 -0.4  0.0  0.0;
0.6 -0.2  1.0  1.0;  -1.8 -0.3  1.0  1.0;  -0.3  1.6  1.0  1.0;
-0.6  0.8  0.0  1.0;   0.3 -0.5  0.0  0.0;   1.6  1.4  1.0  1.0;
-1.1  0.6  1.0  1.0;  -0.3  0.6  1.0  1.0;  -0.6  0.1  1.0  1.0;
1.0  0.6  1.0  1.0];

% Dependent variable
y = [0;1;0;0;0;0;1;1;1;0;0;1;1;0;0;0;0;1;1;1;
1;0;1;1;1;0;0;1;0;0;1;1;0;0;1;0;0;0;0;1];

% Each observation represents a single trial
t = ones(size(x,1));

% Include all independent variables in the model
isx = int64(ones(size(x,2),1));
ip = int64(sum(isx) + (upper(mean(1:1)) == 'M'));

% In order to use cross-validation we need to initialise the random
% number generator (using L'Ecuyers MRG32k3a and a repeatable sequence)
seed = int64(42321);
genid = int64(6);
subid = int64(0);
[state,ifail] = g05kf(genid,subid,seed);

% generate 10 random sub-samples
nsamp = int64(10);

% size of sub-samples to generate (size of training dataset)
nt = int64(32);
% size validation dataset is n - nt

% Some of the routines used in this example issue warnings, but return
% sensible results, so save current warning state and turn warnings on
warn_state = nag_issue_warnings();
nag_issue_warnings(true);

tn = 0;
fn = 0;
fp = 0;
tp = 0;

%  Loop over each sample
for i = 1:nsamp

% Split the data into training and validation datasets
[state,x,y,t,ifail] = g05pw( ...
nt,x,state,'y',y,'w',t);
if (ifail~=0)
break
end

% Call routine to fit generalized linear model, with Binomial errors
% to training data (the first nt values in x)
[~,~,b,~,~,cov,~,ifail] = g02gb( ...
if (ifail~=0 & ifail < 6)
break
end

% Predict the response for the observations in the validation dataset
[~,~,pred,~,ifail] = g02gp( ...
vfobs, 't',t(nt+1:end));
if (ifail~=0)
break
end

% Cross-tabulate the observed and predicted values
obs_val = ceil(y(nt+1:end) + 0.5);
pred_val = (pred >= 0.5) + 1;
count = zeros(2,2);
for i = 1:size(pred_val,1)
count(pred_val(i),obs_val(i)) = count(pred_val(i),obs_val(i)) + 1;
end

% Extract the true/false negatives/positives
tn = tn + count(1,1);
fn = fn + count(1,2);
fp = fp + count(2,1);
tp = tp + count(2,2);
end

% Reset the warning state to its initial value
nag_issue_warnings(warn_state);

np = tp + fn;
nn = fp + tn;

fprintf('                       Observed\n');
fprintf('             --------------------------\n');
fprintf(' Predicted | Negative  Positive   Total\n');
fprintf(' --------------------------------------\n');
fprintf(' Negative  | %5d     %5d     %5d\n', tn, fn, tn + fn);
fprintf(' Positive  | %5d     %5d     %5d\n', fp, tp, fp + tp);
fprintf(' Total     | %5d     %5d     %5d\n', nn, np, nn + np);
fprintf('\n');

if (np~=0)
fprintf(' True Positive Rate (Sensitivity): %4.2f\n', tp / np);
else
fprintf(' True Positive Rate (Sensitivity): No positives in data\n');
end
if (nn~=0)
fprintf(' True Negative Rate (Specificity): %4.2f\n', tn / nn);
else
fprintf(' True Negative Rate (Specificity): No negatives in data\n');
end

```
```g05pw example results

Observed
--------------------------
Predicted | Negative  Positive   Total
--------------------------------------
Negative  |    38        20        58
Positive  |     8        14        22
Total     |    46        34        80

True Positive Rate (Sensitivity): 0.41
True Negative Rate (Specificity): 0.83
```