Intel® Math Kernel Library Cookbook

Multiple simple random sampling without replacement

13

Goal

Generate K>>1 simple random length-M samples without replacement from a population of size N (1 ≤M≤N).

Solution

For exact definitions and more details of the problem, see [

SRSWOR

].

Use the following implementation of a partial Fisher-Yates Shuffle algorithm [ KnuthV2

] and Intel MKL random number generators (RNG) to generate each sample:

Partial Fisher-Yates Shuffle algorithm

A2.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N

A2.2: for i from 1 to M do:

A2.3: generate random integer X uniform on {i,...,N}

A2.4: interchange PERMUT_BUF[i] and PERMUT_BUF[X]

A2.5: (Copy step) for i from 1 to M do: RESULTS_ARRAY[i]=PERMUT_BUF[i]

End.

The program that implements the algorithm conducts 11 969 664 experiments. Each experiment, which generates a sequence of M unique random natural numbers from 1 to N, is actually a partial length-M random shuffle of the whole population of N elements. Because the main loop of the algorithm works as a real lottery, each experiment is called "lottery M of N" in the program.

The program uses M=6 and N=49, stores result samples (sequences of length M) in a single array

RESULTS_ARRAY

, and uses all available parallel threads.

Source code: see the lottery6of49

folder in the samples archive available at http://software.intel.com/enus/mkl_cookbook_samples.

Parallelization

#pragma omp parallel

{

thr = omp_get_thread_num(); /* the thread index */

VSLStreamStatePtr stream;

/* RNG stream initialization in this thread */

vslNewStream( &stream, VSL_BRNG_MT2203+thr, seed );

... /* Generation of experiment samples (in thread number thr) */

vslDeleteStream( &stream );

}

The code exploits all CPUs with all available processor cores by using the OpenMP*

#pragma parallel directive. The array of experiment results

RESULTS_ARRAY

is broken down into

THREADS_NUM

portions, where

THREADS_NUM

is the number of available CPU threads, and each thread (parallel region) processes its own portion of the array.

77

13

Intel

®

Math Kernel Library Cookbook

Intel MKL basic random number generators with the

VSL_BRNG_MT2203

parameter easily support a parallel independent stream in each thread.

Generation of experiment samples

/* A2.1: Initialization step */

/* Let PERMUT_BUF contain natural numbers 1, 2, ..., N */

for( i=0; i<N; i++ ) PERMUT_BUF[i]=i+1; /* using the set {1,...,N} */

for( sample_num=0; sample_num<EXPERIM_NUM/THREADS_NUM; sample_num++ ){

/* Generate next lottery sample (steps

A2.2

, A2.3

, and A2.4

): */

Fisher_Yates_shuffle(...);

/* A2.5: Copy step

*/

for(i=0; i<M; i++)

RESULTS_ARRAY[thr*ONE_THR_PORTION_SIZE + sample_num*M + i] = PERMUT_BUF[i];

}

This code implements the partial Fisher-Yates Shuffle algorithm in each thread.

In the case of simulating many experiments, the

Initialization step

is only needed once because at the

beginning of each experiment, the order of natural numbers 1...N in the

PERMUT_BUF

array does not matter

(like in a real lottery).

Fisher_Yates_shuffle function

/* A2.2

: for i from 0 to M-1 do */

Fisher_Yates_shuffle (...)

{

for(i=0; i<M; i++) {

/* A2.3

: generate random natural number X from {i,...,N-1} */

j = Next_Uniform_Int(...);

/* A2.4

: interchange PERMUT_BUF[i] and PERMUT_BUF[X] */

tmp = PERMUT_BUF[i];

PERMUT_BUF[i] = PERMUT_BUF[j];

PERMUT_BUF[j] = tmp;

}

}

Each iteration of the loop

A2.2

works as a real lottery step: it extracts a random item

X

from the bin with remaining items

PERMUT_BUF[i], ..., PERMUT_BUF[N]

and puts the item

X

at the end of the results row

PERMUT_BUF[1],...,PERMUT_BUF[i]

. The algorithm is partial because it does not generate the full permutation of length N, but only a part of length M.

NOTE

Unlike the pseudocode that describes the algorithm, the program uses zero-based arrays.

Discussion

In step

A2.3

, the program calls the

Next_Uniform_Int

function to generate the next random integer

X

, uniform on {i, ..., N-1} (see the source code for details). To exploit the full power of vectorized RNGs from

Intel MKL, but to minimize vectorization overheads, the generator must generate a sufficiently large vector

D_UNIFORM01_BUF

of size

RNGBUFSIZE

that fits the L1 cache. Each thread uses its own buffer

D_UNIFORM01_BUF

and the index

D_UNIFORM01_IDX

pointing to after the last used random number from that buffer. In the first call to

Next_Uniform_Int

function (or in the case all random numbers from the buffer have been used), the full buffer of random numbers is regenerated again by calling the vdRngUniform function with the length

RNGBUFSIZE

and the index

D_UNIFORM01_IDX

set to zero (earlier in the program): vdRngUniform( ... RNGBUFSIZE, D_UNIFORM01_BUF ... );

78

Multiple simple random sampling without replacement

13

Because Intel MKL only provides generators of random values with the same distribution, but step

A2.3

requires random integers on different intervals, the buffer is filled with double-precision random numbers uniformly distributed on [0;1) and then, in the

Integer scaling step, these double-precision values are

converted to fit the needed integer intervals: number 0 distributed on {0,...,N-1} = 0 + {0,...,N-1} number 1 distributed on {1,...,N-1} = 1 + {0,...,N-2}

...

number M-1 distributed on {M-1,...,N-1} = M-1 + {0,...,N-M}

(then repeat previous M steps) number M distributed on: see (0) number M+1 distributed on: see (1)

...

number 2*M-1 distributed on: see (M-1)

(then again repeat previous M steps)

...

and so on

Integer scaling

/* Integer scaling step */ for(i=0;i<RNGBUFSIZE/M;i++)

for(k=0;k<M;k++)

I_RNG_BUF[i*M+k] =

k + (unsigned int)(D_UNIFORM01_BUF[i*M+k] * (double)(N-k));

Here

RNGBUFSIZE

is a multiple of M.

See [ SRSWOR

] for performance notes related to this code.

Routines Used

Task

Creates and initializes an RNG stream.

Generates double-precision numbers uniformly distributed over the interval [0;1).

Deletes an RNG stream.

Allocates memory buffers aligned on 64-byte boundaries for the results and population.

Frees memory allocated by mkl_malloc

.

Routine

vslNewStream vdRngUniform vslDeleteStream mkl_malloc mkl_free

79

13

Intel

®

Math Kernel Library Cookbook

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and

SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

80

Intel® Math Kernel Library Cookbook

Multiple simple random sampling without replacement

Related manuals

Table of contents

Intel® Math Kernel Library Cookbook

Multiple simple random sampling without replacement

Related manuals

APTECH

GAUSS 4

Intel

MKL999LSGE01

Table of contents