Pagina's

2013/12/30

MWC58 Fast RNG small period part 4

/*
"MWC" stands for "Multiply With Carry", "58" stands for the period: > 2^58 
Actually the period is larger than 2^60, the first version had a smaller period.
A big drawback: The usable period is a lot smaller. Despite this drawback,
MWC58 might be usefull:
-128 non correlated sequences of  random numbers,
 each sequence contains at least 590 million numbers,
 sequences are repeatable, initialisation is fast.
-seed another RNG

It's  based on: Marsaglia's message(1998)
    
MWC58:
  
    private static uint mwc58()
    {
        z0 = m0 * (z0 & 65535) + (z0 >> 16);   //  both (m0<<15)-1 and (m0<<16)-1 are prime
        z1 = m1 * (z1 & 65535) + (z1 >> 16);   //  both (m1<<15)-1 and (m1<<16)-1 are prime
        return z0 + (z1 << 16);
    }

Why is this fast? From a publication by Mark A. Overton(2011):
 
    Fast, High-Quality, Parallel Random Number Generators
    
Two independent generators are used: z0 = m0 * ... , z1 = m1 * ...   
The compiler places instructions in an order that causes the processor to 
pipeline the computations of the generators. As a result of this parallelizing, 
the generators execute even faster than expected. The second multiply, z1 = m1 * ...
only adds one more clock cycle.  

The period "p0" of the first generator is: p0 = m0 * 2^15 ,
with a seed value "s0": 0 < s0 <= 2 * p0 - 2 .
The period of MWC58: m0 * m1 * 2^30 .
The usable period is a lot smaller. The 16 lsb's (least significant bits) are
the 16 lsb's of the first generator: z0 , with period: p0 = m0 * 2^15.
The most significant bits depend much on the second generator: z1 ,
where the highest bits have the largest dependencies.
For the smallest value of m0: 18030, MWC58 generates 18030 * 2^15 (~5.9 * 10^8)
random numbers in ~2.25 seconds. In other words: When 5.9 * 10^8 random numbers
are needed, MWC58 generates them fast.

With 18030 <= m <= 65184 , there are 256 valid multipliers:
    ushort[] m = { 18030 ... 65184 };
With fixed multipliers it is possible to use a 32 bits seed value,
from which valid values for s0 and s1 can be dereived.
When we want different sequences, using different seed values,
to be more divergent (less correlated), it is better to use the seed for
indices in the "ushort[] m" array.

Off topic: The k-th (ordered) combination of { 0, 1, 2, 3, ... , n } , or
the k-th 2-element subset of the set { 0, 1, 2, 3, ... , n }
Triangular numbers and their inverse are involved.


 
    private static uint[] tri_com(uint k, uint n) // n > 0                //     n = 3   
    {                                                                     //   k | i | j 
        uint t, i, j;                                                     //  ---|---|---
        t = n * (n + 1) / 2;                                              //   0 | 0 | 1 
        k %= t;                                                           //   1 | 0 | 2 
        i = n - 1 - (uint)((Math.Sqrt(8 * (t - k - 1) + 1) - 1) / 2);     //   2 | 0 | 3 
        j = (i * (i + 1) / 2 + k) % n + 1;                                //   3 | 1 | 2 
        uint[] ij = { i, j };                                             //   4 | 1 | 3 
        return ij;                                                        //   5 | 2 | 3 
    }                                                                     //   6 | 0 | 1 
                                                                       
When combined indices are used like: 0,3 1,2 (2 combinations)
random sequences generated by those combinations are less (not?) correlated.
With 256 multipliers there are 128 combinations.
 
The scaled version:

"mwc_uint(u)" returns a value x ( 0 <= x <= u ).
Before I used a trick to save random bits, which is not necessary any longer.
An example: mwc_uint(1) returns 0 or 1.
Under the hood 32 random bits are/were generated, they were put in a buffer,
from which the bits were taken out, one by one.
That required some code: buffer empty? refill, etc.
Some code takes some time.
Now one bit is returned, 31 remaining bits are not used,
it is faster to generate 32 new bits, and to use just one of them.
The rather small period is a bottleneck, a solution: CMR63

Results:
    
    10^7 times rand.Next() takes: 107 ms (~11 ns per random number)
    10^7 times rand.Next(int someValue) takes: 236 ms (~24 ns ...)
    
    |--------|--------------------------------------------------------|      
    |        |                        ms 10^7 *                       |      rand.Next()    11 ns      
    |        |-------------|--------------|-------------|-------------|          mwc58()     4 ns
    |      u | mwc_uint(u) | well_uint(u) | gen_uint(u) | g_uint_c(u) |     rand.Next(i)    24 ns                 
    |--------|------|------|-------|------|-------|-----|------|------|      mwc_uint(u)    18 ns, mean  ~12 ns
    |      0 |   20 |      |    24 |      |    24 |     |   24 |      |     well_uint(u)    44 ns, mean  ~32 ns  
    |      1 |      |  51  |       |   74 |       |  73 |      |  139 |      gen_uint(u)    50 ns, mean  ~35 ns   
    |      2 |   96 |      |   121 |      |   124 |     |  296 |      |      g_uint_c(u)   280 ns, mean ~250 ns
    |      3 |      |  51  |       |   80 |       |  78 |      |  211 |       
    |      4 |  133 |      |   222 |      |   242 |     |  518 |      |                 
    |      7 |      |  51  |       |   82 |       |  83 |      |  282 |               
    |      8 |  153 |      |   272 |      |   290 |     |  680 |      |     
    |  2^7-1 |      |  51  |       |   98 |       | 102 |      |  562 |     
    |  2^7-0 |  172 |      |   344 |      |   380 |     | 1079 |      |           
    | 2^15-1 |      |  51  |       |  131 |       | 130 |      | 1123 |               
    | 2^15-0 |  174 |      |   379 |      |   414 |     | 1634 |      |               
    | 2^30-1 |      |  51  |       |  193 |       | 193 |      | 2164 |  
    | 2^30-0 |  175 |      |   431 |      |   494 |     | 2672 |      |  
    | 2^31-1 |      |  51  |       |  195 |       | 210 |      | 2251 |  
    | 2^31-0 |  175 |      |   427 |      |   461 |     | 2747 |      |  
    | 2^32-1 |      |  51  |       |  196 |       | 211 |      | 2307 |  
    |--------|------|------|-------|------|-------|-----|------|------| 

http://www.cacert.at/cgi-bin/rngresults  (code: find and hover mouse over "MWC58")
 
|---------|---------|--------|----------|----------|--------|---------|----------|----------|----------|----------|
|  Gene-  |  Speed  | Upload | Entropy  | Birthday | Matrix | 6*8 Ma- | Minimum  | Random   | Squeeze  | Overlap- |
|  rator  | (MiB/s) |  Size  | (->8)    | Spacing  | Ranks  | trix R  | Distance | Spheres  |          | ping Sums|
|-------- |---------|--------|----------|----------|--------|---------|----------|----------|----------|----------|
|   mwc58 |  1000   | 19 MiB | 7.999990 | 0.161056 |  0.504 |  0.920  | 0.877503 | 0.432728 | 0.677887 | 0.458630 |
| well512 |   400   | 19 MiB | 7.999992 | 0.446363 |  0.636 |  0.879  | 0.967495 | 0.690789 | 0.942252 | 0.005987 |
|  Random |   370   | 17 MiB | 7.999989 | 0.000000 |  0.392 |  0.136  | 0.063920 | 0.905097 | 0.098440 | 0.005467 |
|     CSP |    34   | 15 MiB | 7.999987 | 0.032102 |  0.084 |  0.942  | 0.894277 | 0.036221 | 0.706538 | 0.023523 |
|-------- |---------|--------|----------|----------|--------|---------|----------|----------|----------|----------|

*/

using System;
class mwc
{
    private static uint m0 = 18030, z0 = 325080900, m1 = 65184, z1 = 4248953856;

    private static uint mwc58()
    {
        z0 = m0 * (z0 & 65535) + (z0 >> 16);
        z1 = m1 * (z1 & 65535) + (z1 >> 16);
        return z0 + (z1 << 16);
    }

    private static uint g_u = 0;
    private static int g_s = 0;

    private static uint mwc_uint(uint u)
    {
        uint x;
        if (u == 0) return 0;
        if (u != g_u)
        {
            g_u = u;
            g_s = 32 - nob(u);
        }
        do
        {
            z0 = m0 * (z0 & 65535) + (z0 >> 16);
            z1 = m1 * (z1 & 65535) + (z1 >> 16);
            x = (z0 + (z1 << 16)) >> g_s;
        }
        while (x > u);
        return x;
    }

    private static void init(uint seed)
    {
        ushort[] m = {
            18030, 18273, 18513, 18879, 19074, 19098, 19164, 19215, 19584, 19599,
            19950, 20088, 20508, 20544, 20664, 20814, 20970, 21153, 21243, 21423,
            21723, 21954, 22125, 22188, 22293, 22860, 22938, 22965, 22974, 23109,
            23124, 23163, 23208, 23508, 23520, 23553, 23658, 23865, 24114, 24219,
            24660, 24699, 24864, 24948, 25023, 25308, 25443, 26004, 26088, 26154,
            26550, 26679, 26838, 27183, 27258, 27753, 27795, 27810, 27834, 27960,
            28320, 28380, 28689, 28710, 28794, 28854, 28959, 28980, 29013, 29379,
            29889, 30135, 30345, 30459, 30714, 30903, 30963, 31059, 31083, 31215,
            31353, 31488, 31743, 32430, 32718, 33105, 33189, 33249, 33375, 33378,
            33663, 33768, 33858, 33894, 34158, 34323, 34383, 34590, 34653, 34890,
            35355, 35523, 35643, 36309, 36594, 36804, 36969, 37698, 37935, 37959,
            38079, 38223, 38283, 38484, 38568, 38610, 38649, 38733, 38850, 39444,
            39618, 39690, 39948, 40833, 40995, 41019, 41064, 41289, 41628, 41793,
            41874, 42153, 42444, 42513, 42594, 42633, 42699, 42819, 42903, 42975,
            43038, 43155, 43473, 43563, 43995, 44019, 44568, 44574, 44994, 45723,
            45729, 45780, 45789, 45915, 45939, 46515, 47088, 47529, 48015, 48033,
            48195, 48204, 48393, 49209, 49248, 49299, 49458, 50034, 50223, 50580,
            50589, 50694, 50853, 50988, 51198, 51558, 51618, 51729, 51744, 51813,
            51873, 51933, 52023, 52215, 52275, 52509, 52743, 52950, 53130, 53199,
            53529, 53709, 53898, 53934, 53958, 54144, 54168, 54399, 54474, 54564,
            54885, 55044, 55074, 55179, 55254, 55680, 55809, 55848, 55869, 56205,
            56538, 56604, 56790, 56859, 57039, 57204, 57225, 57525, 57603, 57774,
            57780, 57918, 58149, 58368, 58443, 58758, 59253, 59325, 59775, 60009,
            60060, 60489, 60735, 60990, 61140, 61578, 61914, 62505, 62634, 62778,
            62790, 62865, 62874, 62904, 63129, 63273, 63444, 63663, 63765, 63885,
            64185, 64314, 64455, 64545, 64860, 65184 };

        seed &= 127; m0 = m[seed]; z0 = m0 * m0;
        seed ^= 255; m1 = m[seed]; z1 = m1 * m1;
    }

    private static int nob(uint u) // number of bits, position of the highest set bit, u > 0  
    {
        return
        u < 1u << 16 ? u < 1u << 08 ? u < 1u << 04 ? u < 1u << 02 ? u < 1u << 01 ? 01 : 02 :
                                                                    u < 1u << 03 ? 03 : 04 :
                                                     u < 1u << 06 ? u < 1u << 05 ? 05 : 06 :
                                                                    u < 1u << 07 ? 07 : 08 :
                                      u < 1u << 12 ? u < 1u << 10 ? u < 1u << 09 ? 09 : 10 :
                                                                    u < 1u << 11 ? 11 : 12 :
                                                     u < 1u << 14 ? u < 1u << 13 ? 13 : 14 :
                                                                    u < 1u << 15 ? 15 : 16 :
                       u < 1u << 24 ? u < 1u << 20 ? u < 1u << 18 ? u < 1u << 17 ? 17 : 18 :
                                                                    u < 1u << 19 ? 19 : 20 :
                                                     u < 1u << 22 ? u < 1u << 21 ? 21 : 22 :
                                                                    u < 1u << 23 ? 23 : 24 :
                                      u < 1u << 28 ? u < 1u << 26 ? u < 1u << 25 ? 25 : 26 :
                                                                    u < 1u << 27 ? 27 : 28 :
                                                     u < 1u << 30 ? u < 1u << 29 ? 29 : 30 :
                                                                    u < 1u << 31 ? 31 : 32;
    }

    static void Main()
    {
        // test ( part 1 ? )
    }    
}

No comments:

Post a Comment