With Stéphane Tufféry, we were working this week on a chapter of a book, entitled *Statistical Learning in Actuarial Science*. The chapter should be based on R functions, and we wanted to reproduce some outputs he previously obtained with SAS. The good thing is that even complex functions (logistic regression, regression trees, etc) produce the same kind of outputs. But we found a problem that we could not fix: generating identical training subsets of observations… Out of 1,000 lines, we subsample about 600 lines. The problem is that we could not generate the same sets of indexes with R, and SAS. Even using similar random generators… (execpt if we want to extract 1 or 2 lines, no more).

Let us try to explain what’s going on (based on code produced by Stéphane). According to Eubank (2010), there are (at least) two generators of random numbers in SAS,

- Fishman and Moore (1982) used for function RANUNI
- Mersenne-Twister used for the RAND function, based on Matsumoto and Nishimura (1997)

For instance, for the RAND function, if we generate a Gaussian sample – with Mersenne-Twister algorithm – the code should be

%LET SEED =6; %LET NREP=10; DATA TESTRANDOM ; CALL STREAMINIT(&SEED); DO REP = 1 TO &NREP; X = RAND ('NORMAL'); OUTPUT; END; RUN; PROC PRINT DATA = TESTRANDOM ; RUN ;

And we get here

Obs. |
REP |
X |

1 |
1 | 2.10680 |

2 |
2 | -0.25604 |

3 |
3 | 0.28692 |

4 |
4 | -0.22806 |

5 |
5 | 1.34569 |

6 |
6 | 0.16341 |

7 |
7 | -0.27788 |

8 |
8 | 0.02133 |

9 |
9 | 1.24050 |

10 |
10 | 1.01054 |

If we want a Uniform sample, it should be

%LET SEED =6; %LET NREP=10; DATA TESTRANDM ; CALL STREAMINIT(&SEED); DO REP = 1 TO &NREP; X = RAND ('UNIFORM'); OUTPUT; END; RUN; PROC PRINT DATA = TESTRANDOM ; RUN ;

Obs. |
REP |
X |

1 |
1 | 0.66097 |

2 |
2 | 0.48044 |

3 |
3 | 0.87849 |

4 |
4 | 0.19916 |

5 |
5 | 0.04838 |

6 |
6 | 0.19966 |

7 |
7 | 0.81353 |

8 |
8 | 0.53807 |

9 |
9 | 0.01105 |

10 |
10 | 0.53753 |

On good thing (so far) about the latest, is that Mersenne-Twister has been coded in R, in the RNGkind function

> RNGkind("Mersenne") > set.seed(6) > runif(10) [1] 0.64357277 0.91590888 0.09523258 0.29537280 [5] 0.76993170 0.25589353 0.51789573 0.67784993 [9] 0.14722782 0.70052604

But the output is different, even if we’re supposed to start, here, with the same seed. Now, if we want to make sure about what is done here, let us write our own codes of the Fishman and Moore (1982) algorithm (in order to reproduce the SAS output). The R version of

> a = 397204094 # RANUNI multiplier > seed = 123 # seed > n = 10 # sample size > m = (2^31) - 1 # period > for (i in (1:n-1)) { + seed = (a*seed)%%m + u = seed / m + print(u) + } [1] 0.7503961 [1] 0.3209121 [1] 0.3453204 [1] 0.2683455 [1] 0.241798 [1] 0.9888181 [1] 0.3943279 [1] 0.9710172 [1] 0.001632214 [1] 0.921537

Let us now run a similar code with SAS,

%LET SEED =123; %LET NREP=10; DATA FRANUNI (KEEP = x) ; seed = &SEED ; DO REP = 1 TO &NREP; CALL RANUNI(seed, x); OUTPUT; END; RUN; PROC PRINT DATA = FRANUNI ; RUN ;

and we get the following output

Obs. |
x |

1 |
0.75040 |

2 |
0.32091 |

3 |
0.17839 |

4 |
0.90603 |

5 |
0.35712 |

6 |
0.22111 |

7 |
0.78644 |

8 |
0.39808 |

9 |
0.12467 |

10 |
0.18769 |

It looks like here, indeed, we start with the same seed, since the first two numbers generated are similar. But then, it looks like we really have random numbers… If we change the seed, the first two numbers are similar, but that’s all.

We might be missing something trivial here, but we did not see it. So if anyone has a clue about reproducibility issues when generating random samples, with R and SAS, we are interested !

Bonjour,

En fait, moi je travaille sur un projet qui porte sur la création d’une base de données pour des profils assurés.

Je vous explique le principe:

Dans la base il y a la variable formule qui a 3 observations (F1, F2, F3), sexe (M, F), Département (75, 78, 92) et salaire (3000, 2500, 1600).

Ainsi, dans la base au niveau de la formule variable, on dit que par exemple: la F1 représente 20% des lignes de la base, F2 30% et F3 50% des lignes de la base sur la variable formule.

Pour le sexe, les hommes (M) sont de 60% et les femmes (F) 40% des lignes de la base sur la variable sexe.

Pour le Département, sur un 75 est de 50%, le 78 est 25% et le 92, 25% des lignes de la base sur la variable de la variable Département.

Enfin pour le salaire: ceux qui gagnent 3000 $ sont de 40%, 2500 $ sont de 15% et 1600 $ de 35%.

Le nombre d’observation souhaité est de 5000 obs.

I suspect your problem is somewhere around the treatment of floating point precision. Processors, languages and libraries are all different in how they handle excess precision provided by hardware. This is not normally a problem until you try to do computation that is chaotic, in the sense that a small input perturation requires a large output perturbation.

I suspect you would find some intelligent commentary in the R source code, or failing that, the mailing list.

There are two separate issues here, I think.

1) given a specific algorithm the results from different programming languages should produce the same answers. That we’re seeing different results is disturbing and the core team probably should know as i’m fairly sure that the random number generator for base rnorm, etc is based on Mersenne-Twister. It’s also interesting to see that SAS and Python behave similarly and I wouldn’t be surprised if Christian’s suggestion around treatment of floating point figures is to blame.

2) separately, there is the issue of repeatability in the context of your book, where you want the reader to get the same results. Part of this is presumably resolved by passing a set.seed command before each examples. The only remaining issue (notwithstanding point 1) is if different machines/versions of R produce a different sequance of random numbers, then we’re all lost…

Keep up the good work Arthur, looking forward to reading your book in due course.

D

Hi

The answer to your problem is quite simple: the equivalent of your R program in SAS is the following:

data test;

seed=123;

a=397204094;

m=2**31-1;

do i=1 to 10;

seed=mod(a*seed,m);

u=seed/m;

output;

end;

run;

and the u you’ll obtain will be:

u X

0.7503960881151200 0.7503960881151200

0.3209120548893200 0.3209120250870000

0.0159941157400581 0.1783896489899500

0.2518609036932900 0.9060333813103000

0.0655178083411900 0.3571170775020100

0.7030306559535800 0.2211140195006100

0.7522681647689400 0.7864382973808900

0.8320894962326100 0.3980819067908800

0.4779906470691700 0.1246652235857500

0.9095850893806500 0.1876858469041400

as you can see, I add the X serie obtained with your program using the CALL RANUNI routine.

If u=X for the first observation, this is not exactly the case for the second one.

why? because you can’t reproduce the results of the CALL RANUNI routine with a data step.

The maximum integer you can store in a 8 bytes long SAS numerical variable is 9 007 1999 254 740 992 (=2**53 in a windows environment). (I don’t know the maximum integer you can store with R but I guess that the problem is the same – by using double precision numbers, may be you can fix your problem…).

Starting from :

seed=123;

a=397204094;

m=2**31-1;

For the first value of U, you first need the modulus of :

123*397 204 094 /2 147 483 647 = 1 611 463 328

your first U will be :

1 611 463 328 /2 147 483 647 = 0.7504

note that 123*397204094= 48 856 103 562 < 9 007 1999 254 740 992

for the second observation, you first need to get the modulus of :

1 611 463 328 * 397 204 094 / 2 147 483 647

but 1 611 463 328 * 397 204 094 is superior to 9 007 1999 254 740 992 and SAS will use 640 079 831 212 464 000 but the exact value is 640 079 831 212 464 832.

CALL RANUNI is using the exact modulus (689 153 326) while the SAS program is using a proxy (689 153 390).

By using the exact modulus for the third observation, you'll obtain the value given by the CALL RANUNI routine, by using the proxy, you'll obtain the one I'm having with my SAS program.

et voilà !

more infos here : http://www.sascommunity.org/wiki/How_the_SAS_Random_Number_Generators_Work

You can't reproduce the functioning of the CALL RANUNI subroutine with a data step but there is a second problem…

run this program :

data test;

seed=123;

a=397204094;

m=2**31-1;

do i=1 to 1000000;

seed=mod(a*seed,m);

u=seed/m;

output;

end;

run;

proc means ;

var u;

run;

the mean is 0.0093406… (after the 18616th observation, the seed is equal to zero… U too…)

run this one :

%LET SEED =123;

%LET NREP=1000000;

DATA FRANUNI (KEEP = x) ;

seed = &SEED ;

DO REP = 1 TO &NREP;

CALL RANUNI(seed, x);

OUTPUT;

END;

RUN;

proc means ;

var x;

run;

the mean is 0.4996268

the values given by the SAS program are wrong but I'm having a solution: the CALL RANUNI routine.

What about the ones you obtain with R ?

First of all, I got a different result from R 2.15.3 (32bit on Vista) when I ran the R code with minor modification to print both seed and u values.

“`

a = 397204094 # RANUNI multiplier

seed = 123 # seed

n = 10 # sample size

m = (2^31) – 1 # period

for (i in (1:n-1)) {

seed = (a*seed)%%m

u = seed / m

cat(‘seed:’, seed, “, u:”, u, “\n”)

}

“`

“`

seed: 1611463328 , u: 0.7503961

seed: 689153390 , u: 0.3209121

seed: 34347102 , u: 0.01599412

seed: 540867172 , u: 0.2518609

seed: 140698422 , u: 0.06551781

seed: 1509746837 , u: 0.7030307

seed: 1615483582 , u: 0.7522682

seed: 1786898586 , u: 0.8320895

seed: 1026477098 , u: 0.4779906

seed: 1953319105 , u: 0.9095851

“`

Then, I translated the R code into Python 3.2.3 (32 bit on Vista) and the result is almost identical to that of SAS. Since I am a Python novice, there may be some mistakes, though.

“`

a = 397204094 # RANUNI multiplier

seed = 123 # seed

n = 10 # sample size

m = (2**31) – 1 # period

for i in range(n):

seed = (a*seed)%m

u = seed / m

print(“seed: {}, u: {}”.format(seed, round(u, 7)))

“`

“`

seed: 1611463328, u: 0.7503961

seed: 689153326, u: 0.320912

seed: 383088854, u: 0.1783896

seed: 1945691870, u: 0.9060334

seed: 766903084, u: 0.3571171

seed: 474838741, u: 0.221114

seed: 1688863383, u: 0.7864383

seed: 854874385, u: 0.3980819

seed: 267716529, u: 0.1246652

seed: 403052287, u: 0.1876858

“`

It seems that the problem is the calculation of ‘seed’ value. From the third iteration, calculated seed values are quite different in two languages. Maybe it has to do with precision. The variable ‘seed’ is ‘numeric’ in R and ‘integer’ in Python. Hopefully someone with the knowledge of internal workings of R can clarify this.

I am excited to hear of the book! Can we have some details?

give me a couple of weeks, and I’ll try to post some more information…

Maybe it is due to the way R and SAS are coding floating (i.e. the number of bytes)?

A solution (but not satisfactory) could be to generate random numbers with R (or with SAS) and to save it in file. Then you just have to load the corresponding file… But it is not an answer to your question!

This is what we plan to do for the book: explain how to generate (randomly) a sample, and then also use indexes produced by sas, the list being in a separate text file