Random number streams in SAS: How do they work?
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
I previously showed how to generate random numbers in SAS by using the RAND function in the DATA step or by using the RANDGEN subroutine in SAS/IML software. These functions generate a stream of random numbers. (In statistics, the random numbers are usually a sample from a distribution such as the uniform or the normal distribution.) You can control the stream by setting the seed for the random numbers. The random number seed is set by using the STREAMINIT subroutine in the DATA step or the RANDSEED subroutine in the SAS/IML language.
A random number seed enables you to generate the same set of random numbers every time that you run the program. This seems like an oxymoron: if they are the same every time, then how can they be random? The resolution to this paradox is that the numbers that we call “random” should more accurately be called “pseudorandom numbers.” Pseudorandom numbers are generated by an algorithm, but have statistical properties of randomness. A good algorithm generates pseudorandom numbers that are indistinguishable from truly random numbers. The random number generator used in SAS is the Mersenne-Twister random number generator (Matsumoto and Nishimura, 1998), which is known to have excellent statistical properties.
Why would you want a reproducible sequence of random numbers? Documentation and testing are two important reasons. When I write SAS code and publish it on this blog, in a book, or in SAS documentation, it is important that SAS customers be able to run the code and obtain the same results.
Random number streams in the DATA step
The STREAMINIT subroutine is used to set the random number seed for the RAND function in the DATA step. The seed value controls the sequence of random numbers. Syntactically, you should call the STREAMINIT subroutine one time per DATA step, prior to the first invocation of the RAND function. This ensures that when you run the DATA step later, it produces the same pseudorandom numbers.
If you start a new DATA step, you can specify a new seed value. If you use a seed value of 0, or if you do not specify a seed value, then the system time is used to determine the seed value. In this case, the random number stream is not reproducible.
To see how random number streams work, each of the following DATA step creates five random observations. The first and third data sets use the same random number seed (123), so the random numbers are identical. The second and fourth variables both use the system time (at the time that the RAND function is first called) to set the seed. Consequently, those random number streams are different. The last data set contains random numbers generated by a different seed (456). This stream of numbers is different from the other streams.
data A(drop=i); call streaminit(123); do i = 1 to 5; x123 = rand("Uniform"); output; end; run; data B(drop=i); call streaminit(0); do i = 1 to 5; x0 = rand("Uniform"); output; end; run; data C(drop=i); call streaminit(123); do i = 1 to 5; x123_2 = rand("Uniform"); output; end; run; data D(drop=i); /* no call to streaminit */ do i = 1 to 5; x0_2 = rand("Uniform"); output; end; run; data E(drop=i); call streaminit(456); do i = 1 to 5; x456 = rand("Uniform"); output; end; run; data AllRand; merge A B C D E; run; /* concatenate */ proc print data=AllRand; run;
Notice that the STREAMINIT subroutine, if called, is called exactly one time at the beginning of the DATA step. It does not make sense to call STREAMINIT multiple times within the same DATA step; subsequent calls are ignored. In the one DATA step (D) that does not call STREAMINIT, the first call to the RAND function implicitly calls STREAMINIT with 0 as an argument.
If a single program contains multiple DATA steps that generate random numbers (as above), use a different seed in each DATA step or else the streams will not be independent. This is also important if you are writing a macro function that generates random numbers. Do not hard-code a seed value. Rather, enable the user to specify the seed value in the syntax of the function.
Random number streams in PROC IML
So that it is easier to compare random numbers generated in SAS/IML with random numbers generated by the SAS DATA step, I display the table of SAS/IML results first:
These numbers are generated by the RANDGEN and RANDSEED subroutines in PROC IML. The numbers are generated by five procedure calls, and the random number seeds are identical to those used in the DATA step example. The first and third variables were generated from the seed value 123, the second and fourth variables were generated by using the system time, and the last variable was generated by using the seed 456. The following program generates the data sets, which are then concatenated together.
proc iml; call randseed(123); x = j(5,1); call randgen(x, "Uniform"); create A from x[colname="x123"]; append from x; proc iml; call randseed(0); x = j(5,1); call randgen(x, "Uniform"); create B from x[colname="x0"]; append from x; proc iml; call randseed(123); x = J(5,1); call randgen(x, "Uniform"); create C from x[colname="x123_2"]; append from x; proc iml; /* no call to randseed */ x = J(5,1); call randgen(x, "Uniform"); create D from x[colname="x0_2"]; append from x; proc iml; call randseed(456); x = J(5,1); call randgen(x, "Uniform"); create E from x[colname="x456"]; append from x; quit; data AllRandgen; merge A B C D E; run; proc print data=AllRandgen; run;
Notice that the numbers in the two tables are identical for columns 1, 3, and 5. The DATA step and PROC IML use the same algorithm to generate random numbers, so they produce the same stream of random values when given the same seed.
Summary
- To generate random numbers, use the RAND function (for the DATA step) and the RANDGEN call (for PROC IML).
- To create a reproducible stream of random numbers, call the STREAMINIT (for the DATA step) or the RANDSEED (for PROC IML) subroutine prior to calling RAND or RANDGEN. Pass a positive value (called the seed) to the routines.
- To initialize a stream of random numbers that is not reproducible, call STREAMINIT or RANDSEED with the seed value 0.
- To insure independed streams within a single program, use a different seed value in each DATA step or procedure.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |