Sampling from a Bayesian Posterior Distribution in SAS

This post was kindly contributed by Confounded by Confounding » SAS - go there to comment and to read the full post.

One of the things that frequently comes up in my research is the need to estimate a parameter from data, and then randomly draw samples from that parameter’s distribution to plug into another model. If you have a regular estimate from something like PROC LOGISTIC or PROC GENMOD, this is easy as pie, as SAS (and indeed most stats programs and programming languages) have a wide variety of random number generators.

But if you couple this with the new Bayes statement in SAS 9.2, is it possible to generate random samples from the posterior distribution of the variable, which doesn’t have a nice clean form you can feed into a random number generator. The answer: Yes.

This method exploits the fact that the MCMC algorithm used to generate your results in the first place is based around generating random samples from the posterior distribution. So, instead of generating your own, you can just grab those. There are some caveats to this approach:

  • It’s computationally expensive. You have to have the MCMC algorithm in SAS generate a tremendous number of iterations to begin with, more than you probably need just to get good results, and this takes a long time. For example, the code below takes ~ 9 minutes and 45 seconds to run on a 3.2 GHz Xeon quad-core processor – almost all of it the MCMC component. Comparatively, generating the same number of random numbers from a parametric distribution would have been…very very fast.
  • You can’t draw an arbitrary number of samples. Or you can, because you’re sampling with replacement, but there’s an upper limit to the number of values you can get, since the distribution is a collection of values, rather than a continuous function. How big of a deal this is probably depends on your question.

All that aside, here’s one way to do it. If folks have a more computationally efficient way, I’d be happy to hear it. The how-to is broken down into steps, with commentary:

data work.tinkering;

do id=1 to 20000;

disroll= ranuni(68273);

yran = ranuni(102842);

if yran<= 0.333 then y = 1; else if yran >0.333 then y=0;

pdis = exp(-2.30258+(1.0986*y));

if disroll<= pdis then case=1; else if disroll>pdis then case=0;

output;

end;

This bit just creates a simulated data set, as a proof of concept cohort. We’ve got a exposure y present in a third of the population, that has a true relative risk of 3.00. We then generate everyone’s probability of disease, set at 0.10 + the increase in risk from their value of y. If another randomly generated number (the geeks among us know this as a saving throw) is below the probability of disease, the randomly generated subject becomes a case.

ods graphics on;

proc genmod data=work.tinkering descending;

model case = y / dist = binomial link = log;

bayes nbi=2000 nmc = 10000 plots=all outpost=work.out;

ods graphics off;

This bit is the core of the thing. We turn on ODS graphics so we get pretty diagnostic plots, and run a Generalized Linear Model on our simulated dataset, modeling the relationship between being a case and our exposure y. We use the Bayes statement to do Bayesian inference, with 2000 burn-in iterations (the SAS default) to let the MCMC algorithm start to properly converge, and then 10,000 iterations to flesh out the posterior distribution. outpost=work.out is key – this outputs the samples from the MCMC posterior distribution as a new dataset. Note it will give you samples for all your covariates, plus some other information. I’m going to ignore these, and just look at the result of y.

proc surveyselect data=work.out method urs outhits n=1000 out=work.SampledPrior;

One line of code enters. 1000 random samples of the posterior dataset, with replacement, leave. And There you have your sampled prior distribution. I did a graphical comparison of the two. Click the pictures for better resolution:

Full Posterior Distribution

1000 Samples from the Posterior Distribution

Overall, I’d say not bad, though as I mentioned previously, somewhat slow. If you were doing a more intensive analysis, this is one of those “leave it alone to run and go have dinner” situations.

Filed under: Epidemiology, SAS, Simulation

This post was kindly contributed by Confounded by Confounding » SAS - go there to comment and to read the full post.