Bayesian Computation with SAS (1).

This post was kindly contributed by SAS Programming for Data Mining Applications - go there to comment and to read the full post.


The book “Bayesian Computation with R” by Jim Albert is an easy to read entry level book on applied Bayesian Statistics. While the book was written for R users, it is not difficult to translate the languages between R and SAS and I believe it is a good way to show Bayesian capability of SAS. In the next several months, I am going to translate all R code in the book into SAS. Readers are encouraged to buy this book to understand what’s behind the code. The posts here will only spend minimum effects to explain the statistics underlying while most resource will still be denoted to SAS coding.

This post will cover Chapter 1.

We first follow section 1.2.2, using PROC IMPORT to read in TAB deliminated file, which corresponds to read.table(file, sep=’\t’, header=T)  in R. We also designated PDF as Output Destination.


/* Chapter 1 of Bayesian Computation with R */

%let folder=C:\Documents\SAS for Bayesian Compu with R;
%let datafolder=&folder\data;

libname data "&datafolder";
options missing='.' formchar = "|----|+|---+=|-/\<>*"  errors=2 fullstimer;

/*@@@@@@ Section 1.2.2 @@@@@*/
/*--> Read Tab Deliminated Data into SAS */
proc import datafile="&datafolder/studentdata.txt"  out=student dbms=dlm   replace;
      datarow=2;       /* redundent with GETNAMES= statement */
   guessingrows=2;  /* Only use first 2 rows of data to guess the type of data */
      delimiter='09'x;  /* this is redudent if we specify DBMS=TAB in PROC IMPORT option */
   getnames=YES;  /* This is the same as 'header=True' in R's table.read function */
run;

ods pdf file="&folder/Chapter1.pdf";
/*-->  To see Variable Names and print first row of this data 
   Use SPLIT="*" to specify the split character being "*", which controls 
   line breaks in column headings
*/
proc print data=student(obs=1)  split="*"; run;

SAS output looks like:

Section 1.2.3, summarize frequency tables and use Barplot to visualize the counts.

PROC FREQ directly corresponds to table() function in R, but provides much richer functionality. PROC GBARLINE is used to answer what barplot() does in R. The LABEL= statement in SAS is a very handy tool for annotation purpose for PRINT or GRAPH.

PROC MEANS somehow corresponds to summary() function to obtain summary statistics for quantitative variables. You can sepecify MAXDEC= to keep desired number of decimals in print out.
Histogram can be obtained in SAS via either PROC UNIVARIATE or newly introduced PROC SGPLOT. PROC UNIVARIATE provides more control though. As shown below, you can specify desired mid-points while you can’t do so in SGPLOT. On the other hand, SGPLOT is very convenient if you want to overlay density curve, either a normal fit or kernel one.



/*@@@@ section 1.2.3 @@@*/
/*--> To summarize and graph a single Batch */
proc freq data=student;
     table Drink;
run;

/*--> Barplot on single variable */
title "Original idea, summarize and plot";
proc freq data=student  noprint;
      table Drink /out=t;
run;
goptions reset=all noborder;
proc gbarline data=t;
      bar  Drink /discrete space=4  sumvar=count;
run;quit;
title;
/*--> Actually, in SAS, it is better to directly use 
    PROC GBARLINE */
goptions reset=all noborder;
title "In SAS, PROC GBARLINE directly summarize";
proc gbarline data=student;
      bar Drink /discrete space=3 ;
run;quit;
title;

data student;
      set student;
   hours_of_sleep=WakeUp-ToSleep;
   /* Label can be useful for annotation purpose */
   label hours_of_sleep="Hours Of Sleep"; 
run;

proc means data=student 
           /* Request same statistics as in the book */
           min q1 median q3 max  nmiss             
           /* MAXDEC= specifies keeping 2 decimals */
           maxdec=2;                          
    var hours_of_sleep;
run;

/*--> Histogram using PROC UNIVARIATE */
proc univariate data=student  noprint;
      var hours_of_sleep;
   histogram /midpoints=2.5 to 12.5 by 1;
run;

/* new SGPLOT is a handy way to draw histogram and density */
proc sgplot data=student;
      histogram hours_of_sleep/scale=count;     
      density   hours_of_sleep/scale=count  type=normal;
run;

Compare SAS outputs:

Boxplot to visualize distribution of  quantitative variables by some classification variable. Here, formula such as var1*var2 style corresponds to var1~var2 style in R.



/*@@@ section 1.2.4 @@@@*/
/*---> Compare Batches using Boxplot */
proc sort data=student out=studentsorted; by Gender; run;
proc boxplot data=studentsorted;
      plot  hours_of_sleep*Gender;
run;   

proc means data=student  
            min q1 median q3 max nmiss  
            nway  maxdec=2;
     class Gender;
  var  Haircut;
run;

SAS output looks like:

R has built-in jitter function, however, we have to make our own in SAS. This verision doesn’t apply Fuzzy yet, only for demo purpose.



%macro jitter(var, newname, data=last, factor=1, amount=);
/* follow the JITTER function in R, no fuzzy applied yet . 
   if amount is given, then use this value for disturbance
   else if amount is 0, then use the range of &var for disturbance
   else if amount is NULL, then use the smallest difference among
        distinct values of &var for disturbance. if all values are 
        the same, then use the value obtained from range

   if range of &var is 0, then use the lower range for disturbance
   otherwise 
*/
%let blank=%str( );
%if &data=' ' %then %let data=last;
%local fid;

data _null_;
     fid=round(ranuni(0)*10000, 1);
  call symput('fid', compress(fid));
run;
proc means data=&data  noprint;
     var &var;
  output  out=_util&fid  
          range(&var)=range  
          min(&var)=min  
          max(&var)=max;
run;
data _util&fid;
     set _util&fid;
  if range^=0 then z=range; else z=min;
  if z=0 then z=1;
run;

%if %eval(&amount=&blank) %then %do;
    proc sort data=&data.(keep=&var  where=(&var^=.))   out=_xuti&fid  nodupkey;
       by &var;
 run;
 data _duti&fid;
      set _xuti&fid  nobs=ntotal  end=eof;      
   array _x{2} _temporary_;
   if ntotal=1 then do;
      amount=&factor/50*abs(&var);
   keep amount;
   output;
   stop;
   end;
   else do;
      if _n_=1 then do; 
         _x[1]=&var; _x[2]=constant('BIG'); 
      end;
   else do;
               _x[2]=min(_x[2], &var - _x[1]);
      _x[1]=&var;
      if eof then do;
         amount=&factor/5*abs(_x[2]);
      keep amount;
      output;
      end;
   end;
   end;
  run;
    
%end;
%else %if %eval(&amount=0) %then %do;
    data _duti&fid;
      set _util&fid;
   amount=&factor*z/50;
   keep amount;
   output;
 run;     
%end;
%else %do;
    data _duti&fid;
      amount=&amount;
   keep amount;
   output;
 run;
%end;

proc sql noprint;
     select name into :keepvars separated by ' '
  from   sashelp.vcolumn
  where  libname='WORK' 
    and  memname=%upcase("&data") 
    and  memtype="DATA"
  ;
quit;
data &data;     
  array _x{1} _temporary_;
     if _n_=1 then do;
     set _duti&fid;
  _x[1]=amount;
  end;
     set &data;
  &newname=&var + ranuni(0)*(2*_x[1])-_x[1];
  label &newname="jitter(&var)";
  keep &keepvars  &newname;
run;
proc datasets library=work nolist;
     delete _duti&fid _xuti&fid _util&fid;
run;quit;
%mend; 

Now we can apply jitter function to the two variables studied in the book and check them out. For graphics, we use axis statement to define how we want the two axes to appear in print out, and use symbol statement to define the appearance of symbols for data. Here we ask for black circles, not connected (interpol=none).



%jitter(hours_of_sleep, xhos, data=student, factor=1, amount=); 
%jitter(ToSleep, xts, data=student, factor=1, amount=);

goptions border  reset=all;
axis1  label=("jitter(ToSleep)")    
       major=(height=2  number=5)   minor=none;
axis2  label=(angle=90  "jitter(Hours of Sleep")     
       major=(height=2  number=5)  minor=none;
symbol1 value=circle  interpol=none  color=black;
proc gplot data=student;
      plot xhos*xts/haxis=axis1  vaxis=axis2;
run;quit;      

In order to overlay a fitted regression line, we have multiple ways to do it in SAS, and the Statistical Plot procedures are particularly handy to generate this type of figures. Compare the code and corresponding figures (in the same order of LINEPRINTER, ODS GRAPHICS, Manually Drawn, SGPLOT). For manually drawn figures, since we ask for connected dots in order to obtain a line overlay on circles, we have to sort the data by X-axis variable to make the line as straight as possible. The manually drawn figure is also the one closest to the R figure in the book.



/*1.) Old school LINEPRINTER option*/
proc reg data=student  lineprinter;
      title "Line Printer Style";
      model   Hours_of_Sleep=ToSleep /noprint;
   var     xhos  xts;   
   plot pred.*xts='X'  xhos*xts/overlay  symbol='o';
run;quit;
title;

/*2.)Using ODS GRAPHICS */
ods select none; /* to suppress PROC REG output */
title "ODS Graphics";
ods graphics on;
proc reg data=student ;
      model   Hours_of_Sleep=ToSleep;
   var     xhos  xts;
   output  out=student  pred=p;  
      ods select FitPlot; 
run;quit;
ods graphics off;
title;
ods select all;

/*3.) Following standard R approach in the book, 
   more tedious in SAS 
*/
proc sort data=student  out=studentsorted;
      by ToSleep;
run;
goptions border  reset=all;
axis1  label=("jitter(ToSleep)")      
       order=(-2 to 6 by 2)  
       major=(height=2  number=5)   minor=none;
axis2  label=(angle=90  "jitter(Hours of Sleep")    
       offset=(, 2 pct)  
       major=(height=2  number=5)  minor=none;
symbol1 value=circle  interpol=none  color=black;
symbol2 value=none  interpol=line  color=black;
proc gplot data=studentsorted;
      plot xhos*xts  p*xts/overlay  haxis=axis1  vaxis=axis2;   
run;quit;      

/*4.) New SGPLOT has build in regression line capability: 
       REG, SPLINE, LOESS 
*/
proc sgplot data=student;
      scatter  x=xts  y=xhos;
   reg  x=ToSleep  y=Hours_of_Sleep;
run;

Now, to study the robustness of T-Statistics which relies on several assumptions, such as normality, homoskedasticity, etc. First, we need to define a “function” to calculate T-statistics like in the book.



/* section 1.3 Robustness of T-Statistics */
/* section 1.3.2 Write a function to compute T-Statistics */
%macro tstatistics(vars, dsn=last, outdsn=, class=);
%if &class='' %then %let classstmt=;
%else %let classstmt=class &class;

%let nvars=%sysfunc(count(&vars, ' '));

%let var1=%scan(&vars, 1, ' ');
%let var2=%scan(&vars, 2, ' ');
proc means data=&dsn noprint  nway;
     &classstmt;
  var &vars;
  output  out=tdata(where=(_STAT_ in ('STD', 'MEAN', 'N')));
run;
data &outdsn;
     set tdata; by &class _TYPE_;
  retain  m  n  mean1 mean2  std:;
  select(compress(_STAT_));
          when('N') do;
        m=&var1;
     n=&var2;
    end;
    when('MEAN') do;
              mean1=&var1;
     mean2=&var2;
    end;
          when('STD') do;
        std1=&var1;
     std2=&var2;
    end;
    otherwise;
  end;
     if last._TYPE_ then do;
        sp=sqrt(((m-1)*std1**2 + (n-1)*std2**2)/(m+n-2));
     t= (mean1-mean2)/(sp * sqrt(1/m + 1/n));
  keep &class _TYPE_ m  n std:  mean:  sp t;
  output;
  end;
run;
%mend;

Just like the “source(file)” command in R, in SAS, we have a counterpart called %include(or in abv. %inc). Suppose you have a SAS file called “tstatistics.sas” in some folder &folder
, the following statement will read in the source code from this file. Since this is a SAS file, suffix “.sas” is not necessary.


filename BCR "&folder";
%inc BCR(tstatistics);

The following code pieces conduct the four simulate scenarios illustrated in the book and draw the last figure.


data test;
      input x y;
cards;
1  5
4  4
3  7
6  6
5  10
;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=);

/* section 1.3.3 Monte Carlo study on the Robustness of T-Statistics */
%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
      retain seed1 99787  seed2 76568;
   do iter=1 to &Niter;
      do j=1 to max(&n, &m);
            if j<=&m then call rannor(seed1, x); else x=.;
   if j<=&n then call rannor(seed2, y); else y=.;
   keep iter x y;
   output;
   end;
    end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
      set tdata2  end=eof;
   retain n_reject 0;
   if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
   if eof then do;
      call symput('n_reject', n_reject/_n_);
   end;
run;
%put &n_reject;


/* section 1.3.4 Study the robustness of T-Statistics */

/* 1.) both are standard normal */

%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
      retain seed1 987687  seed2 76568;
   do iter=1 to &Niter;
      do j=1 to max(&n, &m);
            if j<=&m then call rannor(seed1, x); else x=.;
   if j<=&n then call rannor(seed2, y); else y=.;
   keep iter x y;
   output;
   end;
    end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
      set tdata2  end=eof;
   retain n_reject 0;
   if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
   if eof then do;
      call symput('n_reject', n_reject/_n_);
   end;
run;
%put &n_reject;

/* 2.) both are normal with std=1 and 10 */

%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
      retain seed1 99787  seed2 76568;
   do iter=1 to &Niter;
      do j=1 to max(&n, &m);
            if j<=&m then call rannor(seed1, x); else x=.;
   if j<=&n then call rannor(seed2, y); else y=.;
   y=y*10;
   keep iter x y;
   output;
   end;
    end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
      set tdata2  end=eof;
   retain n_reject 0;
   if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
   if eof then do;
      call symput('n_reject', n_reject/_n_);
   end;
run;
%put &n_reject;


/* 2.) both are normal with std=1 and 10 */

%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
      retain seed1 99787  seed2 76568;
   do iter=1 to &Niter;
      do j=1 to max(&n, &m);
            if j<=&m then call rannor(seed1, x); else x=.;
   if j<=&n then call rannor(seed2, y); else y=.;
   y=y*10;
   keep iter x y;
   output;
   end;
    end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
      set tdata2  end=eof;
   retain n_reject 0;
   if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
   if eof then do;
      call symput('n_reject', n_reject/_n_);
   end;
run;
%put &n_reject;


/* 3.) both are T-distribution with df=4 */

%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
      retain seed1 99787  seed2 76568;
   call streaminit(seed1);
   do iter=1 to &Niter;
      do j=1 to max(&n, &m);
            if j<=&m then x=rand('T', 4); else x=.;
   if j<=&n then y=rand('T', 4); else y=.;   
   keep iter x y;
   output;
   end;
    end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
      set tdata2  end=eof;
   retain n_reject 0;
   if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
   if eof then do;
      call symput('n_reject', n_reject/_n_);
   end;
run;
%put &n_reject;


/* 4.) one normal with mu=10 and std=2 while the other is 
       exponential with rate=0.10 */

%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
      retain seed1 99787  seed2 76568;
   do iter=1 to &Niter;
      do j=1 to max(&n, &m);
            if j<=&m then call rannor(seed1, x); else x=.;
   if j<=&n then call ranexp(seed2, y); else y=.;
   x=10+x*2;
   y=y/0.1;
   keep iter x y;
   output;
   end;
    end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
      set tdata2  end=eof;
   retain n_reject 0;
   if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
   if eof then do;
      call symput('n_reject', n_reject/_n_);
   end;
run;
%put &n_reject;

/* draw the overlay density curves of empirical T-statistics and T-distribution */
proc sort data=tdata2;
      by t;
run;
data tdata2;
      set tdata2;
   dt=pdf('T', t, 18);
run;

ods select none;
ods output Controls=Controls;
ods output UnivariateStatistics=UniStat;
proc kde data=tdata2  ;
      univar  t /out=density  method=snr  unistats;
run;
ods select all;

data density;
      set density;
   dt=pdf('T', value, 18);
run;

data _null_;
      set UniStat;
   if Descr='Bandwidth' then call symput ('bw', compress(round(t, 0.0001)));
run;

goptions reset=all border;

title "Comparing Empirical and Theoretical Densities";
legend label=none   position=(top right inside)  mode=share;
axis1  label=("N=&Niter Bandwidth=&bw") 
       order=(-4 to 8 by 2) offset=(1,1)  
       major=(height=2) minor=none;
axis2  label=(angle=90  "Density")     
       order=(0 to 0.4 by 0.1)  offset=(0, 0.1) 
       major=(height=2)  minor=none;
symbol1  interpol=join  color=black  value=none  width=3;
symbol2  interpol=join  color=grey   value=none  width=1;
proc gplot data=density;
      plot density*value  dt*value/overlay  legend=legend
                                   haxis=axis1  vaxis=axis2  ;
      label dt='t(18)' density='Exact';   
run;quit;

ods pdf close;

Bayesian Computation with R (Use R)

This post was kindly contributed by SAS Programming for Data Mining Applications - go there to comment and to read the full post.