Benchmark Regression Procedures using OLS Regression

This post was kindly contributed by SAS Programming for Data Mining Applications - go there to comment and to read the full post.


Rick Wicklin discussed in his blog the performance in solving a linear system using SOLVE() function and INV() function from IML.

Since regression analysis is an integral part of SAS applications and there are many SAS procedures in SAS/STAT that are capable to conduct various regression analysis, it would be interesting to benchmark their relative performance using OLS regression, the fundamental regression analysis of all.

The analysis will compare REG, GLMSELECT, GENMOD, MIXED, GLIMMIX, HPMIXED on 10 OLS regressions with 100 to 1000 variables, incremental at 100, and with the number of observations twice the number of variables to avoid possible numerical issues. A macro wraps them together:



%macro wrap;
proc printto log='c:\testlog.txt';run;
data proc_comparison;
run;
%do i=1 %to 10;
    %let nobs=%sysevalf(&i*100);
	options nonotes;
	data _temp;
	     array x{&nobs};
	     do i=1 to 2*&nobs;
		 do j=1 to &nobs;
		    x[j]=rannor(0);
		end;
		y=rannor(0);
		drop i j;
		output;
	    end;		 
	run;
	options notes;
       ods select none;
	proc reg data=_temp;
	     model y = x1-x&nobs;
	run;quit;

	proc genmod data=_temp;
	     model y = x1-x&nobs /dist=normal;
	run;

	proc glmselect data=_temp;
	     model y = x1-x&nobs /selection=none;
       run;

	proc glimmix data=_temp;
	     model y = x1-x&nobs /dist=normal;
	run;

	proc mixed data=_temp;
	     model y = x1-x&nobs;
	run;

	proc hpmixed data=_temp;
	     model y = x1-x&nobs;
	run;
	ods select all;
%end;
proc printto; run;
%mend;
%wrap;

After running all iterations, the SAS log is parsed to obtain procedure names and corresponding real time and CPU time. The following SAS code does this job:



data proc_compare;
     infile "c:\testlog.txt";
	 input;
	 retain procedure ;
	 retain realtime cputime  realtime2 ; 
	 length procedure $12.;
	 length realtime  cputime $24.;
	 if _n_=1 then id=0;
	 x=_infile_;
	 if index(x, 'PROCEDURE')>0 then do;
	    procedure=scan(_infile_, 3);		
		if procedure="REG" then id+1;		
	 end;
	
	 if index(x, 'real time')>0 then do;
	    _t1=index(_infile_, 'real time');
		_t2=index(_infile_, 'seconds');
	    if _t2=0 then _t2=length(_infile_);
        realtime=substr(_infile_, _t1+9, _t2-_t1-9);
		if index(realtime, ':')>0 then do;
 	       realtime2=scan(realtime, 1, ':')*60;
		   sec=input(substr(realtime, index(realtime, ':')+1), best.);
		   realtime2=realtime2+sec;		 
		end;
		else realtime2=input(compress(realtime), best.);
	 end;
	 if index(x, 'cpu time')>0 then do;
	    _t1=index(_infile_, 'cpu time');
		_t2=index(_infile_, 'seconds');
		if _t2=0 then _t2=length(_infile_);
	    cputime=substr(_infile_, _t1+8, _t2-_t1-8);
		if index(cputime, ':')>0 then do;
 	       cputime2=scan(cputime, 1, ':')*60;
		   sec=input(substr(cputime, index(cputime, ':')+1), best.);
		   cputime2=cputime2+sec;
		end;
		else cputime2=input(compress(cputime), best.);
		keep id size  procedure cputime2 realtime2 ;
		size=id*100;
		if compress(procedure)^="PRINTTO" then output;
	 end;
run;

We then visualize the results using the following code:



title "Benchmark Regression PROCs using OLS";
proc sgpanel data=proc_compare;
     panelby procedure;
     series y=cputime2  x=size
            / lineattrs=(thickness=2);
     label cputime2="CPU Time (sec)"
	    size="Problem Size"
		   ;;
run;
title;

title "Closer Look on REG vs. GLMSELECT";
proc sgplot data=proc_compare  uniform=group;
     where procedure in ("GLMSELECT", "REG");
     series x=size y=cputime2
           /group=procedure  curvelabel lineattrs=(thickness=2);
     label cputime2="CPU Time (sec)"
           size="Problem Size"
		   ;;
run;
title;

It is found that PROC REG and GLMSELECT beat all other procedures with large margin while HPMIXED is the slowest since it uses sparse matrix technique which is inefficient when we have a large dense matrix. It should be noted that GLMSELECT further outperforms REG by a large margin even though, by inspecting both the real time and CPU time , REG utilized multi cores on my laptop while GLMSELECT does not:

************ Partial LOG of the last iteration ********
NOTE: PROCEDURE REG used (Total process time):
real time 6.79 seconds
cpu time 9.36 seconds

NOTE: There were 2000 observations read from the data set WORK._TEMP.
NOTE: PROCEDURE GLMSELECT used (Total process time):
real time 3.06 seconds
cpu time 2.96 seconds
********************************************************

Both REG and GLMSELECT are developed by the same group of developers in SAS, as far as I know.

This post was kindly contributed by SAS Programming for Data Mining Applications - go there to comment and to read the full post.