Obtain Trace of the Projection Matrix in a Linear Regression

This post was kindly contributed by SAS Programming for Data Mining Applications - go there to comment and to read the full post.


Recently, I am working on coding in SAS for a set of regularized regressions and need to compute trace of the projection matrix:
S=X(X’X + \lambda I)^{-1}X’.

Trace of the project matrix S is a key concept in modern regression analysis. For example, the effective degree of freedom of the model in a regularized linear regression is trace(S)/N [1]. For another example, in approximating leave-one-out-cross-validation using GCV, trace(S)/N is a key component (formula 7.52 of [1]).

Check the reference and the cited works therein for more information.

Ref:
1. Hastie et al., The Elements of Statistical Learning, 2nd Ed.


/* Use the SocEcon example data created in
   Example A.1: A TYPE=CORR Data Set Produced by PROC CORR
   on page 8153 of SAS/STAT 9.22 User Doc
*/
data SocEcon;
input Pop School Employ Services House;
datalines;
5700 12.8 2500 270 25000
1000 10.9 600 10 10000
3400 8.8 1000 10 9000
3800 13.6 1700 140 25000
4000 12.8 1600 140 25000
8200 8.3 2600 60 12000
1200 11.4 400 10 16000
9100 11.5 3300 60 14000
9900 12.5 3400 180 18000
9600 13.7 3600 390 25000
9600 9.6 3300 80 12000
9400 11.4 4000 100 13000
;
run;

%let depvar=HOUSE;
%let covars=pop school employ services;
%let lambda=1;
/* Below is the way to obtain trace(S), where S is the project matrix in a (regularized) linar regression. 
   For further information, check pp.68, pp.153 of Elements of Statistical Learning,2nd Ed.
*/

/* For details about TYPE=SSCP special SAS data, consult:
  Appendix A: Special SAS Data Sets, pp.8159 of SAS/STAT 9.22 User's Guide
*/
proc corr data=SocEcon sscp out=xtx(where=(_TYPE_='SSCP'))  noprint;
     var &covars;
run;


data xtx2;
     set xtx;
  array _n{*} _numeric_;
  array _i{*} i1-i5 (5*0);
  do j=1 to 5;
     if j=_n_ then _i[_n_]=λ
  else _i[j]=0;
  end;
  _n[_n_]=_n[_n_]+1;
  drop j;
run;

/* Obtain the inverse of (XTX+\lambda*I)
   Note that we explicitly specified Intercept term in the 
   covariate list and fit a model without implicit intercept 
   term in the model.
*/
proc reg data=xtx2  
         outest=S0(type=SSCP
                   drop=i1-i5 _MODEL_  _DEPVAR_  _RMSE_)
         singular=1E-17;
     model i1-i5 = Intercept &covars / noint   noprint;
run;quit;

data S0;
     set S0; 
  length _NAME_ $8;
  _NAME_=cats('X_', _n_);
run;

proc score data=SocEcon  score=S0  out=XS0(keep=X_:)  type=parms;
     var &covars;
run;     

data XS0X;
     merge XS0  X;
  array _x1{*} X_:;
  array _x0{*} intercept pop school employ services;
  do i=1 to dim(_X1);
     _x1[i]=_x1[i]*_x0[i];
  end;
  rowSum=sum(of _x1[*]);
  keep rowSum;
run;
proc means data=XS0X  noprint;
     var rowSum;
  output out=trace  sum(rowSum)=Trace;
run;

Verify the result using R:


> socecon<-read.csv('c:/socecon.csv', header=T)
> x<-as.matrix(cbind(1, socecon[,-5]))
> xtx<-t(x)%*%x
> phi<-xtx+diag(rep(1, 5))
> 
> # method 1. 
> S<-x%*%solve(phi)*x
> sum(S)
[1] 4.077865
> # method 2. 
> S<-(x%*%solve(phi))%*%t(x)
> sum(diag(S))
[1] 4.077865
>

This post was kindly contributed by SAS Programming for Data Mining Applications - go there to comment and to read the full post.