AUC calculation using Wilcoxon Rank Sum Test

This post was kindly contributed by SAS Programming for Data Mining Applications - go there to comment and to read the full post.


Accurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data

In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linear logistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1 respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able to obtain accurate measurement of AUC for this given data.

The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 and N0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W is the Wilcoxon Rank Sums.

In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it as AUC=0.9119491555



%macro AUC( dsn, Target, score);
ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data=&dsn ;
     where &Target^=.;
     class &Target;
     var  &score; 
run;
ods select all;

data AUC;
    set WilcoxonScore end=eof;
    retain v1 v2 1;
    if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
    v2=N*v2;
    if eof then do;
       d=v1/v2;
       Gini=d * 2;    AUC=d+0.5;    
       put AUC=  GINI=;
       keep AUC Gini;
     output;
   end;
run;
%mend;

data test;
  do i = 1 to 10000;
     x = ranuni(1);
     y=(x + rannor(2315)*0.2 > 0.35 ) ; 
     output;
  end;
run;

ods select none;
ods output Association=Asso;
proc logistic data = test desc;
    model y = x;
    score data = test out = predicted ; 
run;
ods select all;

data _null_;
     set Asso;
     if Label2='c' then put 'c-stat=' nValue2;
run;
%AUC( predicted, y, p_0);

NPAR1WAY gets        AUC = 0.91766634744;
LOGISTIC reports c-statistic = 0.917659

So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure, PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers’D in PROC FREQ. Here, from NPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer’s D C|R (since the column variable is predictor)from PROC FREQ:



proc freq data=test noprint;
     tables y*x/ measures;
     output out=_measures measures;
run;

data _null_;
     set _measures;
     put _SMDCR_=;
run; 

This post was kindly contributed by SAS Programming for Data Mining Applications - go there to comment and to read the full post.