Using SAS to find the best k for k-Nearest Neighbor classification

This post was kindly contributed by SAS ANALYSIS - go there to comment and to read the full post.

Least-square (regression) and nearest-neighbor are the most fundamental methodologies for supervised classification [Ref. 1]. Even though they are pretty old, they are still popular and widely used in academia and industry. There is a trade-off in comparison of the two methods: least-square brings low variance but high bias; nearest-neighbor leads into low bias but high variance. k-NN does not ask for many stringent presumptions. And it is easy to develop and apply: you just need to find some nearest neighbors for the unclassified point to and give the density estimation.

Two classic textbooks, ESL [Ref. 1] for statistics students and Pattern Recognition [Ref. 2] for engineering students, both introduced k-NN at their first chapters, by using either R or Matlab. For SAS, there is no official document about implementing k-NN so far. However, Liang [Ref. 3] discovered that some procedures such as, Proc Discrim and Proc Loess, can be tweaked for k-NN classification and eventually scoring. Since k-NN is one of the underlying division rules for some of the clustering procedures, such as Proc Modeclus and Proc Discrim, in SAS/STAT, those clustering procedures can be extended from unsupervised clustering to supervised modeling.

In k-NN, if with very large N(like infinite), a larger k usually gives better performance. However, for small values of N(like the real-world data size), a larger k is not always the best choice [Ref. 2]. Thus, the number of how many neighbors are selected is crucial to implement k-NN predicative modeling. In the example below, two SAS 9.2’s help datasets were used: SASHELP.IRIS to predict species and SASHELP.CARS to predict origin. The first one is a well-known dataset for data mining exercises, which is easily classified. The second one is a difficult dataset that is not a testing data and hard to predict by any means. A macro for proportion partition was created first to separate train and score datasets. User-defined scoring function was applied based on Proc Discrim for iteration and generating misclassification rate. In general, k-NN has the satisfying classification result, at least not worse than other methods. For the ‘easy’ dataset, the best k could be from between 13 to 19. For the ‘difficult’ one, though the best k is 23, the performance of 23-nearest-neighbor is almost like that of 1-nearest-neighbor. As ESL [Ref. 1] mentioned, for difficult datasets, more k would not give more inches for improvement and 1-nearest-neighbor can be an economic solution. Even though cross-validation is needed to finalize the best value of such a tuning parameter, the result shows that SAS could be a productive platform to apply k-NN for predicative modeling.

References:
1. Trevor Hastie, Robert Tibshirani, Jerome H. Friedma. The elements of statistical learning: data mining, inference, and prediction, 5th. Springer, 2011
2. Sergios Theodoridis, Konstantinos Koutroumbas. Pattern recognition. Elsevier, 2009
3. Liang Xie. ‘How to implement K-Nearest-Neighbor in SAS’. http://www.sascommunity.org/wiki/

/*******************READ ME*********************************************
* - USING SAS TO FIND THE BEST K FOR K-NEAREST NEIGHBOR CLASSIFICATION -
*
* VERSION:     SAS 9.2(ts2m0), windows 64bit
* DATE:        07apr2011
* AUTHOR:      hchao8@gmail.com
*
****************END OF READ ME*****************************************/

****************(1) MODULE-BUILDING STEP******************;
******(1.1) CREATE A SAMPLING MACRO*************;
%macro partbyprop(dsin = , targetv = , samprate = , seed = );
   /***********************************************************
   *  MACRO:      partbyprop()
   *  OBJECTIVE:  partition dataset by proportion of targetv
   *  PARAMETERS: dsin = input dataset
   *              targetv = target variable
   *              seed = random seed for sampling
   ***********************************************************/
   title "Test on SAS dataset: &dsin";
   proc sort data = &dsin out = _tmp1;
      by &targetv;
   run;

   proc surveyselect data = _tmp1 samprate = &samprate  
      out = _tmp2 seed = &seed outall noprint;
      strata &targetv / alloc = prop;
   run;

   data train score;   
      set _tmp2;
      if selected = 0 then output train;
      else output score;
   run;

   proc datasets nolist;
      delete _:;
   quit;
%mend;

******(1.2) CREATE A MACRO FOR KNN CLASSIFICATION*************;
option mstored sasmstore = sasuser;
%macro knn_macro / store source;
   /***********************************************************
   *  MACRO:      knn_macro()
   *  OBJECTIVE:  a knn classification macro by proc discrim
   *              which also calculates misclassification rate   
   *  PARAMETERS: k = number of nearest neighbours
   *              targetv = target variable
   ***********************************************************/
   %let k = %sysfunc(dequote(&k));
   %let targetv = %sysfunc(dequote(&targetv));
   %let input = %sysfunc(dequote(&input));

   %let error = 0; 
      %if %length(&k) = 0 %then %do; 
         %put ERROR: Value for K is missing ; 
         %let error = 1; 
      %end; 
      %if %length(&targetv) = 0 %then %do; 
         %put ERROR: Value for targetv is missing ; 
         %let error = 1; 
      %end; 
         %if %length(&input) = 0 %then %do; 
         %put ERROR: Value for INPUT is missing ; 
         %let error = 1; 
      %end; 
      %if %sysfunc(exist(train)) = 0 %then %do; 
         %put ERROR: Dataset TRAIN does not exist ; 
         %let error = 1; 
      %end; 
      %if %sysfunc(exist(score)) = 0 %then %do; 
         %put ERROR: Dataset SCORE does not exist ; 
         %let error = 1; 
      %end; 
   %if &error = 1 %then %goto finish; 

   proc discrim data = train test = score testout = _scored
         method = npar k = &k noprint; 
         class &targetv; 
         var &input; 
   run; 

   data _null_;
      set _scored nobs = nobs end = eof;
      retain count; 
      if &targetv ne _into_ then count + 1;
      if eof then do;
         misc = count / nobs;
         call symput ('misc', misc);
      end;
   run;

   %finish: ; 
%mend;

******(1.3) COMPILE FUNCTION TO EMBED THE MACRO ABOVE*********;
proc fcmp outlib = sasuser.knn.funcs;
   function knn(k, targetv $, input $);
   /***********************************************************
   *   FUNCTION:    misc = knn(k, targetv, input)
   *   OBJECTIVE:   pass values to macro and get return
   *      INPUT:    k = number of nearest neighbours 
   *                targetv = target variable
   *                input = input variable   
   *     OUTPUT:    misc = misclassification rate 
   ***********************************************************/
      rc = run_macro('knn_macro', k, targetv, input, misc);
      if rc eq 0 then return(misc);
      else return(.);
   endsub;
run;

******(1.4) CREATE A TESTING MACRO*************;
%macro findk(targetv = , input = , maxiter = , plotpath = );
   /***********************************************************
   *  MACRO:      findk()
   *  OBJECTIVE:  iterate and plot 
   *  PARAMETERS: targetv = target variable
   *              input = input variable
   *              maxiter = maximum number of k value
   *              plotpath = output path for ods images
   ***********************************************************/
   option cmplib = (sasuser.knn) mstored sasmstore = sasuser;
   data _test;
      targetv = "&targetv";
      input = "&input";
      do k = 1 to &maxiter;
         misc_rate = knn(k, targetv, input);
         output;
      end;
   run;
   
   proc sql noprint;
      select min(misc_rate) into: min_misc
      from _test;
      select k into: bestk separated by ', '
      from _test
      having misc_rate = min(misc_rate);
   ;quit;
 
   ods html style = harvest gpath = "&plotpath";
   proc sgplot data = _test;
      series x = k y = misc_rate;
      xaxis grid values = ( 1 to &maxiter by 2) 
            label = 'kth neareast neighbours';
      yaxis grid values = ( 0 to 0.5 by 0.1)  
            label = 'Misclassification rate';
      refline &min_misc / transparency = 0.3
            label = "k = &bestk";
   run;
   ods html close;
%mend;

****************END OF STEP (1)******************;

****************(2) TESTING STEP******************;
******(1.1) TEST THE DIFFICULT DATASET*************;
%partbyprop(dsin = sashelp.cars, targetv = origin, samprate = 0.5, seed = 20110406);
%findk(targetv = origin, input = invoice wheelbase length , maxiter = 40,
       plotpath = h:\);

******(1.2) TEST THE EASY DATASET*************;
%partbyprop(dsin = sashelp.iris, targetv = species, samprate = 0.5, seed = 20110406);
%findk(targetv = species, input = petallength petalwidth sepallength sepalwidth , 
       maxiter = 40, plotpath = h:\);

****************END OF STEP (2)******************;

****************(3) VALIDATION STEP******************;
proc modeclus data = sashelp.iris m = 1 k = 13 out = _test1;
    var petallength petalwidth sepallength sepalwidth;
run;

ods html style = navy gpath = 'h:\';
proc sgplot data = _test1;
    scatter y = petallength  x = petalwidth 
            / group = species markerchar = cluster;
run;
ods html close;

****************END OF STEP (3)******************;
****************END OF ALL CODING***************************************;

This post was kindly contributed by SAS ANALYSIS - go there to comment and to read the full post.

Post navigation