Using SAS to find the best k for k-Nearest Neighbor classification

This post was kindly contributed by SAS ANALYSIS - go there to comment and to read the full post.

Least-square (regression) and nearest-neighbor are the most fundamental methodologies for supervised classification [Ref. 1]. Even though they are pretty old, they are still popular and widely used in academia and industry. There is a trade-off in comparison of the two methods: least-square brings low variance but high bias; nearest-neighbor leads into low bias but high variance. k-NN does not ask for many stringent presumptions. And it is easy to develop and apply: you just need to find some nearest neighbors for the unclassified point to and give the density estimation.

Two classic textbooks, ESL [Ref. 1] for statistics students and Pattern Recognition [Ref. 2] for engineering students, both introduced k-NN at their first chapters, by using either R or Matlab. For SAS, there is no official document about implementing k-NN so far. However, Liang [Ref. 3] discovered that some procedures such as, Proc Discrim and Proc Loess, can be tweaked for k-NN classification and eventually scoring. Since k-NN is one of the underlying division rules for some of the clustering procedures, such as Proc Modeclus and Proc Discrim, in SAS/STAT, those clustering procedures can be extended from unsupervised clustering to supervised modeling.

In k-NN, if with very large N(like infinite), a larger k usually gives better performance. However, for small values of N(like the real-world data size), a larger k is not always the best choice [Ref. 2]. Thus, the number of how many neighbors are selected is crucial to implement k-NN predicative modeling. In the example below, two SAS 9.2’s help datasets were used: SASHELP.IRIS to predict species and SASHELP.CARS to predict origin. The first one is a well-known dataset for data mining exercises, which is easily classified. The second one is a difficult dataset that is not a testing data and hard to predict by any means. A macro for proportion partition was created first to separate train and score datasets. User-defined scoring function was applied based on Proc Discrim for iteration and generating misclassification rate. In general, k-NN has the satisfying classification result, at least not worse than other methods. For the ‘easy’ dataset, the best k could be from between 13 to 19. For the ‘difficult’ one, though the best k is 23, the performance of 23-nearest-neighbor is almost like that of 1-nearest-neighbor. As ESL [Ref. 1] mentioned, for difficult datasets, more k would not give more inches for improvement and 1-nearest-neighbor can be an economic solution. Even though cross-validation is needed to finalize the best value of such a tuning parameter, the result shows that SAS could be a productive platform to apply k-NN for predicative modeling.

References:
1. Trevor Hastie, Robert Tibshirani, Jerome H. Friedma. The elements of statistical learning: data mining, inference, and prediction, 5th. Springer, 2011
2. Sergios Theodoridis, Konstantinos Koutroumbas. Pattern recognition. Elsevier, 2009
3. Liang Xie. ‘How to implement K-Nearest-Neighbor in SAS’. http://www.sascommunity.org/wiki/

/*******************READ ME*********************************************
* - USING SAS TO FIND THE BEST K FOR K-NEAREST NEIGHBOR CLASSIFICATION -
*
* VERSION: SAS 9.2(ts2m0), windows 64bit
* DATE: 07apr2011
* AUTHOR: hchao8@gmail.com
*
****************END OF READ ME*****************************************/

****************(1) MODULE-BUILDING STEP******************;
******(1.1) CREATE A SAMPLING MACRO*************;
%macro partbyprop(dsin = , targetv = , samprate = , seed = );
/***********************************************************
* MACRO: partbyprop()
* OBJECTIVE: partition dataset by proportion of targetv
* PARAMETERS: dsin = input dataset
* targetv = target variable
* seed = random seed for sampling
***********************************************************/
title "Test on SAS dataset: &dsin";
proc sort data = &dsin out = _tmp1;
by &targetv;
run;

proc surveyselect data = _tmp1 samprate = &samprate
out = _tmp2 seed = &seed outall noprint;
strata &targetv / alloc = prop;
run;

data train score;
set _tmp2;
if selected = 0 then output train;
else output score;
run;

proc datasets nolist;
delete _:;
quit;
%mend;

******(1.2) CREATE A MACRO FOR KNN CLASSIFICATION*************;
option mstored sasmstore = sasuser;
%macro knn_macro / store source;
/***********************************************************
* MACRO: knn_macro()
* OBJECTIVE: a knn classification macro by proc discrim
* which also calculates misclassification rate
* PARAMETERS: k = number of nearest neighbours
* targetv = target variable
***********************************************************/
%let k = %sysfunc(dequote(&k));
%let targetv = %sysfunc(dequote(&targetv));
%let input = %sysfunc(dequote(&input));

%let error = 0;
%if %length(&k) = 0 %then %do;
%put ERROR: Value for K is missing ;
%let error = 1;
%end;
%if %length(&targetv) = 0 %then %do;
%put ERROR: Value for targetv is missing ;
%let error = 1;
%end;
%if %length(&input) = 0 %then %do;
%put ERROR: Value for INPUT is missing ;
%let error = 1;
%end;
%if %sysfunc(exist(train)) = 0 %then %do;
%put ERROR: Dataset TRAIN does not exist ;
%let error = 1;
%end;
%if %sysfunc(exist(score)) = 0 %then %do;
%put ERROR: Dataset SCORE does not exist ;
%let error = 1;
%end;
%if &error = 1 %then %goto finish;

proc discrim data = train test = score testout = _scored
method = npar k = &k noprint;
class &targetv;
var &input;
run;

data _null_;
set _scored nobs = nobs end = eof;
retain count;
if &targetv ne _into_ then count + 1;
if eof then do;
misc = count / nobs;
call symput ('misc', misc);
end;
run;

%finish: ;
%mend;

******(1.3) COMPILE FUNCTION TO EMBED THE MACRO ABOVE*********;
proc fcmp outlib = sasuser.knn.funcs;
function knn(k, targetv $, input $);
/***********************************************************
* FUNCTION: misc = knn(k, targetv, input)
* OBJECTIVE: pass values to macro and get return
* INPUT: k = number of nearest neighbours
* targetv = target variable
* input = input variable
* OUTPUT: misc = misclassification rate
***********************************************************/
rc = run_macro('knn_macro', k, targetv, input, misc);
if rc eq 0 then return(misc);
else return(.);
endsub;
run;

******(1.4) CREATE A TESTING MACRO*************;
%macro findk(targetv = , input = , maxiter = , plotpath = );
/***********************************************************
* MACRO: findk()
* OBJECTIVE: iterate and plot
* PARAMETERS: targetv = target variable
* input = input variable
* maxiter = maximum number of k value
* plotpath = output path for ods images
***********************************************************/
option cmplib = (sasuser.knn) mstored sasmstore = sasuser;
data _test;
targetv = "&targetv";
input = "&input";
do k = 1 to &maxiter;
misc_rate = knn(k, targetv, input);
output;
end;
run;

proc sql noprint;
select min(misc_rate) into: min_misc
from _test;
select k into: bestk separated by ', '
from _test
having misc_rate = min(misc_rate);
;quit;

ods html style = harvest gpath = "&plotpath";
proc sgplot data = _test;
series x = k y = misc_rate;
xaxis grid values = ( 1 to &maxiter by 2)
label = 'kth neareast neighbours';
yaxis grid values = ( 0 to 0.5 by 0.1)
label = 'Misclassification rate';
refline &min_misc / transparency = 0.3
label = "k = &bestk";
run;
ods html close;
%mend;

****************END OF STEP (1)******************;

****************(2) TESTING STEP******************;
******(1.1) TEST THE DIFFICULT DATASET*************;
%partbyprop(dsin = sashelp.cars, targetv = origin, samprate = 0.5, seed = 20110406);
%findk(targetv = origin, input = invoice wheelbase length , maxiter = 40,
plotpath = h:\);

******(1.2) TEST THE EASY DATASET*************;
%partbyprop(dsin = sashelp.iris, targetv = species, samprate = 0.5, seed = 20110406);
%findk(targetv = species, input = petallength petalwidth sepallength sepalwidth ,
maxiter = 40, plotpath = h:\);

****************END OF STEP (2)******************;

****************(3) VALIDATION STEP******************;
proc modeclus data = sashelp.iris m = 1 k = 13 out = _test1;
var petallength petalwidth sepallength sepalwidth;
run;

ods html style = navy gpath = 'h:\';
proc sgplot data = _test1;
scatter y = petallength x = petalwidth
/ group = species markerchar = cluster;
run;
ods html close;

****************END OF STEP (3)******************;
****************END OF ALL CODING***************************************;

This post was kindly contributed by SAS ANALYSIS - go there to comment and to read the full post.