Proc Arboretum: a secret weapon in decision tree

This post was kindly contributed by SAS Analysis - go there to comment and to read the full post.

Introduction: Decision tree, such as CHAID and CART, is a power predicative tool in statistical learning and business intelligence. Starting from SAS®9.1, the ARBORETUM procedure provided facilities to interactively build and deploy decision tress. Even though it is still an experiment procedure, the ARBORETUM procedure has comprehensive features for classification and predication. And the ARBORETUM procedure is also the foundation of decision tree node in SAS Enterprise Miner.
Method: A common SAS dataset ’sashelp.cars’ was divided into three parts of equal size: training, validation and scoring. Two methods were applied: the target variable ‘origin’ as nominal level and the target variable ’ MSRP’ as interval level.
Result: the codes below introduced how to use PROC RBORETUM to train, validate and score datasets based on decision tree. The generated DATA step codes were stored in two flat text files.
Conclusion: the ARBORETUM procedure is quick and versatile for applying decision tree for any size of dataset. It is really a secret weapon in the procedure stockpile of SAS.

Reference: Xiangxiang Meng. Using the SGSCATTER Procedure to Create High-Quality Scatter Plots. SAS Global Forum 2010.

/*DIVIDE THE ORIGINAL DATA INTO 3 PARTS: 1:1:1*/
data cars;
set sashelp.cars;
_index=_n_;
run;
proc sort data=cars;by origin;run;
proc surveyselect data=cars samprate=0.3333 out=train;
strata origin /alloc=prop ;
run;
proc sql;
create table cars2 as
select * from cars
where _index not in ( select _index from train)
;quit;
proc surveyselect data=cars2 samprate=0.5 out=validation;
strata origin /alloc=prop ;
run;
proc sql;
create table test as
select * from cars2
where _index not in ( select _index from validation)
;quit;
proc datasets;
delete cars2 cars;
run;

/*TARGET VARIABLE: NOMINAL*/
filename code_1 'C:\code_1.txt';
proc arboretum data=train;
target origin / level=nominal;
input MSRP Cylinders Length Wheelbase MPG_City MPG_Highway Invoice Weight Horsepower/ level=interval;
input EngineSize/level=ordinal;
input DriveTrain Type /level=nominal;
assess validata=validation;
code file=code_1;
score data=test out=scorecard outfit=scorefit;
save IMPORTANCE=imp1 MODEL=mymodel NODESTATS=nodstat1 RULES=rul1 SEQUENCE=seq1 SIM=sim1 STATSBYNODE= statb1 SUM=sum1
;
run;
quit;

/*TARGET VARIABLE: INTERVAL*/
filename code_2 'C:\code_2.txt';
proc arboretum data=train;
target MSRP / level=interval;
input Cylinders Length Wheelbase MPG_City MPG_Highway Weight Horsepower/ level=interval;
input EngineSize/level=ordinal;
input DriveTrain Type origin /level=nominal;
assess validata=validation;
code file=code_2;
score data=test out=scorecard2 outfit=scorefit2;
save IMPORTANCE=imp2 MODEL=mymode2 NODESTATS=nodstat2 RULES=rul2 SEQUENCE=seq2 SIM=sim2 STATSBYNODE= statb2 SUM=sum2
;
run;
quit;

This post was kindly contributed by SAS Analysis - go there to comment and to read the full post.