How to split one data set into many

This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post.

Back in the day when the prison system forced inmates to perform “hard labor”, folks would say (of someone in prison): “He’s busy making little ones out of big ones.” This evokes the cliché image of inmates who are chained together, forced to swing a chisel to break large rocks into smaller rocks. (Yes, it seems like a pointless chore. Here’s a Johnny Cash/Tony Orlando collaboration that sets it to music.)

SAS programmers are often asked to break large data sets into smaller ones. Conventional wisdom says that this is also a pointless chore, since you can usually achieve what you want (that is, process a certain subset of data) by applying a WHERE= option or FIRSTOBS=/OBS= combination. Splitting a data set creates more files, which occupy more disk space and forces more I/O operations. I/O and disk access is often the most expensive part of your SAS processing, performance-wise.

But if the boss asks for broken-up data sets, you might as well spend the least possible effort on the task. Let’s suppose that you need to break up a single data set into many based on the value of one of the data columns. For example, if you need to break SASHELP.CARS into different tables based on the value of Origin, the SAS program would look like:

DATA out_Asia;
 set sashelp.cars(where=(origin='Asia'));
run;
DATA out_Europe;
 set sashelp.cars(where=(origin='Europe'));
run;
DATA out_USA;
 set sashelp.cars(where=(origin='USA'));
run;

I’m going to admit right now that this isn’t the most efficient or elegant method, but it’s something that most beginning SAS programmers could easily come up with.

Writing the above program is easy, especially since there are only 3 different values for Origin and I’ve memorized their values. But if there are more discrete values for the “split-by” column, the task could involve much more typing and has a high possibility for error. This is when I usually use PROC SQL to generate the code for me.

If you’ve read my article about implementing BY processing for an entire SAS program, you know that you can use PROC SQL and SELECT INTO to place data values from a data set into a macro variable. For example, consider this simple program:

proc sql;
 select distinct ORIGIN into :valList separated by ',' from SASHELP.CARS;
quit;

It creates a macro variable VALLIST that contains the comma-separated list: “Asia,Europe,USA”.

But we can use SAS functions to embellish that output, and create additional code statements that weave the data values into SAS program logic. For example, we can use the CAT function to combine the values that we query from the data set with SAS keywords. The results are complete program statements, which can then be referenced/executed in a SAS macro program. I’ll share my final program, and then I’ll break it down a little bit for you. Here it is:

/* define which libname.member table, and by which column */
%let TABLE=sashelp.cars;
%let COLUMN=origin;
 
proc sql noprint;
/* build a mini program for each value */
/* create a table with valid chars from data value */
select distinct 
   cat("DATA out_",compress(&COLUMN.,,'kad'),
   "; set &TABLE.(where=(&COLUMN.='", &COLUMN.,
   "')); run;") into :allsteps separated by ';' 
  from &TABLE.;
quit;
 
/* macro that includes the program we just generated */
%macro runSteps;
 &allsteps.;
%mend;
 
/* and...run the macro when ready */
%runSteps;

Here are the highlights from the PROC SQL portion of the program:

SELECT DISTINCT ensures that the results include just one record for each unique value of the variable.
The CAT function concatenates a set of string values together. Note that CATX and CATS and CATT — other variations of this function — will trim out white space from the various string elements. In this case I want to keep any blank characters that occur in the data values because we’re using those values in an equality check.
The program calculates a name for each output data set by using each data value as a suffix (“OUT_dataValue“). SAS data set names can contain only numbers and letters, so I use the COMPRESS function to purge any invalid characters from the data set name. The ‘kad’ options on COMPRESS tell it to keep only alpha and digit characters.
The resulting program statements all end up in the &ALLSTEPS macro variable. I could just reference the &ALLSTEPS variable in the body of the SAS program, and SAS would run it as-is. Instead I chose to wrap it in the macro %runSteps. This makes it a little bit easier to control the scope and placement of the executable SAS program statements.

“By each value of a variable” is just one criterion that you might use for splitting a data set. I’ve seen cases where people want to split the data based on other rules, such as:

Quantity of observations (split a 3-million-record table into 3 1-million-record tables)
Rank or percentiles (based on some measure, put the top 20% in its own data set)
Time span (break up a data set by year or month, assuming the data records contain a date or datetime variable)

With a small modification, my example program can be adapted to serve any of these purposes. What about you? Are you ever asked to split up SAS data sets, and if so, based on what criteria? Leave a comment and tell us about it.

tags: data management, macro programming, PROC SQL, SAS programming

The post How to split one data set into many appeared first on The SAS Dummy.