Splitting a data set into smaller data sets

This post was kindly contributed by SAS Users - go there to comment and to read the full post.

In his blog post, How to split one data set into many, Chris Hemedinger showed how to subset or split SAS data sets based on the values of categorical variables. For example, based on a value of variable REGION you may split a data set MARKETING into MARKETING_ASIA, MARKETING_AMERICA, MARKETING_EUROPE, and so on.

In some cases, however, we need to split a large data set into many – not by a subsetting variable values, but by a number of observations in order to produce smaller, better manageable data sets. Such an approach can be dictated by restrictions on the data set size imposed by hardware (memory size, transmission channel bandwidth etc.), processing time, or user interface convenience (e.g. search results displayed by pages).

We might need to split a data set into smaller tables of K observations or less each; or to split a data set into S equal (or approximately equal) pieces.

We might need to split a data set into sequentially selected subsets where the first K observations go into the first data set, the second K observations go into the second data set, and so on. Alternatively, we might need to randomly select observations from a data set while splitting it into smaller tables.

This blog post provides possible coding solutions for such scenarios.

Splitting a data set into smaller data sets sequentially

Let’s say we need to split a data set SASHELP.CARS (number of observation N=428) into several smaller datasets. We will consider the following two sequential observation selection scenarios:

Each smaller data set should have maximum of K observations.
There should be S smaller data sets of approximately same size.

Ideally, we would like to split a data set into K observations each, but it is not always possible to do as the quotient of dividing the number of observations in the original dataset N by K is not always going to be a whole number. Therefore, we will split it into several smaller data sets of K observations each, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Similarly, with the scenario 2, we will split the source data set into several smaller data sets of the same size, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Below is a SAS macro code that covers both these scenarios. Parameter’s SPLIT_DEF value defines the type of scenario (SPLIT_DEF=NOBS means that the source data set is split into several smaller data sets with the number of observations not to exceed SPLIT_NUM value; SPLIT_DEF=SETS means that the source data set is split into SPLIT_NUM=value of smaller data sets).

%macro split (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
/* Parameters:
/*   SRC_DATASET - name of the source data set     */
/*   OUT_PREFIX - prefix of the output data sets   */
/*   SPLIT_NUM - split number                      */
/*   SPLIT_DEF - split definition (=SETS or =NOBS) */
 
   %local I K S TLIST;
 
   /* number of observations &K, number of smaller datasets &S */
   data _null_;
      if 0 then set &SRC_DATASET nobs=N;
      if upcase("&SPLIT_DEF")='NOBS' then
         do;
            call symputx('K',&SPLIT_NUM); 
            call symputx('S',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
         end;
         else if upcase("&SPLIT_DEF")='SETS' then
         do;
            call symputx('S',&SPLIT_NUM); 
            call symputx('K',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
        end;
         else put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
      stop; 
   run;
 
   /* terminate macro if nothing to split */
   %if (&K le 0) or (&S le 0) %then %return;
 
    /* generate list of smaller dataset names */
   %do I=1 %to &S;
      %let TLIST = &TLIST &OUT_PREFIX._&I;
   %end;
 
   /* split source dataset into smaller datasets */
   data &TLIST;
      set &SRC_DATASET;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;
 
%mend split;

The following are examples of the macro invocations:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);

These invocations will produce the following SAS logs:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
***MACRO SPLIT: Incorrect SPLIT_DEF=SET value. Must be either SETS or NOBS.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
***MACRO SPLIT: Splitting into datasets of no more than 100 observations
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_4 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_5 has 28 observations and 15 variables.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);
***MACRO SPLIT: Splitting into 3 datasets
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 142 observations and 15 variables.

Splitting a data set into smaller data sets randomly

For randomly splitting a data set into many smaller data sets we can use the same approach as above with a slight modification. In essence, we are going to randomly shuffle observations of our source data set first, and then apply the sequential splitting.

In order to implement this, we just need to replace the last data step in the above macro with the following 3 steps:

   /* generate random numbers, R */
   data;
      set &SRC_DATASET;
      call streaminit(1234);
      R = rand('uniform');
   run;
 
   /* sort data in R order */
   proc sort;
      by R;
   run;
 
   /* split source dataset into smaller datasets */
   data &TLIST (drop=R);
      set;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;

This modified code will produce similar results (with the same information in the SAS log), however, smaller data sets will have their observations randomly selected from the source data set.

DATAn naming convention

You may have noticed that in this random splitting code I have not specified data set names neither in the DATA statement of the first DATA step, nor in the PROC SORT and not even in the SET statement of the last DATA step. Not only these shortcuts possible due to SAS’ DATAn naming convention, but it is a very robust way of dynamically assigning temporary data set names. This method is especially useful and appropriate for SAS macros as it guarantees that you do not accidentally overwrite a data set with the same name in SAS program that invokes your macro. Think about it: if you are a macro developer you need to make sure that whatever temporary data sets you create within your macro their names must be unique for a SAS session in order not to interfere with any data sets that may be created in the calling SAS program outside of your macro.

Here are defaults in SAS’ DATAn naming convention:

If you do not specify a name for the output data set in a DATA statement, SAS automatically assigns the default names WORK.DATA1, WORK.DATA2, and so on, to each successive data set that you create.
If you do not specify a name for the input data set in a SET statement, SAS automatically uses the last data set that was created. SAS keeps track of the most recently created data set through the reserved name _LAST_. When you execute a DATA or PROC step without specifying an input data set, by default, SAS uses the _LAST_ data set.

For more information on this useful SAS coding technique see special data set names and examples and warning on using special data set names.

Your thoughts?

Do you find this post useful? Have you ever split data sets into smaller ones based on a number of observations? Do you use special data set names and DATAn naming convention in your SAS coding? Please share your thoughts in the comments section below.

Splitting a data set into smaller data sets was published on SAS Users.