Two macros for detecting data errors

This post was kindly contributed by SAS Users - go there to comment and to read the full post.

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart rates below 40 or over 100 or adult heights below 45 inches or above 84 inches. Although values outside those ranges might be valid values, it would be a good idea to check those out-of-range values to see if they are real or data errors.

In the third edition of my book, Cody’s Data Cleaning Techniques, I present two macros: %Errors and %Report. These two macros provide a consolidated error report.

To demonstrate these two macros, I created a data set called Patients. A listing of the first 10 observations is shown below:

Notice that the unique identifier is called Patno, and you see three variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure).

The calling arguments for the %Errors macro are:

  • VAR=     A variable for which you have pre-defined bounds
  • Low=     The lowest reasonable value for this variable
  • High=    The highest reasonable value for this variable
  • Missing=Error or Missing=Ignore. The default for this argument is IGNORE, but it is still good practice to include it in the call so that anyone reading your program understands how missing values are being handled.

Because you might be calling this macro for many variables, the values of two macro variables &Dsn (data set name) and &IDVar (the identifying variable such as Patno or Subj) are assigned values once, using two %Let statements. You can then call the %Errors macro for each variable of interest. When you are finished, call the %Report macro to see a consolidated report of your possible errors.

Here is an example:

/* Set values for &Dsn and %IDVar with %LET statements */
%let Dsn = Clean.Patients;   
%let Idvar = Patno;
 
%Errors(Var=HR, Low=40, High=100, Missing=error)
%Errors(Var=SBP, Low=80, High=200, Missing=ignore)
%Errors(Var=DBP, Low=60, High=120)   
 
/* When you are finished selecting variables, create the report */
%Report

You are reporting all heart rates below 40 or above 100 (and considering missing values as errors); values of SBP below 80 or above 200 (and ignoring missing values); and values of DBP below 60 or above 120 (also ignoring missing values—using the default value of IGNORE).

Here is the result:

Notice that several patients with missing values for HR are flagged as errors. I should point out that I have violated a cardinal rule of macro programming: never write a macro that has no arguments—it should be a program. However, I liked the idea of calling %Errors and then %Report. Shown below are the two macros:

/****************************************************************
| PROGRAM NAME: ERRORS.SAS  in c:\Books\Cleans\Patients         |
| PURPOSE: Accumulates errors for numeric variables in a SAS    |
|          data set for later reporting.                        |
|          This macro can be called several times with a        |
|          different variable each time. The resulting errors   |
|          are accumulated in a temporary SAS data set called   |
|          Errors.                                              |
| ARGUMENTS: Dsn=    - SAS data set name (assigned with a %LET) |
|            Idvar=  - Id variable (assigned with a %LET)       |
|                                                               |
|            Var     = The variable name to test                |
|            Low     = Lowest valid value                       |
|            High    = Highest valid value                      |
|            Missing = IGNORE (default) Ignore missing values   |
|                      ERROR Missing values flagged as errors   |
|                                                               |
| EXAMPLE: %let Dsn = Clean.Patients;                           |
|          %let Idvar = Patno;                                  |
|                                                               |
|          %Errors(Var=HR, Low=40, High=100, Missing=error)     |
|          %Errors(Var=SBP, Low=80, High=200, Missing=ignore)   |
|          %Errors(Var=DBP, Low=60, High=120)                   |
|          Test the numeric variables HR, SBP, and DBP in data  |
|          set Clean.patients for data outside the ranges       |
|          40 to 100, 80 to 200, and 60 to 120 respectively.    |
|          The ID variable is PATNO and missing values are to   |
|          be flagged as invalid for HR but not for SBP or DBP. |
****************************************************************/
%macro Errors(Var=,    /* Variable to test     */
              Low=,    /* Low value            */
              High=,   /* High value           */
              Missing=IGNORE 
                       /* How to treat missing values         */
                       /* Ignore is the default.  To flag     */
                       /* missing values as errors set        */
                       /* Missing=error                       */);
data Tmp;
   set &Dsn(keep=&Idvar &Var);
   length Reason $ 10 Variable $ 32;
   Variable = "&Var";
   Value = &Var;
   if &Var lt &Low and not missing(&Var) then do;
      Reason='Low';
      output;
   end;
 
   %if %upcase(&Missing) ne IGNORE %then %do;
      else if missing(&Var) then do;
         Reason='Missing';
         output;
      end;
   %end;
 
   else if &Var gt &High then do;
      Reason='High';
      output;
      end;
      drop &Var;
   run;
 
   proc append base=Errors data=Tmp;
   run;
 
%mend Errors;

The basic idea for the %Errors macro is to test each variable and, if it is a possible error, use PROC APPEND to add it to a data set called Errors. When the first error is detected, PROC APPEND creates the data set Errors. From then on, each observation in data set Tmp is added to data set Errors.

Most of this macro is straightforward. For those readers who are not that comfortable with macro programming, the third section (beginning with %if %upcase(&Missing)) is executed only when the value of the macro variable &Missing is not equal to IGNORE.

Below is a listing of the %Report macro:

%macro Report;
   proc sort data=Errors;
      by &Idvar;
   run;
 
   proc print data=Errors;
      title "Error Report for Data Set &Dsn";
      id &Idvar;
      var Variable Value Reason;
   run;
 
   proc delete data=Errors Tmp;
   run;
 
%mend Report;

The %Report macro is mainly a PROC PRINT of the temporary data set Errors. I added PROC DELETE to delete the two temporary data sets Error and Tmp.

You can cut and paste these macros, or you can download all of the macros, programs, and data sets from Cody’s Data Cleaning Techniques Using SAS®, Third Edition, by going to support.sas.com/Cody, search for the book, then click Example Code and Data. You do not have to buy the book to download all the files (although I would be delighted if you did). This is true for all of my books published by SAS Press.

Comments and/or corrections are always welcome.

Two macros for detecting data errors was published on SAS Users.

This post was kindly contributed by SAS Users - go there to comment and to read the full post.