Unexpected results from missing values with PROC SQL

This post was kindly contributed by SAS Users - go there to comment and to read the full post.

SAS SQL handles missing values differently than the ANSI standard for SQL. PROC SQL follows the SAS convention for handling missing values: numerical missing values are always interpreted as less or smaller than all nonmissing values. My first blog showed that missing values can be troublemakers in non-grouped descriptive statistics. This blog now focuses on how to handle tables if they contain missing values.

In a single table

Missing values can lead to unexpected results while working with only one table, for example, in WHERE conditions, when sorting data using ORDER BY, and when grouping data using GROUP BY. (See the following examples.)

WHERE

If a column contains missing values, a WHERE condition can lead to undesirable results under certain circumstances. If you want to query all S0666 values smaller than 1.5 in the example code below, then a simple WHERE would also include missing values.

To keep S0666 values less than 1.5, but without missing values, you can extend the WHERE condition by adding IS NOT MISSING

Example: WHERE

proc sql ;
  select * 
  from SASHELP.SYR1001
  where S0666 &lt; 1.5 
            and S0666 is not missing ;
quit ;

ORDER BY

When sorting data in ascending order, PROC SQL places missing values before data of type numeric or string. If that ORDER BY does not contain any other column to sort by, the values in other columns are only sorted if you expressly do so. If your data contain missing values, you may have to add further sorting variables to ORDER BY after the column containing missing values, for example, T and S0502 in the example below.

Example: ORDER BY

proc sql ;
  select * 
  from SASHELP.SYR1001
  order by S0666, T, S0502 ;
quit ;

GROUP BY

Missing values have subtle effects that may lead to fake statistics. The following example demonstrates the effect that missing values have on a seemingly simple calculation of means. To demonstrate, I replaced the missing values in column S0666 with 0s and saved the new column as S0666_0. The means of both S0666 and S0666_0 are calculated based on the sum of all available values divided by the number of nonmissing values data rows.

Example: GROUP BY

proc sql ;
       select T, S0666, 
	       avg(S0666) as S0666_MEAN,  
                case
		    when S0666 is missing then 0
                         else S0666 
                  end as S0666_0, 
          avg(calculated S0666_0) as S0666_0_MEAN 
	from SASHELP.SYR1001 ;
quit ;

Output

                   S0666_              S0666_0_
 T      S0666       mean     S0666_0       mean       

 52    1.52000    1.75444    1.52000    0.90229
 53    1.84000    1.75444    1.84000    0.90229
 54    1.96000    1.75444    1.96000    0.90229
 55     .         1.75444    0.00000    0.90229
 56     .         1.75444    0.00000    0.90229
 57     .         1.75444    0.00000    0.90229
 58     .          ...

Because the denominator for S0666_0_mean is larger, the resulting mean is (not surprisingly) lower in the output. However, S0666_mean also returns means for rows that did not contribute values to its calculation (only for the denominator, if you will).

When joining tables

If you are about to join tables, you need to check whether the keys of the tables to be joined contain missing values. And if they contain missing values, you need to check if they are of the same type. SAS defines missing values in different ways. Unfortunately, a numerical column may contain different types of missing values at the same time. Missing numerical values and missing strings are handled completely differently.

Even if your data contain only one type of numerical or string missing value, you may be in for a big surprise. In the following example, the data sets ONE and TWO are identical except for the names of the columns in TWO. Each second and third row have missing values in the key; the following example highlights its effect.

Data ONE	Data TWO
data ONE ; input ID A B C ; datalines; 01 1 2 3 . 1 2 3 . 99 99 99 05 1 2 3 ; run ;	data TWO ; input ID D E F; datalines; 01 1 2 3 . 1 2 3 . 99 99 99 05 1 2 3 ; run ;

Example: Join

proc sql ; 
create table FJ as select
a.ID, A,B,C,D,E,F
from ONE a full join TWO b
on a.ID=b.ID
order by ID ;
quit ;

Output

ID     A     B     C     D     E     F
 .     1     2     3    99    99    99
 .    99    99    99     1     2     3
 .    99    99    99    99    99    99
 .     1     2     3     1     2     3
 1     1     2     3     1     2     3
 5     1     2     3     1     2     3

The missing values in the key of ONE and TWO generate a Cartesian product. Each missing value of ONE is combined with each missing value of TWO (the same effect applies to non-missing values as well). With 2 times 2 missing values, two missing values are enough to multiply the associated data rows. Please note also the limited effect of ORDER BY in the presence of missing values. Beware if you may have more than one missing value in keys and are about to join one or more tables.

Have you had unexpected results from missing values in your data when using PROC SQL? You can learn more about effects caused by missing values in Chapter 2 of my book Advanced SQL with SAS.

Unexpected results from missing values with PROC SQL was published on SAS Users.