This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
SAS SQL handles missing values differently than the ANSI standard for SQL. PROC SQL follows the SAS convention for handling missing values: numerical missing values are always interpreted as less or smaller than all nonmissing values. My first blog showed that missing values can be troublemakers in non-grouped descriptive statistics. This blog now focuses on how to handle tables if they contain missing values.
In a single table
Missing values can lead to unexpected results while working with only one table, for example, in WHERE conditions, when sorting data using ORDER BY, and when grouping data using GROUP BY. (See the following examples.)
WHERE
If a column contains missing values, a WHERE condition can lead to undesirable results under certain circumstances. If you want to query all S0666 values smaller than 1.5 in the example code below, then a simple WHERE would also include missing values.
To keep S0666 values less than 1.5, but without missing values, you can extend the WHERE condition by adding IS NOT MISSING
Example: WHERE
proc sql ; select * from SASHELP.SYR1001 where S0666 < 1.5 and S0666 is not missing ; quit ;
ORDER BY
When sorting data in ascending order, PROC SQL places missing values before data of type numeric or string. If that ORDER BY does not contain any other column to sort by, the values in other columns are only sorted if you expressly do so. If your data contain missing values, you may have to add further sorting variables to ORDER BY after the column containing missing values, for example, T and S0502 in the example below.
Example: ORDER BY
proc sql ; select * from SASHELP.SYR1001 order by S0666, T, S0502 ; quit ;
GROUP BY
Missing values have subtle effects that may lead to fake statistics. The following example demonstrates the effect that missing values have on a seemingly simple calculation of means. To demonstrate, I replaced the missing values in column S0666 with 0s and saved the new column as S0666_0. The means of both S0666 and S0666_0 are calculated based on the sum of all available values divided by the number of nonmissing values data rows.
Example: GROUP BY
proc sql ; select T, S0666, avg(S0666) as S0666_MEAN, case when S0666 is missing then 0 else S0666 end as S0666_0, avg(calculated S0666_0) as S0666_0_MEAN from SASHELP.SYR1001 ; quit ; |
Output
S0666_ S0666_0_ T S0666 mean S0666_0 mean 52 1.52000 1.75444 1.52000 0.90229 53 1.84000 1.75444 1.84000 0.90229 54 1.96000 1.75444 1.96000 0.90229 55 . 1.75444 0.00000 0.90229 56 . 1.75444 0.00000 0.90229 57 . 1.75444 0.00000 0.90229 58 . ...
Because the denominator for S0666_0_mean is larger, the resulting mean is (not surprisingly) lower in the output. However, S0666_mean also returns means for rows that did not contribute values to its calculation (only for the denominator, if you will).
When joining tables
If you are about to join tables, you need to check whether the keys of the tables to be joined contain missing values. And if they contain missing values, you need to check if they are of the same type. SAS defines missing values in different ways. Unfortunately, a numerical column may contain different types of missing values at the same time. Missing numerical values and missing strings are handled completely differently.
Even if your data contain only one type of numerical or string missing value, you may be in for a big surprise. In the following example, the data sets ONE and TWO are identical except for the names of the columns in TWO. Each second and third row have missing values in the key; the following example highlights its effect.
Data ONE | Data TWO |
---|---|
data ONE ; input ID A B C ; datalines; 01 1 2 3 . 1 2 3 . 99 99 99 05 1 2 3 ; run ; |
data TWO ; input ID D E F; datalines; 01 1 2 3 . 1 2 3 . 99 99 99 05 1 2 3 ; run ; |
Example: Join
proc sql ; create table FJ as select a.ID, A,B,C,D,E,F from ONE a full join TWO b on a.ID=b.ID order by ID ; quit ;
Output
ID A B C D E F . 1 2 3 99 99 99 . 99 99 99 1 2 3 . 99 99 99 99 99 99 . 1 2 3 1 2 3 1 1 2 3 1 2 3 5 1 2 3 1 2 3
The missing values in the key of ONE and TWO generate a Cartesian product. Each missing value of ONE is combined with each missing value of TWO (the same effect applies to non-missing values as well). With 2 times 2 missing values, two missing values are enough to multiply the associated data rows. Please note also the limited effect of ORDER BY in the presence of missing values. Beware if you may have more than one missing value in keys and are about to join one or more tables.
Have you had unexpected results from missing values in your data when using PROC SQL? You can learn more about effects caused by missing values in Chapter 2 of my book Advanced SQL with SAS.
Unexpected results from missing values with PROC SQL was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |