This post was kindly contributed by DATA ANALYSIS - go there to comment and to read the full post. |
This is a following post after my previous post about SAS/SQL.
SAS’s SQL procedure has a basic SQL syntax. I found that the most challenging work is to use PROC SQL to solve the TOP N (or TOP N by Group) questions. Comparing with other modern database systems, PROC SQL is lack of –
-
The ranking functions such as
RANK()
or theSELECT TOP
clause such asselect TOP 3 *
from class
; -
The
partition by
clause such asselect sex, name, weight
from (select sex, name, max(weight) over(partition by sex) max_weight
from class)
where weight = max_weight
;
However, there are always some alternative solutions in SAS. I list a few question from an ascending difficulty below to explore the possibilities.
Prepare the data
First a SASHELP.CLASS dataset is used as a demo (availabe for every SAS copy). It is a small weight and height dataset from a faked class of 19 children. Now I only keep the
weight
variable as target column.data class;
set sashelp.class;
keep name sex weight;
run;
proc sort;
by descending weight;
run;
Name | Sex | Age | Weight |
---|---|---|---|
Philip | M | 16 | 150 |
Ronald | M | 15 | 133 |
Robert | M | 12 | 128 |
Alfred | M | 14 | 112.5 |
Janet | F | 15 | 112.5 |
Mary | F | 15 | 112 |
William | M | 15 | 112 |
Carol | F | 14 | 102.5 |
Henry | M | 14 | 102.5 |
John | M | 12 | 99.5 |
Barbara | F | 13 | 98 |
Judy | F | 14 | 90 |
Thomas | M | 11 | 85 |
Jane | F | 12 | 84.5 |
Alice | F | 13 | 84 |
Jeffrey | M | 13 | 84 |
James | M | 12 | 83 |
Louise | F | 12 | 77 |
Joyce | F | 11 | 50.5 |
1. Select highest value
It is straightforward to use the
outobs
option at the begining to single out the highest weight. title "Select highest weight overall";
proc sql outobs = 1;
select name, weight
from class
order by weight desc
;quit;
Name | Weight |
---|---|
Philip | 150 |
2. Select second highest value
How about the second highest weight? The logic is simple — if we remove the highest weight first, then the second highest weight will take the first row.
title "Select second highest weight overall";
proc sql outobs = 1;
select name, weight
from class
where weight not in (select max(weight) from class)
order by weight desc
;quit;
Name | Weight |
---|---|
Ronald | 133 |
3. Select Nth highest value
Now it comes to the hard part. How about the Nth highest value, say, the fourth highest weight? Now we have to do a self-joining to let the distinct value point to 3. Since there are two children with the weight 112.5, the query returns the two tied names.
title "Select Nth highest weight";
%let n = 4;
proc sql;
select distinct a.name, a.weight
from class as a
where (select count(distinct b.weight)
from class as b
where b.weight > a.weight
) = &n - 1;
quit;
Name | Weight |
---|---|
Alfred | 112.5 |
Janet | 112.5 |
4. Select highest values by group
There are two groups Male and Female in the class, and the easiest way to find the highest weight for each category is
select max for female union select max for male
. However, a more scalable solution is to use the group by
clause that fits more than two groups. title "Select highest weights by group";
proc sql;
select sex, name, weight
from class
group by sex
having weight = max(weight)
;quit;
Sex | Name | Weight |
---|---|---|
F | Janet | 112.5 |
M | Philip | 150 |
5. Rank all values
The ultimate solution to solve all the question above is to derive a
rank
column for the target. There are two solutions: the first one use a subquery in the select
clause, while the second one utilizes a subquery in the where
clause. The subquery in the first solution is independent to the main query, which uses less codes and is easier to recall in practice. The second one is actually a self-joining that is faster than the first solution.
/* Solution I */
proc sql;
select name, weight, (select count(distinct b.weight)
from class as b where b.weight >= a.weight) as Rank
from class as a
order by rank
;quit;
/* Solution II */
proc sql;
select a.name, a.weight, count(b.weight) as rank
from class as a, (select distinct weight
from class
) as b
where a.weight <= b.weight
group by a.name, a.weight
order by a.weight desc
;quit;
Name | Weight | Rank |
---|---|---|
Philip | 150 | 1 |
Ronald | 133 | 2 |
Robert | 128 | 3 |
Alfred | 112.5 | 4 |
Janet | 112.5 | 4 |
Mary | 112 | 5 |
William | 112 | 5 |
Henry | 102.5 | 6 |
Carol | 102.5 | 6 |
John | 99.5 | 7 |
Barbara | 98 | 8 |
Judy | 90 | 9 |
Thomas | 85 | 10 |
Jane | 84.5 | 11 |
Alice | 84 | 12 |
Jeffrey | 84 | 12 |
James | 83 | 13 |
Louise | 77 | 14 |
Joyce | 50.5 | 15 |
6. Select top N values by group
Once with the
rank
column at hand, many perplexing problems could be easily solved. For example, we can use it to find the top 3 heaviest people for each category of male and female. And it is also scalable to more than two groups. title "Select Top N weights by group";
proc sql;
select a.sex, a.name, a.weight, (select count(distinct b.weight)
from class as b where b.weight >= a.weight and a.sex = b.sex ) as rank
from class as a
where calculated rank <= 3
order by sex, rank
;quit;
Sex | Name | Weight | rank |
---|---|---|---|
F | Janet | 112.5 | 1 |
F | Mary | 112 | 2 |
F | Carol | 102.5 | 3 |
M | Philip | 150 | 1 |
M | Ronald | 133 | 2 |
M | Robert | 128 | 3 |