Top 10 most powerful functions for Proc SQL

This post was kindly contributed by SAS ANALYSIS - go there to comment and to read the full post.


Proc SQL is actually not a standard SAS procedure but a distinctive subsystem with all features from SQL (structured query language). Equipped with it, SAS upgrades to a full-fledging relational database management system. In addition, Proc SQL always provides alternative ways to manage data, besides the traditional Data Step and procedures. SAS also supplies some goodies, such as its functions, to further strengthen SQL operation by Proc SQL.

Here are ten powerful functions for Proc SQL. The corresponding illustrations are given below for each function. Out of the four datasets used for demonstration, two are simulated and the other two are from SAS’s help files.

(1)MONOTONIC(): an equivalent part as the internal variable _n_ in Data Step. In the example, it allows us to freely specify any particular rows. And if needed, it can be transformed to a column representing observation numbers.
(2)COUNT()/N()/FREQ()/NMISS(): those aggregation functions are especially useful in data cleaning. With them, we would be able to see how many missing values exist for each variable, in just one screen.
(3)UNIQUE(): brings the unique values for individual variables. Combing it with Count(), the level number of each variable would be clearly displayed.
(4)COALESCE(): a magic function to combine multiple rows into a single row. In this example, there are two rows about SSNs: normally they should be identical though some of them are missing. To obtain a most complete record, Coalesce() overlaps the two rows together and maximizes the SSN information;
(5)MISSING(): returns a Boolean value(0 when non-missing; 1 when missing). With it, we could easily indentify the overall situation or an observation’s performance regarding missing values.
(6)SPEDIS()/SOUNDEX(): my favorite functions to fulfill fuzzy matching. In the first example, with Spedis(), three pairs of similar SSN are discovered: they may belong to the same persons due to typo or other mistakes. Here 25 means the difference the function will detect. the In the second example, we attempts to find dubious identical names. As the result, John and Jane turn out to be quite similar phonically. If necessary, we can always double-check the other variables, such as sex and age, to confirm that they are indeed two persons.
(7)RANUNI(): makes random sampling easy in Proc SQL. Usually random sampling in Data Step needs a lot of coding. And most programmers prefer the quicker way using Proc Surveyselect. By applying Ranuni() in the Order clause of a SQL query, a small sample of 30 SSNs is produced. Another good thing to use this method is that we can therefore generate more batches of unduplicated samples from the raw data by simply excluding what we have previously sampled.
(8)MAX(): executives the so called largest observation carried forward. In healthcare or pharmaceutical industries, largest observation carried forward is a routine work, which means that in sequential treatments the largest value would remain while other are ignored. In Data Step, array has to be used to answer such a request. Simply implementing the Max() function, the maximum value would be found and shown.
(9)IFC()/IFN(): avoids the conditional statements involving if/then/else or case/when .The difference between them is that Ifc() returns characters and Ifn() returns numbers. While a binary selection is desired, using the two functions saves code and increases flexibility.
(10) PUT(): aggregates and filters data in the Where clause. In this example, we first create a customized format by Proc Format, and later exercising it with Put() filtered out the results we tend to avoid.

Overall, most SAS’s functions can be used in the clauses of Proc SQL. In many cases, combing them together makes Proc SQL more effective and efficient.

*********(0) DATA INPUT**************;
*****(0-1) Simulation to generate a dataset for SSN;
data SSNdata;
do i=1 to 1000;
var1=ceil((ranuni(4321)*1E9));
var2=var1;
format var1 var2 ssn11.;
if var1 le ceil((ranuni(1000)*1E9)) then call missing(var1);
if var2 le ceil((rannor(2000)*1E9)) then call missing(var2);
drop i;
output;
end;
run;
*****(0-2) Simulation to generate a dataset for patients' visit;
data patient;
input id visit Treat1 Treat2 Treat3 Cost ;
cards;
1 1 0 0 0 520
1 2 1 0 0 320
1 3 0 1 0 650
2 1 1 0 0 560
2 2 1 0 0 360
3 1 1 0 0 500
3 2 0 0 1 350
;
run;
*****(0-3) SAS's help dataset since 9.2: sashelp.cars;
*****(0-4) SAS's help dataset since 8.2: sashelp.class;
****************END OF DATA INPUT************;

********(1)MONOTONIC()--SPECIFY ROW NUMBER*******;
proc sql;
title 'Monotonic()--Specify row number';
select monotonic(var1) as Obs_num, *
from SSNdata
where monotonic() between 501 and 1000
;quit;

*********(2) COUNT()/N()/FREQ()/NMISS()--FIND MISSING VALUES ***********;
******* NOTE: n() or freq() has to use an argument;
proc sql;
title'Count()/N()/Freq()/Nmiss()--Find the number of missing values';
select count(*) as total, count(var1) as non_miss_var1,
count(var2) as non_miss_var2, nmiss(var1) as miss_var1,
nmiss(var2) as miss_var2
from SSNdata
/* select N(monotonic()) as total, N(var1) as num_var1, N(var2) as num_var2*/
/* from SSNdata;*/
;quit;

*********(3)UNIQUE() -- FIND THE LEVELS OF CATEGORICAL VARIABLES*********;
proc sql;
title'Unique() -- Find the levels of categorical variables';
select count(unique(make)) as L_make, count(unique(Origin)) as L_origin,
count(unique(type)) as L_type
from sashelp.cars
;quit;

********(4)COALESCE() -- COMBINE COLUMN VALUES************;
proc sql;
title 'Coalesce() -- Combine column values';
select Monotonic() as obs, coalesce (var1, var2) as ssn format=ssn11.
from SSNdata
;quit;

********(5)MISSING() -- RETURN BOOLEAN VALUE FOR ANY MISSING OBS.******;
proc sql ;
title 'Missing() -- Return Boolean value for any missing obs.';
select (case sum(missing(var1), missing(var2)) when 0 then ' no missing'
when 1 then 'one missing '
else 'both missing'
end) as outcome,
count(calculated outcome) as number
from SSNdata
group by calculated outcome
;quit;

********(6)SPEDIS()/SOUNDEX() -- FUZZ MATCHING EVERYWHERE************;
********(6-1) Spedis() finds spelling mistakes;
proc sql;
title 'Spedis()-- List possible duplicates of SSN in a column';
select a.var1 as x , b.var1 as y
from ssndata as a, ssndata as b
where (x gt y) and (spedis(put( x, z11.), put( y, z11.) ) le 25)
;quit;

********(6-2) Soundex() finds phonic mistakes;
proc sql;
title 'Soundex() -- List possible duplicates of names in a column';
select a.name as name1, a.sex as sex1, b.name as name2, b.sex as sex2
from sashelp.class as a, sashelp.class as b
where soundex(name1)=soundex(name2) and (name1 gt name2)
;quit;

**********(7)RANUNI() -- SIMPLE RANDOM SAMPLING***********;
proc sql outobs=30;
title 'Ranuni() -- Simple random sampling';
select coalesce (var1, var2) as sample format=ssn11.
from SSNdata
where sum(missing(var1), missing(var2)) ne 2
order by ranuni(20110126)
;quit;

********(8)MAX() -- LARGEST OBSERVATION CARRIED FORWARD********;
proc sql;
title 'Max() -- largest observation carried forward';
select id, max(treat1) as Effect1, max(treat2) as Effect2, max(treat3) as Effect3
from patient
group by id
;quit;

*********(9)IFC()/IFN() -- BINARY SELECTION FOR CHARATERS AND NUMBERS*********;
proc sql;
title '(9)Ifc() and Ifn() -- binary selection for charaters and numbers';
select id, ifc(max(treat1)=1, 'Yes', 'No') as Effect1 length=3,
avg(cost) as meancost format=dollar8.2,
ifn(calculated meancost ge 450, calculated meancost*0.95,
calculated meancost*1) as discount_cost format=dollar8.2
from patient
group by id
;quit;

***********(10) PUT() -- FILTING IN WHERE STATEMENT WITH FORMAT**************;
proc format;
value price 40000-high='High '
26000 -<40000='Medium'
other ='Low ';
run;

proc sql;
title 'Put() -- Filting in where statement with format';
select MSRP, MSRP as price_range format=price.
from sashelp.cars
where put(MSRP, price.) in ('High ', 'Medium')
order by MSRP;
;quit;

*************END***********TESTED AT 27JAN2011***********;

This post was kindly contributed by SAS ANALYSIS - go there to comment and to read the full post.