Tag: SAS

What is SAS Studio? RIP Enterprise Guide?

by Andrew Ratcliffe • June 23, 2015 • Comments Off

In recent weeks I’ve mentioned SAS Studio in passing but it’s a strategic product for SAS and it merits a proper description.It’s strategic because there is a clear trend for SAS to produce “thin” web-based user interfaces rather than “thick” applicati…

Two alternative ways to query large dataset in SAS

by Charlie H • June 19, 2015 • Comments Off

I really appreciate those wonderful comments on my SAS posts by the readers (1, 2, 3). They gave me a lot of inspirations. Due to SAS or SQL’s inherent limitation, recently I feel difficult in deal with some extremely large SAS datasets (it means that I exhausted all possible traditional ways). Here I conclude two alternative solutions in these extreme cases as a follow-up to the comments.

Read Directly
- Use a scripting language such as Python to Reading SAS datasets directly
Code Generator
- Use SAS or other scripting languages to generate SAS/SQL codes

The examples still use sashelp.class, which has 19 rows. The target variable is weight.

*In SAS
data class;
    set sashelp.class;
    row = _n_;
run;

Example 1: Find the median

SQL Query

In the comment, Anders SköllermoFebruary wrote

Hi! About 1. Calculate the median of a variable:

If you look at the details in the SQL code for calculation the median, then you find that the intermediate file is of size N*N obs, where N is the number of obs in the SAS data set.

So this is OK for very small files. But for a file with 10000 obs, you have an intermediate file of size 100 million obs. / Br Anders Anders Sköllermo Ph.D., Reuma and Neuro Data Analyst

The SQL query below is simple and pure, so that it can be ported to any other SQL platform. However, just like what Anders said, it is just way too expensive.

*In SAS
proc sql;
   select avg(weight) as Median
   from (select e.weight
       from class e, class d
       group by e.weight
       having sum(case when e.weight = d.weight then 1 else 0 end)
          >= abs(sum(sign(e.weight - d.weight)))
    );
quit;

PROC UNIVARIATE

In the comment, Anonymous wrote:

I noticed the same thing – we tried this on one of our ‘smaller’ datasets (~2.9 million records), and it took forever.

Excellent solution, but maybe PROC UNIVARIATE will get you there faster on a large dataset.

Indeed PROC UNIVARIATE is the best solution in SAS to find the median, which utilizes SAS’s built-in powers.

Read Directly

When the extreme cases come, say SAS cannot even open the entire dataset, we may have to use the streaming method to Reading the sas7bdat file line by line. The sas7bdat format has been decoded by Java, R and Python. Theoretically we don’t need to have SAS to query a SAS dataset.

Heap is an interesting data structure, which easily finds a min or a max. ream the values, we could build a max heap and a min heap to cut the incoming stream into half in Python. The algorithm looks like a heap sorting. The good news is that it only Reading one variable each time and thus saves a lot of space.

#In Python
import heapq
from sas7bdat import SAS7BDAT
class MedianStream(object):
    def __init__(self):
        self.first_half = [] # will be a max heap
        self.second_half = [] # will be a min heap, 1/2 chance has one more element
        self.N = 0

    def insert(self, x):
        heapq.heappush(self.first_half, -x)
        self.N += 1
        if len(self.second_half) == len(self.first_half):
            to_second, to_first = map(heapq.heappop, [self.first_half, self.second_half])
            heapq.heappush(self.second_half, -to_second)
            heapq.heappush(self.first_half, -to_first)
        else:
            to_second = heapq.heappop(self.first_half)
            heapq.heappush(self.second_half, -to_second)

    def show_median(self):
        if self.N == 0:
            raise IOError('please use the insert method first')
        elif self.N % 2 == 0:
            return (-self.first_half[0] + self.second_half[0]) / 2.0
        return -self.first_half[0]

if __name__ == "__main__": 
    stream = MedianStream()
    with SAS7BDAT('class.sas7bdat') as infile:
        for i, line in enumerate(infile):
            if i == 0:
                continue
            stream.insert(float(line[-1]))
    print stream.show_median()

99.5

Example 2: Find top K by groups

SQL Query

This query below is very expensive. We have a self-joining O(N^2) and a sorting O(NlogN), and the total time complexity is a terrible O(N^2 + Nlog(N)).

* In SAS
proc sql; 
    select a.sex, a.name, a.weight, (select count(distinct b.weight) 
            from class as b where b.weight >= a.weight and a.sex = b.sex ) as rank 
    from class as a
    where calculated rank <= 3
    order by sex, rank
;quit;

Code Generator

The overall thought is break-and-conquer. If we synthesize SAS codes from a scripting tool such as Python, we essentially get many small SAS codes segments. For example, the SQL code below is just about sorting. So the time comlexity is largely decreased to O(Nlog(N)).

# In Python
def create_sql(k, candidates):
    template = """
    proc sql outobs = {0};
    select *
    from {1}
    where sex = '{2}' 
    order by weight desc
    ;
    quit;"""
    for x in candidates:
        current = template.format(k, 'class', x)
        print current
if __name__ == "__main__":
    create_sql(3, ['M', 'F'])


    proc sql outobs = 3;
    select *
    from class
    where sex = 'M' 
    order by weight desc
    ;
    quit;

    proc sql outobs = 3;
    select *
    from class
    where sex = 'F' 
    order by weight desc
    ;
    quit;

Read Directly

This time we use the data structure of heap again in Python. To find the k top rows for each group, we just need to prepare the min heaps with the k size for each group. With the smaller values popped out everytime, we finally get the top k values for each group. The optimized time complexity is O(Nlog(k))

#In Python
from sas7bdat import SAS7BDAT
from heapq import heappush, heappop

def get_top(k, sasfile):
    minheaps = [[], []]
    sexes = ['M', 'F']
    with SAS7BDAT(sasfile) as infile:
        for i, row in enumerate(infile):
            if i == 0:
                continue
            sex, weight = row[1], row[-1]
            i = sexes.index(sex)
            current = minheaps[i]
            heappush(current, (weight, row))
            if len(current) > k:
                heappop(current)
    for x in minheaps:
        for _, y in x:
            print y

if __name__ == "__main__":
    get_top(3, 'class.sas7bdat')


[u'Robert', u'M', 12.0, 64.8, 128.0]
[u'Ronald', u'M', 15.0, 67.0, 133.0]
[u'Philip', u'M', 16.0, 72.0, 150.0]
[u'Carol', u'F', 14.0, 62.8, 102.5]
[u'Mary', u'F', 15.0, 66.5, 112.0]
[u'Janet', u'F', 15.0, 62.5, 112.5]

Example 3: Find Moving Window Maxium

At the daily work, I always want to find three statistics for a moving window: mean, max, and min. The sheer data size poses challenges.

In his blog post, Liang Xie showed three advanced approaches to calculated the moving averages, including `PROC EXPAND`, `DATA STEP` and `PROC SQL`. Apparently `PROC EXPAND` is the winner throughout the comparison. As conclusion, self-joining is very expensive and always O(N^2) and we should avoid it as much as possible.

The question to find the max or the min is somewhat different other than to find the mean, since for the mean only the mean is memorized, while for the max/min the locations of the past min/max should also be memorized.

Code Generator

The strategy is very straightforward: we choose three rows from the table sequentially and calculate the means. The time complexity is O(k*N). The generated SAS code is very lengthy, but the machine should feel comfortable to Reading it.

In addition, if we want to save the results, we could insert those maximums to an empty table.

# In Python
def create_sql(k, N):
    template = """
    select max(weight)
    from class
    where row in ({0})
    ;"""
    SQL = ""
    for x in range(1, N - k + 2):
        current = map(str, range(x, x + 3))
        SQL += template.format(','.join(current))
    print "proc sql;" + SQL + "quit;"
if __name__ == "__main__":
    create_sql(3, 19)


proc sql;
    select max(weight)
    from class
    where row in (1,2,3)
    ;
    select max(weight)
    from class
    where row in (2,3,4)
    ;
    select max(weight)
    from class
    where row in (3,4,5)
    ;
    select max(weight)
    from class
    where row in (4,5,6)
    ;
    select max(weight)
    from class
    where row in (5,6,7)
    ;
    select max(weight)
    from class
    where row in (6,7,8)
    ;
    select max(weight)
    from class
    where row in (7,8,9)
    ;
    select max(weight)
    from class
    where row in (8,9,10)
    ;
    select max(weight)
    from class
    where row in (9,10,11)
    ;
    select max(weight)
    from class
    where row in (10,11,12)
    ;
    select max(weight)
    from class
    where row in (11,12,13)
    ;
    select max(weight)
    from class
    where row in (12,13,14)
    ;
    select max(weight)
    from class
    where row in (13,14,15)
    ;
    select max(weight)
    from class
    where row in (14,15,16)
    ;
    select max(weight)
    from class
    where row in (15,16,17)
    ;
    select max(weight)
    from class
    where row in (16,17,18)
    ;
    select max(weight)
    from class
    where row in (17,18,19)
    ;quit;

Read Directly

Again, if we want to further decrease the time complexity, say O(N), we have to use better data structure, such as queue. SAS doesn’t have queue, so we may switch to Python. Actually it has two loops which adds up to O(2N). However, it is still better than any other methods.

# In Python
from sas7bdat import SAS7BDAT
from collections import deque

def maxSlidingWindow(A, w):
    N = len(A)
    ans =[0] * (N - w + 1)
    myqueue = deque()
    for i in range(w):
        while myqueue and A[i] >= A[myqueue[-1]]:
            myqueue.pop()
        myqueue.append(i)
    for i in range(w, N):
        ans[i - w] = A[myqueue[0]]
        while myqueue and A[i] >= A[myqueue[-1]]:
            myqueue.pop()
        while myqueue and myqueue[0] <= i-w:
            myqueue.popleft()
        myqueue.append(i)
    ans[-1] = A[myqueue[0]]
    return ans

if __name__ == "__main__":
    weights = []
    with SAS7BDAT('class.sas7bdat') as infile:
        for i, row in enumerate(infile):
            if i == 0:
                continue
            weights.append(float(row[-1]))

    print maxSlidingWindow(weights, 3)

[112.5, 102.5, 102.5, 102.5, 102.5, 112.5, 112.5, 112.5, 99.5, 99.5, 90.0, 112.0, 150.0, 150.0, 150.0, 133.0, 133.0]

Conclusion

While data is expanding, we should more and more consider three things –

Time complexity: we don’t want run data for weeks.

Space complexity: we don’t want the memory overflow.

Clean codes: the colleagues should easily Reading and modify the codes.

SAS Increasingly Embracing Virtuality & Cloud (and Hadoop) #SASGF15

by Andrew Ratcliffe • June 17, 2015 • Comments Off

Last week’s UK SAS Forum encouraged me to dig-out my notes from this year’s SAS Global Forum and type them up into some blog posts.One of the notable trends at this year’s SAS Global Forum was cloud and virtuality, both providing SAS sites with increas…

Using Data Science for the Greater Good #SASGF15

by Andrew Ratcliffe • June 15, 2015 • Comments Off

One of the keynote sessions at this year’s SAS Global Forum (SGF) was by Jake Porway on the subject of using data for social good. Jake is CEO of DataKind – a charitable organisation established in 2011 to harness the power of data science for the bene…

Estimating Analytics Software Market Share by Counting Books

by Bob Muenchen • June 9, 2015 • Comments Off

Below is the latest update to The Popularity of Data Analysis Software. Books The number of books published on each software package or language reflects its relative popularity. Amazon.com offers an advanced search method which works well for all the software except R … Continue reading →

Indian Baby Population in US by State

by CN Sharad • June 8, 2015 • Comments Off

Folks, I was going thru few visualizations on Tableau’s website today and I came across this visualization Exploring the SSA Baby Names Dataset by one of the acclaimed Tableau professional….It made me thinking to explore of How many of those SSA baby names are of Indian American (Desi) descent…

I found a website online that had a list of popular Indian baby names…I read the data into SAS and made a Tableau Story out of it… Please take a few moments to play with this interesting viz…I hope you like it….

Exploring Indian Baby Names from SSA Data

Here’s the SAS code that went into the prep of the data…

/*Read Indian Baby Names by parsing the http URL */

libnamesharad “C:UsersSharadDesktopnamesbystate”;

%macroloop(type);

proc sql; drop table name; quit;

%doi=1 %to 60;

filename foo url

“http://www.modernindianbabynames.com/modern_baby_name/starting_with/ANY/MF/Sikh/1560/&i.”;

data _null;

retain start recind recst recend hier;

length SN Name Meaning Gender Origin $ 100;

retain SN Name Meaning Gender Origin;

infile foo length=len;

input record $varying200.len;

put record $varying200.len;

if index(record,‘) then start=1;

if index(record,‘

‘) then start=0;

if index(record,‘) then delete;

if index(record,‘
‘
) then do; recvalst=1; hier+1;; end;

if index(record,‘

‘) then do; recvalend=1; delete; end;

if index(record,‘ ‘ ) then do; recst=1; hier=0;delete; end;

if index(record,‘

‘) then do; recst=0; hier=0;; end;

if hier=1then do; record=tranwrd(record,‘
‘
,”); SN=strip(record); end;

else if hier=2then do; record=tranwrd(record,‘
‘
,”); Name=strip(record); end;

else if hier=3then do; record=tranwrd(record,‘
‘
,”); Meaning=strip(record); end;

else if hier=4then do; record=tranwrd(record,‘
‘
,”); Gender=strip(record); end;

else if hier=5then do; record=tranwrd(record,‘
‘
,”); Origin=strip(record); end;

record=tranwrd(record,‘
‘
,”);

record=tranwrd(record,‘

‘,”);

record=strip(record);

if index(record,‘

‘) and start then do; recend=1; hier=0; output; end;

else delete;

keep SN Name Meaning Gender Origin;

run;

OPTION SPOOL;

proc append data=_null base=sharad.&type force; run;

%end;

%mendloop;

%loop(Hindi);

Make a list of Indian Names that definetly sound Indian or Closely Indian

Y – Yes

P – Indian Possibility

dataSharad.Def_IndiaNames;

infilecards4 dlm=’09’xmissover;

lengthName $ 100 IndianorNot $ 1;

inputName IndianorNot;

Name=strip(propcase(Name));

cards;

Name Indian

Tina P

Tanya P

Maya P

Trisha Y

Nadia P

Amir P

Aisha P

Tanisha P

Chandra P

Chaya P

Rohan Y

—-and 1000’s of other records—

;

run;

Join all available Indian Names

dataSharad.ALLNames;

setSharad.telugu

sharad.bengali sharad.hindi sharad.sikh;

Name=translate(Name,”,“‘”);

ifcompress(Name)=” thendelete;

dropSNO SN;

run;

Remove Dups

procsort data=Sharad.ALLNames noduprecs; by Name; run;

Re-purpose the data a bit

dataSharad.IndianNames(rename=(dMeaning=Meaning dGender=IGender dOrigin=Origin));

lengthdMeaning $ 100 dGender $15dOrigin $ 100;

retaindMeaning dGender dOrigin;

setSharad.ALLNames;

byName;

if first.name then

do;

dMeaning=”;

dOrigin=”;

dGender=”;

end;

ifindex(strip(dMeaning),strip(Meaning)) eq 0then dMeaning=catx(‘ OR ‘,strip(dMeaning),strip(Meaning));

ifindex(strip(dOrigin),strip(Origin)) eq 0then dOrigin=catx(‘ ,’,strip(dOrigin),strip(Origin));;

ifindex(strip(dGender),strip(Gender)) eq 0then dGender=catx(‘ OR ‘,strip(dGender),strip(Gender));;

ifdGender in (“Boy OR Girl”,“Girl OR Boy”) thendGender=“Boy OR Girl”;

dGender=strip(dGender);

if last.name thenoutput;

keepName dMeaning dGender dOrigin;

run;

Read US Gov SSA Baby Names data fields

filenameallst “C:UsersSharadDesktopnamesbystateallallstates.txt”;

dataSharad.USNames;

infileallst dlm=‘,’dsd missoverfirstobs=2;

lengthState $ 2 Gender $1Year $4 Name $ 50;

inputState Gender Year Name Occurences;

run;

Merge US Gov SSA Baby Names data with Indian Names Data

procsql;

createtable sharad.IndNames as

selectA.*,IGender,Meaning,Origin

fromSharad.USNames A

leftjoin Sharad.IndianNames B

onA.name=B.name

orderby A.name;

quit;

Merge US Gov SSA Baby Names data with Hand picked Indian Data

procsql;

createtable sharad.DefinitelyIndian as

selectA.*,

case

whenA.name=B.name and IndianorNot=‘Y’then ‘Indian Name’

whenA.name=B.name and IndianorNot=‘P’then ‘Likely an Indian Name’

else‘Non-Indian Name’

endas IndianDescent length=10

fromsharad.IndNames A

leftjoin Sharad.Def_IndiaNames B

onA.name=B.name

;

quit;

9.4, What’s New – Becoming Newer Month-by-Month

by Andrew Ratcliffe • June 8, 2015 • Comments Off

If you’re attending SAS Forum UK in London this week (or Manchester later this month), you might not be expecting to use it as a means to find out what’s new in V9.4, after all V9.4 was released in summer 2013, so we can’t really call it “new” any…

saslib: a simple Python tool to lookup SAS metadata

by Charlie H • June 3, 2015 • Comments Off

saslib is an HTML report generator to lookup the metadata (or the head information) like PROC CONTENTS in SAS.

It reads the sas7bdat files directly and quickly, and does not need SAS installed.
Emulate PROC CONTENTS by jQuery and DataTables.
Extract the meta data from all SAS7bdat files under the specified directory.
Support IE(>=10), firefox, chrome and any other modern browser.

Installation

pip install saslib

saslib requires sas7bdat and jinjia2.

Usage

The module is very simple to use. For example, the SAS data sets under the SASHELP library could be viewed —

from saslib import PROCcontents

sasdata = PROCcontents('c:/Program Files/SASHome/SASFoundation/9.3/core/sashelp')
sasdata.show()

The resulting HTML file from the codes above will be like here.

Page 20 of 115

« 1 … 18 19 20 21 22 … 115 »

Example 1: Find the median

SQL Query

PROC UNIVARIATE

Read Directly

Example 2: Find top K by groups

SQL Query

Code Generator

Read Directly

Example 3: Find Moving Window Maxium

Code Generator

Read Directly

Conclusion

While data is expanding, we should more and more consider three things – Time complexity: we don’t want run data for weeks. Space complexity: we don’t want the memory overflow. Clean codes: the colleagues should easily Reading and modify the codes.

Installation

Usage

While data is expanding, we should more and more consider three things –

Time complexity: we don’t want run data for weeks.

Space complexity: we don’t want the memory overflow.

Clean codes: the colleagues should easily Reading and modify the codes.