3 Tips for Testing Large Data on Slow Connections or Escaping Upper Hell

This post was kindly contributed by Business Intelligence Notes for SAS® BI Users - go there to comment and to read the full post.

Do you ever find yourself wondering if you are paying for sins from a past life while you wait for SAS Enterprise Guide to show your data?  If you review Dante’s Circles of Hell – I’m pretty sure this situation was one of things he was trying to describe in ancient language.  There is no way the agony of waiting for data to appear with a looming deadline did not make his list! Often it seems like the closer the deadline, the slower the connection.  When working with data locally, the data appears in moments.  However, when you are connected over a slow network connection (maybe sprinkle a little VPN on  top of it), it may take minutes to see your result. When you add those minutes up over an entire day – that is a lot of time spent  staring at the screen.  So here are some techniques I have used to help overcome the issue.  

suffering slow connections

Tip 1: Use Less Data

Depending on where you are in the coding process, sometimes you just need to verify that a join worked correctly or that the code does not drop any rows when it goes through the program.  There are several ways to limit the observations:

Use Related Data

Limit the dataset to certain related rows.  For instance, all order IDs for a certain day, all account numbers that start with 411, or only Canadian customers.  This is especially useful if you are testing a join or want to watch a record go through the process.  Once you know the value the target record is suppose to return, it makes it easier to check for the correct results and with less data your program with process quicker.  

Sometimes I setup a macro variable that is a Where statement.  For instance, it might contain the orders I want to follow through the data.  The advantage to this method is that I can use the macro variable in several different files in case I am checking the orders from different datasets.  You can set the macro variable to nothing (%LET wherestmt=;) without causing an error.  This is useful when you need to test between different runs and want to ensure it’s always the same orders. 

macro variables with where stmt

Limit the Observations (Rows)

Maybe you just need to see the process and don’t really care about a particular category.  There are several ways to limit the amount of data without adding a where clause.  One easy way to limit the data is to set the number of rows that process using the OBS option. [SAS documentation: OBS Option]

Here’s some of the ways I use OBS:

  • Issue an OPTIONS statement for the number of rows I want.  (Tip: Use OBS=MAX to return to reset the option.)
  • Add OUTOBS to PROC SQL   
  • Add OBS to the PROC SQL input dataset  

OBS

Each of these methods have their limitations and uses.  I used a screen capture of the log to help you understand that it makes a difference where the OBS is used. For instance, with #2 I get a Warning statement in the log that makes it seem like I’ve broken a secret SAS rule. However it is just making me aware that is did not return all rows, which is what I told it to do.  

Also notice when the OBS is placed on the input dataset (#3), it limits the number of records used from the beginning.  This could impact your testing if you want to see a join work – there may not be enough rows to complete the join  depending on what you are doing.  On the other hand – if the dataset has 500k rows – you could limit the OBS to 1000 and certainly ensure the code was working.

Tip 2: Don’t Create – Select!

In SAS Enterprise Guide, I notice when I click on the Data tab – it can take forever for the data to return.  However, if I want to see Output – its lightning fast.  SAS EG returns my output when it finishes the job, while it has to retrieve data.  So why create a dataset if you really just need to see results?

I started using the PROC SQL without the “Create Table” statement so I could see the output or explore the data more quickly.   By removing the “create table” statement – the results are simply returned to the Results tab as a report.  This method has its limits as well – for instance you cannot display eleventy-million rows without SAS complaining.  So use one of the methods from Tip 1 to help out.

create vs select

Tip 3: Use Proc FREQs to Judge Output

Similar to the above tip – learn to use the SAS procedures such as PROC FREQ, PROC MEANS and PROC TABULATE to see results.  This method works when you need to understand how the data was categorized.  The added advantage to this method is that the results are returned quickly (much like Tip #2).

Ensuring Variables Populated (Without the scroll, scroll, scroll and wait)

Maybe you just want to ensure that all of the years were captured in the data.  Imagine if you had scroll through a large dataset to determine if there was an entry for each week!  If you have done it – then you are familiar with the dance.  It’s something like drag the slider down so you can see more data and then wait for the data to catch up.  Or if you scroll too far then it’s take even longer and I have not found a way to break out of it, so I just practice my cursing. [May the fleas of a 1,000 camels … ]

But here’s a quicker method. If you use PROC FREQ and format the Date variable to year – you can quickly see a result to verify the data exists for each row.  You can expand this method to check several categories at a time. [More Proc FREQ help from Guido! ]

freq

Find Missing Data

PROC FORMAT allows you to do some really handy testing. Sometimes you just need to verify that a value populated.  In the example, all we care about is if a stock price was posted for the month.  We don’t care what the stock price was – we just need to know that one is there.   Use the PROC FORMAT to quickly identify the missing values.  In the following example – any value o to whatever is captured with the low  – high and likewise for Missing.  Notice in the following figure how the format changes the output – now you quickly know you need to verify what happened in February to Big Blue. [Excellent paper from PharmaSug 2006.]   

0512_suffering_slow_connections_06

Other Tips

You have to keep these techniques in your toolbox and use them when it makes sense.  Let me know any techniques you regularly use in the Comments below.

 

Image:  Gustave Doré’s illustrations to Dante’s Inferno, Plate LXV: Canto XXXI 

The post 3 Tips for Testing Large Data on Slow Connections or Escaping Upper Hell appeared first on Business Intelligence Notes for SAS® BI Users. Written by .

This post was kindly contributed by Business Intelligence Notes for SAS® BI Users - go there to comment and to read the full post.