Tag: data cleaning

Two macros for detecting data errors

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart […]

Two macros for detecting data errors was published on SAS Users.

Finding Possible Data Errors Using the %Auto_Outliers Macro

One of the first and most important steps in analyzing data, whether for descriptive or inferential statistical tasks, is to check for possible errors in your data. In my book, Cody’s Data Cleaning Techniques Using SAS, Third Edition, I describe a macro called %Auto_Outliers. This macro allows you to search […]

Finding Possible Data Errors Using the %Auto_Outliers Macro was published on SAS Users.

Breaking Bad (data) with SAS!

In the Breaking Bad TV series, Walter White has an impressive lab where he secretly makes the illegal drug methamphetamine (meth). Wouldn’t it be cool to use SAS to show the locations of all the clandestine meth labs in the US?!? Let’s do it!… In thi…

5D visualiztion: from SAS to Google Motion Chart

Three dimensions are usually regarded as the maximum for data presentation. With the opening of ODS from SAS 9.2 and its graph template language, 3D graphing is no longer a perplexing problem for SAS programmers. However, nowadays magnificent amount of data with multi-dimension structure needs more vivid and simpler way to be displayed.

The emerging of Google Motion Chart now provides a sound solution to visualize data in a more than three dimensions scenario. This web-based analytical technology originated from Dr. Hans Rosling’s innovation. Dr. Rosling and his Gapminder foundation invented a technology to demonstrate the relationship among multiple dimensions by animated bubbles. They developed a lot of bubble plots in Gapminder’s website to discover knowledge form a bulk of public information, especially for regional/national comparison. It soon attracted Google’s attention. In 2008 after an agreement between Dr. Rosling and Google’s two founders, Google launched its Motion Chart gadget. People could create motion chart by using Google Docs, an online alternative to Microsoft’s Office.

The combination between SAS and Google Motion Chart shows a handy and cheap way for up-to-five-dimension data visualization. For Motion Chart, it supports five variables all together in a plot. Commonly the data structure requires time(animation), var1(X axis), var2(Y axis), var3(color) and var4(bubble size). The correlation from var1 to var4 is expected: usually the bubbles with changing color and size tend to move along the diagonal line. Overall 5d visualization can be rendered within such a single plot. In this example, a SAS help dataset ‘SASHELP.SHOES’ is used. The data set has several regions to compare each other. Logged return money is Y-axis, while logged sale money is X-axis. A series of virtual time is given to each region, with inventory as bubble size and the store number as color. By SAS, the data structure in Motion Chart can be prepared quickly. Thus, once the CSV file is uploaded to Google Docs, a motion chart is ready to be published in any webpage. OK, it’s time to sit and discover some interesting tendency…

Reference:
1.’Show me–New ways of visualising data’. The Economist. Feb 25th 2010.
2.‘Making data dance’. The Economist. Dec 11st 2010.
3. Google Docs online help center. 2010.

*********(1) Extract data from SASHELP.SHOES***********;
proc sql;
create table test as
select region, Sales, Inventory, Returns, Stores
from sashelp.shoes
order by region , sales desc
;quit;
********(2) Create a random variable for time************;
data test1;
do i=1 by 1 until (last.region);
set test;
by region;
time=today()-i+1;
mytime=put(time, mmddyy8.);
drop i;
output;
end;
run;
********(3) Transform some variables with log**********;
proc sql;
create table test2 as
select region, mytime, log(sales) as logsales, log(returns) as logreturn, Stores as storenum, Inventory
from test1
order by region, mytime
;quit;
********(4) Export data as CSV***************;
proc export data=test2 outfile='C:\Users\Yanyi\Desktop\test.csv' replace;
run;
*******(5) Upload CSV to Google Docs************;
******(6) Create Google Motion Chart manually**********;

**********END*********TEST PASSED 12DEC2010****************************;

How to predict unemployment rate in US by SAS/ETS

The raw data: hereFILENAME data URL “http://research.stlouisfed.org/fred2/data/UNRATE.txt” DEBUG LRECL=100;data one;infile data;input @1 year 4. @6 month 2. @9 day 2. uer;DATE = MDY(MONTH,DAY,YEAR);FORMAT DATE monyy7.;drop year month day;if date=. th…

How to generate ICD9 table

data icd; infile ‘http://within.dhfs.state.wi.us/helpfiles/dlookupbrowse.html’ truncover; input star code $5. description $100.; if anyalpha(substr(code, 5, 1))=1 then addon=substr(code, 5, 1);run;data icd1; set icd nobs=nobs; code=compress(code, addo…

The top12 pharms

1. Data import: input/format/label/keep+array2. GTL can definitely be reused;data top12; input rank company & $20. country & $15. revenue comma12.2 netincome comma12.2 employee; personearn=(netincome/employee)*1000000; format revenue dolla…