Tag: Statistics

Data Science Software Popularity Update

I have recently updated my extensive analysis of the popularity of data science software. This update covers perhaps the most important section, the one that measures popularity based on the number of job advertisements. I repeat it here as a blog post, so you don’t have to read the entire article. Job Advertisements One of … Continue reading Data Science Software Popularity Update

The post Data Science Software Popularity Update first appeared on r4stats.com.

Summarizing data

Because it is near the end of the year, I thought a blog about “Summarizing” data might be in order. For these examples, I am going to use a simulated data set called Drug_Study, containing some categorical and numerical variables. For those interested readers, the SAS code that I used […]

Summarizing data was published on SAS Users.

Estimating birth date from age

This code demonstrates an algorithm for estimating birth date from age. We cannot know the exact birth date, but we can get close: the maximum error is half a year, and the typical error is one quarter of a year.



/* The %age macro was taken from the Internet---maybe from here http://support.sas.com/kb/24/808.html ? */
%macro age(date,birth);
floor ((intck('month',&birth,&date) - (day(&date) %mend age;

/*
Generate 10000 fake people with random birth dates and random perspective days
on which their age was measured. Then, calculate age from that perspective date.
In reality, there is some seasonality to births (e.g., more births in July), but
here we assume each day of the year has an equal distribution of births.
*/
data person;
format birth_date submit_date yymmdd10.;
do i = 1 to 10000;
birth_date = %randbetween(19000,20500);
submit_date = birth_date + %randbetween(0,100*365);
age = %age(submit_date, birth_date);
output;
end;
drop i;
%runquit;

/* Work in reverse from age to estimated birth date. */
data reverse;
set person;
format birth_date_min birth_date_max yymmdd10.;
birth_date_min = intnx('years', submit_date, -1 * (age+1), 's') - 1;
birth_date_max = intnx('years',birth_date_min,1,'s') + 1;

/* check range of estimates for errors */
min_error = (birth_date > birth_date_min);
max_error = (birth_date < birth_date_max);

/* estimate birth date as the middle of the range */
birth_date_avg = mean(birth_date_min, birth_date_max);

/* calculate variance */
abs_days_error = abs(birth_date - birth_date_avg);
%runquit;

/* Both errors should always be zero. */
proc freq data=reverse;
table min_error max_error;
quit;

/* Error of estimates range from 0 to 183.5 with a median of 92 and average of 91.*/
proc means data=reverse n nmiss min median mean max;
var abs_days_error;
quit;

/* Distribution of errors is uniform */
proc sgplot data=reverse;
histogram abs_days_error;
quit;

Tested with SAS 9.4M6

For more posts like this, see Heuristic Andrew.

Thomas Bayes’ theorem and “inverse probability”

The following is an excerpt from Cautionary Tales in Designed Experiments by David Salsburg. This book is available to download for free from SAS Press. The book aims to explain statistical design of experiments (DOE) to readers with minimal mathematical knowledge and skills. In this excerpt, you will learn about […]

Thomas Bayes’ theorem and “inverse probability” was published on SAS Users.

Finding Possible Data Errors Using the %Auto_Outliers Macro

One of the first and most important steps in analyzing data, whether for descriptive or inferential statistical tasks, is to check for possible errors in your data. In my book, Cody’s Data Cleaning Techniques Using SAS, Third Edition, I describe a macro called %Auto_Outliers. This macro allows you to search […]

Finding Possible Data Errors Using the %Auto_Outliers Macro was published on SAS Users.

Summarization in CASL

Summarizing numeric data is an important step in analyzing your data. CASL provides multiple actions that generate summary statistics. This blog provides a quick overview of three of those actions: SIMPLE.SUMMARY, AGGREGATION.AGGREGATE, and DATAPREPROCESS.RUSTATS.

Summarization in CASL was published on SAS Users.