Has learning R been driving you a bit crazy? If so, it may be that you’re “lost in translation.” On April 21 and 23, I’ll be teaching a webinar, R for SAS, SPSS and Stata Users. With each R concept, … Continue reading →
Tag: SAS
Analytics Software Popularity Update: Counting Blogs, Simplifying Job Searches
My latest update to The Popularity of Data Analysis Software is an attempt to use blog counts to estimate the popularity of analytics software. While I was able to greatly broaden the coverage of packages when studying job data, I … Continue reading →
SAS Global Forum 2014 for Platform Administrators
SAS Global Forum 2014 is just around the corner and I’m getting really excited about it. There is so much content for SAS platform administrators this year. I had been thinking of posting a list of admin related papers I was keen to see, but then I saw that Greg Nelson from ThotWave wrote a […]
Type I error rates in test of normality by simulation
This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.
First, we run a simulation in R. Notice the simulation is vectorized: there are no “for” loops that clutter the code and slow the simulation.
# type I error
alpha <- 0.05
# number of simulations
n.simulations <- 10000
# number of observations in each simulation
n.obs <- 100
# a vector of test results
type.one.error shapiro.test(rnorm(n.obs))$p.value)<alpha
# type I error for the whole simulation
mean(type.one.error)
# Store cumulative results in data frame for plotting
sim <- data.frame(
n.simulations = 1:n.simulations,
type.one.error.rate = cumsum(type.one.error) /
seq_along(type.one.error))
# plot type I error as function of the number of simulations
plot(sim, xlab="number of simulations",
ylab="cumulative type I error rate")
# a line for the true error rate
abline(h=alpha, col="red")
# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) +
geom_line() +
xlab('number of simulations') +
ylab('cumulative type I error rate') +
ggtitle('Simulation of type I error in Shapiro-Wilk test') +
geom_abline(intercept = 0.05, slope=0, col='red') +
theme_bw()
As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.
It’s elegant the whole simulation can be condensed to 60 characters:
mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)
Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.
data normal;
length simulation 4 i 3; /* save space and time */
do simulation = 1 to 10000;
do i = 1 to 100;
x = rand('normal');
output;
end;
end;
run;
proc univariate data=normal noprint ;
by simulation;
var x;
output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;
data univariate;
set univariate;
type_one_error = probnrun;
/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
table type_one_error/nocum;
run;
In my SAS simulation the type I error rate was 5.21%.
Tested with R 3.0.2 and SAS 9.3 on Windows 7.
For more posts like this, see Heuristic Andrew.
Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More!
I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward … Continue reading →
A SAS Note for Length Limit of Strings in CDISC Datasets
Clinical programmers are very familiar with the length limit of strings in CDISC compliant datasets, such as #1: variable names: <= 8 characters #2: variable labels: <= 40 characters #3: data set labels: <= 40 characters #4: data value of a single variable: <= 200 characters and they are due to the limitations of SAS […]
NOWINDOWS Goes Out the Window
I just discovered by accident a new “feature” in SAS 9.4, one that apparently wasn’t glamorous enough to be mentioned back at SAS Global Forum last spring. Starting with SAS 9.4, by default, PROC REPORT runs in noninteractive mode. In other words, you no longer have to specify the NOWINDOW option in order to avoid […]
SAS Format Viewer and Others (EG)
Today I tried to install a SAS format viewer, one of my favorite SAS AF utilities in a virtual machine, and it came out a bad news and a good news. The bad news is that the website I used to download the SAS Format Viewer (by Frank Poppe) is no longer accessible(LESSON LEARNED: make […]