I’ve been tracking The Popularity of Data Analysis Software for many years now, and a clear trend is the decline of the market share of the bigger analytics firms, notably SAS and SPSS. Many people have interpreted my comments as implying … Continue reading →
Tag: Statistics
Updated: Why R is Hard to Learn
I’ve updated one of my most widely read blog posts, Why R is Hard to Learn. It focuses on the aspects of R which tend to trip up beginners. The new version is over twice as long as the original … Continue reading →
Read sas7bdat files in R with GGASoftware Parso library
… using the new R package sas7bdat.parso. The software company GGASoftware has extended the work of myself and others on the sas7bdat R package by developing a Java library called Parso, which also reads sas7bdat files. They have worked out most of the remaining kinks. For example, the Parso library reads sas7bdat files with compressed […]
R Passes SPSS in Scholarly Use, Stata Growing Rapidly
by Robert A. Muenchen Here is my latest update to The Popularity of Data Analysis Software. To save you the trouble of reading all 25 pages of that article, the new section is below. The two most interesting nuggets it contains are: … Continue reading →
R Continues Its Rapid Growth
This post was kindly contributed by r4stats.com » SAS – go there to comment and to read the full post. I’ve just updated the section below from The Popularity of Data Analysis Software. Note that the overall article is still…
SAS, SPSS, Stata Users: Learn R from Home April 21
Has learning R been driving you a bit crazy? If so, it may be that you’re “lost in translation.” On April 21 and 23, I’ll be teaching a webinar, R for SAS, SPSS and Stata Users. With each R concept, … Continue reading →
Analytics Software Popularity Update: Counting Blogs, Simplifying Job Searches
My latest update to The Popularity of Data Analysis Software is an attempt to use blog counts to estimate the popularity of analytics software. While I was able to greatly broaden the coverage of packages when studying job data, I … Continue reading →
Type I error rates in test of normality by simulation
This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.
First, we run a simulation in R. Notice the simulation is vectorized: there are no “for” loops that clutter the code and slow the simulation.
# type I error
alpha <- 0.05
# number of simulations
n.simulations <- 10000
# number of observations in each simulation
n.obs <- 100
# a vector of test results
type.one.error shapiro.test(rnorm(n.obs))$p.value)<alpha
# type I error for the whole simulation
mean(type.one.error)
# Store cumulative results in data frame for plotting
sim <- data.frame(
n.simulations = 1:n.simulations,
type.one.error.rate = cumsum(type.one.error) /
seq_along(type.one.error))
# plot type I error as function of the number of simulations
plot(sim, xlab="number of simulations",
ylab="cumulative type I error rate")
# a line for the true error rate
abline(h=alpha, col="red")
# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) +
geom_line() +
xlab('number of simulations') +
ylab('cumulative type I error rate') +
ggtitle('Simulation of type I error in Shapiro-Wilk test') +
geom_abline(intercept = 0.05, slope=0, col='red') +
theme_bw()
As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.
It’s elegant the whole simulation can be condensed to 60 characters:
mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)
Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.
data normal;
length simulation 4 i 3; /* save space and time */
do simulation = 1 to 10000;
do i = 1 to 100;
x = rand('normal');
output;
end;
end;
run;
proc univariate data=normal noprint ;
by simulation;
var x;
output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;
data univariate;
set univariate;
type_one_error = probnrun;
/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
table type_one_error/nocum;
run;
In my SAS simulation the type I error rate was 5.21%.
Tested with R 3.0.2 and SAS 9.3 on Windows 7.
For more posts like this, see Heuristic Andrew.