Tag: simulation

Creating Simulated Data Sets

There are times when it is useful to simulate data. One of the reasons I use simulated data sets is to demonstrate statistical techniques such as multiple or logistic regression. By using SAS random functions and some DATA step logic, you can create variables that follow certain distributions or are […]

Creating Simulated Data Sets was published on SAS Users.

Type I error rates in two-sample t-test by simulation

What do you do when analyzing data is fun, but you don’t have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we “fail to reject the null hypothesis” rather than “we accept it.” If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.

R

First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no “for” loops that clutter the code and slow the simulation.

Read more »

For more posts like this, see Heuristic Andrew.

Type I error rates in test of normality by simulation

This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.

First, we run a simulation in R. Notice the simulation is vectorized: there are no “for” loops that clutter the code and slow the simulation.


# type I error
alpha <- 0.05

# number of simulations
n.simulations <- 10000

# number of observations in each simulation
n.obs <- 100

# a vector of test results
type.one.error shapiro.test(rnorm(n.obs))$p.value)<alpha

# type I error for the whole simulation
mean(type.one.error)

# Store cumulative results in data frame for plotting
sim <- data.frame(
n.simulations = 1:n.simulations,
type.one.error.rate = cumsum(type.one.error) /
seq_along(type.one.error))

# plot type I error as function of the number of simulations
plot(sim, xlab="number of simulations",
ylab="cumulative type I error rate")

# a line for the true error rate
abline(h=alpha, col="red")

# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) +
geom_line() +
xlab('number of simulations') +
ylab('cumulative type I error rate') +
ggtitle('Simulation of type I error in Shapiro-Wilk test') +
geom_abline(intercept = 0.05, slope=0, col='red') +
theme_bw()

As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.

It’s elegant the whole simulation can be condensed to 60 characters:


mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)

Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.


data normal;
length simulation 4 i 3; /* save space and time */
do simulation = 1 to 10000;
do i = 1 to 100;
x = rand('normal');
output;
end;
end;
run;

proc univariate data=normal noprint ;
by simulation;
var x;
output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;

data univariate;
set univariate;
type_one_error = probnrun;

/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
table type_one_error/nocum;
run;

In my SAS simulation the type I error rate was 5.21%.

Tested with R 3.0.2 and SAS 9.3 on Windows 7.

For more posts like this, see Heuristic Andrew.

On Unpublished Software

sciseekclaimtoken-4f343317d3d60 I ran across this post at The Tree of Life entitled ‘Interesting new metagenomics paper w/ one big big big caveat – critical software not available”. The long and short of it? Paper appears in Science, has fancy new methodology, lacks the software for someone else to use their methodology. Blog author understandably annoyed. But I […]

StackExchange and CrossValidated: An Epidemiologist’s Review

This seems like as good a day as any to review CrossValidated, and the whole StackExchange constellation of websites. It’s been a month since I joined, exactly, and today I also crossed the 1,000 reputation threshold on the site. So why not give my impressions of it? First, how I got there in the first […]

Sampling from a Bayesian Posterior Distribution in SAS

One of the things that frequently comes up in my research is the need to estimate a parameter from data, and then randomly draw samples from that parameter’s distribution to plug into another model. If you have a regular estimate from something like PROC LOGISTIC or PROC GENMOD, this is easy as pie, as SAS […]

Proc Fcmp(4): Binomial tree vs. Black-Scholes model

The very truth is that SAS has limited financial functions. Thanks to SAS Institute, they finally added some option pricing functions in the base module of SAS 9.2, such as Black-Scholes put/call functions, Garman-Kohlhagen put/call functions, etc. Thu…

Find the ‘right’ SAS functions

How many functions SAS has? Well, it sounds like a job interview question. For SAS 9.2, by querying the system dictionary (sashelp.vfunc or dictionary.functions), the exact answer is 946, including all functions and call routines. There are two types -…