Tag: r-project

Type I error rates in two-sample t-test by simulation

What do you do when analyzing data is fun, but you don’t have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we “fail to reject the null hypothesis” rather than “we accept it.” If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.

R

First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no “for” loops that clutter the code and slow the simulation.

Read more »

For more posts like this, see Heuristic Andrew.

Fibonacci sequence in R and SAS

Because the Fibonacci sequence is simply defined by recursion, it makes for an elegant programming exercise. Here is one way to do it in SAS, and another way to do it in R. I’ve also included unit testing code to check that it works.

Fibonacci sequence in SAS using a recursive macro:


%macro fib(n);
%if &n = 1 %then 1; * first seed value;
%else %if &n = 2 %then 1; * second seed value;
%else %eval(%fib(%eval(&n-1))+%fib(%eval(&n-2))); * use recursion;
%mend;

* show values 1-5;
%put %fib(1);
%put %fib(2);
%put %fib(3);
%put %fib(4);
%put %fib(5);

* check values 1-10;
%macro check_fib;
%if %fib(1) ne 1 %then %abort;
%if %fib(2) ne 1 %then %abort;
%if %fib(3) ne 2 %then %abort;
%if %fib(4) ne 3 %then %abort;
%if %fib(5) ne 5 %then %abort;
%if %fib(6) ne 8 %then %abort;
%if %fib(7) ne 13 %then %abort;
%if %fib(8) ne 21 %then %abort;
%if %fib(9) ne 34 %then %abort;
%if %fib(10) ne 55 %then %abort;
%put NOTE: OK!;
%mend;
%check_fib;

Fibonacci sequence in R using a recursive function that supports either single integers or a vector of integers:


fib {
if (length(n) > 1) return(sapply(n, fib)) # accept a numeric vector
if (n == 1) return(1) # first seed value
if (n == 2) return(1) # second seed value
return(fib(n-1)+fib(n-2)) # use recursion
}

# print first five Fibonacci numbers
fib(1)
fib(2)
fib(3)
fib(4)
fib(5)

# verify the Fibonacci sequence 1 through 10
(actual (expected all.equal(actual,expected)

For alternative implements, see SAS and R: Example 7.1: Create a Fibonacci sequence. In SAS, Nick Horton calculates the Fibonacci sequence using a DATA STEP, and in R he uses a FOR loop.

Adam Rich responded with his post Fibonacci Sequence in R with Memoization which gives a performance boost by caching the results.

In the comments below, Rick Wicklin referred to his SAS/IML solution that generates the Fibonacci sequence iteratively and Matrices, eigenvalues, Fibonacci, and the golden ratio.

This post first appeared on Heuristic Andrew.

For more posts like this, see Heuristic Andrew.

Type I error rates in test of normality by simulation

This simulation tests the type I error rates of the Shapiro-Wilk test of normality in R and SAS.

First, we run a simulation in R. Notice the simulation is vectorized: there are no “for” loops that clutter the code and slow the simulation.


# type I error
alpha <- 0.05

# number of simulations
n.simulations <- 10000

# number of observations in each simulation
n.obs <- 100

# a vector of test results
type.one.error shapiro.test(rnorm(n.obs))$p.value)<alpha

# type I error for the whole simulation
mean(type.one.error)

# Store cumulative results in data frame for plotting
sim <- data.frame(
n.simulations = 1:n.simulations,
type.one.error.rate = cumsum(type.one.error) /
seq_along(type.one.error))

# plot type I error as function of the number of simulations
plot(sim, xlab="number of simulations",
ylab="cumulative type I error rate")

# a line for the true error rate
abline(h=alpha, col="red")

# alternative plot using ggplot2
require(ggplot2)
ggplot(sim, aes(x=n.simulations, y=type.one.error.rate)) +
geom_line() +
xlab('number of simulations') +
ylab('cumulative type I error rate') +
ggtitle('Simulation of type I error in Shapiro-Wilk test') +
geom_abline(intercept = 0.05, slope=0, col='red') +
theme_bw()

As the number of simulations increases, the type I error rate approaches alpha. Try it in R with any value of alpha and any number of observations per simulation.

It’s elegant the whole simulation can be condensed to 60 characters:


mean(replicate(10000,shapiro.test(rnorm(100))$p.value)<0.05)

Likewise, we now do a similar simulation of the Shapiro-Wilk test in SAS. Notice there are no macro loops: the simulation is simpler and faster using a BY statement.


data normal;
length simulation 4 i 3; /* save space and time */
do simulation = 1 to 10000;
do i = 1 to 100;
x = rand('normal');
output;
end;
end;
run;

proc univariate data=normal noprint ;
by simulation;
var x;
output out=univariate n=n mean=mean std=std NormalTest=NormalTest probn=probn;
run;

data univariate;
set univariate;
type_one_error = probnrun;

/* Summarize the type I error rates for this simulation */
proc freq data=univariate;
table type_one_error/nocum;
run;

In my SAS simulation the type I error rate was 5.21%.

Tested with R 3.0.2 and SAS 9.3 on Windows 7.

For more posts like this, see Heuristic Andrew.

Calculate RMSE and MAE in R and SAS

Here is code to calculate RMSE and MAE in R and SAS. RMSE (root mean squared error), also called RMSD (root mean squared deviation), and MAE (mean absolute error) are both used to evaluate models. MAE gives equal weight to all errors, while RMSE gives…

Train neural network in R, predict in SAS

This R code fits an artificial neural network in R and generates Base SAS code, so new records can be scored entirely in Base SAS. This is intended to be a simple, elegant, fast solution. You don’t need SAS Enterprise Miner, IML, or any other spe…