Example 9.1: Scatterplots with binning for large datasets
This post was kindly contributed by SAS and R - go there to comment and to read the full post. |
Scatterplots can get very hard to interpret when displaying large datasets, as points inevitably overplot and can’t be individually discerned. A number of approaches have been crafted to help with this problem. One approach uses binning. This approach is also sometimes called a heat map, and can be though of as a two-dimensional histogram, where shades of the bins take the place of the heights of the bars. Any regular tesselation of the plane can be used, but there is some attraction to using hexagons. Why? In the vignettes for the hexbin package author Nicholas Lewin-Koh notes:
There are many reasons for using hexagons, at least over squares. Hexagons have symmetry of nearest neighbors which is lacking in square bins. Hexagons are the maximum number of sides a polygon can have for a regular tesselation of the plane, so in terms of packing a hexagon is 13% more efficient for covering the plane than squares. This property translates into better sampling efficiency at least for elliptical shapes. Lastly hexagons are visually less biased for displaying densities than other regular tesselations.
On the other hand, it’s unclear whether these advantages are relevant here or whether they outweigh the simplicity of the square and the constant x and y values accompanying it.
In this entry, we demonstrate the use of a binned scatterplot for data from a sample of 10,000 generated bivariate normal random variables (section 1.10.6).
R
In R, we use the hexbin package to generate our plot, after generating our bivariate normals with correlation approximately 0.52.
library(MASS)
library(hexbin)
mu = c(1, -1)
Sigma = matrix(c(3, 2,
2, 5), nrow=2)
xvals = mvrnorm(10000, mu, Sigma)
Sigma[1,2]/sqrt(Sigma[1,1]*Sigma[2,2]) # correlation
plot(hexbin(xvals[,1], xvals[,2]), xlab="X1", ylab="X2")
SAS
We’re not aware of a SAS procedure to generate a binned scatterplot or of previously existing macros to do it. Ken wrote a relatively simple macro to do it, which can be found here. The macro uses proc gmap, and we hope that someone will develop an approach using proc template and proc sgrender, as demonstrated in an example from SAS Institute.
After running the macro, the following code generates the image shown below.
data Sigma (type=cov);
infile cards;
input _type_ $ _Name_ $ x1 x2;
cards;
cov x1 3 2
cov x2 2 5
;
run;
proc simnormal data=Sigma out=mvnorms numreal = 10000;
var x1 x2;
run;
%twodhist(data=mvnorms,x=x1,y=x2,nbinsx=30,nbinsy=30,nshades=9);
We note that the default number of shades shown in R, and the number chosen here for SAS, seem to exceed the eye’s ability to differentiate, especially for the darker shades.
Update
An anonymous commenter reported that the SAS code bombed when run. I (Ken) added a new version of the code at the link listed above. I note it here only to emphasize that in either SAS or R, settings or objects in the environment can affect the performance of code. If your plan to share code, an item to add to your checklist is to run the code in a fresh session.
This post was kindly contributed by SAS and R - go there to comment and to read the full post. |