Progress reading SAS sas7bdat files (natively) in R

This post was kindly contributed by BioStatMatt » SAS - go there to comment and to read the full post.

This post describes some preliminary results from a compatibility study of the SAS sas7bdat file format. The most current results stored in a github repository here: sas7bdat

The ultimate goal is a native solution to the incompatibility between open-source statistical software (e.g. R) and sas7bdat database files.

Demonstration

There has been significant progress in interpreting these files natively within R. Enter the following code at the R prompt to read a sas7bdat dataset from the UTK Statistics Department website:

source("http://biostatmatt.com/R/sas7bdat.R")
read.sas7bdat("http://bus.utk.edu/stat/stat579/hotel.sas7bdat")

Alternatively, download sas7bdat.R from the github repository.

Study Data

The study considers a collection of 142 sas7bdat datasets, obtained freely from web resources. Source information is available in the repository (data/sources.csv). The files in this collection are generally inefficient data storage containers. xz -9 compression resulted in a 42.7 fold reduction in overall collection size. Indeed, the collection requires 541.4 MB uncompressed, and just 12.5 MB compressed (xz -9). However, this result is driven by a few very large data files. The median and mean compression ratios are 4.5 and 13.7, respectively.

Format Description

The most impacting product of this study, in terms of compatibility, is a description of the sas7bdat file format. Although others have presumably conducted similar compatibility studies (e.g. Chris Long, of the Oceanview Consultancy), no description of the format, or code that implements a reader has been released. The doc/ directory of the github repository contains a first description of the format.

Prototype Reader

The code available at github implements the function read.sas7bdat, written purely in R, which extracts meaningful data from ~91% of the data file collection. The remaining data files have unusual properties (e.g.originating on non-Microsoft Windows platforms). This should be considered a reference/experimental implementation, rather than an every day tool (but, see also doc/implementation-notes.rst).

“Need More Input”

The sas7bdat file format is complex. There are many features of the format with unknown purpose. I am happy to have feedback; successes and failures. I am also looking for additional material for investigation, especially related to Linux, Sun OS, and other non-Microsoft Windows platforms.

The study could benefit from additional technical contributions. There is an enormous amount of skill within the R community, and some of it relates to sas7bdat files (e.g. recent support for sas7bdat files in Revolution Analytics R). If you’re interested, it won’t hurt to fork the repository at github! Ultimately, I’d like to contribute an R package, or code to the existing foreign package to read sas7bdat files. Also, It would be fun to compile and write about tools and strategies for ‘compatibility studies’ of numeric database files. Again, collaborators are welcome.

The text, code, and data presented or referenced herein are the result of original research.© Matt Shotwell, VUMC, 2011.

This post was kindly contributed by BioStatMatt » SAS - go there to comment and to read the full post.