Data Structure for Repeated Measures Analysis… A Teaser

This post was kindly contributed by The SAS Training Post - go there to comment and to read the full post.

Next week’s blog entry will build on this one, so I want you to take notes, OK?

It’s not headline news that in most cases, the best way to handle a repeated measures analysis is with a mixed models approach, especially for Normal reponses (for other distributions in the exponential family, GLIMMIX also has some advantages over GEEs. But that’s another blog post for another day). You have more flexibility in modeling the variances and covariances among the time points, data do not need to be balanced, and a small amount of missingness doesn’t spell the end of your statistical power.

But sometimes the data you have are living in the past: arranged as if you were going to use a multivariate repeated measures analysis. This multivariate data structure arranges the data with one row per subject, each time-based response a column in the data set. This enables the GLM procedure to set up H matrices for the model effects, and E and T matrices for testing hypotheses about those effects. It also means that if a subject has any missing time points, you lose the entire subject’s data. I’ve worked on many repeated measures studies in my pre-SAS years, and I tell you, I’ve been on the phones, email, snail mail, and even knocked on doors to try to bring subjects back to the lab for follow-ups. I mourned over every dropout. To be able to use at least the observations you have for a subject before dropout would be consolation to a weary researcher’s broken heart.

Enter the mixed models approach to repeated measures. But, your data need to be restructured before you can use MIXED for repeated measures analysis. This is, coincidentally, the same data structure you would use for a univariate repeated measures, like in the old-olden days of PROC ANOVA with hand-made error terms (well, almost hand-made). Remember those? The good old days. But I digress.

The MIXED and GLIMMIX procedures require the data be in the univariate structure, with one row per measurement. Notice that these procedures still use CCA, but now the “case” is different. Instead of a subject, which in the context of a mixed model can be many things at once (a person, a clinic, a network…), the “case” is one measurement occurence.

How do you put your wide (multivariate) data into the long (univariate) structure? Well, there are a number of ways, and to some extent it depends on how you have organized your data. If the multivariate response variable names share a prefix, then this macro will convert your data easily.

What if you want to go back to the wide structure (for example, to create graphs to profile subjects over time)? There’s a macro for that as well.

What if your variables do not share a prefix, but instead have different names (such as SavBal, CheckBal, and InvestAmt)? Then you will need an alternative strategy. For example:

This needs some rearrangement, but there are two issues. First, there is no subject identifier, and I will want this in next week’s blog when I fit a mixed model. Second, the dependent variables are not named with a common prefix. In fact, they aren’t even measured over time! They are three variables measured for one person at a given time. (I’ll explain why in next week’s blog).

So, my preference is to use arrays to handle this:

Which results in the following:

I tip my hat to SAS Tech Support, who provide the %MAKELONG and %MAKEWIDE macros and to Gerhard Svolba, who authored them. If someone wants to turn my arrays into a macro, feel free. I’ll tip my hat to you, too.

Tune in next week for the punchline to the joke:
“Three correlated responses walk into a bar…”

This post was kindly contributed by The SAS Training Post - go there to comment and to read the full post.