How to store data about your data in your data

This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post.

Although the NSA and news media have given metadata a bad name in the popular press, the fact remains that information about the nature of your data is extremely valuable.

For example, I posted an article yesterday about data cardinality. Cardinality measures the uniqueness of values in a variable. Cardinality is metadata: it’s an attribute of the data variable that isn’t the actual data content. If you know the cardinality of a variable, then you can decide what types of analyses might be appropriate. That’s very useful.

I provided a method to calculate cardinality. The method (a SAS macro program) creates a data set with the cardinality information, plus an optional report. But what if you could infuse that cardinality information back into the original source data set, so that this metadata was present and visible the next time that you needed it? The information would be part of the data set, and you wouldn’t need to keep track of another data table “on the side” or recalculate the values for each operation in your process.

With SAS 9.4, you can achieve this using extended attributes in SAS data sets. Extended attributes can be added, modified, and removed by PROC DATASETS and a series of XATTR statements.

The following program uses the %getcardinality macro program to create a cardinality data set, which includes values for nlevels and pct_unique. Then it generates a series of attributes and values (using the XATTR ADD VAR statement) to add these back into the original source data. (Well, in my case I made a copy of the data because I didn’t want to modify the SASHELP version.)

/* See http://blogs.sas.com/content/sasdummy/about-cardinality/ */
%getcardinality(SASHELP.CARS, WORK.CARDS, 0);
 
/* Assemble XATTR statements to add in nlevels and pct_unique */
/* These were calculated in the %getcardinality macro         */
proc sql;
/* will generate statements like:                               */
/* XATTR ADD VAR Invoice (nlevels=425 pct_unique=0.9929906542); */
select 
  cat('XATTR ADD VAR ', 
         Name,' (nlevels=',
         nlevels,' pct_unique=',pct_unique,');') 
          into :xattr separated by ' '
    from work.cards;
quit;
 
proc datasets lib=work nolist;
  /* make a copy of the SASHELP.CARS data */
  /* so I don't modify the original */
  copy in=sashelp out=work;
  select cars;
 
  /* Now get ready to modify WORK.CARS */
  modify cars;
 
  XATTR ADD DS STATUS="Added cardinality on &sysdate.";
  /* Fold in the XATTR statements */
  &xattr.;
 
  /* Report on the contents including extended attributes */
  /* save just the attributes to a data set */
  ods output ExtendedAttributesVar = work.savedAttrs;
  contents data=cars;
quit;

At the end of the program, the PROC DATASETS CONTENTS statement produces a report and data set that shows the extended attributes, which are now “baked into” the SAS data set that they describe. Here’s the report from PROC DATASETS — note the new “Extended Attributes” sections.


Perhaps more useful, I used the ODS OUTPUT statement to save a copy of the ExtendedAttributesVar table, which contains the cardinality information that can inform my analysis.


Of course, storing cardinality is just one of many possible applications for extended attributes. You could extend my example here and “precalculate” any univariate statistic that you might want to know later: mean, median, sum, and so on. Or you could store audit details at the data set level with XATTR ADD DS. You can even store a copy of the SAS program that you used to create the data — Diane Olson provides an example of this in her paper about extended attributes.

Important note: Only the BASE engine supports extended attributes. You can also access them from the META engine when the underlying metadata library is a BASE library. And SAS releases prior to 9.4 won’t be able to access a data set once it contains extended attributes.

What other uses can you imagine for extended attributes? Post back in the comments!

tags: cardinality, extended attributes, PROC DATASETS, SAS 9.4

This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post.