Expanding lengths of all character variables in SAS data sets

This post was kindly contributed by SAS Users - go there to comment and to read the full post.

CVP engine as a magnifying glassIn my earlier blog post, Changing variable type and variable length in SAS datasets, I showed how you can effectively change variables lengths in a SAS data set. That approach works fine when you need to change length attribute for few variables, on a case by case basis. But what if you need to change lengths for all character variables in a data set? Or if you need to do this for all data sets in a data library? For example, you need to expand (increase) all your character variables lengths by 50%. Well, then the case-by-case approach becomes too laborious and inefficient.

What is a character variable’s length attribute?

Before reading any further, let’s take a quick quiz:

Q: A character variable length attribute represents a number of:

  1. Bits
  2. Bytes
  3. Centimeters
  4. Characters

If your answer is anything but B, it’s incorrect. According to the SAS documentation, length refers to the number of bytes used to store each of the variable’s values in a SAS data set. You can use a LENGTH statement to set the length of both numeric and character variables.

It is true though that for some older encoding systems (ASCII, ISO/IEC 8859, EBCIDIC, etc.) there was no difference between the number of bytes and the number of characters as those systems were based on exactly one byte per character encoding. They are even called Single Byte Character Sets (SBCS) for that reason. The problem is they can accommodate only a maximum of 28=256 symbols which is not nearly enough to cover all the variety of natural languages, special characters, emojis etc.

Why would we want to expand character variable lengths?

Use case 1. Expanding character values range

For this scenario, let’s consider Internet traffic analysis where your data contains multiple character columns for Internet Protocol addresses (IP addresses) in 32-bit version 4 (IPv4, e.g. ‘125.255.501.780’). You transition to a newer 128-bit IPv6 standard (e.g. ‘2001:0000:3238:DFE1:0063:0000:0000:FEFB’) and need to modify your data structure to accommodate the new standard with longer character values.

Use case 2. Migrating SAS data to multi-byte encoding environment

In this scenario, you migrate/move SAS data sets from older SBCS environments to newer Multi-Byte-Character Set (MBCS) encoding environments. For such a case, the ability to increase character variables lengths in bulk with a simple action becomes especially significant and critical.

Currently, the most commonly used MBCS is Unicode which is supported by all modern operating systems, databases and web browsers. Out of different flavors of Unicode (UTF-8, UTF-16, UTF-32) the most popular is UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-width character set that uses from 1 to 4 one-byte (8-bit) code units per character; it is capable of encoding 1,112,064 various characters that covers most modern languages, including Arabic and Hebrew characters, hieroglyphs, emojis as well as many other special characters.

Since each UTF-8 encoded character may require somewhere between one and four bytes, and not all SBCS characters are represented by one byte in UTF-8, data migration from SBCS to UTF-8 may cause data truncation and subsequently data loss.

When SAS reads an SBCS-encoded data set and writes its records into UTF-8-encoded data set it throws an ERROR message in the log and stops execution:

ERROR: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

When SAS reads an SBCS-encoded data set and produces a UTF-8-encoded printed report only (without generating a UTF-8-encoded output data set) it generates a WARNING message (with identical description as the above ERROR message) while continuing execution:

WARNING: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

Either ERROR or WARNING is unacceptable and must be properly addressed.

How to expand all character variables lengths?

Regardless of character transcoding, SAS’ CVP Engine is short and effective answer to this question. CVP stands for Character Variable Padding which is exactly what this special-purpose engine does – it pads or expands, increases character variables by a number of bytes. CVP engine is part of Base SAS and does not require any additional licensing.

The CVP engine is a read-only engine for SAS data sets only. You can think of it as of a magnifying glass: it creates an expanded view of the character data descriptors (lengths) without changing them. Still we can use the CVP Engine to actually change a data set or a whole data library to their expanded character variables version. All we need to do is to define our source library as CVP library, for example:

libname inlib cvp 'c:\source_folder';

Then use PROC COPY to create expanded versions of our original data sets in a target library:

libname outlib 'c:\target_folder';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Or, if we need to expand character variable lengths for the whole library, then we use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

It’s that easy. And the icing on the cake is that CVP engine by default automatically adjusts the variables format widths to match the expanded byte lengths for all converted character variables.

Avoiding character data truncation by using the CVP Engine

CVP Engine is a near-perfect SAS solution to the problem of potential data truncation when data is transcoded during migration or move from SBCS-based to MBCS-based systems.

To avoid data loss from possible data truncation during transcoding we can use the above code with a slight but important modification – define the target library with outencoding='UTF-8' option. It will result in our target data not only expanded lengthwise but properly encoded as well. Then we run this modified code in the old SBCS environment before moving/migrating our data sets to the new MBCS environment:

libname inlib cvp 'c:\source_folder';
libname outlib 'c:\utf8_target_folder' outencoding='UTF-8';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Again, if you need to expand character variable lengths for the whole library, then you can use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

After that we can safely move our expanded, UTF-8-encoded data to the new UTF-8 environment.

Code notes

  • The code above will create a different version of your original data sets with desired encoding and expanded by 50% (default) character variables lengths. As shown below, this default behavior can be changed by using CVPBYTES= or CVPMULTIPLIER= options which explicitly define bytes expansion rate.
  • It is important to note that CVP option is specified on the input library since the CVP engine is read-only engine, thus available for input (read) processing only.
  • For the output library you specify your desired encoding option, in this case outencoding='UTF-8'.
  • The noclone option specifies not to copy data set attributes. This is needed to make sure the attributes are recreated rather than duplicated.
  • If you want to migrate your data sets using PROC MIGRATE, you should expand column lengths before using PROC COPY as shown above since the CVP engine is not currently supported with PROC MIGRATE.
  • The CVP engine supports only SAS data files (no SAS views, catalogs, item stores, and so on).

CVP Engine options

There are several options available with the CVP Engine. Here are the most widely used:

CVPBYTES=bytes – specifies the number of bytes by which to expand character variable lengths. The lengths of character variables are increased by adding the specified bytes value to the current length.

Example: libname inlib 'SAS data-library' cvpbytes=5;

The CVPBYTES= option implicitly specifies the CVP engine, that is if you specify the CVPBYTES= option you don’t have to specify CVP engine explicitly as SAS will use it automatically.

CVPMULTIPLIER=multiplier – specifies a multiplier value that expands character variable. The lengths of character variables are increased by multiplying the current length by the specified multiplier value. You can specify a multiplier value from 1 to 5, or you can specify 0 and then the CVP engine determines the multiplier automatically.

Example: libname inlib 'SAS data-library' cvpmultiplier=2.5;

The CVPMULTIPLIER= option also implicitly specifies the CVP engine, that is if you specify the CVPMULTIPLIER= option, you don’t have to specify CVP engine explicitly as SAS will use it automatically.

Notes:

  • You cannot specify both the CVPMULTIPLIER= option and the CVPBYTES= option. Specify only one of these options.
  • If you explicitly assign the CVP engine but do not specify either CVPBYTES= or CVPMULTIPLIER=, then SAS defaults to using CVPMULTIPLIER=1.5 to increase the lengths of the character variables.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments section below.

Expanding lengths of all character variables in SAS data sets was published on SAS Users.

This post was kindly contributed by SAS Users - go there to comment and to read the full post.