This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
Understanding the problem
Working with character data of different encodings across different systems can occasionally result in failures to translate or transcode that data from its source format to a destination format. When using a SAS/ACCESS LIBNAME engine to query DBMS data, this scenario can end up causing errors when trying to read the data into SAS. To help alleviate selected instances of this problem, we introduced the SUBCHAR= LIBNAME and data set options to certain SAS/ACCESS engines.
A possible solution
The SUBCHAR= option lets you configure how the SAS/ACCESS engine deals with transcoding errors that occur in the engine itself. This does not control any transcoding issues that might occur in the underlying DBMS client. Depending on the settings in the DBMS client and the OS configuration, the DBMS client might choose to transcode the data itself and therefore bypass any attempts by the SAS/ACCESS engine to handle any incompatible data.
How do I know where the error is occurring so that I know SUBCHAR= might help?
For SAS/ACCESS engines that support SUBCHAR=, the error is always roughly the same.
If you see this error message, you can likely use SUBCHAR= to work around the error. If the error message instead includes the name of your DBMS, client layers, or both, it suggests the error is occurring in the DBMS client and SUBCHAR= does not help.
The details
The SUBCHAR= option takes one of these values: QUESTIONMARK, SPACE, SUB, or UESC. These option values determine which one or more characters the SAS/ACCESS engine substitutes for any characters in your DBMS data that cannot be represented in the current SAS session encoding. The alias for this option, SUBSTITUTION_CHAR, makes this option’s behavior a bit clearer.
The first two option values, QUESTIONMARK and SPACE, are self-explanatory. Any invalid character found during transcoding are replaced with the question mark (?) or space ( ) character, respectively.
The third option value, SUB, replaces the invalid character with the OS substitution character for that encoding. For instance, on Linux running in latin1, that character’s hexadecimal representation is 0x1A. In a latin1 SAS session encoding, the output of that character – for example, in a PROC PRINT — displays as a small rectangular box.
The final option value, UESC, replaces the invalid character with the Unicode ESCape for that character.
How do I find these pesky characters?
You might ask: How do I know which columns and/or rows have errors in them?
When you enable the SUBCHAR= option, the SAS/ACCESS engine also displays a warning in the SAS log to indicate that the substitution is taking place. Here is an example of such a warning.
As the warning suggests, you can turn on the sastrace facility so that the SAS/ACCESS engine displays additional information about the transcoding failure. This valuable debugging information helps you track down cases of incompatible characters in your data for the given SAS session encoding. With sastrace enabled, you’ll see a message similar to this in the SAS log
This debug message will tell you which column(s) failed to transcode and how many times. It will also give you the first row (observation) where a transcoding failure was detected.
The examples
For this blog, I created a simple one-column table in Amazon Redshift and inserted 4 rows.
I inserted the Greek omega character (U+03A9) at the end of the value in the third row. For Amazon Redshift, character data is stored in UTF-8 encoding.
If I try a PROC PRINT of this table from SAS running in a latin1 encoding on a Linux system, I get the expected error message.
The output of the PRINT is the three rows without incompatible characters.
So, let’s try out the first option value, QUESTIONMARK.
As you can see, with SUBCHAR=UESC you get the Unicode escape for the omega character, so you can tell exactly what character couldn’t be translated.
Does my SAS/ACCESS engine support it?
As of this writing, these SAS/ACCESS engines support the SUBCHAR= option.
• Amazon Redshift
• Greenplum
• Hadoop
• HAWQ
• Impala
• JDBC
• Microsoft SQL Server
• Netezza
• ODBC
• OLE DB
• PostgreSQL
• SAP HANA
• Spark
• Vertica
The wrap up
As I mentioned earlier, the SUBCHAR= option helps only when the SAS/ACCESS engine is transcoding the data.
The SUBCHAR= option can be a useful tool to pinpoint incompatible characters that might exist in your data to suppress those characters during output if you know you don’t need them. Hopefully, this article can be a useful guide in understanding the SUBCHAR-acteristics of certain transcoding scenarios you might encounter in your SAS jobs.
Acknowledgements
Thanks to Salman Maher and Kat Turk for their help and support.
The SAS/ACCESS Developer's Corner: SUBCHAR-acteristics was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |