This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
Before we delve into unquoting SAS character variables let’s briefly review existing SAS functionality related to the character strings quoting/unquoting.
%QUOTE and %UNQUOTE macro functions
Don’t be fooled by these macro functions’ names. They have nothing to do with quoting or un-quoting character variables’ values. Moreover, they have nothing to do with quoting or un-quoting even macro variables’ values. According to the %QUOTE Macro Function documentation it masks special characters and mnemonic operators in a resolved value at macro execution. %UNQUOTE Macro Function unmasks all special characters and mnemonic operators so they are interpreted as macro language elements instead of as text. There are many other SAS “macro quoting functions” (%SUPERQ, %BQUOTE, %NRBQUOTE, all macro functions whose name starts with %Q: %QSCAN, %QSUBSTR, %QSYSFUNC, etc.) that perform some action including masking.
Historically, however, SAS Macro Language uses terms “quote” and “unquote” to denote “mask” and “unmask”. Keep that in mind when reading SAS Macro documentation.
QUOTE function
Most SAS programmers are familiar with the QUOTE function that adds quotation marks around a character value. It can add double quotation marks (by default) or single quotation marks if you specify that in its second argument.
This function goes even further as it doubles any quotation mark that already existed within the value to make sure that an embedded quotation mark is escaped (not treated as an opening or closing quotation mark) during parsing.
DEQUOTE function
There is also a complementary DEQUOTE function that removes matching quotation marks from a character string that begins with a quotation mark. But be warned that it also deletes all characters to the right of the first matching quotation mark. In my view, deleting those characters is overkill because when writing a SAS program, we may not know what is going to be in the data and whether it’s okay to delete its part outside the first matching quotes. That is why you need to be extra careful if you decide to use this function. Here is an example of what I mean. If you run the following code:
data a; input x $ 1-50; datalines; 'This is what you get. Let's be careful.' ; data _null_; set a; y = dequote(x); put x= / y=; run; |
you will get the following in the SAS log:
y=This is what you get. Let
This is hardly what you really wanted as you have just lost valuable information – part of the y character value got deleted: 's be careful. I would rather not remove the quotation marks at all than remove them at the expense of losing meaningful information.
$QUOTE informat
The $QUOTE informat does exactly what the DEQUOTE() function does, that is removes matching quotation marks from a character string that begins with a quotation mark. You can use it in the example above by replacing
y = dequote(x);
with the INPUT() function
y = input(x, $quote50.);
Or you can use it directly in the INPUT statement when reading raw data from datalines or an external file:
input x $quote50.;
Both, $QUOTE informat and DEQUOTE() function, in addition to removing all characters to the right of the closing quotation mark do the following unconventional, peculiar things:
- Remove a lone quotation mark (either double or single) when it’s the only character in the string; apparently, the lone quotation mark is matched to itself.
- Match single quotation mark with double quotation mark as if they are the same.
- Remove matching quotation marks from a character string that begins with a quotation mark; if your string has one or more leading blanks (that is, a quotation mark is not the first character), nothing gets removed (un-quoted).
If the described behavior matches your use case, you are welcome to use either $QUOTE informat or DEQUOTE() function. Otherwise, please read on.
UNQUOTE function definition
Up to this point such a function did not exist, but we are about to create one to justify the title. Let’s keep it simple and straightforward. Here is what I propose our new unquote() function to do:
- If first and last non-blank characters of a character string value are matching quotation marks, we will remove them. We will not consider quotation marks matching if one of them is a single quotation mark and another is a double quotation mark.
- We will remove those matching quotation marks whether they are both single quotation marks OR both double quotation marks.
- We are not going to remove or change any other quotation marks that may be present within those matching quotation marks that we remove.
- We will remove leading and trailing blanks outside the matching quotation marks that we delete.
- However, we will not remove any leading or trailing blanks within the matching quotation marks that we delete. You may additionally apply the STRIP() function if you need to do that.
To summarize these specifications, our new UNQUOTE() function will extract a character substring within matching quotation marks if they are the first and the last non-blank characters in a character string. Otherwise, it returns the character argument unchanged.
UNQUOTE function implementation
Here is how such a function can be implemented using PROC FCMP:
libname funclib 'c:\projects\functions'; proc fcmp outlib=funclib.userfuncs.v1; /* outlib=libname.dataset.package */ function unquote(x $) $32767; pos1 = notspace(x); *<- first non-blank character position; if pos1=0 then return (x); *<- empty string; char1 = char(x, pos1); *<- first non-blank character; if char1 not in ('"', "'") then return (x); *<- first non-blank character is not " or ' ; posL = notspace(x, -length(x)); *<- last non-blank character position; if pos1=posL then return (x); *<- single character string; charL = char(x, posL); *<- last non-blank character; if charL^=char1 then return (x); *<- last non-blank character does not macth first; /* at this point we should have matching quotation marks */ return (substrn(x, pos1 + 1, posL - pos1 - 1)); *<- remove first and last quotation character; endfunc; run; |
Here are the highlights of this implementation:
We use multiple RETURN statements: we sequentially check for different special conditions and if one of them is met we return the argument value intact. The RETURN statement does not just return the value, but also stops any further function execution.
At the very end, after making sure that none of the special conditions is met, we strip the argument value from the matching quotation marks along with the leading and trailing blanks outside of them.
NOTE: SAS user-defined functions are stored in a SAS data set specified in the outlib= option of the PROC FCMP. It requires a 3-level name (libref.datsetname.packagename) for the function definition location to allow for several versions of the same-name function to be stored there.
However, when a user-defined function is used in a SAS DATA Step, only a 2-level name can be specified (libref.datasetname). If that data set has several same-name functions stored in different packages the DATA Step uses the latest function definition (found in a package closest to the bottom of the data set).
UNQUOTE function results
Let’s use the following code to test our newly minted user-defined function UNQUOE():
libname funclib 'c:\projects\functions'; options cmplib=funclib.userfuncs; data A; infile datalines truncover; input @1 S $char100.; datalines; ' " How about this? How about this? "How about this?" 'How about this?' "How about this?' 'How about this?" " How about this?" ' How about this?' ' How "about" this?' ' How 'about' this?' " How about this?" " How "about" this?" " How 'about' this?" ' How about this?' ; data B; set A; length NEW_S $100; label NEW_S = 'unquote(S)'; NEW_S = unquote(S); run; |
This code produces the following output table:
As you can see it does exactly what we wanted it to do – removing matching first and last quotation marks as well as stripping out blanks outside the matching quotation marks.
DSD (Delimiter-Sensitive Data) option
This INFILE statement’s option is particularly and extremely useful when using LIST input to read and un-quote comma-delimited raw data. In addition to removing enclosing quotation marks from character values, the DSD option specifies that when data values are enclosed in quotation marks, delimiters within the value are masked, that is treated as character data (not as delimiters). It also sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.
In contrast with the above UNQUOTE() function, the DSD option will not remove enclosing quotation marks if there are same additional quotation marks present inside the character value. When DSD option does strip enclosing quotation marks it also strips leading and trailing blanks outside and within the removed quotation marks.
Additional Resources
- Using PROC FCMP to the Fullest: Getting Started and Doing More
- How does PROC FCMP store functions?
- Finding n-th instance of a substring within a string
- Expanding lengths of all character variables in SAS data sets
Your thoughts?
Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.
How to unquote SAS character variable values was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |