Removing trailing characters from SAS strings

This post was kindly contributed by SAS Users - go there to comment and to read the full post.

Deleting any trailing characters in SAS stringsIn my previous post, we solved the task of removing specified leading characters from SAS strings. In this post, we tackle the complementary task of removing trailing characters.

While removing trailing blanks is well covered in SAS by the TRIM() and TRIMN() functions, removing non-blank trailing characters remains a bit of a mystery that can pop up during text string processing.

For example, you may need to clean up the following strings by removing all trailing x‘s from them:

012345x
012345xxx
012345xx

These extra characters can result from data entry errors, prior data manipulations, etc. No matter how you get them in, you want them out.

How to remove any trailing characters

For instance, let’s remove all occurrences of the arbitrary trailing character '*'. The following diagram illustrates what we are going to achieve and how:
Diagram: Deleting trailing characters
In order to remove a specified character (in this example '*') from all trailing positions in a string, we need to search our string from right to left starting from the rightmost non-blank character position and find the position p of the first character in that string that is not equal to the specified character. Note, that despite our right-to-left direction of search the position p=10 is still enumerated from left-to-right. Then we can extract the substring starting from position 1 with the length equal to the found position p.

Unlike in our leading characters removal solution, out of two contenders for our search functionality, VERIFY and FINDC, the VERIFY function has to be dropped from the competition as it does not provide right-to-left search functionality. However, the FINDC function stays on track. Here is a possible solution using the FINDC() function.

Using FINDC() function with negative start-position

The FINDC(X, C, ’K’, -LENGTH(X)) function searches string X from right to left starting from the last non-blank character position determined by the optional start-position argument equal to LENGTH(X), and returns the position P of the first character that does not appear in C.

Here we use the K modifier that switches the default behavior of searching for any character that appears in C to searching for any character that does not appear in C.

The direction of search is defined by the minus sign in front of the start-position (a negative start-position argument translates into searching from right to left.)

Then we can apply the SUBSTR(X, 1, P) function that extracts a substring of X starting from position 1 with a length of P which is effectively a substring of the first P characters in X.

Special considerations

Before we proceed to the code implementation of the outlined solution let’s consider the following edge case.

If our string X consists of all '*' characters and nothing else the FINDC() function will find no character (and therefore no position) that is not equal to '*'. In this case it will return 0. However, 0 is not a valid third argument value for the SUBSTR(X, 1, P) function. Valid values are 1 . . . through VLENGTH(X) – the length attribute of X. Having a 0 value for the third argument will trigger the automatic data step variable _ERROR_=1 and the following note generated in the SAS log:

NOTE: Invalid third argument to function SUBSTR at line ## column #.

Therefore, we need to handle this special case separately, conditionally using SUBSTR(X, 1, P) for P>0 and assigning blank ('') otherwise.

Code implementation for removing trailing characters

Now we are ready to put everything together. First, let’s create a test data table:

data TEST;
   input X $ 1-20;
   datalines;
*It's done***
*********
**01234*ABC**
No trailing *'s
;

Then we apply the logic described above. The following DATA step illustrates our coding solution for deleting trailing characters:

data CLEAN (keep=X Y);
   set TEST;
   C = '*'; *<- trailing character(s) to be removed;
 
   P = findc(X, C, 'K', -length(X));
   if P then Y = substr(X, 1, P); 
        else Y = '';
 
   put _n_= / X= / P= / Y= /;
run;

The SAS log will show interim and final results by the DATA step iterations:

_N_=1
X=*It's done***
P=10
Y=*It's done
 
_N_=2
X=*********
P=0
Y=
 
_N_=3
X=**01234*ABC**
P=11
Y=**01234*ABC
 
_N_=4
X=No trailing *'s
P=15
Y=No trailing *'s

Here is the output data table CLEAN showing the original and the resulting strings X and Y side by side:
Removing any trailing characters in SAS strings

Conclusion

The solution presented in this blog post expands trailing character deletion functionality beyond solely blanks (which are handled by the TRIM and TRIMN functions). Moreover, using this coding technique, we can simultaneously remove a variety of trailing characters. For example, if we have a string X='012345xxx.%' and specify C = 'x.%' (the order of characters listed within the value of C does not matter), then all three characters 'x', '.', and '%' will be removed from all trailing positions of X. The resulting string will be Y='012345'.

In addition, numerous modifiers of the FINDC() function allow specifying many characters in bulk, without explicitly listing them one by one. For example, we may augment a list of characters being removed by adding the D modifier as in P = FINDC(X, C, ‘KD‘, -LENGTH(X)) which will remove all trailing digits in addition to those characters specified in C. Similarly, we may throw in the U modifier as in P = FINDC(X, C, ‘KDU‘, -LENGTH(X)) which adds all uppercase letters to the list of trailing characters to be removed. And so on.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions, concerns, comments? Please share with us below.

Removing trailing characters from SAS strings was published on SAS Users.

This post was kindly contributed by SAS Users - go there to comment and to read the full post.