This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
Since we added the new “Recommended by SAS” widget in the SAS Support Communities, I often find myself diverted to topics that I probably would not have found otherwise. This is how I landed on this question and solution from 2011 — “How to convert 5ft. 4in. (Char) into inches (Num)“. While the question was deftly answered by my friend (and SAS partner) Jan K. from the Netherlands, the topic inspired me to take it a step further here.
Jan began his response by throwing a little shade on the USA:
Short of moving to a country that has a decent metric system in place, I suggest using a regular expression.
On behalf of my nation I just want say that, for the record, we tried. But we did not get very far with our metrication, so we are stuck with the imperial system for most non-scientific endeavors.
Matching patterns with a regular expression
Regular expressions are a powerful method for finding specific patterns in text. The syntax of regular expressions can be a challenge for those getting started, but once you’ve solved a few pattern-recognition problems by using them, you’ll never go back to your old methods.
Beginning with the solution offered by Jan, I extended this program to read in a “ft. in.” measurement, convert to the component number values, express the total value in inches, and then convert the measurement to centimeters. I know that even with my changes, we can think of patterns that might not be matched. But allow me to describe the updates:
- Long-time users of PRXPARSE in SAS will notice that I did not apply the common practice of wrapping PRXPARSE in an “IF _N_=1” condition. For several releases of SAS, it’s been the case that when using a constant parse pattern the DATA step will compile the regular expression just once. Parsing the expression hundreds or thousands of times across each DATA step iteration would be a performance concern — but since in our case the pattern never changes, SAS compiles it just once.
- The PRXPOSN function returns the nth “capture buffer” from the pattern match. Capture buffers are identified in the pattern by parentheses — you count each open parenthesis to arrive at the expected buffer. So the first buffer matches on the first sequence of digits: (\d*). The second buffer is the optional whitespace between ft. and in.: (\s*). Buffer 3 is the entire pattern for “n in.”: ((\d?\.?\d?)in.). Finally, buffer 4 is an “inner” capture group of buffer 3, containing just the sequence of digits with optional decimal for inches: (\d?\.?\d?)
- The PRXPOSN function returns the text value of the match, so we have to use the INPUT function to convert that to a SAS numeric value.
- Finally, I added the calculations to convert to total inches, and then centimeters.
Here’s my program, followed by the result:
data measure; length original $ 25 feet 8 inches 8 total_inches 8 total_cm 8; /* constant regex is parsed just once */ re = prxparse('/(\d*)ft.(\s*)((\d?\.?\d?)in.)?/'); input; original = _infile_; if prxmatch(re, original) then do; feet = input ( prxposn(re, 1, original), best12.); inches = input ( prxposn(re, 4, original), best12.); if missing(inches) and not missing(feet) then inches=0; end; else original = "[NO MATCH] " || original; total_inches = (feet*12) + inches; total_cm = total_inches * 2.54; drop re; cards; 5ft. 4in. 4ft 0in. 6ft. 10in. 3ft.2in. 4ft. 6ft. 1.5in. 20ft. 11in. 25ft. 6.5in. Soooooo big ; |
Other tools to help with regular expressions
The Internet offers a plethora of online tools to help developers build and test regular expression syntax. Here’s a screenshot from RegExr.com, which I used to test some aspects of my program.
Tools like these provide wonderful insight into the capture groups and regex directives that will influence the pattern matching. They are part tutorial, part workbench for regex crafters at all levels.
Many of these tools also include community contributions for matching common patterns. For example, many of us often need to match/parse data with e-mail addresses, phone numbers, and other tokens as data fields. Sites like RegExr.com include the syntax for many of these common patterns that you can copy and use in your SAS programs.
See Also
- Using a regular expression to validate a SAS variable name
- In his book High-Performance SAS Coding, Christian Graffeuille (ChrisNZ on the communities) devotes a chapter to using regular expressions in SAS. (He also answers a TON of regex questions on the community.)
- Ron Cody also teaches about regular expressions for data cleansing — see his blog post and even more in his book, Cody’s Data Cleansing Techniques using SAS.
The post Convert a text-based measurement to a number in SAS appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |