Find your data pattern with Perl

This post was kindly contributed by The SAS Training Post - go there to comment and to read the full post.

Who doesn’t like bargains? I’m sure you will all agree that good quality at a next-to-nothing cost is irresistible. My recent Dollarama run had me ecstatic about the gloves that come in all colours, styles and sizes for just over a dollar. (Fact: big retail stores charge over 10 times the price). Of course I picked up quite a few, the red one caught my eye and oh! that beige snowflake fingerless pattern looked rather stylish! The jacquard lining was an extra bonus so you got the warmth of wool without the prickly sensation. I raved about it to my coworkers and they couldn’t resist. They just had to go back with me and see for themselves- good quality and patterns at throwaway prices. But, it did come with a large dose of good-natured ribbing about seeing me run a Dollarama franchise in the distant future!

In a recent SAS Programming 1: Essentials class, Julie Monette from Statcan asked “Help! How can I find data that matches a specific pattern”? My HS10_ column has a series of any 10 or 6 digit numbers. An additional challenge- this series never appears in the same position”.

Being one of the largest SAS users in Canada, a million rows of data is like child’s play to Statcan. So I’m sure you can appreciate the need for finding an easy and economical solution for them. Here is a sampling of Julie’s data:

Sticks or profile shapes of subheading 3916.10
Reproduction proofs for the production of printing plates, rolls, of tariff item No. 8442.50.20
Microcopies of tariff item No. 4903.00.10, 4905.91.00, 4911.10.10 or 4911.10.20

How did I help Julie find her pattern?

Once you see the single statement I crafted, I’m sure you’ll be equally convinced that the Perl regular expression is a must-have in your SAS toolkit. For someone who is always looking for beautiful patterns that can be bought cheaply-thoughts of using the IF statement were quickly discarded in favor of the WHERE clause. This is an efficiency question that I promise to tackle in a future blog.

I used a combination of techniques. The LIBNAME statement (that I wrote about in an earlier post) and the WHERE clause (for filtering data) were my mainstays. For pattern finding I’d like to introduce you to the Perl regular expression using the PRXMATCH function to locate the pattern’s position.

/ I used forward slashes as default Perl delimiters.

\d matches a digit 0 to 9

{n} matches the previous expression n times, \d{4} matches any 4 digits

\. Is the pattern to match a period


Of the original 12,865 rows, SAS found 52 rows that matched our pattern.

Wasn’t that amazingly efficient? For more, here’s a handy regular expressions cheat sheet and a nice paper by Ron Cody on Perl regular expressions. Now that you’ve got a taste of the Perl expression, the sky is the limit to finding beautiful patterns in your data. I’d love to hear from you on how you use Perl pattern recognition. Are you able to apply the technique I showed in your work? Drop me a note, I’d love to hear about your experience.

This post was kindly contributed by The SAS Training Post - go there to comment and to read the full post.