SAS multiple character and varying length Delimiters

This post was kindly contributed by Jared Prins' Blog - SAS - go there to comment and to read the full post.

I’ve solved a problem I had in my previous post “Strip Characters From Between Two Delimiters“. At work I was given a strangely formatted data file – example.

I needed to separate out the data chunks to work with it. While my first solution was to give up and use Ruby, my second solution is much easier and uses SAS. I was able to mashup bits of code suggestions from SAS programmers all over the internet.

Chris suggested this PRX solution which can work, but it took extra effort to format the data to fit. I had to go outside the SAS toolset and use HTMLtidy to format the XML for use with the PRX solution.

I’ve strung together the code below which compares the two solutions. No doubt there are inefficiencies with this code, but I learned all sorts of valuable lessons:

I’ve relearned that there is always more than one way to skin a cat
Regex is difficult to learn, but worth it
The tranwrd function has limitations
How to pause code to wait until the systask command is complete
Infile is fun to use

But the most important lesson is this:

Use the right tools for job. Don’t limit yourself to only using a single tool. SAS is great for what it was designed to do. But in the case of prettying up an XML file, HTMLtidy was specifically written to do just that. SAS let’s you easily run shell commands such as HTMLtidy (for EG users, you need to edit your registry).

Here is the code. Have fun with it. Don’t forget to put startdata.txt and tidy.config in your c:\ or edit the paths accordingly. The tidy config file tells tidy you are dealing with XML input and you want XML output whereas the default tidy configuration is for HTML.

%let SOL=”<row><datetime>”;

%let EOL=”</email></row>”;

/*** CREATE DATASET ***/

data data1;

infile “c:\startdata.txt” dsd missover truncover lrecl = 2048;

   input line $2048.;

/* swap out the odd delimeters for XML tags. */

line = tranwrd(line,”:Location:”,”</datetime><location>”);

line = tranwrd(line,”:Gender:”,”</location><gender>”);

line = tranwrd(line,”:Comment:”,”</gender><comment>”);

line = tranwrd(line,”:Email:”,”</comment><email>”);

/* finish the XML */

line = cats(&SOL,line,&EOL);

run;

/*Note: Be careful with tranwrd.

If the row is exactly 2048 characters and you swap in characters longer than what

is being swapped out, you’ll cut off your data.*/

/*** COMPLETE the XML file ***/

data data2;set data1 end=eof;

if _n_ =1 then

line = cats(“<?xml version=’1.0′ encoding=’windows-1252′ ?><xml>”,line);

if eof then

line = cats(line,”</xml>”);

run;

/*** EXPORT the XML which is one observation per line ***/

proc export data=data2

   outfile = “C:\enddata.txt”

   dbms=tab replace;

   putnames=no;

run;

/*** IMPORT the XML file using XML libname. ***/

libname a xml ‘c:\enddata.txt’;

data data3; set a.row; run;

/*

At this point, we are now done. We can use data3 easily.

The above solution works fantastic as long as you have no

special characters in your data.

For example, the ampersand & is a special character in XML

and SAS won’t import it.

One way around this is to include CDATA tags

<![CDATA[text goes here]]> in the tranwrd

swap.

e.g. line = tranwrd(line,”:Location:”,”]]></datetime><location><![CDATA[“);

Another way could be to swap out special characters for literal characters.

e.g. line = tranwrd(line,”&”,”and”);

Since I am only interested in the COMMENTS portion

of the dataset, the below PRX solution worked for me.

My biggest issue was to “beautify” the XML so that only one tagset appeared

on each line.

*/

/* If we don’t beautify the XML, the following PRX solution does NOT work.

It will result in an empty dataset.*/

data data5 (keep = line );

   retain queName ;

   retain line ;

   set data2;

   /*use PRX to capture the structure of XML data;*/

   If _n_=1 then do;

   queName=prxparse(‘/^\<comment\>/’);

   end;

   queNameN=prxmatch(queName,line);

   /*use PRX to remove the XXML tags;*/

   if queNameN>0 then do;

   rx1=prxparse(“s/<.*?>//”);

   call prxchange(rx1,99,line);

   output;

   end;

run;

/*** HTMLtidy is used to properly format and beautify the XML file ***/

SYSTASK COMMAND “tidy -config c:\tidy.config -m c:\enddata.txt” taskname=prog1;

waitfor prog1; /* we have to wait for the systask to run before continuing. */

/*** BRING THE PROPERLY FORMED XML DATA BACK IN ***/

data data4;

infile “c:\enddata.txt” dsd missover truncover lrecl = 2048;

   input line $2048.;

run;

/*** EXTRACT THE COMMENTS DATA NODE ***/

data data6 (keep = line);

   retain queName ;

   retain line ;

   set data4;

   /*use PRX to capture the structure of XML data;*/

   If _n_=1 then do;

   queName=prxparse(‘/^\<comment\>/’);

   end;

   queNameN=prxmatch(queName,line);

   /*use PRX to remove the XXML tags;*/

   if queNameN>0 then do;

   rx1=prxparse(“s/<.*?>//”);

   call prxchange(rx1,99,line);

   output;

   end;

run;

This post was kindly contributed by Jared Prins' Blog - SAS - go there to comment and to read the full post.

Post navigation