This post was kindly contributed by Jared Prins' Blog - SAS - go there to comment and to read the full post. |
I’ve solved a problem I had in my previous post “Strip Characters From Between Two Delimiters“. At work I was given a strangely formatted data file – example.
I needed to separate out the data chunks to work with it. While my first solution was to give up and use Ruby, my second solution is much easier and uses SAS. I was able to mashup bits of code suggestions from SAS programmers all over the internet.
Chris suggested this PRX solution which can work, but it took extra effort to format the data to fit. I had to go outside the SAS toolset and use HTMLtidy to format the XML for use with the PRX solution.
I’ve strung together the code below which compares the two solutions. No doubt there are inefficiencies with this code, but I learned all sorts of valuable lessons:
- I’ve relearned that there is always more than one way to skin a cat
- Regex is difficult to learn, but worth it
- The tranwrd function has limitations
- How to pause code to wait until the systask command is complete
- Infile is fun to use
But the most important lesson is this:
Use the right tools for job. Don’t limit yourself to only using a single tool. SAS is great for what it was designed to do. But in the case of prettying up an XML file, HTMLtidy was specifically written to do just that. SAS let’s you easily run shell commands such as HTMLtidy (for EG users, you need to edit your registry).
Here is the code. Have fun with it. Don’t forget to put startdata.txt and tidy.config in your c:\ or edit the paths accordingly. The tidy config file tells tidy you are dealing with XML input and you want XML output whereas the default tidy configuration is for HTML.
%let SOL=”<row><datetime>”;%let EOL=”</email></row>”;/*** CREATE DATASET ***/data data1;infile “c:\startdata.txt” dsd missover truncover lrecl = 2048;input line $2048.;/* swap out the odd delimeters for XML tags. */line = tranwrd(line,”:Location:”,”</datetime><location>”);line = tranwrd(line,”:Gender:”,”</location><gender>”);line = tranwrd(line,”:Comment:”,”</gender><comment>”);line = tranwrd(line,”:Email:”,”</comment><email>”);/* finish the XML */line = cats(&SOL,line,&EOL);run;/*Note: Be careful with tranwrd.If the row is exactly 2048 characters and you swap in characters longer than whatis being swapped out, you’ll cut off your data.*//*** COMPLETE the XML file ***/data data2;set data1 end=eof;if _n_ =1 thenline = cats(“<?xml version=’1.0′ encoding=’windows-1252′ ?><xml>”,line);if eof thenline = cats(line,”</xml>”);run;/*** EXPORT the XML which is one observation per line ***/proc export data=data2outfile = “C:\enddata.txt”dbms=tab replace;putnames=no;run;/*** IMPORT the XML file using XML libname. ***/libname a xml ‘c:\enddata.txt’;data data3; set a.row; run;/*At this point, we are now done. We can use data3 easily.The above solution works fantastic as long as you have nospecial characters in your data.For example, the ampersand & is a special character in XMLand SAS won’t import it.One way around this is to include CDATA tags<![CDATA[text goes here]]> in the tranwrdswap.e.g. line = tranwrd(line,”:Location:”,”]]></datetime><location><![CDATA[“);Another way could be to swap out special characters for literal characters.e.g. line = tranwrd(line,”&”,”and”);Since I am only interested in the COMMENTS portionof the dataset, the below PRX solution worked for me.My biggest issue was to “beautify” the XML so that only one tagset appearedon each line.*//* If we don’t beautify the XML, the following PRX solution does NOT work.It will result in an empty dataset.*/data data5 (keep = line );retain queName ;retain line ;set data2;/*use PRX to capture the structure of XML data;*/If _n_=1 then do;queName=prxparse(‘/^\<comment\>/’);end;queNameN=prxmatch(queName,line);/*use PRX to remove the XXML tags;*/if queNameN>0 then do;rx1=prxparse(“s/<.*?>//”);call prxchange(rx1,99,line);output;end;run;/*** HTMLtidy is used to properly format and beautify the XML file ***/SYSTASK COMMAND “tidy -config c:\tidy.config -m c:\enddata.txt” taskname=prog1;waitfor prog1; /* we have to wait for the systask to run before continuing. *//*** BRING THE PROPERLY FORMED XML DATA BACK IN ***/data data4;infile “c:\enddata.txt” dsd missover truncover lrecl = 2048;input line $2048.;run;/*** EXTRACT THE COMMENTS DATA NODE ***/data data6 (keep = line);retain queName ;retain line ;set data4;/*use PRX to capture the structure of XML data;*/If _n_=1 then do;queName=prxparse(‘/^\<comment\>/’);end;queNameN=prxmatch(queName,line);/*use PRX to remove the XXML tags;*/if queNameN>0 then do;rx1=prxparse(“s/<.*?>//”);call prxchange(rx1,99,line);output;end;run;
This post was kindly contributed by Jared Prins' Blog - SAS - go there to comment and to read the full post. |