Reading a table from a website into a SAS dataset

This post was kindly contributed by SASopedia - go there to comment and to read the full post.

Many a times we may want to read a table from the webpages into our datasets. This may be a requirement especially when I would want to analyse the stock market shares and their corresponding trends over the past. This can be done in many ways depending on the web application that is in consideration.

Here I discuss the filename url and the other related methods to access the static webpage which we see in the browser. However, there are many other different methods like FTPing the webpage through some mechanism and then parsing the html/aspx source tags to get the required data, etc.

Now that the Cricket world cup is here, i’ve decided to use the http://www.espncricinfo.com/ to show how we can read the scorecard into our datasets. I’ve used a match score card which appears like this in the website:

To begin with, we need to assign a filename to the url where the table resides, by using the filename url syntax:

filename fn url “http://www.espncricinfo.com/icc_cricket_worldcup2011/engine/match/473333.html”;

Now that the fileref has been added, we try reading the file into the dataset using the infile/input statements in a datastep:

data _null_;
infile fn lrecl=30000;;
input col1 $10000.;
file “~/test.txt”;
put col1 $10000.;
run;

Note: The default value for the lrecl (where we specify the maximum record length in the file), is 256 characters. However, one can specify a value upto 32767. I’ve specified a length of 30000 just assuming that the html file max length would be 30000.

The above code reads one entire line from the url specified (which essentially contains a html file) into the SAS as a single character variable of length 1000 and writes it into a file named test.txt.

When we open the test.txt to read the contents of the html, we see a lot of html tags which needs to be parsed into a dataset to get the table of our choice.

In the test.txt, we see the following tags appearing in the file as shown below :

<td width=”192″><a class=”playerName” href=”http://www.blogger.com/icc_cricket_worldcup2011/content/player/35263.html” target=”” title=”view the player profile for Virender Sehwag”>V Sehwag </a>&nbsp; </td>

<td class=”battingRuns”>23</td>

<td class=”battingDetails”>30</td>

<td class=”battingDetails”>0</td>

Now, all we need to do is look out for the occurrences of  the text tag: ‘class=”playerName”‘ to fetch the player name; ‘class=”battingRuns”>’ to fetch the player score; ‘class=”battingDetails”>’ to fetch the player matches, and so on and so forth..

This can easily be done by using the following data step code:

data inp;
infile “~/test.txt” lrecl=30000;
input @’class=”playerName”‘ name1 $300. @’class=”battingRuns”>’ runs1 : $20. @’class=”battingDetails”>’ matches1 : $20.;
run;

The above code searches for the occurrences of the text ‘class=”playerName”‘ and reads 300 characters following it into the variable name1. Similarly, it also searches for the occurrence of the text ‘class=”battingRuns”‘ and ‘class=”battingDetails”‘ and reads upto the next 20 characters until it encounters a space (the default delimiter).

Now the first observation of the dataset inp contains the following values:

Name1
Runs1
Matches1
href=”http://www.blogger.com/icc_cricket_worldcup2011/content/player/35263.html” target=”” title=”view the player profile for Virender Sehwag”>V Sehwag </a>&nbsp; </td>
23</td>
30</td>

Now, we need to extract the name (V Sehwag) from the name1 variable. This can be done by picking the index of </a> and the string ‘”>’ from the value of the variable name1. The function is as follows:

x=index(t,'</a>’);
y=index(t,’playerName’);
name=substr(t,x+4,y-x+4);

The above set of functions calculate the x and y indices which is the beginning point and the ending point of the string V Sehwag. Then i do a substr of the string knowing the beginning position and the ending position.

To extract the numeric value from the character value, we use the following function:

runs=input(compress(lowcase(runs1),’abcdefghijklmnopqrstuvwxyz<>/’),8.);

Now, coming all the above set of functions, a data step can be built which would give us the final set of data as follows:

data final;
set inp;
x=index(name1,'</a>’);
y=index(name1,’playerName’);
name=substr(name1,x+4,y-x+4);

runs=input(compress(lowcase(runs1),’abcdefghijklmnopqrstuvwxyz<>/’),8.);
matches=input(compress(lowcase(matches1),’abcdefghijklmnopqrstuvwxyz<>/’),8.);
run;

Thats it for now.. Let me know your experiences or suggestions on doing this in a better way…

This post was kindly contributed by SASopedia - go there to comment and to read the full post.