Many a times we may want to read a table from the webpages into our datasets. This may be a requirement especially when I would want to analyse the stock market shares and their corresponding trends over the past. This can be done in many ways depending on the web application that is in consideration.
Here I discuss the filename url and the other related methods to access the static webpage which we see in the browser. However, there are many other different methods like FTPing the webpage through some mechanism and then parsing the html/aspx source tags to get the required data, etc.
Now that the Cricket world cup is here, i've decided to use the http://www.espncricinfo.com/ to show how we can read the scorecard into our datasets. I've used a match score card which appears like this in the website:
To begin with, we need to assign a filename to the url where the table resides, by using the filename url syntax:
filename fn url "http://www.espncricinfo.com/icc_cricket_worldcup2011/engine/match/473333.html";
Now that the fileref has been added, we try reading the file into the dataset using the infile/input statements in a datastep:
data _null_;
infile fn lrecl=30000;;
input col1 $10000.;
file "~/test.txt";
put col1 $10000.;
run;
Note: The default value for the lrecl (where we specify the maximum record length in the file), is 256 characters. However, one can specify a value upto 32767. I've specified a length of 30000 just assuming that the html file max length would be 30000.
The above code reads one entire line from the url specified (which essentially contains a html file) into the SAS as a single character variable of length 1000 and writes it into a file named test.txt.
When we open the test.txt to read the contents of the html, we see a lot of html tags which needs to be parsed into a dataset to get the table of our choice.
In the test.txt, we see the following tags appearing in the file as shown below :
<td width="192"><a class="playerName" href="http://www.blogger.com/icc_cricket_worldcup2011/content/player/35263.html" target="" title="view the player profile for Virender Sehwag">V Sehwag </a> </td>
<td class="battingRuns">23</td>
<td class="battingDetails">30</td>
<td class="battingDetails">0</td>
Now, all we need to do is look out for the occurrences of the text tag: 'class="playerName"' to fetch the player name; 'class="battingRuns">' to fetch the player score; 'class="battingDetails">' to fetch the player matches, and so on and so forth..
This can easily be done by using the following data step code:
data inp;
infile "~/test.txt" lrecl=30000;
input @'class="playerName"' name1 $300. @'class="battingRuns">' runs1 : $20. @'class="battingDetails">' matches1 : $20.;
run;
The above code searches for the occurrences of the text 'class="playerName"' and reads 300 characters following it into the variable name1. Similarly, it also searches for the occurrence of the text 'class="battingRuns"' and 'class="battingDetails"' and reads upto the next 20 characters until it encounters a space (the default delimiter).
Now the first observation of the dataset inp contains the following values:
Now, we need to extract the name (V Sehwag) from the name1 variable. This can be done by picking the index of </a> and the string '">' from the value of the variable name1. The function is as follows:
x=index(t,'</a>');
y=index(t,'playerName');
name=substr(t,x+4,y-x+4);
The above set of functions calculate the x and y indices which is the beginning point and the ending point of the string V Sehwag. Then i do a substr of the string knowing the beginning position and the ending position.
To extract the numeric value from the character value, we use the following function:
runs=input(compress(lowcase(runs1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);
Now, coming all the above set of functions, a data step can be built which would give us the final set of data as follows:
data final;
set inp;
x=index(name1,'</a>');
y=index(name1,'playerName');
name=substr(name1,x+4,y-x+4);
runs=input(compress(lowcase(runs1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);
matches=input(compress(lowcase(matches1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);
run;
Thats it for now.. Let me know your experiences or suggestions on doing this in a better way...
Here I discuss the filename url and the other related methods to access the static webpage which we see in the browser. However, there are many other different methods like FTPing the webpage through some mechanism and then parsing the html/aspx source tags to get the required data, etc.
Now that the Cricket world cup is here, i've decided to use the http://www.espncricinfo.com/ to show how we can read the scorecard into our datasets. I've used a match score card which appears like this in the website:
To begin with, we need to assign a filename to the url where the table resides, by using the filename url syntax:
filename fn url "http://www.espncricinfo.com/icc_cricket_worldcup2011/engine/match/473333.html";
Now that the fileref has been added, we try reading the file into the dataset using the infile/input statements in a datastep:
data _null_;
infile fn lrecl=30000;;
input col1 $10000.;
file "~/test.txt";
put col1 $10000.;
run;
Note: The default value for the lrecl (where we specify the maximum record length in the file), is 256 characters. However, one can specify a value upto 32767. I've specified a length of 30000 just assuming that the html file max length would be 30000.
The above code reads one entire line from the url specified (which essentially contains a html file) into the SAS as a single character variable of length 1000 and writes it into a file named test.txt.
When we open the test.txt to read the contents of the html, we see a lot of html tags which needs to be parsed into a dataset to get the table of our choice.
In the test.txt, we see the following tags appearing in the file as shown below :
<td width="192"><a class="playerName" href="http://www.blogger.com/icc_cricket_worldcup2011/content/player/35263.html" target="" title="view the player profile for Virender Sehwag">V Sehwag </a> </td>
<td class="battingRuns">23</td>
<td class="battingDetails">30</td>
<td class="battingDetails">0</td>
Now, all we need to do is look out for the occurrences of the text tag: 'class="playerName"' to fetch the player name; 'class="battingRuns">' to fetch the player score; 'class="battingDetails">' to fetch the player matches, and so on and so forth..
This can easily be done by using the following data step code:
data inp;
infile "~/test.txt" lrecl=30000;
input @'class="playerName"' name1 $300. @'class="battingRuns">' runs1 : $20. @'class="battingDetails">' matches1 : $20.;
run;
The above code searches for the occurrences of the text 'class="playerName"' and reads 300 characters following it into the variable name1. Similarly, it also searches for the occurrence of the text 'class="battingRuns"' and 'class="battingDetails"' and reads upto the next 20 characters until it encounters a space (the default delimiter).
Now the first observation of the dataset inp contains the following values:
Name1 | Runs1 | Matches1 |
href="http://www.blogger.com/icc_cricket_worldcup2011/content/player/35263.html" target="" title="view the player profile for Virender Sehwag">V Sehwag </a> </td> | 23</td> | 30</td> |
Now, we need to extract the name (V Sehwag) from the name1 variable. This can be done by picking the index of </a> and the string '">' from the value of the variable name1. The function is as follows:
x=index(t,'</a>');
y=index(t,'playerName');
name=substr(t,x+4,y-x+4);
The above set of functions calculate the x and y indices which is the beginning point and the ending point of the string V Sehwag. Then i do a substr of the string knowing the beginning position and the ending position.
To extract the numeric value from the character value, we use the following function:
runs=input(compress(lowcase(runs1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);
Now, coming all the above set of functions, a data step can be built which would give us the final set of data as follows:
data final;
set inp;
x=index(name1,'</a>');
y=index(name1,'playerName');
name=substr(name1,x+4,y-x+4);
runs=input(compress(lowcase(runs1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);
matches=input(compress(lowcase(matches1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);
run;
Thats it for now.. Let me know your experiences or suggestions on doing this in a better way...
Well, apart from connecting directly to the URL from SAS, the rest of the procedure is quite intensive! I hadn't known about the @class types.
ReplyDeleteWouldn't it just be easier to copy it into Excel and upload??! :)
Nice post, and I think I'm going to be following this from now on.
Thanks,
Arun (ex-colleague)
Arun,
ReplyDeleteIt would always be the best to copy paste the table from the browser to the excel sheet.. The above method would come handy only when you need to do this on a regular basis.. so that you can automate the code..
However, you may need to be sure that the format of the html which is being rendered has to be consistent always... Otherwise our parsing mechanism goes for a toss..
You may also want to explore the libname xml method for parsing the html files as explained here: http://support.sas.com/rnd/base/xmlengine/sxle913/usersguide913.htm
Thanks for your encouraging words.. This truly motivates me to write better!
The compress function has a third argument, which can be used to add a list of characters to the second argument. I think 'a' specifies all alphabet characters (upper and lower), so perhaps that would shorten your code. Take a look at the other modifiers for more useful characters.
ReplyDeletehttp://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212246.htm
Hi Chris,
ReplyDeleteUsing the compress function would definitely help in parsing the html codes...
I also realized later that i could even parse this using the xml engine in SAS as expained here:
http://support.sas.com/rnd/base/xmlengine/sxle91/xmlprocess.html
However, we may not be lucky always to get a 'well constructed' html page to employ this method for parsing...