Monday, February 28, 2011

Making SAS Interactive (Part 1): Using stdin and stdout

Many a times, we may come across a need for having a dynamic programs. Meaning, we may need the user to key in the input and run the code accordingly, based on his input. This can be achieved in SAS by using the automatic file descriptors: stdin and stdout. This is more widely used in UNIX environment, especially when we batch submit the code in the command line.

In the below code, I illustrate the use of stdin and stdout by implementing a simple calculator, which takes in the numbers and the operators as the arguments and outputs the results.

data test;
if (_N_ eq 1) then do;
 file stdout;
 infile stdin;
 put @1 "Enter the first variable:";
 input X @;
 put @1 "Enter the second variable:";
 input Y @;
 put @1 "Choose the operator: + - * / **:";
 input op $;
end;
retain X Y op;
select (op);
 when ('+') result=X+Y;
 when ('-') result=X-Y;
 when ('*') result=X*Y;
 when ('/') result=X/Y;
 when ('**') result=X**Y;
 otherwise ;
end;
put "The result is:" result;
run;


In the above code, I've redirected the infile and file statements to the stdin and stdout respectively. So the input is always read through the terminal key and the output is always written to the terminal.

When we run the above code in the batch mode, we get the following output:


We can also route the output of a procedure into the terminal using the proc printo as shown below:

proc printto print=stdout;
run;


The below code would output all the details of the student whose name is keyed in the terminal for the sashelp.class dataset:

data name;
title;
if (_N_ eq 1) then do;
 file stdout;
 infile stdin;
 put @1 "Enter the student name:";
 input n $;
end;
retain n;
call symput('name',n);
run;

proc printto print=stdout;
run;
options nodate nonumber;
proc print data=sashelp.class noobs;
where name="&name";
run;


This would give us the below output:



Let me know if you guys have any thoughts or other approaches.

More to come: Making SAS Iinteractive (Part 2): Using window prompts

Saturday, February 19, 2011

Reading a table from a website into a SAS dataset

Many a times we may want to read a table from the webpages into our datasets. This may be a requirement especially when I would want to analyse the stock market shares and their corresponding trends over the past. This can be done in many ways depending on the web application that is in consideration.

Here I discuss the filename url and the other related methods to access the static webpage which we see in the browser. However, there are many other different methods like FTPing the webpage through some mechanism and then parsing the html/aspx source tags to get the required data, etc.

Now that the Cricket world cup is here, i've decided to use the http://www.espncricinfo.com/ to show how we can read the scorecard into our datasets. I've used a match score card which appears like this in the website:



To begin with, we need to assign a filename to the url where the table resides, by using the filename url syntax:

filename fn url "http://www.espncricinfo.com/icc_cricket_worldcup2011/engine/match/473333.html";

Now that the fileref has been added, we try reading the file into the dataset using the infile/input statements in a datastep:

data _null_;
infile fn lrecl=30000;;
input col1 $10000.;
file "~/test.txt";
put col1 $10000.;
run;

Note: The default value for the lrecl (where we specify the maximum record length in the file), is 256 characters. However, one can specify a value upto 32767. I've specified a length of 30000 just assuming that the html file max length would be 30000.

The above code reads one entire line from the url specified (which essentially contains a html file) into the SAS as a single character variable of length 1000 and writes it into a file named test.txt.

When we open the test.txt to read the contents of the html, we see a lot of html tags which needs to be parsed into a dataset to get the table of our choice.

In the test.txt, we see the following tags appearing in the file as shown below :

<td width="192"><a class="playerName" href="http://www.blogger.com/icc_cricket_worldcup2011/content/player/35263.html" target="" title="view the player profile for Virender Sehwag">V Sehwag </a>&nbsp; </td>

<td class="battingRuns">23</td>

<td class="battingDetails">30</td>

<td class="battingDetails">0</td>

Now, all we need to do is look out for the occurrences of  the text tag: 'class="playerName"' to fetch the player name; 'class="battingRuns">' to fetch the player score; 'class="battingDetails">' to fetch the player matches, and so on and so forth..

This can easily be done by using the following data step code:

data inp;
infile "~/test.txt" lrecl=30000;
input @'class="playerName"' name1 $300. @'class="battingRuns">' runs1 : $20. @'class="battingDetails">' matches1 : $20.;
run;


The above code searches for the occurrences of the text 'class="playerName"' and reads 300 characters following it into the variable name1. Similarly, it also searches for the occurrence of the text 'class="battingRuns"' and 'class="battingDetails"' and reads upto the next 20 characters until it encounters a space (the default delimiter).

Now the first observation of the dataset inp contains the following values:


Name1
Runs1
Matches1
href="http://www.blogger.com/icc_cricket_worldcup2011/content/player/35263.html" target="" title="view the player profile for Virender Sehwag">V Sehwag </a>&nbsp; </td>
23</td>
30</td>


Now, we need to extract the name (V Sehwag) from the name1 variable. This can be done by picking the index of </a> and the string '">' from the value of the variable name1. The function is as follows:

x=index(t,'</a>');
y=index(t,'playerName');
name=substr(t,x+4,y-x+4);


The above set of functions calculate the x and y indices which is the beginning point and the ending point of the string V Sehwag. Then i do a substr of the string knowing the beginning position and the ending position.

To extract the numeric value from the character value, we use the following function:

runs=input(compress(lowcase(runs1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);

Now, coming all the above set of functions, a data step can be built which would give us the final set of data as follows:

data final;
set inp;
x=index(name1,'</a>');
y=index(name1,'playerName');
name=substr(name1,x+4,y-x+4);

runs=input(compress(lowcase(runs1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);
matches=input(compress(lowcase(matches1),'abcdefghijklmnopqrstuvwxyz<>/'),8.);
run;

Thats it for now.. Let me know your experiences or suggestions on doing this in a better way...