Saturday, April 9, 2011

Creating a search engine in SAS

Its the age of the search engine! I remember people "yahoo"ing during the late 90's and "Google"ing till the late 2k's and now "Bing"ing.

I just wondered.. Why not SAS? So I started off by doing some reading on the yahoo search engine API's. They have new API released, called as the BOSS. Its documentation is provided here: http://developer.yahoo.com/search/boss/boss_guide/

Next step was to fetch a api key which was generated after i filled out their form. Using this, i could start accessing their BOSS api...

I used the proc http to access the BOSS api using the program below:

filename in "C:\test\curr_in";
filename out "C:\test\curr_out.txt";


data _null_;
title;
if (_N_ eq 1) then do;
 file stdout;
 infile stdin;
 put @1 "Enter the search text:";
 input n $;
 var='appid=xxxxxxx&query='||compress(n)||'&results=1';
 file in;
 put var $;
end;
run;



proc http in=in out=out url="http://search.yahooapis.com/WebSearchService/V1/webSearch" method="post" ct="application/x-www-form-urlencoded";
run;

The above program picks the input from the user, which would be the text that needs to be searched; creates a file curr_in which contains the parameter that needs to be sent out to the BOSS api and posts it to the api using the proc http procedure.

Note that the api key has been typed as xxxx which can be replaced by the api key that you would generate from the developer.yahoo.com site.

Once the program is executed, we can see that the output of the api has been dumped into the curr_out file, which contains the search result in the form of XML. This xml is then parsed using the suitable mechanism to fetch the needed fields and then output it to the stdout. This is accomplished by the below code:

data new;
infile out lrecl=10000 truncover;
input @1 rec $1000.;
if(index(rec,'<Summary>')>0) then do;
 title= substr(rec,index(rec,'<Title>')+7,index(rec,'</Title>')-(index(rec,'<Title>')+7));
 summary=substr(rec,index(rec,'<Summary>')+9,index(rec,'</Summary>')-(index(rec,'<Summary>')+9));
 url = substr(rec,index(rec,'<Url>')+5,index(rec,'</Url>')-(index(rec,'<Url>')+5));
 output;
end;
run;

data _null_;
set new;
file stdout;
put "Title: " title;
put "Summary: " summary;
put "Url: " url;
run;


This produces the output as shown below:


Let me know your feedback/comments!

3 comments:

  1. You have stdin and stdout in the infile and fil statements

    Your filerefs in the FILENAME statements are in and out

    where are stdin and stdout defined? Am I missing something?

    ReplyDelete
  2. Dr. AnnMaria: stdin and stdout are the built in file descriptors which would by default point to the command line prompt. Which means that these would input/output the things from the unix command prompt.

    I've illustrated this in one of my earlier posts below:
    http://pramod-r.blogspot.com/2011/02/making-sas-interactive-part-1-using.html

    Let me know if this makes sense..

    ReplyDelete
  3. Hey really interesting..Good stuff :)

    ReplyDelete