Java Spider

Java Spider
I want a have spider modified or built which ever is easier. You can use existing opensource libraries or anything, it doesn’t matter as long as it acheives the tasks.
I want to be able to run the spider as an applet and from the command line so that i can be execute as a cron job.

The spider must be able to accept command line arguments eg.
main(String args[]) { String var = args[0]}

and the applet should have a simple gui.

The spider should be able to take in the domain name and crawl that domain only unless the option is choosen for the spider to leave the domain. It must have the option to re-index if html page has changed.

It should check header status of a page and does not index unless the page is available, so status 200 etc.

————— Specs —————————

Spider gets full html page contents
if the html tag i want to check for (eg <object></object>) is found then
Parse all html tags
get : array of tags i specify
example String getTags[]={“title”,”keyword”}

if keyword is empty/missing and description or title not empty then
split description at every word
return array of keywords limit to 250
else if title empty
attempt to extract keywords from html body up 250 words
if the html tag i checked for is not found then do not parse the page just get all the links from the page and continue crawling.

the crawler need to be able to return the values of html tags and their attributes that i specify.

I’d like the values returned to be in an associative array/map so that

myObject[‘title’] will contain the title
myObject[‘keyword’] will contain an array of keywords
myObject[‘tagName’][‘Attribute’] will get the attribute value of the html tag example
myObject[’embed’][‘src’]

Lastly i want the data to be inserted/indexed in my mySQL database but only if the html tag i checked for was found.

Please make sure you read and udnerstand the reqiurments. This will be integrated into one of my projects and it needs to be built according to my specs.

The spider can be a modification to the one found here
http://www.developer.com/java/other/article.php/1573761/Programming-a-Spider-in-Java.htm

or here
http://www.javaworld.com/javaworld/jw-11-2004/jw-1101-spider.html

or anything from the net
http://www.google.com/search?hl=en&source=hp&q=java+web+spider&aq=0&oq=java+web+spid&aqi=g2

or if you already have a class or library that does this.

It doesn’t matter i just want a spider customized to do the above.

Escrow payment only… No automated bids please.

Leave a Reply

Your email address will not be published. Required fields are marked *