Simple Php Scraper Project

I am looking for a coder to extract content from a Government website to publish on to my website. This is legit as is for non-commercial usage (which is allowed) and have had this confirmed by such body. I am looking for only a specific few of content to be published (such as by locale) only once a week – so this isn’t a major search engine spider like requirement.

I need a coder to…

Create a php script using fopen or cURL to:
a) contact the website to get a session ID
b) submit specific variables that I can change, along with such session ID as received from the previous step
c) spider links containing certain text on one page and following approx 4 other pages – saving the 100 or so links (normally 20 per page * 5 pages) for next step
d) then extracting a specific URL variable from the 100 (or so) links, and requesting printer friendly page with such variable…
e) for each printer friendly page obtained, save the content to a SQL database split with fields for each variable contained on page (that is enclosed between named ‘div’ tags for extraction).

As it is session based it would probably be a good idea to fetch all the content first (i.e. cache files) and process it separately. All requests made are GET. You will need to capture the session redirect.

This should be relatively easy for a professional php coder.

Leave a Reply

Your email address will not be published. Required fields are marked *