Multithreaded Site Scraper (vb.net Or C#.net)

The purpose of this software is to see if a website is relevant to a certain keyword, or list of keywords, by checking to see if those words are present in the Title, or text of the page.

The software will load a list of sites, and a list of keywords, and open up each site in the background, scan the site, note the frequency of the keyword, and if it is above a certain frequency, add the site to a list of “relevant” sites.

Visit a list of websites(blogs), and either
1. Get Title of Website
2. Get Text/Html of Entire Page (or get the fist *x amount* of html, if possible, to reduce bandwidth used)

Must have a variable amount of threads, but no proxy support is required

Search returned data for certain keywords that are entered by user, in a multiline text box, with ability to load text file into text box.

Export only sites that contain keywords to text file

If possible, remove all comments on blog before searching for keywords, to avoid false positives.

Options/Fields wanted in software:

Url List

Keyword List

Keyword Frequency Threshold
(For example, if the threshold of 5 is selected, and one keyword appears 3 times in the body, and another keyword appears 4 times, then the overall keyword frequency is 7, and will pass the threshold.)
Option to Only Search in h1,h2,h3 tags for keywords

Multiline textbox that is not editable, that shows all results that include keyword, and pass all tests.

Multiline textbox for “negative” keywords, that will automatically stop the current site from being added to the export list. This option, if enabled, will be more important than all other keywords. If a site contains both 5 keywords, and 1 negative keywords, then it is not considered relevant, and therefor, not added to the export list.

Option to filter out all comments and trackbacks on blog, to avoid spam comments from triggering the site as relevant.

This should be an extremely easy job for most coders, and I need it done very quickly.

Leave a Reply

Your email address will not be published. Required fields are marked *