Screen Scraping Exercise 2

Given a list of up to several hundred thousand Chinese terms, write a C# program that will fetch search result count and the first 10 headlines (plain text headline, newspaper link, article link and date) from news.baidu.com into a SQLite or SQLServer database. Write out all SQL inserts explicitly. Headlines should be exact match. An API for Baidu may exist, if so, use it. Searching RSS feeds may be more straight forward, if so, do it. Use UTF8 encoding. Expected tables as described below.

Terms:
termid
term (text, unique)

Search Results
searchid
termid
count (integer)
searched (datetime)

Headlines:
headlineid
headline (text, unique) – make sure this is long enough

TermHeadlines:
termid
headlineid

Given a term, I should be able to get the hit count and the latest 10 sample headlines. The term MUST be present in the headline.

Please successful complete one run (we will review partial results) and provide source code. For a large sample word list, you may use the simplified or traditional (or both) column from cc-cedict.org

Leave a Reply

Your email address will not be published. Required fields are marked *