Need a freelancer with Data Extraction experience.
The task requires the creation source code for a program that can extract information from the very large crawled snapshot of the Web posted on Amazon S3’s service: http://www.commoncrawl.org/data/accessing-the-data/
The program would take in input of a domain, run on an EC2 instance, and would produce a tab-separated text files with three columns: SourceURL\tAnchorText\tTargetURL where the TargetURL points to the domain of the input.
…
