Similar Contect Checker Using K-shingle Algorithm

I want a Java app. made that runs on my desktop and connects remotely with the mysql database (you must know how to configure this for remote access) of my article directory (http://tinyurl.com/39vpw6w ). It has to detect similar content using the k-shingle algorithm. It has to run through the articles (field is article_body) by id. So it starts at id=1 and goes through the whole database, then id=2 and so on…

Parameters to enter in the app:

k-value.

article_body start id (default=1)

%similarity, detects articles higher than this value

After processing it has to give a list of results of the similar articles. The original is always the lowest id. ability to delete them one by one, or all at once

This has to run fast, have posted this project in php 3 times before and nobody could do this properly, php script is too slow. The site has over 700000 articles that need to be checked.

Leave a Reply

Your email address will not be published. Required fields are marked *