I am developing a web crawler able to recieve any given domain and crawl the main page, then creating a list with all the sub links of that page and crawl them too.
I programmed the part of retrieving the main content of any given page, for that I used HtmlUnit to get the html, and boilerpipe to identify the main content with pretty accurate results.
Now I’m facing the problem of having to identify all the sub-links of that page, the biggest deal here is the fact that every webpage has it’s own html structure. I’ve tried to accomplish that with the following methods:
- Searching all the Anchors (a): this was my first idea and worked
pretty well, the problem came when I tried to crawl pages that
implement JavaScript, as they wont use Anchor tags but (div) tags
with onClick properties:
(onclick=”widgetEvCall(‘handlers.openResult’, event, this,
‘/Attraction_Review-g187497-d670716-Reviews-Barcelona_Bus_Turistic-Barcelona_Catalonia.html’) - Search all the tags with onCLick attributes: this solution didnt work
either as not all the webs use that attribute. - Get the button.click() response: The problem here was that not all
the elements that have link redirect on them are buttons, some of
them are just divs.
I know that JSoup can do that pretty easily but it crashes when finding JavaScript elements. At this point I ran out of ideas, anyone could help me with this task?