HTML Parsing and Screen Scraping with the Simple HTML DOM Library


If you need to parse HTML, regular expressions aren’t the way to go. In this tutorial, you’ll learn how to use an open source, easily learned parser, to read, modify, and spit back out HTML from external sources. Using nettuts as an example, you’ll learn how to get a list of all the articles published on the site and display them.


Step 1. Preparation

The first thing you’ll need to do is download a copy of the simpleHTMLdom library, freely available from sourceforge.

There are several files in the download, but the only one you need is the simple_html_dom.php file; the rest are examples and documentation.

Download from Sourceforge

Step 2. Parsing Basics

This library is very easy to use, but there are some basics you should review before putting it into action.

Loading HTML

$html = new simple_html_dom();

// Load from a string
$html->load('<html><body><p>Hello World!</p><p>We're here</p></body></html>');

// Load a file
$html->load_file('http://net.tutsplus.com/');

You can create your initial object either by loading HTML from a string, or from a file. Loading a file can be done either via URL, or via your local file system.

A note of caution: The load_file() method delegates its job to PHP’s file_get_contents. If allow_url_fopen is not set to true in your php.ini file, you may not be able to open a remote file this way. You could always fall back on the CURL library to load remote pages in this case, then read them in with the load() method.

Accessing Information

Transforming your HTML

Once you have your DOM object, you can start to work with it by using find() and creating collections. A collection is a group of objects found via a selector – the syntax is quite similar to jQuery.

<html>
<body>
    <p>Hello World!</p>
    <p>We're Here.</p>
</body>
</html>

In this example HTML, we’re going to take a look at how to access the information in the second paragraph, change it, and then output the results.

# create and load the HTML
include('simple_html_dom.php');
$html = new simple_html_dom();
$html->load("<html><body><p>Hello World!</p><p>We're here</p></body></html>");

# get an element representing the second paragraph
$element = $html->find("p");

# modify it
$element[1]->innertext .= " and we're here to stay.";

# output it!
echo $html->save();

Using the find() method always returns a collection (array) of tags unless you specify that you only want the nth child, as a second parameter.

Lines 2-4: Load the HTML from a string, as explained previously.

Line 7: This line finds all <p> tags in the HTML, and returns them as an array. The first paragraph will have an index of 0, and subsequent paragraphs will be indexed accordingly.

line 10: This accesses the 2nd item in our collection of paragraphs (index 1), and makes an addition to its innertext attribute. Innertext represents the contents between the tags, while outertext represents the contents including the tag. We could replace the tag entirely by using outertext.

We’re going to add one more line, and modify the class of our second paragraph tag.

$element[1]->class = "class_name";
echo $html->save();

The resulting HTML of the save command would be:

<html>
<body>
    <p>Hello World!</p>
    <p class="class_name">We're here and we're here to stay.</p>
</body>
</html>

Other Selectors

Here are some other examples of selectors. If you’ve used jQuery, these will seem very familiar.

# get the first occurrence of id="foo"
$single = $html->find('#foo', 0);

# get all elements with class="foo"
$collection = $html->find('.foo');

# get all the anchor tags on a page
$collection = $html->find('a');

# get all anchor tags that are inside H1 tags
$collection = $html->find('h1 a');

# get all img tags with a title of 'himom'
$collection = $html->find('img[title=himom]');

The first example isn’t entirely intuitive – all queries by default return collections, even an ID query, which should only return a single result. However, by specifying the second parameter, we are saying “only return the first item of this collection”.

This means $single is a single element, rather then an array of elements with one item.

The rest of the examples are self-explanatory.

Documentation

Complete documentation on the library can be found at the project documentation page.

special properties

Step 3. Real World Example

To put this library in action, we’re going to write a quick script to scrape the contents of the Nettuts website, and produce a list of articles present on the site by title and description….only as an example. Scraping is a tricky area of the web, and shouldn’t be performed without permission.

Screen Scraping Nettuts
include('simple_html_dom.php');

$articles = array();
getArticles('http://net.tutsplus.com/page/76/');

We start by including the library, and calling the getArticles function with the page we’d like to start parsing. In this case we’re starting near the end and being kind to Nettuts’ server.

We’re also declaring a global array to make it simple to gather all the article information in one place. Before we begin parsing, let’s take a look at how an article summary is described on Nettuts+.

<div class="preview">
    <!-- Post Taxonomies -->
    <div class="post_taxonomy"> ... </div>
    <!-- Post Title -->
    <h1 class="post_title"><a>Title</a></h1>
    <!-- Post Meta -->
    <div class="post_meta"> ... </div>
    <div class="text"><p>Description</p></div>
</div>

This represents a basic post format on the site, including source code comments. Why are the comments important? They count as nodes to the parser.


Step 4. Starting the Parsing Function

function getArticles($page) {
    global $articles;

    $html = new simple_html_dom();
    $html->load_file($page);

    // ... more ...
}

We begin very simply by claiming our global, creating a new simple_html_dom object, then loading the page we want to parse. This function is going to be calling itself later, so we’re setting it up to accept the URL as a parameter.


Step 5. Finding the Information We Want

Count The Children
$items = $html->find('div[class=preview]');  

foreach($items as $post) {
    # remember comments count as nodes
    $articles[] = array($post->children(3)->outertext,
                        $post->children(6)->first_child()->outertext);
}

This is the meat of the getArticles function. It’s going to take a closer look to really understand what’s happening.

Line 1: Creates an array of elements – div’s with the class of preview. We now have a collection of articles stored in $items.

Line 5: $post now refers to a single div of class preview. If we look at the original HTML, we can see that the third child is the H1 containing the article title. We take that and assign it to $articles[index][0].

Remember to start at 0 and to count comments when trying to determine the proper index of a child node.

Line 6: The sixth child of $post is <div class=”text”>. We want the description text from within, so we grab the first child’s outertext – this will include the paragraph tag. A single record in articles now looks like this:

$articles[0][0] = "My Article Name Here";
$articles[0][1] = "This is my article description"

Step 6, Pagination

The first thing we do is determine how to find our next page. On Nettuts+, the URLs are easy to figure out, but we’re going to pretend they aren’t, and get the next link via parsing.

Find the next page to parse

If we look at the HTML, we see the following:

<a href="http://net.tutsplus.com/page/2/" class="nextpostslink">ยป</a>

If there is a next page (and there won’t always be), we’ll find an anchor with the class of ‘nextpostslink’. Now that information can be put to use.

if($next = $html->find('a[class=nextpostslink]', 0)) {
    $URL = $next->href;

    $html->clear();
    unset($html);

    getArticles($URL);
}

On the first line, we see if we can find an anchor with the class nextpostslink. Take special notice of the second parameter for find(). This specifies we only want the first element (index 0) of the found collection returned. $next will only be holding a single element, rather than a group of elements.

Next, we assign the link’s HREF to the variable $URL. This is important because we’re about to destroy the HTML object. Due to a php5 circular references memory leak, the current simple_html_dom object must be cleared and unset before another one is created. Failure to do so could cause you to eat up all your available memory.

Finally, we call getArticles with the URL of the next page. This recursion ends when there are no more pages to parse.


Step 7 Outputting the Results

First we’re going to set up a few basic stylings. This is completely arbitrary – you can make your output look however you wish.

Final Output
#main {
    margin:80px auto;
    width:500px;
}
h1 {
    font:bold 40px/38px helvetica, verdana, sans-serif;
    margin:0;
}
h1 a {
    color:#600;
    text-decoration:none;
}
p {
    background: #ECECEC;
    font:10px/14px verdana, sans-serif;
    margin:8px 0 15px;
    border: 1px #CCC solid;
    padding: 15px;
}
.item {
    padding:10px;
}

Next we’re going to put a small bit of PHP in the page to output the previously stored information.

<?php
    foreach($articles as $item) {
        echo "<div class='item'>";
        echo $item[0];
        echo $item[1];
        echo "</div>";
    }
?>

The final result is a single HTML page listing all the articles, starting on the page indicated by the first getArticles() call.


Step 8 Conclusion

If you’re parsing a great deal of pages (say, the entire site) it may take longer then the max execution time allowed by your server. For example, running from my local machine it takes about one second per page (including time to fetch).

On a site like Nettuts, with a current 78 pages of tutorials, this would run over one minute.

This tutorial should get you started with HTML parsing. There are other methods to work with the DOM, including PHP’s built in one, which lets you work with powerful xpath selectors to find elements. For easy of use, and quick starts, I find this library to be one of the best. As a closing note, always remember to obtain permission before scraping a site; this is important. Thanks for reading!