Advanced Regular Expression Tips and Techniques

Advanced Regular Expression Tips and Techniques

Regular Expressions are the Swiss Army knife for searching through information for certain patterns. They have a wide arsenal of tools, some of which often go undiscovered or underutilized. Today I will show you some advanced tips for working with regular expressions.


Adding Comments

Sometimes regular expressions can become complex and unreadable. A regular expression you write today may seem too obscure to you tomorrow even though it was your own work. Much like programming in general, it is a good idea to add comments to improve the readability of regular expressions.

For example, here is something we might use to check for US phone numbers.

preg_match("/^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$/",$number)

It can become much more readable with comments and some extra spacing.

preg_match("/^

			(1[-\s.])?	# optional '1-', '1.' or '1'
			( \( )?		# optional opening parenthesis
			\d{3}		# the area code
			(?(2) \) )	# if there was opening parenthesis, close it
			[-\s.]?		# followed by '-' or '.' or space
			\d{3}		# first 3 digits
			[-\s.]?		# followed by '-' or '.' or space
			\d{4}		# last 4 digits

			$/x",$number);

Let’s put it within a code segment.

$numbers = array(
"123 555 6789",
"1-(123)-555-6789",
"(123-555-6789",
"(123).555.6789",
"123 55 6789");

foreach ($numbers as $number) {
	echo "$number is ";

	if (preg_match("/^

			(1[-\s.])?	# optional '1-', '1.' or '1'
			( \( )?		# optional opening parenthesis
			\d{3}		# the area code
			(?(2) \) )	# if there was opening parenthesis, close it
			[-\s.]?		# followed by '-' or '.' or space
			\d{3}		# first 3 digits
			[-\s.]?		# followed by '-' or '.' or space
			\d{4}		# last 4 digits

			$/x",$number)) {

		echo "valid\n";
	} else {
		echo "invalid\n";
	}
}

/* prints

123 555 6789 is valid
1-(123)-555-6789 is valid
(123-555-6789 is invalid
(123).555.6789 is valid
123 55 6789 is invalid

*/

The trick is to use the ‘x’ modifier at the end of the regular expression. It causes the whitespaces in the pattern to be ignored, unless they are escaped (\s). This makes it easy to add comments. Comments start with ‘#’ and end at a newline.


Using Callbacks

In PHP preg_replace_callback() can be used to add callback functionality to regular expression replacements.

Sometimes you need to do multiple replacements. If you call preg_replace() or str_replace() for each pattern, the string will be parsed over and over again.

Let’s look at this example, where we have an e-mail template.

$template = "Hello [first_name] [last_name],

Thank you for purchasing [product_name] from [store_name].

The total cost of your purchase was [product_price] plus [ship_price] for shipping.

You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days.

Sincerely,
[store_manager_name]";

// assume $data array has all the replacement data
// such as $data['first_name'] $data['product_price'] etc...

$template = str_replace("[first_name]",$data['first_name'],$template);
$template = str_replace("[last_name]",$data['last_name'],$template);
$template = str_replace("[store_name]",$data['store_name'],$template);
$template = str_replace("[product_name]",$data['product_name'],$template);
$template = str_replace("[product_price]",$data['product_price'],$template);
$template = str_replace("[ship_price]",$data['ship_price'],$template);
$template = str_replace("[ship_days_min]",$data['ship_days_min'],$template);
$template = str_replace("[ship_days_max]",$data['ship_days_max'],$template);
$template = str_replace("[store_manager_name]",$data['store_manager_name'],$template);

// this could be done in a loop too,
// but I wanted to emphasize how many replacements were made

Notice that each replacement has something in common. They are always strings enclosed within square brackets. We can catch them all with a single regular expression, and handle the replacements in a callback function.

So here is the better way of doing this with callbacks:

// ...

// this will call my_callback() every time it sees brackets
$template = preg_replace_callback('/\[(.*)\]/','my_callback',$template);

function my_callback($matches) {
	// $matches[1] now contains the string between the brackets

	if (isset($data[$matches[1]])) {
		// return the replacement string
		return $data[$matches[1]];
	} else {
		return $matches[0];
	}
}

Now the string in $template is only parsed by the regular expression once.


Greedy vs. Ungreedy

Before I start explaining this concept, I would like to show an example first. Let’s say we are looking to find anchor tags in an html text:

$html = 'Hello World!';

if (preg_match_all('/.*/',$html,$matches)) {

	print_r($matches);

}

The result will be as expected:

/* output:
Array
(
    [0] => Array
        (
            [0] => World!
        )

)
*/

Let’s change the input and add a second anchor tag:

$html = 'Hello
World!';

if (preg_match_all('/.*/',$html,$matches)) {

	print_r($matches);

}

/* output:
Array
(
    [0] => Array
        (
            [0] => Hello
            [1] => World!

        )

)
*/

Again, it seems to be fine so far. But don’t let this trick you. The only reason it works is because the anchor tags are on separate lines, and by default PCRE matches patterns only one line at a time (more info on: ‘m’ modifier). If we encounter two anchor tags on the same line, it will no longer work as expected:

$html = 'Hello World!';

if (preg_match_all('/.*/',$html,$matches)) {

	print_r($matches);

}

/* output:
Array
(
    [0] => Array
        (
            [0] => Hello World!

        )

)
*/

This time the pattern matches the first opening tag, and last opening tag, and everything in between as a single match, instead of making two separate matches. This is due to the default behavior being “greedy”.

“When greedy, the quantifiers (such as * or +) match as many character as possible.”

If you add a question mark after the quantifier (.*?) it becomes “ungreedy”:

$html = 'Hello World!';

// note the ?'s after the *'s
if (preg_match_all('/.*?/',$html,$matches)) {

	print_r($matches);

}

/* output:
Array
(
    [0] => Array
        (
            [0] => Hello
            [1] => World!

        )

)
*/

Now the result is correct. Another way to trigger the ungreedy behavior is to use the U pattern modifier.


Lookahead and Lookbehind Assertions

A lookahead assertion searches for a pattern match that follows the current match. This might be explained easier through an example.

The following pattern first matches for ‘foo’, and then it checks to see if it is followed by ‘bar’:

$pattern = '/foo(?=bar)/';

preg_match($pattern,'Hello foo'); // false
preg_match($pattern,'Hello foobar'); // true

It may not seem very useful, as we could have simply checked for ‘foobar’ instead. However, it is also possible to use lookaheads for making negative assertions. The following example matches ‘foo’, only if it is NOT followed by ‘bar’.

$pattern = '/foo(?!bar)/';

preg_match($pattern,'Hello foo'); // true
preg_match($pattern,'Hello foobar'); // false
preg_match($pattern,'Hello foobaz'); // true

Lookbehind assertions work similarly, but they look for patterns before the current match. You may use (?< for positive assertions, and (?<! for negative assertions.

The following pattern matches if there is a ‘bar’ and it is not following ‘foo’.

$pattern = '/(?<!foo)bar/';

preg_match($pattern,'Hello bar'); // true
preg_match($pattern,'Hello foobar'); // false
preg_match($pattern,'Hello bazbar'); // true

Conditional (If-Then-Else) Patterns

Regular expressions provide the functionality for checking certain conditions. The format is as follows:

(?(condition)true-pattern|false-pattern)

or

(?(condition)true-pattern)

The condition can be a number. In which case it refers to a previously captured subpattern.

For example we can use this to check for opening and closing angle brackets:

$pattern = '/^()$/';

preg_match($pattern, ''); // true
preg_match($pattern, ''); // false
preg_match($pattern, 'hello'); // true

In the example above, ‘1′ refers to the subpattern (<), which is also optional since it is followed by a question mark. Only if that condition is true, it matches for a closing bracket.

The condition can also be an assertion:

// if it begins with 'q', it must begin with 'qu'
// else it must begin with 'f'
$pattern = '/^(?(?=q)qu|f)/';

preg_match($pattern, 'quake'); // true
preg_match($pattern, 'qwerty'); // false
preg_match($pattern, 'foo'); // true
preg_match($pattern, 'bar'); // false

Filtering Patterns

There are various reasons for input filtering when developing web applications. We filter data before inserting it into a database, or outputting it to the browser. Similarly, it is necessary to filter any arbitrary string before including it in a regular expression. PHP provides a function named preg_quote to do the job.

In the following example we use a string that contains a special character (*).

$word = '*world*';

$text = 'Hello *world*!';

preg_match('/'.$word.'/', $text); // causes a warning
preg_match('/'.preg_quote($word).'/', $text); // true

Same thing can be accomplished also by enclosing the string between \Q and \E. Any special character after \Q is ignored until \E.

$word = '*world*';

$text = 'Hello *world*!';

preg_match('/\Q'.$word.'\E/', $text); // true

However, this second method is not 100% safe, as the string itself can contain \E.


Non-capturing Subpatterns

Subpatterns, enclosed by parentheses, get captured into an array so that we can use them later if needed. But there is a way to NOT capture them also.

Let’s start with a very simple example:

preg_match('/(f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => foo'
echo "b* => " . $matches[2]; // prints 'b* => bar'

Now let’s make a small change by adding another subpattern (H.*) to the front:

preg_match('/(H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => Hello'
echo "b* => " . $matches[2]; // prints 'b* => foo'

The $matches array was changed, which could cause the script to stop working properly, depending on what we do with those variables in the code. Now we have to find every occurence of the $matches array in the code, and adjust the index number accordingly.

If we are not really interested in the contents of the new subpattern we just added, we can make it ‘non-capturing’ like this:

preg_match('/(?:H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => foo'
echo "b* => " . $matches[2]; // prints 'b* => bar'

By adding ‘?:’ at the beginning of the subpattern, we no longer capture it in the $matches array, so the other array values do not get shifted.


Named Subpatterns

There is another method for preventing pitfalls like in the previous example. We can actually give names to each subpattern, so that we can reference them later on using those names instead of array index numbers. This is the format: (?Ppattern)

We could rewrite the first example in the previous section, like this:

preg_match('/(?Pf.*)(?Pb.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches['fstar']; // prints 'f* => foo'
echo "b* => " . $matches['bstar']; // prints 'b* => bar'

Now we can add another subpattern, without disturbing the existing matches in the $matches array:

preg_match('/(?PH.*) (?Pf.*)(?Pb.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches['fstar']; // prints 'f* => foo'
echo "b* => " . $matches['bstar']; // prints 'b* => bar'

echo "h* => " . $matches['hi']; // prints 'h* => Hello'

Don’t Reinvent the Wheel

Perhaps it’s most important to know when NOT to use regular expressions. There are many situations where you can find existing utilities than you can use instead.

Parsing [X]HTML

A poster at Stackoverflow has a brilliant explanation on why we should not use regular expressions to parse [X]HTML.

…dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of corrupt entities…

Joking aside, it is a good idea to take some time and figure out what kind of XML or HTML parsers are available, and how they work. For example, PHP offers multiple extensions related to XML (and HTML).

Example: Getting the second link url in an HTML page

$doc = DOMDocument::loadHTML('
	<html>
	<body>Test
		<a href="http://www.nettuts.com">First link</a>
		<a href="http://net.tutsplus.com">Second link</a>
	</body>
	</html>
');

echo $doc->getElementsByTagName('a')
		->item(1)
		->getAttribute('href');

// prints: http://net.tutsplus.com

Validating Form Input

Again, you can use existing functions to validate user inputs, such as form submissions.

if (!filter_var($_POST['email'], FILTER_VALIDATE_EMAIL)) {

	$errors []= "Please enter a valid e-mail.";
}
// get supported filters
print_r(filter_list());

/* output
Array
(
    [0] => int
    [1] => boolean
    [2] => float
    [3] => validate_regexp
    [4] => validate_url
    [5] => validate_email
    [6] => validate_ip
    [7] => string
    [8] => stripped
    [9] => encoded
    [10] => special_chars
    [11] => unsafe_raw
    [12] => email
    [13] => url
    [14] => number_int
    [15] => number_float
    [16] => magic_quotes
    [17] => callback
)
*/

More info: PHP Data Filtering

Other

Here are some other utilities to keep in mind, before using regular expressions:


Thanks so much for reading!



Leave a Reply

Your email address will not be published. Required fields are marked *