Regular Expressions – Grouping

As we learned in our Regular Expressions Primer, regular expressions can be simple or extremely complex. However, there are a few concepts that were left out. Today we are going to cover one of those concepts… grouping.

In regular expressions, grouping allows you to “chunk” parts of your expression and tell the regular expression engine to treat each “chunk” as a separate match. This allows you to, say, find the type of domain (com, eu, biz) of an email address, or possibly whether or not the uri is using encrytion (https). There are a whole lot of reasons you want to use grouping, but enough talk, let’s get to it. Today we will be using PHP yet again, but grouping is supported by any self-respecting regular expression engine.

The Basics of Grouping

Grouping is implemented with ()s. Anything inside the parentheses are considered a group, and are tracked by the order in which they appear in the expression. So if I have two groups, whichever appears first (from left to right) will be marked as match one, and match two will be the one that appears next. Some engines even allow sub-groups, but for today we will stick to single-scope groups.

Grouping has no effect on the matching itself, and only effects the way matches are returned for processing. Lets take a simple example, which finds (or validates) an email address:

//Without Grouping
$regex = "/[a-zA-Z0-9-_]+@[a-zA-Z0-9-_]+\.\w{2,3}/";

//With Grouping
$regex = "/([a-zA-Z0-9-_]+)@([a-zA-Z0-9-_]+)\.(\w{2,3})/";

As you can see, the only difference here is that some pieces of our expression are wrapped in parentheses. In PHP, when a match is returned, it will return the whole expression, followed by all the groups, or chunks, that were defined. In the case of our example, we will get given the user, the domain, and the top-level domain (com, edu, etc.), on top of the full match.

Let’s say for a second you have “[email protected]”. When you run it through our regular expression you will match “[email protected]”, “user”, “mydomain”, and “edu”. This is the essence of grouping, an easy way to check parts of a match.

Back References

When we are talking about regular expressions, back-references are references to groups that you can reference when replacing matches. In order to use back-references, you have groups defined in your expression.

Taking the example above, let’s say we want to replace all the domains in a set of emails with “awesomesite”. In PHP you use the preg_replace function, which uses regular expressions to replace text in a string. Let’s check out how to use back-references to replace the domain in our email.

$text = "my email is "user@domain.com" and allows me to send emails";
$regex = "/([a-zA-Z0-9-_]+)@([a-zA-Z0-9-_]+)\.(\w{2,3})/";

preg_replace($regex, "$1@awesomesite.$3", $text);

So this will replace “domain” with “awesomesite”, in turn making the email address “[email protected]”. Now the replacement text is the important part here. The “$1” and “$3” are the back-references, which reference the first and third group respectively. In this example we just drop the second group and replace it with whatever we want. It’s really that easy.

The only trick to using back-references like this, in PHP, is that you have to use a special syntax when you want to follow a back-reference with a number:

preg_replace($regex, "${1}1234@awesomesite.$3", $text);

This example will give us an email of “[email protected]”. The “${1} is a syntax used when you want to follow a back-reference with a number, although nothing is stopping the paranoid programmer from using that syntax all the time.

So that is the power of grouping. Pretty simple once you get the hang of regular expressions. Just remember, when you need coding help, all you have to do is Switch On The Code.

If you want to learn more, check out these great regular expression books:

Leave a Reply

Your email address will not be published. Required fields are marked *