Pattern recognition is a very hard thing to get right, and as a developer it can be as easy as a few lines of code or more than you can count with fingers and toes. Today we are going to go over a very common tool used for text pattern recognition: regular expression. So let’s get started and dive right in.
Regular expressions provide a standard framework for pattern recognition in strings. It allows you to define and find specific pieces of text, which can be slightly more complicated that it sounds. In fact, regular expressions are really a language unto themselves, albeit a small language.
Pretty much any high level language has an implementation of regular expressions. For this tutorial we will be using PHP for our examples, which using Perl-style regular expressions. However, keep in mind most regular expression engines use the same syntax, with just small variations in features being the difference.
Our First Example
The first rule to regular expression is to remember that we are searching more for patterns, rather than exact text. For example, I may want to validate that something is in the form of currency ($##.##), rather than a specific amount of money. In fact, let’s roll with that example first.
$regex = "/$\d+\.\d{2,}/";
Before we get too deep, let me explain exactly what this does. We start with encasing everything in / (forward slashes), IE “/…expression…/”. This is a requirement for Perl-style regular expressions.
Next we dive into the expression itself. First we have a literal “$”, followed by any number at least once, expressed as “\d+”. Then there it looks for a literal “.”, which has to be escaped, thus “\.”. Finally we look for two or more numbers, which accounts for the expression “\d{2,}.
So there are a few things that may need a bit more explanation here. While things like “\d” are pretty strait forward (“\d” being a match for any digit), things like a “+” are a bit confusing. The “+” and “{2,}” in our example both deal with matching things a certain number of times. In fact there are a few more these “number of occurrences” operators, so here is a super sweet table for reference.
Number of Occurrence Operators
Operator |
Matches… |
? |
0 or 1 times |
* |
0 or more times |
+ |
1 or more times |
{min, max} |
between the min and max number of times. However, you can leave the max out to specify no max. For example, {2,} matches 2 or more times. |
Ok, so we have some operators. What about things you use to actually create the pattern to match x number of times? Well, let’s start with a table for some reference.
Simple Expressions
Expression |
Matches… |
\d |
Any digit (0-9) |
\D |
Any non-digit |
\w |
Any alpha numeric character |
\W |
Any non-alpha numeric character |
\s |
Any whitespace character (space, tab, etc.) |
. |
Matches any character |
Between these two tables we can start to wrap our heads around our example. Let’s take “\d+”. In simple terms it says “there must be a collection of one or more digits in a row”, IE “01”, “19”, “1”, “432”, etc. We also now know why we have to escape the “.”, as it is an expression used by the engine.
Digging Deeper
It is already pretty apparent that things can get pretty complex pretty fast…and this is only the tip of the iceberg. What if you have a specific group of acceptable characters? or maybe you just want to match lowercase alphabetic characters? Well this is where classes and []s come in.
Using []s, you can define a list of characters to match or use some built in “classes” that allow you to specify a range or alphabetic or numeric characters. As crazy as that sounds, it is extremely useful. Let’s go with another simple example:
$regex = "/[f-jA-C\sGKO]/";
So here we have an expression that is going to match a single character. However, that character has a lot of choices as what it could be.
First we have the range “f-j” which matches any lowercase letter between f and j. Keep in mind this is case sensitive, so our range only matches lowercase letters, while our next range of “A-C” only matches uppercase letters.
After our two ranges, we have “\s”, a space character, followed by a collection of single uppercase letters. Each letter is a separator, so the last three characters all count as a valid match.
So basically our expression is looking for a single character that is f, g, h, i, j, A, B, C, G, K, O, or a space of some kind.
Lets go ahead and mix things up a bit more. Lets say we modify our example to be:
$regex = "/^[f-jA-C\sGKO]$/";
What? Now we have a “^” and “$” in there? Well those are position operators. The “^” represents the start of a given string and the “$” represents the end. In essence, we have limited our expression to a one character string, containing one of the acceptable characters. Lets get a table going one last time:
Ranges, Classes, and Positions
Expression |
Effect |
[] |
Used for specifying a collection of single characters or a range of characters. Any character defined is case sensitive. |
– |
Range specifier, case sensitive. Example: [a-fA-F] |
^ |
Denotes pattern should start that the beginning of the string. |
$ |
Denotes pattern should end the end of the string. |
Closing
I can sit here and explain things to you all day, but in the end examples are really the best way to explain regular expressions. So here are few that might help to visualize how regular expressions come together.
A Few Examples
Expression |
Matches(Separated by commas) |
/[Hh]ello\s?[Ww]orld/ |
hello world,Hello World,helloworld |
/[bBfFHh]{1}a[a-zA-Z]+/ |
Bat,hat,Hall,ball,Fatt,bait,hate |
/http[s]?/ |
http,https |
/\d\.\d{2,4}/ |
1.44,5.121,9.4356 |
Ok, so now we have a simple understanding of regular expressions and some examples to look through. Before I wrap this up officially though, I have compiled all of our reference tables in one super reference, or cheat sheet if you will, below.
Regular Expression Reference
Expression |
Effect |
\d |
Any digit (0-9) |
\D |
Any non-digit |
\w |
Any alpha numeric character |
\W |
Any non-alpha numeric character |
\s |
Any whitespace character (space, tab, etc.) |
. |
Matches any character |
? |
0 or 1 times |
* |
0 or more times |
+ |
1 or more times |
{min, max} |
between the min and max number of times. However, you can leave the max out to specify no max. For example, {2,} matches 2 or more times. |
[] |
Used for specifying a collection of single characters or a range of characters. Any character defined is case sensitive. |
– |
Range specifier, case sensitive. Example: [a-fA-F] |
^ |
Denotes pattern should start that the beginning of the string. |
$ |
Denotes pattern should end the end of the string. |
This is going to wrap it up for our introduction to regular expressions. Keep an eye out for more, because things can be a lot crazier. Just remember, when you need coding help, all you have to do is Switch On The Code.