introduction to regular expressions

An introduction to regular expressions

Regular expressions are used in almost every programming language and are better known as regex or regexp. You find them in many types of applications. If you have never seen a regex before, they may seem a bit complicated due to all the special characters. Luckily, they turn out to have a clear structure.

Searching text with a regular expression

regular expression of regex

A regular expression can be used to search data. This may be a long text in which you want to search or replace a specific string, or a set of urls that you want to rewrite in a .htaccess file. Instead of searching for an exact match of your search query, a regex searches for patterns. For example: you may use Google Analytics to gather statistics of your website and you want to know how many people visited a page with a date in the url. Instead of defining your search query as ‘show me the pages where the url contains 01-01-2013 or 02-01-2013 or …’ you simply use a regex: ‘search for pages where the url contains dd-mm-yyyy’.

Easy regex for simple search queries

For a complex query you likely need a complex regular expression. However, a regex can be used in simple cases. Of course you need some knowledge about the structure and special characters that are used in regular expressions. Here are some examples:

Dot .

A dot is used to match any character in a string, for example:

chapter .

will match chapter 1, chapter 2, chapter 3, chapter b, but not chapter 12. There is only one dot, so to match chapter 12, the following regex with 2 dots can be used:

chapter ..

Special characters

In a regular expression many characters have a special meaning. You just saw that a . (dot) is used to match any random character, but also + $ ? ^ | have special functions. In case you want to match a string with one of those characters, the character should be escaped by putting a backslash ( \ ) in front of the symbol. Example: to determine whether an ip address has a certain value, the following regex may be used:

123\.456\.7\.8

Ranges

Square brackets are used to indicate that only one of the characters within the brackets has to be matched. Therefore, a regular expression

P[lL]int-sites

will match both PLint-sites (our company name) as well as Plint-sites. By using a – (dash), a range of characters can be indicated. A regex that is often used is

[0-9] or [^0-9]

The first regular expression will match the numbers 0 to 9, while the last one only matches expressions that do not contain any number.

Repetition of characters

So far we only considered variations of one character within a regular expression. Of course it’s possible to match multiple characters at the same time. You have to use the symbols ? * + or { }. The questionmark is used to indicate that the string needs to contain 0 or 1 of characters before the questionmark. Therefore,

12?

will match 1 and 12. The plus sign (+) is used to match any number of the previous character:

12+

This expression will match with 12, 122, 1222 etc. The asterisk * is comparable, but also matches when the character in front of the * does not occur in the string. 12* agrees with 1 while 12+ does not. Last, { } are used to indicate a specific number of repetitions of the previous character. Between the brackets you may put a single number or a range of numbers:

ab{3}

will match with abbb, while

ab{1-3}

forms a match with ab, abb and abbb.

Start and end of a string

The characters ^ and $ are used in a regular expression to indicate the start and end of a string, respectively.

^PLint

matches with ‘PLint-sites’, but not with ‘we are PLint-sites’. In contrast,

PLint$

forms a match with ‘we are PLint’. Be careful: the symbol ^ is also used within the square brackets [] to indicate that characters should not be from the set between the brackets.

Abbreviations

For some frequently occurring expressions an abbreviation may be used. \d indicates any number between 0 and 9, \s indicates a whitespace, and \w will match any number, character a-zA-Z or underscore.

Groups

With parentheses () groups are indicated and the pipe | is used as ‘or’. For example:

P(L|l)int-sites

will match with both PLint-sites as Plint-sites.

Getting started with regular expression

Although regular expression may seem a little complex, they are very powerful once you know how to use them. There are several tools available on the internet to help you write a regex. For example, there are regex testers that show realtime which part of a text is matched by your expression. Another nice tool to test your regex skills is the regex crossword challenge. Starting with simple exercises, you will definitely get experience with the more complex regular expressions.

Regular expressions offer lots of possibilities to search parts of text, to selectively rewrite urls within a .htaccess file or to filter statistics within Google Analytics. The examples in this blog post are very simple and only meant as an illustration of the possibilities. Time to get started and write your own (complex) regular expressions!


Mijn Twitter profiel Mijn Facebook profiel
Leonie Derendorp Webdeveloper and co-owner of PLint-sites in Sittard, The Netherlands. I love to create complex webapplications using Laravel! All posts
View all posts by Leonie Derendorp

Leave a Reply

Your email address will not be published. Required fields are marked *