9.4 Regular expressions

9.4.1 What is a regular expression?

Regular expressions (regex in short) are patterns that describe strings. They are super useful as they allow for complicated search patterns. Many tools and programming languages understand regex, including grep and sed. In fact, grep and sed interpret any pattern as regex.

9.4.2 Basic syntax

A regex consists of parts to be matched in a specific order. For example, the pattern li consists of two parts l and i that need to be matched in sequence.

A simple example:

$ echo 'I really like grep!' | grep 'li'
I really like grep!

Note: Even though it is not necessary for grep, it is good coding practice to encapsulate regex patterns in quotes as they often contain special characters.

Below you can find some of the most often used regular expressions as well as a few examples showing you how to use them.

The characters ^ and $ indicate the start and end of a string, respectively:

$ echo 'blah blah blah' | grep 'blah'
$ echo 'blah blah blah' | grep '^blah'
$ echo 'blah blah blah' | grep 'blah$'
blah blah blah
blah blah blah
blah blah blah

When you execute these commands in the terminal you can see that with the first command, all three ‘blah’s are being highlighted, this is exactly what we expect from grep. In the second line, only the first one out of the three ’blah’s is being highlighted. This is because we added the ^ into the grep command, it tells the program to only look for the pattern ’blah’ at the beginning of a string. Similarly, when executing the third command, you will only receive the third of the ‘blah’s since the $ sign tells the program to only look for the pattern ’blah’ at the end of a string.

Note: The highlighting is unfortunately not preserved when writing to a file such as this site. It is best to execute these examples in a terminal to see their effect.

9.4.3 Character classes

If you are looking for a set of patterns with multiple alternatives that differ maybe only by one character, then you can look for it either using [ab] or (a|b). Both ways are valid, but the (a|b) syntax requires extended regex (option -E):

$ echo 'AUG CAG UAU UGG CAC AAC' | grep 'AU'
AUG CAG UAU UGG CAC AAC

$ echo 'AUG CAG UAU UGG CAC AAC' | grep 'A[UC]'
AUG CAG UAU UGG CAC AAC

Here, we are looking for a pattern that is either ‘AU’ or ‘AC’

$ echo 'AUG CAG UAU UGG CAC AAC' | grep -E 'A(G|C)'
AUG CAG UAU UGG CAC AAC

Here, we are looking for a pattern that is either ‘AG’ or ‘AC’

You can also provide a range of a certain character class, and you can even combine multiple character classes in one statement: [A-C], [3-7], [f-m3-5].

$ echo 'abcdefghijklmnop012345678' | grep '[e-h]'
abcdefghijklmnop012345678

$ echo 'abcdefghijklmnop012345678' | grep '[l-w4-6]'
abcdefghijklmnop012345678

You can always use the square brackets to indicate ranges of characters that you would like to make, but for some frequently used classes, shortcuts have been established. For GNU tools, some helpful shortcuts include:

. (the dot) matches any character - use carefully!
[[:alnum:]] is equivalent to [a-zA-Z0-9]
[[:punct:]] matches all punctuation characters
\w matches all word characters, equivalent to [a-zA-Z0-9_]
\d matches all digits, equivalent to [0-9]
\s matches all white spaces (space, tab, new line, …)

It is often easier to define a class by exclusion using the negation operator ^:

$ echo 'AUG CAG UAU UGG CAC AAC' | grep '[^A]'
AUG CAG UAU UGG CAC AAC

In case you are confused now, since just before we learned that ^ means it should match the character at the beginning of the string, and now we tell you it means to match anything but the given character, don’t worry, we can explain. The square bracket [] is used to match multiple characters and when you use the ^ inside of that square bracket, it means “match anything EXCEPT this”, however if you use the ^ without square brackets like in the first example with the ’blah’s, then it will look for matches at the beginning of the string. Both of these cases are quite useful, so it makes sense to know what they are about.

Special characters need to be escaped using the backslash:

$ echo 'file.txt file_txt.htm files.gz' | grep 'file.'
file.txt file_txt.htm files.gz

$ echo 'file.txt file_txt.htm files.gz' | grep 'file\.'
file.txt file_txt.htm files.gz

In the first example, all three files are highlighted, since the dot (.) stands for ‘match any character’. If you actually want to match a dot, you will have to use a backslash.

9.4.4 Quantifiers

You can use curly brackets to specify the number of occurrences that you are looking for in your pattern. These so called quantifiers immediately follow the character (or character class) that you want to be repeated. Their basic syntax is {m,n}, where m and n specify the minimal and maximal number of occurrences requires.

$ echo 'ATTACTTACCTTACCCTTACCCCTTACCCCCTT' | grep -E 'AC{2,3}T'
ATTACTTACCTTACCCTTACCCCTTACCCCCTT

Note: For regex quantifiers to work, grep must be launched with the option -E, which activates extended regex.

In the above example, we are looking for one A, followed by 2 OR 3 C’s and then followed by one T.

If you require an exact number of occurrences (i.e. if m=n), you may only specify one number:

$ echo 'ATTACTTACCTTACCCTTACCCCTTACCCCCTT' | grep -E 'AC{4}T'
ATTACTTACCTTACCCTTACCCCTTACCCCCTT

If you do not put m or n, zero or an infinite number is implied. Or in words: you may seatch for “at least” or “up to”:

$ echo 'ATTACTTACCTTACCCTTACCCCTTACCCCCTT' | grep -E 'AC{,3}T'
ATTACTTACCTTACCCTTACCCCTTACCCCCTT

$ echo 'ATTACTTACCTTACCCTTACCCCTTACCCCCTT' | grep -E 'AC{3,}T'
ATTACTTACCTTACCCTTACCCCTTACCCCCTT

In case you want to get fancy, there are again shortcuts for these expressions:

C* is equivalent to C{0,}, implying “zero or more”.
C+ is equivalent to C{1,}, implying “one or more”.
C? is equivalent to C{0,1}, implying “zero or one”.

9.4.5 Grouping

As mentioned before, the quantifiers are written after the target character or character class. If you want them to be applied to word, you need to group these characters together using round brackets ().

$ echo 'AGTGTACCACAGTGTGTGTCACCAC' | grep -E '[AC](GT)+[AC]'
AGTGTACCACAGTGTGTGTCACCAC

In words, in the above example you are looking for either an A or a C (that’s the [AC]), followed by one or more occurrences of GT (that’s the (GT)+) and then again for either an A or a C, that’s again the [AC].

Groups themselves may contain alternatives:

$ echo 'ACAC ACAG AGAG' | grep -E '(A[CG])+'
ACAC ACAG AGAG

In the above example, we search for the any of the words “AC” or “AG” and such words may occur once or more. Now that applies to all three words printed above. If you want to limit your search to the words “ACAC” and “AGAG” only and not “ACAG” we will need to use back references.

9.4.6 Back reference

In regex, you may refer back to any group that matched before using \n where n is the number of the group. The first group would then be \1, the second group \2 and so forth.

Referring back to groups allows us match a complicated pattern multiple times. In the above example, for instance, we may limit the search to words such as “ACAC” or “AGAG” by first looking for the part “AC” or “AG”, but then again for whatever was found first.

$ echo 'ACAC ACAG AGAG' | grep -E '(A[CG])\1+'
ACAC ACAG AGAG

Note:: Back-referencing also requires extended regex (i.e. the option -E).

Decomposing the above command: the regex part A[CG] is looking for a word “AC” or “AG”. We then use () to create a group out of this match and refer back to it using \1. So basically the above regex reads “Search either ‘AC’ or ‘AG’, followed by whatever was found once or more”.

We could also achieve the same by writing a regexp that searches for any base followed by any base, but then the same combination of bases again.

$ echo 'ACAC ACAG AGAG' | grep -E '([ACGT][ACGT])\1'
ACAC ACAG AGAG

Or, to make things more complicated, we could achieve the same thing using two groups:

$ echo 'ACAC ACAG AGAG' | grep -E '([ACGT])([ACGT])\1\2'
ACAC ACAG AGAG

Back references are particularly helpful in search-replace settings. Consider the following sentence, in which you would like to replace the comma (,) with a dot (.) whenever it was used as a decimal delimiter: “’As I said, 1,5 is smaller then 7,8!”

Simply replacing all commas with dots is not working as there is also a comma after “said” that must stay. We can easily identify the correct commas using grep:

$ echo 'As I said, 1,5 is smaller then 7,8!' | grep '[0-9],[0-9]'
As I said, 1,5 is smaller then 7,8!

The above code identifies all commas between numbers. To now replace these with dots, however, we need to make sure that the numbers before and after the comma remain untouched. One way to do this is to refer back to the numbers found during the search in the replacement string:

$ echo 'As I said, 1,5 is smaller then 7,8!' | sed -E 's/([0-9]),([0-9])/\1.\2/g'
As I said, 1.5 is smaller then 7.8!

Note: Just as grep, you also need to use the -E option with sed to allow for extended regex features.

Let’s make another example. Imagine you have a list of phone numbers in the format “+41761234567”, but you would like to display them nicely as “+41 76 123 45 67”, hence with spaces at the right places (for Swiss numbers anyways). How can we do that? Actually, it is rather easy when using back-referencing.

$ echo '+41761234567' | sed -E 's/(\+[0-9]{2})([0-9]{2})([0-9]{3})([0-9]{2})([0-9]{2})/\1 \2 \3 \4 \5/g'
+41 76 123 45 67

While this may seem cryptic at first, let’s decompose the regex used. The basic idea is to break the search pattern down into five groups matching the bits of the phone number to be printed together. The first bit is identified with the regex group (\+[0.9]{2}), which looks for a plus sign followed by two numbers. (Note that the + needs to be escaped). The second bit consist of the next two numbers, identified with ([0-9]{2}), followed by a bit with three numbers identified with ([0-9]{3}) and so forth. When replacing the string, we can then refer back to these groups.

Note that groups are always numbered form left to right with their position defined by the opening parenthesis (, also when nested. To illustrate that, consider a case in which we aim at putting parenthesis around all cases of a base in a DNA sequence being repeated three or more times.

Let us again begin by identifying such patterns with grep:

$ echo 'ACCGTGCTTTGCAAACGTACGTTTTCGACAACCTATAAAAG' | grep -E '([ACGT])\1{2,}'
ACCGTGCTTTGCAAACGTACGTTTTCGACAACCTATAAAAG

Here we need to use back-reference to ensure that whatever character was found first will be found at least two more times. To now put parenthesis around these occurences, we will also need to refer back to the entire pattern found.

$ echo 'ACCGTGCTTTGCAAACGTACGTTTTCGACAACCTATAAAAG' | sed -E 's/(([ACGT])\2{2,})/(\1)/g'
ACCGTGC(TTT)GC(AAA)CGTACG(TTTT)CGACAACCTAT(AAAA)G

Note that when doing so, the group indicating the entire pattern is the one with the first opening parenthesis (, so it will be group number one, while the inner group ([ACGT]) will be number two.

9.4.7 That’s it!

Well done, you finished this section! If you never heard of regular expressions before, that was probably quite challenging. Make sure to execute the code blocks in the tutorial, things will become more clear that way, and the exercises will hopefully help you implement some of the concepts you just learned about.