10.4 Extended matching

There are multiple ways of finding a pattern. Whether you want to extract lines where a substring matches anywhere in the line, or if you are looking for an exact match, these are the main basic options:

Return all lines that match the substring anywhere in the line:

$ awk '/ID/' BanthracisProteome.txt | wc -l
50532

Return the lines where the first column (or word) exactly matches the complete search pattern:

$ awk '$1=="ID"' BanthracisProteome.txt | wc -l
5493

Return the lines where the second column exactly matches the complete search pattern:

$ awk '$2=="ID"' BanthracisProteome.txt | wc -l
0

If you are looking for a pattern anywhere within a specific column, you can use the ~ sign. Be aware of these partial matches, as the search pattern “ID” can also be found within the string “NUCLEOTIDE”, and the pattern “sample1” also matches the strings “sample11”, “sample12”, “sample13”, …:

$ awk '$2 ~ /ID/' BanthracisProteome.txt | wc -l
40676

To extract lines based on multiple search patterns at once, you can use && (and) and || (or).

This means, the following command returns all lines that contain either the exact pattern “ID” in the first column, or the string “ID” anywhere in the second column:

$ awk '$1 == "ID" || $2 ~ /ID/' BanthracisProteome.txt | wc -l
46163

While the next code returns all lines that contain the exact pattern “ID” in the first column, and contain the string “ID” anywhere in the second column:

$ awk '$1 == "ID" && $2 ~ /ID/' BanthracisProteome.txt | wc -l
6

You can also use regular expressions in your search patterns. Remember the tutorial about regular expressions (chapter 9.4) - the same results can be accomplished by matching for the regex pattern in awk:

$ echo -e 'TACACACTTTAGAGTTTACAGACTTT' | awk '$1 ~ /(A[CG])+/'
TACACACTTTAGAGTTTACAGACTTT