9.1 grep
grep is an essential command in Bash, it is used to search text and strings in a given file. Specifically, grep extracts all lines containing a match to the specified pattern.
Random fact: grep is an acronym that stands for Global Regular Expression Print.
The syntax of grep is quite easy: you simply type the word grep followed by the pattern that you are looking for. If the pattern is present in the file that you specify, it will print out the entire line and highlight the pattern in red.
Let’s look at an example: Using the grep command below, we will print out the lines of the Banthracis proteome file which contain the string ‘Ba’. grep is case sensitive, so it will not print out lines containing ‘BA’ or ‘ba’.
$ head BanthracisProteome.txt | grep Ba
OS Bacillus anthracis.
OC Bacteria; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus;If you are looking for the entire name ‘Bacillus anthracis’ you can use grep as well, but you need to add quotes:
Often, we are not only interested whether a specific pattern is present in a file, but we want to know how many times (i.e. on how many lines in the file) the pattern occurs. We can use grep for this as well, since grep only prints out the lines that match the pattern, we can then simply count the number of lines:
Note: -l in the wc command specifies that we are only interested in the number of lines, without it, the wc command would print out the number of lines, the number of words and the number of bytes.
grep actually has an integrated line count option, it’s the -c flag. So you can achieve the same result as in the example before using:
There are many different options on how to use grep, a few are shown below. Remember you can always look at the manual by typing man grep.
If you want to keep the lines that do NOT match a pattern you can use grep -v.
Obviously, the number of lines containing and not containing a pattern matches the total number of lines
$ tot=$(wc -l BanthracisProteome.txt | cut -f1 -d' ')
$ id=$(grep -c ID BanthracisProteome.txt)
$ notid=$(grep -cv ID BanthracisProteome.txt)
$ sum=$(echo "$id + $notid" | bc)
$ if [ $sum -eq $tot ]; then echo "Indeed!"; fi
Indeed!If you are only interested in a specific number of lines, you can use grep -m. It will then only display the number of lines that you define:
$ grep -m3 Q[1-9] BanthracisProteome.txt
AC Q81UJ9; Q6I2S8; Q6KWK9;
DR ProteinModelPortal; Q81UJ9; -.
DR IntAct; Q81UJ9; 2.This is thus identical to
$ grep Q[1-9] BanthracisProteome.txt | head -n3
AC Q81UJ9; Q6I2S8; Q6KWK9;
DR ProteinModelPortal; Q81UJ9; -.
DR IntAct; Q81UJ9; 2.As mentioned before, grep is case-sensitive, however you can use the -i flag to make it case-insensitive