10.1 Basic structure

awk commands are embedded in single quotes. They always consist of a pattern followed by an {action}, where the pattern specifies when the action should be performed.

In the following example, we are looking for the pattern “Dan” in the file BanthracisProteome.txt. By default each line containing the pattern is printed. Similar to the grep command in bash, the search is case sensitive (a search for “DAN” and “Dan” will give different results):

$ awk '/Dan/'  BanthracisProteome.txt
RA   Escuyer V., Duflot E., Sezer O., Danchin A., Mock M.;
RA   Escuyer V., Duflot E., Mock M., Danchin A.;
RA   Danchin A.;

Columns are addressed with the $ sign. Where $1, $2 and $3 refer to the first, second and third column, respectively. Every string separated by any whitespace will be considered as a column. This has the advantage that you don’t need to specify whether your field delimiter consists of a tab or multiple spaces. $0 stands for the whole line. Hence, in the example below {print $0} is equivalent to the default, where no action is given:

$ awk '/Dan/ {print $0}' BanthracisProteome.txt
RA   Escuyer V., Duflot E., Sezer O., Danchin A., Mock M.;
RA   Escuyer V., Duflot E., Mock M., Danchin A.;
RA   Danchin A.;

$ awk '/Dan/ {print $1,$3}' BanthracisProteome.txt
RA V.,
RA V.,
RA A.;

awk understands the basic arithmetic operations: +,-, * and / (for addition, subtraction, multiplication and division) and % for modulo (also known as Euclidean division or division with remainder). To concatenate columns, you can simply add a space instead of a comma:

$ echo "1 2 3 4 5" | awk '{print $1, $2*$3, $3-$4, $5%$2, $1 $3}'
1 6 -1 1 13

As you can see, awk can not only read files, but you can also pipe the standard-output of your command-line to awk and you can pipe awk’s output to other bash commands:

$ echo "1 2 3 4 5" | awk '{print $1, $2*$3, $3-$4, $5%$2, $1 $3}' | sed 's/ / and /g'
1 and 6 and -1 and 1 and 13

awk allows to execute actions at the beginning and very end of a file / stream with the BEGIN and END blocks. As they are only executed once, they are good places to define variables (BEGIN) or to print a final calculation (END). Note that each print statement prints its own line, so for readability in this example we translate newlines to spaces in the end:

$ echo "sed awk" | awk 'BEGIN {print "The parrot"} {print} END {print "!"}' | tr '\n' ' '
The parrot sed awk !

Note: {print} without specification prints the default, which is $0 (the whole line).

You can also separate columns by other delimiters with the -F statement:

$ echo "I-don't-like-awk-exercises-even-though-they're-helpful." | awk -F '-' 'BEGIN {print "But"} {print $1, $3, $4",", $7"."}'  | tr '\n' ' '
But I like awk, though.

Or even multiple delimiters:

$ echo "Why-does_this-sentense-look_so_funny-?" | awk -F '[-_]' '{print $1, $3, $4, $7, $8}'
Why this sentense funny ?

Also, you can sum over columns with the += operator. In this example, we store the sum of all values in column 1 in the variable sum and print it in the end:

$ echo -e "1 2\n2 3\n5 6\n10 11"
$ echo " "
$ echo -e "1 2\n2 3\n5 6\n10 11" | awk '{sum+=$1}; END {print "sum of first column is " sum}'
1 2
2 3
5 6
10 11
 
sum of first column is 18