10.6 Exercises

10.6.1 At the beginning it feels awk-ward

See Section 12.0.17 for solutions.

  1. Use seq and tr to write all numbers from 1 to 1000 on individual lines into a file named numbers.txt.

  2. Use awk to extract all lines where the first column is between 107 and 121.

  3. Write an awk command that replicates grep "100" numbers.txt.

  4. Use awk on numbers.txt to create a new file moreNumbers.txt that contains the original number in the first column and the log of that number in the second column.

  5. Extend your command from above to write the new file moreNumbers.txt with three columns: 1) the original number, 2) the log of that number and 3) either “even” or “odd” depending on whether the original number was even or odd.

  6. Write an awk command that counts the number of lines of a file. Test it on moreNumbers.txt and compare to wc -l.

  7. Use awk to calculate the mean of each of the first and second column of moreNumbers.txt. Both means should be printed at the end.

  8. Restrict the calculation of the mean to rows for which the third column reads “odd” only.

  9. Write an awk command that always prints the difference between the values in the second column of two consecutive lines of moreNumbers.txt. Omit printing anything for the first line. Your first line should thus be the difference between the second column of the second and first line.

  10. Write an awk command that always prints ten values of the second column of moreNumbers.txt onto one line. The first line should thus consist of the values of the first ten lines, the second line those of lines 11 through 20 and so forth.

10.6.2 awk on the Banthracis proteome

See Section 12.0.18 for solutions.

Note: The following exercises require the Banthracis proteome BanthracisProteome.txt. You may get the file as follows:

$ # download the zipped file using wget
$ wget --compression=auto https://bitbucket.org/wegmannlab/bash_lecture/raw/master/Files/BanthracisProteome.txt.gz
$ # unzip the file
$ gunzip BanthracisProteome.txt.gz
  1. Extract all lines from BanthracisProteome.txt that contain ”ID” in the first column and save them in a new file ”prots.txt”

  2. Use awk to calculate the percentage of them that are ”Reviewed”.

  3. Write a BASH script that does the same thing using grep, wc and bc.

  4. Use awk to get the total length of all proteins together in amino acids.

  5. Use awk to print a file ”len.txt” containing only the name of the protein and its length in amino acids (without the AA). Then, use awk to print a file ”status.txt” containing only the name of the protein and its status (e.g. ”Reviewed”) for all proteins that contain only letters (no numbers) in their name. Finally, use join to create a file called len_status.txt, containing three columns: the protein name, its length and its status.

  6. Use awk to create a new file ”seq.txt” that contains one line per protein with three columns: 1) the name, 2) the length, 3) the amino acid sequence (as a single column without spaces). Hint: Remember, that awk reads your file line by line and use the opportunity to store values in a variable until a certain condition is met.

  7. Make sure that the output of exercise 6 (seq.txt) is correct. To do so, test if the length of the protein sequence equals the number in column 2. If a line fails this sanity-check, print “ERR: lengths differ for <the failed line>”. Use the END command to add a last line, either stating “file checked”, or “ERR: something went wrong”. (You can correct the length of failed proteins and repeat to see if your sanity-check works both-ways).