10.6 Exercises
10.6.1 At the beginning it feels awk-ward
See Section 12.0.17 for solutions.
Use
seqandtrto write all numbers from 1 to 1000 on individual lines into a file namednumbers.txt.Use
awkto extract all lines where the first column is between 107 and 121.Write an
awkcommand that replicatesgrep "100" numbers.txt.Use
awkonnumbers.txtto create a new filemoreNumbers.txtthat contains the original number in the first column and the log of that number in the second column.Extend your command from above to write the new file
moreNumbers.txtwith three columns: 1) the original number, 2) the log of that number and 3) either “even” or “odd” depending on whether the original number was even or odd.Write an
awkcommand that counts the number of lines of a file. Test it onmoreNumbers.txtand compare towc -l.Use
awkto calculate the mean of each of the first and second column ofmoreNumbers.txt. Both means should be printed at the end.Restrict the calculation of the mean to rows for which the third column reads “odd” only.
Write an
awkcommand that always prints the difference between the values in the second column of two consecutive lines ofmoreNumbers.txt. Omit printing anything for the first line. Your first line should thus be the difference between the second column of the second and first line.Write an
awkcommand that always prints ten values of the second column ofmoreNumbers.txtonto one line. The first line should thus consist of the values of the first ten lines, the second line those of lines 11 through 20 and so forth.
10.6.2 awk on the Banthracis proteome
See Section 12.0.18 for solutions.
Note: The following exercises require the Banthracis proteome BanthracisProteome.txt. You may get the file as follows:
$ # download the zipped file using wget
$ wget --compression=auto https://bitbucket.org/wegmannlab/bash_lecture/raw/master/Files/BanthracisProteome.txt.gz
$ # unzip the file
$ gunzip BanthracisProteome.txt.gzExtract all lines from BanthracisProteome.txt that contain ”ID” in the first column and save them in a new file ”prots.txt”
Use awk to calculate the percentage of them that are ”Reviewed”.
Write a BASH script that does the same thing using grep, wc and bc.
Use awk to get the total length of all proteins together in amino acids.
Use awk to print a file ”len.txt” containing only the name of the protein and its length in amino acids (without the AA). Then, use awk to print a file ”status.txt” containing only the name of the protein and its status (e.g. ”Reviewed”) for all proteins that contain only letters (no numbers) in their name. Finally, use join to create a file called len_status.txt, containing three columns: the protein name, its length and its status.
Use awk to create a new file ”seq.txt” that contains one line per protein with three columns: 1) the name, 2) the length, 3) the amino acid sequence (as a single column without spaces). Hint: Remember, that awk reads your file line by line and use the opportunity to store values in a variable until a certain condition is met.
Make sure that the output of exercise 6 (seq.txt) is correct. To do so, test if the length of the protein sequence equals the number in column 2. If a line fails this sanity-check, print “ERR: lengths differ for <the failed line>”. Use the END command to add a last line, either stating “file checked”, or “ERR: something went wrong”. (You can correct the length of failed proteins and repeat to see if your sanity-check works both-ways).