9.7 Exercises

Note: Some of these exercises require the Banthracis proteome BanthracisProteome.txt. You may get the file as follows:

$ # download the zipped file using wget
$ wget --compression=auto https://bitbucket.org/wegmannlabteaching/bash_lecture/raw/master/Files/BanthracisProteome.txt.gz
$ # unzip the file
$ gunzip BanthracisProteome.txt.gz

9.7.1 Warm up

See Section 12.0.12 for solutions.

Create a file DNA.txt, and add a random Sequence of at least 20 A, G, C and Ts using vim.
Print the content of DNA.txt to the screen, but replace all As with Gs on the fly.
Again replace all As with Gs on the fly, but add the output to DNA.txt as an extra line. Print the content of DNA.txt to the screen to check.

9.7.2 grep

See Section 12.0.13 for solutions.

grep: extract all lines containg “ID” from BanthracisProteome.txt and look at those using less.
Extract the first 10 Lines containing “123” from BanthracisProteome.txt.
Extract the last 10 lines that do not contain “123” from BanthracisProteome.txt.
Count how many lines of the file BanthracisProteome.txt contain the pattern “out”, but ignoring the first 100,000 lines.
Count how many lines of BanthracisProteome.txt contain the pattern “grep”, regardless of the case (hence also allow for “Grep”, “gREp”, “GREP”, …).
Write a script “extract.sh” that extract the first and last 10 lines from a file (argument 1) that match a pattern (argument 2) and writes them to a new file (argument 3). Use this script to get the first and last 10 lines of BanthracisProteome.txt containing “ab” (case insensitive), followed by a number (e.g. ab7 or aB5) and store these lines in “abc.txt”.

9.7.3 tr and sed

See Section 12.0.14 for solutions.

Extract the first 7 lines from BanthracisProteome.txt that contain the pattern “F*R”, where * stands for any capital letter. Before printing, replace all capital letters with minuscule ones. Find a solution with tr and one with sed.
Print the first 10 lines of BanthracisProteome.txt to screen, but squeeze all spaces (make sure that multiple spaces following each other are printed as a single one).
Extract all lines from BanthracisProteome.txt that contain ‘Reviewed’^ and store them in a new file reviewed.txt. When doing so, delete all semicolons ‘;’ from all lines and squeeze all spaces.
Replace all ‘Reviewed’ with ‘tscheggt’ inside the file ‘reviewed.txt’.
Use sed to replace back ‘tscheggt’ with ‘reviewed’ inside ‘reviewed.txt’, but only on lines 5-9.

9.7.4 Sorting

See Section 12.0.15 for solutions.

The file “reviewed.txt” contains the name and length of all proteins that have been manually reviewed. Use cut to extract all names, sort them and just look at the last 10.
Use this strategy to find the length of the longest and shortest of all reviewed proteins.
Check if there are any reviewed proteins of the exact same length. How many are they?
What is the most common length and how many reviewed proteins have this length?
Use seq to write the numbers from 1 to 100 to a file named ‘numbers.txt’. Extract the third column of all lines in ‘BanthracisProteome.txt’ that contain “KEGG” to a file names KEGG.txt. Paste these two files into a new file named ‘combined.txt’.
Get the content of combined.txt in shuffled order (use the tool shuf) and write the first 50 lines of it to a new file named shuff.txt. Join it with combined on the first column. Each line should then contain the same KEGG ID twice. Note that the tool join expects files to be sorted in alphabetic order (the default of sort).

9.7.5 Regular expressions

See Section 12.0.16 for solutions.

12.Extract all lines from BanthracisProteome.txt that contain “SEQUENCE” followed by an arbitrary number of white spaces, a number, another white space, and “AA”. And example Line would be “SQ SEQUENCE 205 AA; …”.

Extract all lines from BanthracisProteome.txt with GO terms “GO:0009000” to “GO:0009999”.
Extract all lines from BanthracisProteome.txt that contain a proper date of the form “01-JUN-2003”.
Write a regex to extract all proper email address of the form “firstname.lastname@something.com”. It should match “groovy.gorilla@jungle.com” or “hey.dude@cool.com” but not “secret@cia.com”, “blah.blah.internet” or “no.domain@short”.
Write a regular expression that extracts names from text such as “My dear Ronald Fisher, I hope you enjoyed reading the book about Thomas Bayes the other day. Kind regards, Gertrude Cox”. Names are identified as two words separated by a space and starting with a capital letter.
Consider text containing geographical coordinates in the format 46.9462873,7.4446943. Write a sed command that identifies such coordinates and replaces them with the same coordinates but shown with the degree symbol as 46.9462873°,7.4446943°. Your code should work correctly in sentences such as “To refresh during a boring session, I jumped 234.26 meters from 46.9462873,7.4446943 to 46.9450720,7.4462414.”
Replicate the behavior of tr -s '[a-z]' using a search-replace command with sed. Use the sentence “My name is Kaa, so trusssst in meeeee” to test.
Use sed to put parenthesis [] around all multiply occurring lower-case characters such “aaa” or “bb” or “ccccc”. Also test with “My name is Kaa, so trusssst in meeeee”.
Use sed to replace all parenthesis () within equations with [] if they contain nested parenthesis () within them. Your code should thus change “2.5x(y(6.0 + z) + 6.3) - 7.2(5.1 - a) + (1.8 - b)(3.2 + (c - 1.9))” to “2.5x[y(6.0 + z) + 6.3] - 7.2(5.1 - a) + (1.8 - b)[3.2 + (c - 1.9)]”.