7.2 Looking at files

In this section, we will use the Banthracis proteome as an example file. You may get the file as follows:

$ # download the zipped file using wget
$ wget --compression=auto https://bitbucket.org/wegmannlabteaching/bash_lecture/raw/master/Files/BanthracisProteome.txt.gz
$ # unzip the file (we will discuss more about gunzip in the next section)
$ gunzip BanthracisProteome.txt.gz

7.2.1 `more` and `less`

We have learned about cat, which prints the whole file to STDOUT. For large files, this will flood our screen extremely fast, such that it is cumbersome to scroll up to the part we wanted to see.

Thus, for large files, there are better options. Meet more and less!

$ more BanthracisProteome.txt

more shows one screen at a time, and allows us to scroll. For this, press space to get to the next page, and return to move down one line.

Less is more - one of many Linux jokes. less is an improved version of more that lets us navigate in both directions:

$ less BanthracisProteome.txt

Use the arrows to move up and down lines, and space to jump down one page.

Of course, we can also open a file in a graphical text editor, such as gedit. However, these tools usually need to read the whole file before displaying, which might take a long time for large files.

7.2.2 Line count

To count the number of lines in a file, we can use the command wc:

$ wc BanthracisProteome.txt
  515365  2987124 24285426 BanthracisProteome.txt

The three numbers represent the total line, word and character count, respectively. Usually, we are mostly interested in the number of lines in a file. For this, we can use the shortcut wc -l:

$ wc -l BanthracisProteome.txt
515365 BanthracisProteome.txt

7.2.3 Head and tail

The commands head and tail are useful when we want to extract parts of a file, for example individual lines or a set of lines. head will start at the beginning of the file. By default, the first 10 lines of a file are printed:

$ head BanthracisProteome.txt
ID   3MGH_BACAN              Reviewed;         205 AA.
AC   Q81UJ9; Q6I2S8; Q6KWK9;
DT   26-APR-2004, integrated into UniProtKB/Swiss-Prot.
DT   01-JUN-2003, sequence version 1.
DT   13-NOV-2013, entry version 69.
DE   RecName: Full=Putative 3-methyladenine DNA glycosylase;
DE            EC=3.2.2.-;
GN   OrderedLocusNames=BA_0869, GBAA_0869, BAS0826;
OS   Bacillus anthracis.
OC   Bacteria; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus;

With the flag -n, we can change the number of lines that are printed:

$ head -n 5 BanthracisProteome.txt
ID   3MGH_BACAN              Reviewed;         205 AA.
AC   Q81UJ9; Q6I2S8; Q6KWK9;
DT   26-APR-2004, integrated into UniProtKB/Swiss-Prot.
DT   01-JUN-2003, sequence version 1.
DT   13-NOV-2013, entry version 69.

$ head -n 5 BanthracisProteome.txt | wc -l
5

tail, on the other hand, starts at the end of the file. The same settings apply here:

$ tail BanthracisProteome.txt
DR   OMA; KKKHRVR; -.
DR   OrthoDB; EOG6X3WBN; -.
DR   ProtClustDB; CLSK687586; -.
DR   BioCyc; ANTHRA:GBAA_PXO1_0081-MONOMER; -.
DR   BioCyc; BANT261594:GJ7F-5672-MONOMER; -.
PE   4: Predicted;
KW   Complete proteome; Plasmid; Reference proteome.
SQ   SEQUENCE   53 AA;  6111 MW;  8BB41F35C12C0D07 CRC64;
     MDKKKKQRVR RAIFIGVIAM IVSLYIGNEL QDRNGKSYAP AKYFETGTKL ISY
//

$ tail -n 5 BanthracisProteome.txt
PE   4: Predicted;
KW   Complete proteome; Plasmid; Reference proteome.
SQ   SEQUENCE   53 AA;  6111 MW;  8BB41F35C12C0D07 CRC64;
     MDKKKKQRVR RAIFIGVIAM IVSLYIGNEL QDRNGKSYAP AKYFETGTKL ISY
//

We can chain these two commands to print e.g. lines 7-10 only:

$ head -n 10 BanthracisProteome.txt | tail -n 3
GN   OrderedLocusNames=BA_0869, GBAA_0869, BAS0826;
OS   Bacillus anthracis.
OC   Bacteria; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus;

In addition, tail -n +x prints the entire file, starting at line x. For example, if we want the entire file, except for the first two lines, we can use:

$ tail -n +2 BanthracisProteome.txt

7.2.4 Cut

cut is a command that divides a file into columns. Let’s discuss its most important options on the first few lines of BanthracisProteome.txt, which look like this:

$ head -n 2 BanthracisProteome.txt
ID   3MGH_BACAN              Reviewed;         205 AA.
AC   Q81UJ9; Q6I2S8; Q6KWK9;

The flag -f is used to specify the columns that should be extracted. This can be individual numbers (-f 1), ranges (-f 1-3), lists (-f 1,5) or a mixture of those (-f 1,3-5). For example, if we want to print the first column only, we can use:

$ head -n 2 BanthracisProteome.txt | cut -f1
ID   3MGH_BACAN              Reviewed;         205 AA.
AC   Q81UJ9; Q6I2S8; Q6KWK9;

Well… That didn’t really work, did it? The output looks exactly the same as before. The reason is that cut does assume that the columns are separated by a tab \t by default, but this is not the case for our file. The first two columns are separated by three spaces. We use the -d flag to specify the delimiter:

$ head -n 2 BanthracisProteome.txt | cut -f1 -d ' '
ID
AC

… and this works nicely. However, if we take the second column instead:

$ head -n 2 BanthracisProteome.txt | cut -f2 -d ' '

we get an empty column. This is because cut will extract everything between the first and the second space - which is nothing! In such cases, it makes sense to use the command tr upfront. tr -s will remove multiple occurrences of a character, i.e. it will squeeze all double/triple/… spaces into a single space. This makes it much easier to parse the file with `cut``:

$ head -n 2 BanthracisProteome.txt | tr -s ' ' | cut -f2 -d ' '
3MGH_BACAN
Q81UJ9;

We will hear more about tr later.