9.3 sed
sed
is a yet another powerful stream editor for filtering and transforming text. In contrast to tr
, by default sed
is applied line by line, so depending on the task that you perform it might make more sense to use one over the other.
The syntax of sed
can feel weird at first, however it’s such a powerful command that learning to use it will pay off very soon! sed
is most often used to substitute (find and replace) characters or words in a line. The basic syntax for this is:
sed 's/from/to/' inputFileName > outputFileName
or
sed 's/from/to/g' inputFileName > outputFileName
or
sed 'y/from/to/' inputFileName > outputFileName
There are a couple of things going on here, first you notice that the entire command after sed
is written in quotes '
, this is part of the syntax and bash will through an error if you don’t include them. Then, we either have the letter ‘s’ or the letter ‘y’ in the beginning of the command. The ‘s’ stands for substitute and the ‘y’ stands for translating. What is the difference? When you use the ‘s’ option, the entire pattern that you provide will be replaced by your replacement string. Let’s make an example:
$ for i in `seq 1 3`; do echo "hello there, line $i" >> sed.txt; done
$ sed 's/there/you/' sed.txt
hello you, line 1
hello you, line 2
hello you, line 3
You can see that sed
replaced every word ‘there’ with the word ‘you’ as specified in the command.
The ‘y’ option on the other hand translates individual characters (similar to tr
):
Here, every ‘e’ has been translated (replaced) to an ‘X’, every ‘o’ to an ‘Y’ and every ‘i’ to a ‘Z’, so there is no need for ‘eoi’ to occur together as one word
You probably noticed the ‘g’ at the end of one of the sed
commands in the examples in the beginning of this section. The ‘g’ stands for global and you can chose to include it or not. Why would you want to include it? Since sed
works line by line, it will read a line until it meets the pattern that you are looking for, replace that one occurrence and then move on to the next line. So, in case you want ALL occurrences replaced (not just the first one in the line), then you have to specify the global option.
Let’s look at an example using the global option. Say we want to replace every small ‘l’ with a capital ‘L’ in our file:
As you see, if we do not provide the global option it will only replace the first instance of every line. Now, let’s add the global option:
Perfect, now we successfully replaced all small l’s with capital L’s.
Note In this example, you could use the ‘y’ instead of the ‘s’ option, then there would be no need to set the global option, since ‘y’ translates all characters that match the pattern.
Here are some more examples of sed
using the file that you are already familiar with:
A few more examples:
Translating characters: sed 'y/from/to/'
$ head -n2 BanthracisProteome.txt | sed 'y/\n/ /'
ID 3MGH_BACAN Reviewed; 205 AA.
AC Q81UJ9; Q6I2S8; Q6KWK9;
In the example above you can see that sed
can not replace newlines, since it is applied line by line.
Replacing text: sed 's/from/to/
$ head -n1 BanthracisProteome.txt | sed 's/BACAN/FRENCH FRIES/'
ID 3MGH_FRENCH FRIES Reviewed; 205 AA.
Note: Any character other than backslash or newline can be used instead of a slash to delimit the pattern and the replacement. Within the pattern and the replacement, the chosen delimiter itself can be used as a literal character, but you have to precede it with a backslash:
In the example above, the letter ‘a’ was chosen as a delimiter, but at the same time we wanted to replace every ‘u’ with an ‘a’ as well, so we had to precede the replacement letter ‘a’ with a backslash, otherwise it wouldn’t work.
sed
also allows to apply commands to specific lines only:
$ head -2 BanthracisProteome.txt | sed '2 s/A/a/'
ID 3MGH_BACAN Reviewed; 205 AA.
aC Q81UJ9; Q6I2S8; Q6KWK9;
$ head -10 BanthracisProteome.txt | sed '2,10 s/A/a/'
ID 3MGH_BACAN Reviewed; 205 AA.
aC Q81UJ9; Q6I2S8; Q6KWK9;
DT 26-aPR-2004, integrated into UniProtKB/Swiss-Prot.
DT 01-JUN-2003, sequence version 1.
DT 13-NOV-2013, entry version 69.
DE RecName: Full=Putative 3-methyladenine DNa glycosylase;
DE EC=3.2.2.-;
GN OrderedLocusNames=Ba_0869, GBAA_0869, BAS0826;
OS Bacillus anthracis.
OC Bacteria; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus;
Note: to indicate a range until the end of a file, use $
: sed '10,$'
.
sed
can do a lot more things such as deleting, replacing or adding lines. But all these tasks can also be achieved with awk
, which has an easier syntax. Acrtully, for most of the things you want to achieve there exist multiple ways using different commands. It is up to you to find out which ones suit you best. Some commands will be more applicable or efficient for certain tasks, but often there is no right or wrong.