9.5 Sorting and unique lines: sort, unique
In this section we are going to look at two essential commands of the Linux command line, sort
and uniq
. Both are quite simple to understand and they will make your coding life much easier further down the line.
9.5.1 sort
Let’s start with sort
, this command does exactly what you think it will do, it takes a text file as input and outputs it in a sorted way.
$ for i in 7 5 10 1 5 100 5; do echo $i; done > numbers.txt
$ sort numbers.txt | tr '\n' ' '; echo
1 10 100 5 5 5 7
Note: The tr
statement just ensure the numbers are printed on one line and the echo
at the end is just used to add a new line when printing the results to the screen. Neither is needed for sort to work.
Well, did you think it would do that? The default of sort
is to sort alphabetically, not numerically. If you want to sort numbers however, you can simply add the flag -n
.
There are a few additional operations you can do such as sorting in reverse order (using sort -r
), randomizing file lines (using sort -R
- pay attention to the seed!), removing duplicate lines (using sort -u
), or checking whether a file is sorted (using sort -c
).
9.5.2 uniq
If you have a file with a lot of duplicated lines and you want to know how many unique entries you have, you can use the command uniq
. Again, the syntax is quite intuitive, but one thing you need to know is that only duplicated lines that are right next to each other will be detected. Sounds weird, but when you combine uniq
with sort
, then this will never be an issue.
You can see, in the first example, no duplicates were removed, as soon as we sort it however it works!
You can use uniq -d
to do the opposite and only keep duplicated lines.
If you want to know how many times a specific line occurs in your file, you can use the -c
flag.
You can also use sort
and uniq -d
to only keep entries that are present in two or more files:
$ for x in human chimp orangutan gorilla; do echo $x; done > primates.txt
$ head -2 primates.txt > primates.short
$ cat primates.txt primates.short | sort | uniq -d
chimp
human
The command cat
simply pastes the contents of both files next to each other, so by then using the -d
flag, you will receive the duplicated lines, i.e. the lines that are present in both files.