7.1 Encoding
Computers store 0s and 1s only. All data are thus represented in a binary sequence, that is, a sequence of 0s and 1s. The most basic unit of binary is a bit, which can be 0 or 1. With two bits, a computer can represent 4 different patterns: 00, 01, 10 and 11. With three bits, a computer can represent 8 different patterns: 000, 001, 010, 011, 100, 101, 110, 111. And so on.
The next larger unit is a byte, which consists of 8 bits. Guess: How many patterns can we represent with a byte? Well, we have eight combinations of zero or one, which results in \(2^8 = 256\).
Since computers store 0s and 1s only, text needs to be encoded in 1s and 0s. There are different “maps” that encode how to get from a character (e.g. a letter, a number, a whitespace etc.) to a binary representation.
7.1.1 ASCII
ASCII stands for American Standard Code for Information Interchange. It is a very early standardized character-encoding scheme using 7 bits, originally developed for teletype machines. With 7 bits, we can store up to \(2^7 = 128\) code points. Each such code point is mapped to a specific character. Below are some examples:
| Binary | Decimal | Character |
|---|---|---|
| 0000010 | 2 | Start of text |
| 0000100 | 4 | End of transmission |
| 0001001 | 9 | Horizontal tab |
| 0001010 | 10 | Line feed |
| 0001101 | 13 | Carriage return |
| 0110000 | 48 | 0 |
| 0110001 | 49 | 1 |
| 1000001 | 65 | A |
| 1000010 | 66 | B |
| 1100001 | 97 | a |
| 1100010 | 98 | b |
To form words or sentences, these bits are concatenated. For example, the sentence “ASCII is awesome!” results in the following binary code:
01000001 01010011 01000011 01001001 01001001 00100000 01101001 01110011 00100000 01100001 01110111 01100101 01110011 01101111 01101101 01100101 00100001
ASCII can represent the lower- and uppercase English alphabet (52 letters), as well as the digits 0-9, a couple of punctuation marks (.,!?), whitespaces, and other symbols (e.g. +-/#%&). The limit of 127 code points is therefore quickly reached, and many important characters, such as letters from non-English alphabets, do not have an ASCII representation.
7.1.2 UTF-8
UTF-8 stands for Universal Character Set + Transformation Format with 8 bit. UTF-8 is a more flexible encoding that can encode all possible characters in Unicode. Currently, these are 1’112’064 such characters (compared to 127 in ASCII)!
UTF-8 uses a set of one to four bytes, depending on the character. The first 256 characters - which include the characters from ASCII - are encoded by one byte. Characters that appear later are encoded as two bytes, three bytes and eventually four bytes.
If we have a sequence of bytes, like for the sentence “ASCII is awesome” above, the computer needs to know which bytes form one character. This was easy for ASCII, since each character was encoded by one byte. For UTF-8, there is a specific structure that encodes this:
| Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|
| Leading byte | Continuation byte | ||
| 0xxxxxxx | |||
| 110xxxxx | 10xxxxxx | ||
| 1110xxxx | 10xxxxxx | 10xxxxxx | |
| 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The leading byte contains information on how many bytes are used. If only one byte is used, it starts with a 0. If two bytes are used in total, it starts with 110. And so on.
In contrast, every continuation byte starts with 10. Therefore, the computer can find the start of a coding block easily at any random position in the file by reading until a byte does not start with 10, since this is a leading byte.
The x in the representation above are then replaced with the bits of the actual character.
For example:
The letter A is encoded by 1000001 (7 bits). UTF-8 adds a leading 0 to encode that a single byte is used, resulting in 01000001.
The € sign is encoded by 1000001 0101100 (14 bits). Unfortunately, two bytes are not sufficient, since the total number of bytes must also be encoded. Two bytes use 5 out of 16 bits for the encoding of the number of bytes, therefore only leaving 9 bits for the encoding of the character. We therefore need 3 bytes. The encoding fills the bits from right to left:
- Fill \(3^{rd}\) byte until it is full \(\to\) 10101100.
- Fill \(2^{nd}\) byte until it is full \(\to\) 10000010.
- Fill \(1^{st}\) byte until all bits were used. Then pad with zeros until byte is full \(\to\) 11100010.
You might think that it would be a lot easier if UTF-8 would use four bytes for all characters, instead this complicated switching. The answer is easy: you save a lot of memory with smart UTF-8 encoding. A text only containing English letters would require four times more memory.
7.1.3 Newline issue
The end of a line has a historic legacy. In the teletype age, two different characters were required to end a line: a carriage return (CR) and a line feed (LF). Nowadays, on Unix systems, only the line feed is required. On Windows machines, both characters are used. And to make it more complicated, Mac systems used carriage return until version 9, and line feed since OS X. The important thing to realize is that depending on the system, a computer might insert different characters to end a line. This can be problematic when using the same file on a different system.
However, with a few Bash commands, we can easily replace the newline characters.
Let’s consider the file test.txt, which uses both carriage return (\r) and line feed (\n) to end a line:
Here, echo -e will enable regular expression parsing, since \n is a regular expression for a line feed.
Let’s have a look at the file. Using cat -e, we can print non-visible characters using substitutes.
This option displays Unix line endings (\n) as $ and Windows line endings (\r\n) as ^M$.
If we want to get rid of the Windows line ending, we can replace the carriage return (\r) in the file using the following pipe:
We will learn more about tr later. But in this example, tr -d will delete all \r characters.