Inventing UTF-8
Does the UTF‑8 encoding scheme make sense? Is it the best design we can come up with? In this article i’m walking through the process of designing a UTF‑8 alternative, that… well… just magically ends up being UTF‑8. How lovely when things just make sense!
On This Page:
This blog post is the result of many discussions about the UTF-8 and UTF-16 encoding formats. Weighing ups and downs of either. To meaningfully compare different design decisions, however, it is important to understand why these formats were designed the way they were. And what better way of understanding their designs than by reïnventing them yourself?
The challenge seems simple:
- We start off with ASCII, this strange and limited 7-bit text encoding scheme with a lot of strange control codes barely anyone remembers.
- We want to be able to encode all of Unicode, which we now know encodes a single scalar value in at most 21 bits.
- Ideally we want to use as few bits per scalar value as reasonable.
- We boldly assume that most common text will largely consist of code points in the old ASCII range, so ideally pure ASCII text, when encoded using our new scheme, should use no more memory than before.
LEB128
For the machine our text is nothing but a sequence of integers. So our challenge boils down to finding an optimal encoding scheme for just one of these integers and then just encode them all with this scheme in a sequence. Given our 4th requirement and the 7‑bitness of ASCII, we have only one bit to spare to indicate whether or not more bytes belonging to the same integer value follow.
LEB128, or »Little Endian Base 128«, is a common variable-length integer encoding scheme serving precisely this purpose. In LEB128 every number is split up into 7‑bit chunks (2⁷=128), from lowest to highest bits. Each of these chunks then gets an 8th tag bit that is 0 for the first chunk of an integer and 1 for all immediately following chunks belonging to that same integer.
Here’s an example of how different numbers are encoded in LEB128:
Decimal | Binary (BE) | LEB128 |
---|---|---|
41 | 0010’1001 | 0010’1001 |
297 | 1 0010’1001 | 0010’1001 1000’0010 |
809 | 11 0010’1001 | 0010’1001 1000’0110 |
33577 | 1000’0011 0010’1001 | 0010’1001 1000’0110 1000’0010 |
Sure, this format can waste an unfortunate amount of bits when the last chunk of an integer consists of mostly padding. However, for all other chunks this encoding scheme is pleasantly efficient. Awesome! So we’re done here, right?
… or not?
The format above clearly works well enough for the purpose of encoding arbitrary Unicode scalar values using as few bytes as reasonable. However, how well does it actually hold up when confronted with the real world? The world where a ton of software is and was written in very old versions of C, where text files are not just read but also written by humans, edited, inserting and deleting random sections of text somewhere in the middle of files.
The observant among y’all have already noticed that the example numbers in the table above are not really random — They were carefully selected! If you look closely at the LEB128 column, you’ll notice that all blue chunks are equal. So are all red chunks, and all green chunks. If you were to delete the green chunk of 33577 you get the LEB128-encoding of 809, and if you delete the red chunk you get 297. Delete both red and green and you get 41. Deleting a random byte in the middle is an easy to make mistake in software. Especially in a language with text handling as error-prone as C’s. It will happen somewhere. Even the opposite could easily occur: Randomly inserting bytes where they don’t belong, such as falsely turning 41 into 297 by appending the green chunk. Ideally, such byte injection or removal faults should not turn characters of text into arbitrary different characters.
We can conclude that an additional requirement is to be added to our original list:
- The encoding scheme should gracefully handle faulty injection or removal of arbitrary single bytes. I.e. we should be able to detect whether a scalar value is broken and we should be able to read the rest of the text without further errors.
We need a counter!
A pretty robust solution would be to enumerate all chunks of a scalar value from 0th to last. Again, due to our first requirement the topmost 0‑bit of the first byte should indicate that this is the only byte of data for the current number, so that our encoding scheme is a strict superset of ASCII.
Let’s try a naïve first scheme. If the tag bit of the first byte is 0, the remaining 7 bits are all ASCII. If, however, the tag bit is 1, the next of the top bits are a counter field, the rest is our scalar value chunk. Here’s the tricky part, however: We need to be able to count a total of 21 chunk bits to represent all of Unicode. So how many bytes will we need to accumulate enough chunk bits for the last Unicode scalar value? To make matters worse, the more bytes we need, the more counter bits we need, and in turn the fewer chunk bits we get, which makes us need more bytes. Sounds confusing? Well, it is confusing. So let’s get rid of the confusion by demonstrating a working implementation of this scheme.
If the tag bit is 1, the next three top bits are a 3‑bit number that counts down the remaining bytes in the sequence. The last byte has a counter value of 000. This gives us a total of 4 bits per chunk to work with, allowing us to encode up to 24‑bit integers using this scheme, using a total of 6 bytes. Here’s how this looks for the number 33577:
Decimal | Binary (BE) | Counter (LE, naïve) |
---|---|---|
33577 | 1000’0011 0010’1001 | 1011’1001 1010’0010 1001’0011 1000’1000 |
� | 1000’0011 ----’1001 | 1011’1001 1001’0011 1000’1000 |
2098 | 1000 0011’0010 | 1010’0010 1001’0011 1000’1000 |
� | ----’---- 0010’1001 | 1011’1001 1010’0010 1010’0010 1001’0011 1000’1000 |
If we were to now remove the bytes of the red, green or black chunk, our text decoder program would detect that a numbered byte is missing. However, something interesting happens if we were to delete the blue chunk: It gets falsely accepted as a different number again! We have no way of knowing whether the red chunk was the first or a later byte of an encoded number. It gets even weirder if we inject a byte that repeats an adjacent chunk counter. In the last example, the red chunk is repeated. We can detect that the first two bytes of our encoded message form an incomplete scalar value, given that the follow-up counter 001 is missing. However, the last three bytes get again recognised as a valid number, 2098.
Indeed, this encoding scheme is clearly broken. But can we potentially fix it, somehow? Well, we can try, by identifying a crucial piece of information that this scheme is lacking.
We need a start bit!
Let’s go back to LEB128 for a moment. We just identified a new kind of fault in the naïve counter scheme that cannot happen in LEB128: If we delete the first bytes of a sequence, the remaining bytes still form a valid encoded number! The solution for this is easy. In LEB128, a single bit tells us whether a byte is the first of a sequence or any of the remaining bytes.
Let’s add such a start/rest bit to our previous counter design. That gives us 3 counter bits, 1 start/rest bit, and 3 payload bits, allowing us to encode up to 21‑bit integers in 7 bytes. Here’s what it looks like now:
Dec. | Binary (BE) | Counter (LE, terrible) | Error |
---|---|---|---|
297 | 1 0010’1001 | 10010001 11001101 11000100 | All good! |
� | � | 11001101 11000100 | Where’s the start, mate? |
� | 1 00--’-001 | 10010001 11000100 | Something’s missing in the middle… |
� | - --10’1001 | 10010001 11001101 | Where’s the rest? |
� | - ----’-001 | 10010001 11000100 11001101 | Hole in the middle, missing start, missing end. |
� | - --10’1001 |
10010001
11001101
11001101 11000100 |
Missing rest, missing start. |
This new encoding scheme seems to work as intended! But it’s garbage. For every 3 bits of data we need 5 bits of meta data. This is over 100% bloat! Our next goal is thus to reduce the amount of meta data bits we currently use.
We need one counter!
Let’s consider how likely and severe these faults are:
- Injecting or removing a single byte somewhere in the middle of text can easily happen. Off-by-1 bugs are frighteningly common. We still want to be able to catch such errors.
- Swapping two bytes in the middle, however, as shown in the 5th example of the terrible counter, is extremely unlikely.
- We sometimes cannot tell whether it’s one broken scalar value or two. But that’s fine as long as we know they’re definitely broken.
- Deleting or inserting a number of bytes in the middle such that the surrounding scalar values will be successfully recognised as one, or such that a single scalar value will be recognised as two, is so unlikely that chances are it was done on purpose.
We can pull an important conclusion from this: The only counter that really matters is that of the first byte of a sequence! With it alone we can already tell whether we have too few or too many follow-up bytes. The old follow-up byte counters only potentially help detecting byte swaps.
This means that the first byte of a sequence still only encodes 3 payload bits. However, the remaining bytes of a sequence can now encode 6 payload bits each. Not bad. Here’s how this looks:
Dec. | Binary (BE) | Counter (LE, decent) | Error |
---|---|---|---|
297 | 1 0010’1001 | 10001001 11100101 | All good! |
� | - ----’-001 | 10001001 | Missing rest. |
� | � | 11100101 | Missing start. |
The payload bytes of this scheme do quite well. All they need is a single bit to distinguish them from ASCII, and another bit to distinguish them from the start of a sequence. We cannot really do better than that. The starting byte still uses up an unfortunate amount of meta data bits, however.
Information density of the start byte
In our latest scheme, two bytes are capable of encoding… just 9 bits of information. This is just barely enough to encode all of ASCII, Latin‑1, Latin Extended‑A, and half of Latin Extended‑B. We could bias (read: offset) the encoded number by 128, given that encoding ASCII in two bytes would be redundant. This would allow us to also cover the rest of Latin Extended‑B. Not really an exciting amount of coverage for two entire bytes. Needless to say, if any of the designs so far had been the proposal for UTF‑8, we’d all be using UTF‑16 now.
We want to optimise our encoding scheme for the case that our sequence takes two bytes now, in order to maximise BMP coverage. In theory, our starting byte only needs 3 bits of meta data to serve its purpose.
- One bit tells it apart from ASCII.
- One bit marks it as the start of a sequence.
- One bit tells us whether this sequence has more than 2 bytes.
This leaves us with 5 payload bits in the starting byte if our sequence is just 2 bytes long, giving us a total of 11 payload bits. Applying this same idea recursively gives us 4 payload bits in the starting byte of a 3‑bytes‑sequence, or 16 bits in total, covering the entire BMP. An interesting pattern that may allow us to clean up our current mess of bit fields.
An unary counter
Here’s the core idea:
- We start at the top-most bit of our starting byte, assuming our sequence is just 1 byte long.
- If it is a 0, we have our number of bytes in the current sequence.
- If it is a 1, our sequence is one byte longer than we thought. Continue checking the next bit.
In essence, we count leading 1s, which is also known as an unary number system. Here’s what the starting byte of any sequence may look like using this scheme:
Starting Byte | # Follow-Up Bytes |
---|---|
0xxx’xxxx | 0 |
10xx’xxxx | 1 |
110x’xxxx | 2 |
1110’xxxx | 3 |
1111’0xxx | 4 |
1111’10xx | 5 |
1111’110x | 6 |
1111’1110 | 7 |
This nicely covers our starting byte, but the follow-up bytes need to be encoded efficiently as well. The bit pattern of a follow-up byte must never be confusable for a start byte. Remember that we found that 2 meta data bits are as good as it gets for follow-up bytes. What if… we just removed the second table row for the starting byte, then, using that freed-up pattern for our follow-up bytes?
Starting Byte | # Follow-Up Bytes |
---|---|
0xxx’xxxx | 0 |
110x’xxxx | 1 |
1110’xxxx | 2 |
1111’0xxx | 3 |
1111’10xx | 4 |
1111’110x | 5 |
1111’1110 | 6 |
Well, the counter value is a bit wonky now, but for computers it’s trivial to calculate:
let ones = count_leading_ones;
let follow_up_bytes = ones − ;
And now this is the pattern of our follow-up bytes:
Follow-Up Byte |
---|
10xx’xxxx |
We can now detect and gracefully handle all the error cases we explored and analysed in earlier sections, and do so while squeezing as many payload bits as we can into every single byte of a sequence.
Removing nonsense
Let’s have a look at what a 4‑bytes sequence looks like in our current scheme:
Starting Byte | Follow-Up 1 | Follow-Up 2 | Follow-Up 3 |
---|---|---|---|
1111’0xxx | 10xx’xxxx | 10xx’xxxx | 10xx’xxxx |
If we count all the x bits together, we get… exactly 21 bits, the magic lower limit of bits needed to store every possible Unicode scalar value. So really we can proclaim that all starting bytes that would ask for more than 3 follow-up bytes are bogus, further shrinking our starting byte table:
Starting Byte | # Follow-Up Bytes |
---|---|
0xxx’xxxx | 0 |
110x’xxxx | 1 |
1110’xxxx | 2 |
1111’0xxx | 3 |
There are a lot more cases of redundancy we could cover. For example, the sequence 1100’000x 10xx’xxxx only ever encodes the exact same scalar value as just 0xxx’xxxx, wasting a byte. And then there are the UTF‑16 surrogates, which while being Unicode code points are not Unicode scalar values. We can now choose how to handle these edge cases:
- We could remove all these redundant encodings and the surrogate gap by biasing successive values. This is… complicated to do right.
- We could instead introduce specialised bounds checks
at each step. Also annoying.
- However, this allows us to split the handling of UTF-8 text into two parts: A messy but correct validation pass, and a quick and dirty decoding pass that accepts surrogates and wasteful redundant encodings. We only rarely need the full validation.
Given our 21‑bit range and given that all these gaps and redundancies don’t add up to being able to remove a whole follow-up byte, we don’t really gain anything from the complicated biasing technique, so let’s go for the bounds checks instead.
UTF-8 makes sense!
As you may have already guessed, this latest version of a potential Unicode encoding scheme is… exactly UTF-8. Now, how did we arrive here again?
- It all began with LEB128, the densest possible variable length encoding based on bytes.
- Given the context of where and how our byte-based text encoding scheme would be used, there are a number of fault tolerance requirements LEB128 just doesn’t meet.
- All that’s needed to satisfy these requirements is the addition of a counter and a start/rest bit.
- As it happens, a counter based on unary numbers serves all our needs while maximising information density.
- And that’s pretty much how you arrive at the design of UTF-8.
It’s nice, being able to appreciate how much damn good engineering went into this encoding scheme.
Terms used
- Unicode Scalar Value
-
This is Unicode speak for what programmers would call
char
orrune
or indeedUnicodeScalarValue
, depending on which programming language they came from. Why this fancy and unfortunately quite unwieldy name? Well, what most people think of when talking about »characters« in text Unicode calls »extended grapheme cluster«. As it turns out, a single character can consist of one or arbitrarily many more scalar values. And what Unicode calls a »rune« is… well… actual letters of runic alphabets. So what is a scalar value now? A Unicode code point… except the few code points called »surrogates« that are needed to make UTF‑16 able to actually represent all Unicode scalar values. - Little Endian
- LE
-
These days all computer memory is usually addressable
in units of bytes. However, we usually operate on pieces
of data that are multiple bytes in size. For these, there are
multiple ways they can be laid out in memory, though only
two are relevant. Take the number 081516,
for example. We can split it up into the two bytes
0816 and 1516.
In Little Endian byte order, the bytes containing the lowest
bits of a datum are put at the lower memory addresses. In
Big Endian, the other relevant scheme, the higher bits of
a datum are put at the lower addresses. Big Endian is the
digit ordering you are familiar with from e.g. the English
language.
Memory layout of 081516 in different byte orders. N is the memory address from where we start writing our number. Endian N+0 N+1 Little 1516 0816 Big 0816 1516 - Big Endian
- BE
- The opposite byte order of LE.
- Basic Multilingual Plane
- BMP
- The Basic Multilingual Plane of Unicode, also known as plane 0, is a block of 65536 Unicode code points (Yes, code points, the surrogates live here.) that cover the vast majority of human text. Being able to represent all code points of the BMP up to at least the surrogates in as few bytes as possible is highly desirable.
- Basic Latin (aka. ASCII)
- Latin‑1 Supplement
- Latin Extended‑A
- Latin Extended‑B
- Phonetic Extensions
- Phonetic Extensions Supplement
- Latin Extended: The Empire Strikes Back
- Combining Diacritical Marks Supplement
- Latin Extended Additional
- Latin Extended‑C
- Latin Extended‑D
- Latin Extended‑E
- … and probably more i forgot or overlooked are sections of the Unicode BMP for just Latin script. Only one of these is made up.
- Unary Numbers
-
In unary number systems there is only one digit. A number
is represented by how often that one digit occurs. Say
this digit is 🐜. Then 🐜🐜🐜🐜
is the number 4. We can represent unary numbers inside a
binary bit stream by picking one digit, say 1,
as our unary digit, and then counting all of them until
we hit our first binary digit that does not match our
unary one, say 0. Or using ants again,
🐜🐜🐜🏳️🌈🐜🐜🐜🐜 represents the two
separate numbers 3 and 4.
(Side note: Dear font designers, fix the ant emoji, please! They often look nothing like an ant. Learn from Noto.)