In the Mojibake library I use the data of the Unicode standard UnicodeData.txt file. That file is automatically generated and full of similar strings.
00E7;LATIN SMALL LETTER C WITH CEDILLA;...
00E8;LATIN SMALL LETTER E WITH GRAVE;Ll;...
00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;...
00EA;LATIN SMALL LETTER E WITH CIRCUMFLEX;...
00EB;LATIN SMALL LETTER E WITH DIAERESIS;...
00EC;LATIN SMALL LETTER I WITH GRAVE;...
...
The aim of Mojibake is to be as small as possible, and the name column of the unicode_data table is by far the largest piece of data in my SQLite database. How to make the database as small as possible? Techniques are multiple.
Remove automatically generated entries
The UnicodeData.txt file contains generated names, especially for old or unused characters. Example: the entire block from U+13000 to U+143FF is made of sequential names:
13000;EGYPTIAN HIEROGLYPH A001;Lo;0;L;;;;;N;;;;;
13001;EGYPTIAN HIEROGLYPH A002;Lo;0;L;;;;;N;;;;;
13002;EGYPTIAN HIEROGLYPH A003;Lo;0;L;;;;;N;;;;;
13003;EGYPTIAN HIEROGLYPH A004;Lo;0;L;;;;;N;;;;;
As well of ANATOLIAN HIEROGLYPH, CUNEIFORM SIGN, CJK Compatibility Ideographs, Etc. Everything can be removed or simply rebuilt at runtime by using:
snprintf(character->name, 128, "CJK UNIFIED IDEOGRAPH-%X", codepoint);
snprintf(character->name, 128, "CUNEIFORM SIGN %s", name);
// And others
But my database was still pretty big: 2,486,272 bytes
Prefix compression
What is prefix compression? Simple: when two strings share the same prefix, there is no need to save it twice.
Example:
LATIN SMALL LETTER C WITH CEDILLA
LATIN SMALL LETTER E WITH GRAVE
There are many previous studies on how to store this, because a filesystem typically has many files that start with the same string. But the Google File System one captured by attention: https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf.
- Sort the strings
- Identify a common prefix
- Store only diffing parts
- Save the length of the previous string
Let’s take a real example. We sort some strings found in the file:
0. LATIN SMALL LETTER C WITH CEDILLA
1. LATIN SMALL LETTER E WITH GRAVE
2. LATIN SMALL LETTER E WITH ACUTE
3. LATIN SMALL LETTER E WITH CIRCUMFLEX
We calculate previous prefix and store the length.
0. 0 | LATIN SMALL LETTER C WITH CEDILLA
1. 19 | E WITH ACUTE
2. 26 | CIRCUMFLEX
3. 26 | GRAVE
So second string is “substring of length 19 of string at previous position (0) concatenared with string at position 1”
"LATIN SMALL LETTER " + "E WITH ACUTE"
"LATIN SMALL LETTER E WITH ACUTE"
First string is stored in full. The others refer to the previous one. Let’s imagine that SQLite uses only 8 bits for that integer. So the savings will be 19 + 26 + 26 – 1 – 1 – 1 = 68 bytes. This is what Google uses with GFS, SSTables, and Bigtable, and it’s a simple, performant way to save space.
Is this the best for Mojibake? No. On Mojibake you can access a random entry from the database. Example, by using the WASM magic: https://mojibake.zaerl.com/?function=mjb_codepoint_character&codepoint=128
Since each entry depends on the previous entry’s prefix, you must decompress sequentially from the start up to position U+128 and it will be a waste of time.
Position 0: Direct access ✓
Position 1: Rebuild from 0 → 1
Position 2: Rebuild from 0 → 1 → 2
Position N: Rebuild from 0 → 1 → 2 → ... → N ✗ Slow!
How SSTables solve it when there are millions of rows? It simply stores full keys at regular intervals, so no need to read N rows. Let’s imagine an “interval” of three.
LATIN SMALL LETTER C WITH CEDILLA // Checkpoint
LATIN SMALL LETTER E WITH GRAVE // 1. 19 | E WITH GRAVE
LATIN SMALL LETTER E WITH ACUTE // 2. 19 | E WITH ACUTE
LATIN SMALL LETTER E WITH CIRCUMFLEX // Checkpoint
...
Mojibake saves data one time, read it multiple times. So I decided to use a “prefix dictionary table” approach.
CREATE TABLE IF NOT EXISTS prefixes (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS unicode_data (
codepoint INTEGER PRIMARY KEY,
-- multiple columns
prefix INTEGER -- a foreign key to the prefixes table
);
The new query is slightly more complex:
SELECT
u.codepoint,
CASE WHEN p.name IS NOT NULL THEN p.name || u.name ELSE u.name END as name,
-- multiple columns
FROM unicode_data u LEFT JOIN prefixes p ON u.prefix = p.id WHERE u.codepoint = ?
In other words:
- I read common prefixes
- I create the prefix table
- I save on unicode_data the prefix
- I concatenate the two strings at runtime, if needed
On finding the best set of prefixes to save
This is the most complex thing to do. When does saving a prefix save you space? It depends on how much SQLite takes for:
- A new row on the prefix table (plus index)
- An integer on the unicode_data table (instead of a null)
Does this take less space? The only way to be sure it does is to… try. SQLite is very, very performant when it comes to optimization. Let’s see what I thought was the best in the new PrefixCompressor class.
Keep in mind that the algorithm is not meant to be run multiple times, and so it is not optimized for space or speed.
First step. For every Unicode character (~53k) name of length N we save a map for every single prefix (a candidate) of length P (4, default). 0(N^2) run and space complexity
| String length N | Calculation | Total characters |
| N = 5 | (5) | 5 |
| N = 7 | 5 + 6 + 7 | 18 |
| N = 10 | 5 + 6 + … + 10 | 40 (!) |
Lot of RAM, but like I said it’s not a problem. On Unicode 17 UnicodeData.txt file there are a total of 405,878 bytes on name columns, and names mean length is ~11. For all the prefixes we save a set of usage information.
We then greedly select the prefixes and calculate the benefit of adding that prefix. Finally we sort prefixes by length descending for greedy matching (longest first).
Conclusion
| Metric | Bytes |
| Original size | 2,486,272 bytes (~2.37 MB) |
| New size | 1,925,120 bytes (~1.84 MB) |
| Space saved | 561,152 bytes (~548 KB) |
| Reduction | 22.6% |
Was it worth it? Yeah, it was fun.