Prefix compression of thousands of similar strings

In the Mojibake library I use the data of the Unicode standard UnicodeData.txt file. That file is automatically generated and full of similar strings.

00E7;LATIN SMALL LETTER C WITH CEDILLA;...
00E8;LATIN SMALL LETTER E WITH GRAVE;Ll;...
00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;...
00EA;LATIN SMALL LETTER E WITH CIRCUMFLEX;...
00EB;LATIN SMALL LETTER E WITH DIAERESIS;...
00EC;LATIN SMALL LETTER I WITH GRAVE;...
...

The aim of Mojibake is to be as small as possible, and the name column of the unicode_data table is by far the largest piece of data in my SQLite database. How to make the database as small as possible? Techniques are multiple.

Remove automatically generated entries

The UnicodeData.txt file contains generated names, especially for old or unused characters. Example: the entire block from U+13000 to U+143FF is made of sequential names:

13000;EGYPTIAN HIEROGLYPH A001;Lo;0;L;;;;;N;;;;;
13001;EGYPTIAN HIEROGLYPH A002;Lo;0;L;;;;;N;;;;;
13002;EGYPTIAN HIEROGLYPH A003;Lo;0;L;;;;;N;;;;;
13003;EGYPTIAN HIEROGLYPH A004;Lo;0;L;;;;;N;;;;;

As well of ANATOLIAN HIEROGLYPH, CUNEIFORM SIGN, CJK Compatibility Ideographs, Etc. Everything can be removed or simply rebuilt at runtime by using:

snprintf(character->name, 128, "CJK UNIFIED IDEOGRAPH-%X", codepoint);
snprintf(character->name, 128, "CUNEIFORM SIGN %s", name);
// And others

But my database was still pretty big: 2,486,272 bytes

Prefix compression

What is prefix compression? Simple: when two strings share the same prefix, there is no need to save it twice.

Example:

LATIN SMALL LETTER C WITH CEDILLA
LATIN SMALL LETTER E WITH GRAVE

There are many previous studies on how to store this, because a filesystem typically has many files that start with the same string. But the Google File System one captured by attention: https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf.

Sort the strings
Identify a common prefix
Store only diffing parts
Save the length of the previous string

Let’s take a real example. We sort some strings found in the file:

0. LATIN SMALL LETTER C WITH CEDILLA
1. LATIN SMALL LETTER E WITH GRAVE
2. LATIN SMALL LETTER E WITH ACUTE
3. LATIN SMALL LETTER E WITH CIRCUMFLEX

We calculate previous prefix and store the length.

0. 0  | LATIN SMALL LETTER C WITH CEDILLA
1. 19 | E WITH ACUTE
2. 26 | CIRCUMFLEX
3. 26 | GRAVE

So second string is “substring of length 19 of string at previous position (0) concatenared with string at position 1”

"LATIN SMALL LETTER " + "E WITH ACUTE"
"LATIN SMALL LETTER E WITH ACUTE"

First string is stored in full. The others refer to the previous one. Let’s imagine that SQLite uses only 8 bits for that integer. So the savings will be 19 + 26 + 26 – 1 – 1 – 1 = 68 bytes. This is what Google uses with GFS, SSTables, and Bigtable, and it’s a simple, performant way to save space.

Is this the best for Mojibake? No. On Mojibake you can access a random entry from the database. Example, by using the WASM magic: https://mojibake.zaerl.com/?function=mjb_codepoint_character&codepoint=128

Since each entry depends on the previous entry’s prefix, you must decompress sequentially from the start up to position U+128 and it will be a waste of time.

Position 0:  Direct access ✓
Position 1:  Rebuild from 0 → 1
Position 2:  Rebuild from 0 → 1 → 2
Position N:  Rebuild from 0 → 1 → 2 → ... → N  ✗ Slow!

How SSTables solve it when there are millions of rows? It simply stores full keys at regular intervals, so no need to read N rows. Let’s imagine an “interval” of three.

LATIN SMALL LETTER C WITH CEDILLA // Checkpoint
LATIN SMALL LETTER E WITH GRAVE // 1. 19 | E WITH GRAVE
LATIN SMALL LETTER E WITH ACUTE // 2. 19 | E WITH ACUTE
LATIN SMALL LETTER E WITH CIRCUMFLEX // Checkpoint
...

Mojibake saves data one time, read it multiple times. So I decided to use a “prefix dictionary table” approach.

CREATE TABLE IF NOT EXISTS prefixes (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE IF NOT EXISTS unicode_data (
    codepoint INTEGER PRIMARY KEY,
    -- multiple columns
    prefix INTEGER -- a foreign key to the prefixes table
);

The new query is slightly more complex:

SELECT
    u.codepoint,
    CASE WHEN p.name IS NOT NULL THEN p.name || u.name ELSE u.name END as name,
    -- multiple columns
    FROM unicode_data u LEFT JOIN prefixes p ON u.prefix = p.id WHERE u.codepoint = ?

In other words:

I read common prefixes
I create the prefix table
I save on unicode_data the prefix
I concatenate the two strings at runtime, if needed

On finding the best set of prefixes to save

This is the most complex thing to do. When does saving a prefix save you space? It depends on how much SQLite takes for:

A new row on the prefix table (plus index)
An integer on the unicode_data table (instead of a null)

Does this take less space? The only way to be sure it does is to… try. SQLite is very, very performant when it comes to optimization. Let’s see what I thought was the best in the new PrefixCompressor class.

Keep in mind that the algorithm is not meant to be run multiple times, and so it is not optimized for space or speed.

First step. For every Unicode character (~53k) name of length N we save a map for every single prefix (a candidate) of length P (4, default). 0(N^2) run and space complexity

String length N	Calculation	Total characters
N = 5	(5)	5
N = 7	5 + 6 + 7	18
N = 10	5 + 6 + … + 10	40 (!)

Lot of RAM, but like I said it’s not a problem. On Unicode 17 UnicodeData.txt file there are a total of 405,878 bytes on name columns, and names mean length is ~11. For all the prefixes we save a set of usage information.

We then greedly select the prefixes and calculate the benefit of adding that prefix. Finally we sort prefixes by length descending for greedy matching (longest first).

Conclusion

Metric	Bytes
Original size	2,486,272 bytes (~2.37 MB)
New size	1,925,120 bytes (~1.84 MB)
Space saved	561,152 bytes (~548 KB)
Reduction	22.6%

Was it worth it? Yeah, it was fun.

Remove automatically generated entries

Prefix compression

On finding the best set of prefixes to save

Conclusion

More posts

Don’t hate AI

Prefix compression of thousands of similar strings

A new BibLaTeX WordPress block

When a software project needs to stop