JSON parser in 64h: 03/64, Unicode parse tests

Welcome, this is the third episode of Write a JSON parser in 64 hours. My idea of creating new code is: “make it testable, think about a test, write the code that will pass that test”.

When you start touching the test world, the time flies, so this third hour seems to be finished after ten minutes.

About unit tests

I needed to test what was happening under the hood. A JSON parser will loop N bytes that will become M codepoints, with M >= N.

For example the 🌍 emoji EARTH GLOBE EUROPE-AFRICA U+1F30D:

F0 9F 8C 8D in UTF-8
3CD8 0DDF in UTF-16LE (two 16 bits
D83C DF0D in UTF-16BE

If you run strlen("🌍"); in C you will get 4 if your text editor save the text in UTF-8. But the length, in codepoints of this string is 1 despite its representation.

So I decided to have a function called every codepoint to use it as a callback to link to the process.

typedef void (*j128_codepoint_callback)(size_t index, size_t string_index, j128_codepoint codepoint);

These are the parameters:

index: it’s the nth position in the source string.
string_index: the nth codepoint parsed.
codepoint: the codepoint

An example string

Let’s take what seems like an innocent string: €uro ¥en. This string has 8 codepoints and 11 bytes.

Codepoint name	Codepoint	Byte
		0xE2
		0x82
U+20AC EURO SIGN	€	0xE2 0x82 0xAC
U+0075 LATIN SMALL LETTER U	u	0x75
U+0072 LATIN SMALL LETTER R	r	0x72
U+006F LATIN SMALL LETTER 0	o	0x6F
U+0020 SPACE		0x20
		0xC2
U+00A5 YEN SIGN	¥	0xC2 0xA5
U+0065 LATIN SMALL LETTER E	e	0x65
U+006E LATIN SMALL LETTER N	n	0x6E

The function is called 8 times with these values:

Index	String index	Codepoint
2	0	0x20AC
3	1	0x75
4	2	0x72
5	3	0x6F
6	4	0x20
8	5	0xA5
9	6	0x65
10	7	0x6E

Third hours is finished!

Time flies. You can see the third version of the repository here.

Until next time!

About unit tests

An example string

Third hours is finished!

More posts

AIs are the new computers

Don’t hate AI

Prefix compression of thousands of similar strings

A new BibLaTeX WordPress block