JSON parser in 64h: 03/64, Unicode parse tests

Welcome, this is the third episode of Write a JSON parser in 64 hours. My idea of creating new code is: “make it testable, think about a test, write the code that will pass that test”.

When you start touching the test world, the time flies, so this third hour seems to be finished after ten minutes.

About unit tests

I needed to test what was happening under the hood. A JSON parser will loop N bytes that will become M codepoints, with M >= N.

For example the 🌍 emoji EARTH GLOBE EUROPE-AFRICA U+1F30D:

  1. F0 9F 8C 8D in UTF-8
  2. 3CD8 0DDF in UTF-16LE (two 16 bits
  3. D83C DF0D in UTF-16BE

If you run strlen("🌍"); in C you will get 4 if your text editor save the text in UTF-8. But the length, in codepoints of this string is 1 despite its representation.

So I decided to have a function called every codepoint to use it as a callback to link to the process.

typedef void (*j128_codepoint_callback)(size_t index, size_t string_index, j128_codepoint codepoint);

These are the parameters:

  1. index: it’s the nth position in the source string.
  2. string_index: the nth codepoint parsed.
  3. codepoint: the codepoint

An example string

Let’s take what seems like an innocent string: €uro ¥en. This string has 8 codepoints and 11 bytes.

Codepoint nameCodepointByte
0xE2
0x82
U+20AC EURO SIGN0xE2 0x82 0xAC
U+0075 LATIN SMALL LETTER Uu0x75
U+0072 LATIN SMALL LETTER Rr0x72
U+006F LATIN SMALL LETTER 0o0x6F
U+0020 SPACE0x20
0xC2
U+00A5 YEN SIGN¥0xC2 0xA5
U+0065 LATIN SMALL LETTER Ee0x65
U+006E LATIN SMALL LETTER Nn0x6E

The function is called 8 times with these values:

IndexString indexCodepoint
200x20AC
310x75
420x72
530x6F
640x20
850xA5
960x65
1070x6E

Third hours is finished!

Time flies. You can see the third version of the repository here.

Until next time!

@online{zaerl2025-json-parser-in-64h-03-64-unicode-parse-tests,
  author = {Francesco Bigiarini},
  title = {JSON parser in 64h: 03/64, Unicode parse tests},
  date = {2025-02-28},
  url = {https://zaerl.com/2025/02/28/json-parser-in-64h-03-64-unicode-parse-tests/},
  urldate = {2025-02-28}
}