Welcome, this is the third episode of Write a JSON parser in 64 hours. My idea of creating new code is: “make it testable, think about a test, write the code that will pass that test”.
When you start touching the test world, the time flies, so this third hour seems to be finished after ten minutes.
About unit tests
I needed to test what was happening under the hood. A JSON parser will loop N bytes that will become M codepoints, with M >= N.
For example the 🌍 emoji EARTH GLOBE EUROPE-AFRICA U+1F30D:
F0 9F 8C 8Din UTF-83CD8 0DDFin UTF-16LE (two 16 bitsD83C DF0Din UTF-16BE
If you run strlen("🌍"); in C you will get 4 if your text editor save the text in UTF-8. But the length, in codepoints of this string is 1 despite its representation.
So I decided to have a function called every codepoint to use it as a callback to link to the process.
typedef void (*j128_codepoint_callback)(size_t index, size_t string_index, j128_codepoint codepoint);
These are the parameters:
index: it’s the nth position in the source string.string_index: the nth codepoint parsed.codepoint: the codepoint
An example string
Let’s take what seems like an innocent string: €uro ¥en. This string has 8 codepoints and 11 bytes.
| Codepoint name | Codepoint | Byte |
| 0xE2 | ||
| 0x82 | ||
| U+20AC EURO SIGN | € | 0xE2 0x82 0xAC |
| U+0075 LATIN SMALL LETTER U | u | 0x75 |
| U+0072 LATIN SMALL LETTER R | r | 0x72 |
| U+006F LATIN SMALL LETTER 0 | o | 0x6F |
| U+0020 SPACE | 0x20 | |
| 0xC2 | ||
| U+00A5 YEN SIGN | ¥ | 0xC2 0xA5 |
| U+0065 LATIN SMALL LETTER E | e | 0x65 |
| U+006E LATIN SMALL LETTER N | n | 0x6E |
The function is called 8 times with these values:
| Index | String index | Codepoint |
| 2 | 0 | 0x20AC |
| 3 | 1 | 0x75 |
| 4 | 2 | 0x72 |
| 5 | 3 | 0x6F |
| 6 | 4 | 0x20 |
| 8 | 5 | 0xA5 |
| 9 | 6 | 0x65 |
| 10 | 7 | 0x6E |
Third hours is finished!
Time flies. You can see the third version of the repository here.
Until next time!