Let’s take some rest from the Unicode world and speak about JSON. There are multiple parsers in the wild, all with different approaches. I will use the Lemon Parser Generator and not some more famous ones.
Lemon is an LALR(1) parser generator for C. It does the same job as “bison” and “yacc”
…
In yacc and bison, the parser calls the tokenizer. In Lemon, the tokenizer calls the parser.
Lemon do need a tokenizer to send it data. I can use re2c, or other similar tools for this. I will start easy and write a very basic tokenizer myself.
These tokens can be skipped:
0x0020: // SPACE0x0009: // CHARACTER TABULATION0x000A: // LINE FEED (LF)0x000D: // CARRIAGE RETURN (CR)
String literals
In the Parsing JSON is a Minefield this is mentioned: ECMA-262 – 7.8.4 String Literals):
All characters may appear literally in a string literal except for the closing quote character, backslash, carriage return, line separator, paragraph separator, and line feed
The last specification mention this:
All code points may appear literally in a string literal except for the closing quote code points, U+005C (REVERSE SOLIDUS), U+000D (CARRIAGE RETURN), and U+000A (LINE FEED)
Do this mean that now 0x2029 PARAGRAPH SEPARATOR and 0x000A LINE FEED are accepted?
A basic tokenizer
I started with a very basic tokenizer that skip 0x0020, 0x0009, 0x000A, 0x000D and accept {, }. [, ], : and ,.
The callback now accept the token as well:
typedef void (*j128_tokenizer_callback)(size_t index, size_t string_index, j128_codepoint codepoint, j128_token token);
I created an enum for the values. I know that Lemon create a list of #defines with the various token, but it is needed for tests.
Add Lemon
The lemon.c and lemonpar.h files can be downloaded from SQLite site, without installing anything. Adding a CMake rule is very simple:
set(LEMON_SOURCES
lemon.c
)
add_executable(lemon ${LEMON_SOURCES})
target_compile_options(lemon PRIVATE -Wno-strict-prototypes)
I needed to remove a warning from the main options -Wall -Wextra -pedantic -Wno-unused-parameter. There are some function that are declared without parameters, without void.
Fourth hours is finished!
Time flies. You can see the third version of the repository here.
Until next time!