Add a language construct to PHP, part 1

These days, I’m studying the PHP core to see the various techniques used. This is part 1 of our journey, where we will add a patch to an existing construct. Approaching such a big project takes work.

Adding a new function is pretty simple. Here is a quick example written in C taken from the Zend Engine official documentation:

PHP_FUNCTION(test_scale)
{
    double x;

    ZEND_PARSE_PARAMETERS_START(1, 1)
        Z_PARAM_DOUBLE(x)
    ZEND_PARSE_PARAMETERS_END();

    RETURN_DOUBLE(x * 2);
}

The PHP Internal Book is a great start. You can read it to understand how things work under the hood.

But how about adding a new language construct?

First of all, what are we talking about? Simple. It is something like an echo. From the documentation

echo is not a function but a language construct. Its arguments are a list of expressions following the echo keyword, separated by commas, and not delimited by parentheses. Unlike some other language constructs, echo does not have any return value, so it cannot be used in the context of an expression.

Start from the very beginning before digging too much.

Make it start (on MacOS)

I cloned the repository from https://github.com/php/php-src to https://github.com/zaerl/php-src and see if I could compile it.

./buildconf
./configure --enable-debug

configure: error: Please specify the install prefix of iconv with --with-iconv=

Ok, not a problem. Install libiconv and proceed.

./configure --enable-debug --with-iconv=/opt/homebrew/opt/libiconv/

Everything went smoothly. Next step I checked how many cores my computer has:

sysctl -n hw.ncpu
14

Now, let’s do it with the power of parallel processing.

make -j14
Build complete.
Don't forget to run 'make test'.

I then checked if it was true:

./sapi/cli/php --version
PHP 8.4.0-dev (cli) (built: May 13 2024 09:03:04) (NTS DEBUG)
Copyright (c) The PHP Group
Zend Engine v4.4.0-dev, Copyright (c) Zend Technologie

Now try if the tests suite does compile.

make TEST_PHP_ARGS=-j14 test

18,915 functional tests later, I got just one failing:

DOMNode::isEqualNode() [ext/dom/tests/DOMNode_isEqualNode.phpt]

I don’t think this is such a problem. So let’s consider this a success and move on.

A new construct, but what?

PHP already has a lot of operators and a lot of reserved keywords. So let’s choose one that, in order of importance:

  1. Do not break preexisting tests.
  2. It is similar to a preexisting one. I choose echo.
  3. It isn’t too complicated.
  4. It serves some purpose, and it’s not a joke.

So, I decided to create a new one called always. It is a construct that checks if a value is true, and if it is not, it spawns an exception.

Let’s start it slow

PHP uses the Zend Engine under the hood, which uses re2c. Adding a new construct means:

  1. Update the lexer.
  2. The parser.
  3. The compiler.
  4. The virtual machine.

I immediately abandoned the idea of doing something in an extension, Zend extension. I’m pretty sure it’s impossible to do all this.

You can modify the AST tree, which “allows you to modify the AST after it is parsed and created,” but it’s outside the scope of this post here. And not enough funny IMHO.

Start hacking and:

git checkout -b experimental/step-1

We will make a small patch to echo in this first part to make us comfortable with the codebase and not just to serious stuff.

We will add:

  1. A new php.ini configuration called echoln.
  2. If echoln is true than echo will append a \n to the output.
  3. A functional test that ensure it’s true.

The lexer

The first pass it to find where echo is declared and how. PHP, like many other interpreters, uses Lex/Yacc kind of files to generate the code. On the Zend/zend_language_scanner.l Lex file we find the token at line 1,532:

<ST_IN_SCRIPTING>"echo" {
	RETURN_TOKEN_WITH_IDENT(T_ECHO);
}

On Zend/zend_language_parser.y parser, the token is declared at line 121.

%token <ident> T_ECHO          "'echo'"

We will use T_ECHO as the source of truth in the code. The only strange syntax in the YACC file is a %precedence keyword. That’s what I was used to:

  1. %left
  2. %right
  3. %noassoc

… sometimes, when trying to solve a conflict, precedence suffices. In such a case, using %left%right, or %nonassoc might hide future (associativity related) conflicts that would remain hidden.

That makes sense. I like the idea.

The parser

Digging through the parser file bought me to this declaration at line 1,102:

echo_expr:
	expr { $$ = zend_ast_create(ZEND_AST_ECHO, $1); }
;

The statement is compiled in at line 10,616 of the Zend/zend_compile.c file.

case ZEND_AST_ECHO:
	zend_compile_echo(ast);
break;

The C function

PHP does not load and recompile .php files for every request but use opcache. Now we found the function that emits the ZEND_ECHO opcode, found at line 5,397:

static void zend_compile_echo(zend_ast *ast) /* {{{ */
{
	zend_op *opline;
	zend_ast *expr_ast = ast->child[0];

	znode expr_node;
	zend_compile_expr(&expr_node, expr_ast);

	opline = zend_emit_op(NULL, ZEND_ECHO, &expr_node, NULL);
	opline->extended_value = 0;
}
/* }}} */

And finally, the C function that does the actual print at line 1,679 of the Zend/zend_vm_def.h file:

ZEND_VM_HANDLER(136, ZEND_ECHO, CONST|TMPVAR|CV, ANY)
{
	USE_OPLINE
	zval *z;

	SAVE_OPLINE();
	z = GET_OP1_ZVAL_PTR_UNDEF(BP_VAR_R);

	if (Z_TYPE_P(z) == IS_STRING) {
		zend_string *str = Z_STR_P(z);

		if (ZSTR_LEN(str) != 0) {
			zend_write(ZSTR_VAL(str), ZSTR_LEN(str));
		}
	} else {
		zend_string *str = zval_get_string_func(z);

		if (ZSTR_LEN(str) != 0) {
			zend_write(ZSTR_VAL(str), ZSTR_LEN(str));
		} else if (OP1_TYPE == IS_CV && UNEXPECTED(Z_TYPE_P(z) == IS_UNDEF)) {
			ZVAL_UNDEFINED_OP1();
		}
		zend_string_release_ex(str, 0);
	}

	FREE_OP1();
	ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION();
}

This function:

  1. If the z parameter passed is a string is a string print it with zend_write.
  2. Otherwise, an attempt to convert to a string is performed using __zval_get_string_func.

It’s interesting to see the infamous “Array to string conversion” warning in this function. The echo function uses zval_get_string_func and not zval_try_get_string_func. So, the value is never converted to a string, but the ZEND_STR_ARRAY_CAPITALIZED (Array) constant is used instead.

The code is much cleaner than I thought it could have been, even with historical constructs like echo.

Add the new php.ini directive

Before changing the code of the ZEND_ECHO function, we need to add a new php.ini configuration directive. Digging through the code I proceeded to add:

  1. Add echoln to main/main.c PHP_INI section.
  2. Add echoln to main/php_globals.h _php_core_globals struct.

I ran make and make test, and nothing exploded. Good start.

Add new section to ZEND_ECHO

It’s time now to modify the function and add our new section:

ZEND_VM_HANDLER(136, ZEND_ECHO, CONST|TMPVAR|CV, ANY)
{
	USE_OPLINE
	zval *z;

	SAVE_OPLINE();
	z = GET_OP1_ZVAL_PTR_UNDEF(BP_VAR_R);

	if (Z_TYPE_P(z) == IS_STRING) {
		zend_string *str = Z_STR_P(z);

		if (ZSTR_LEN(str) != 0) {
			zend_write(ZSTR_VAL(str), ZSTR_LEN(str));
		}
	} else {
		zend_string *str = zval_get_string_func(z);

		if (ZSTR_LEN(str) != 0) {
			zend_write(ZSTR_VAL(str), ZSTR_LEN(str));
		} else if (OP1_TYPE == IS_CV && UNEXPECTED(Z_TYPE_P(z) == IS_UNDEF)) {
			ZVAL_UNDEFINED_OP1();
		}
		zend_string_release_ex(str, 0);
	}

	bool echoln = INI_BOOL("echoln");

	if (echoln) {
		zend_write("\n", 1);
	}

	FREE_OP1();
	ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION();
}

Test the results

All preexisting tests should work without touching anything. We then add two new files to test our new functionality:

Zend/tests/echoln/001.phpt

--TEST--
Basic "echoln" test with default value.
--FILE--
<?php
echo "Hello World!";
?>
--EXPECT--
Hello World!

Zend/tests/echoln/002.phpt

--TEST--
Basic "echoln" test
--INI--
echoln=1
--FILE--
<?php
echo "Hello World!";
?>
--EXPECT--
Hello World!

The second one sets our directive to true and checks if the output ends with a new line (notice the last empty line).

Conclusions

Changing PHP internals is easy! You can get all the changes here: Add a language construct to PHP, part 1. Next time, we will continue our journey by adding our always token to the grammar, an opcode generator, and a brand-new VM function.

In the meantime, happy hacking!

@online{zaerl2024-add-a-language-construct-to-php-part-1,
  author = {Francesco Bigiarini},
  title = {Add a language construct to PHP, part 1},
  date = {2024-05-13},
  url = {https://zaerl.com/2024/05/13/add-a-language-construct-to-php-part-1/},
  urldate = {2024-05-13}
}