Parsing a expression using the Doctrine Lexer

Willem Turkstra
5 min readFeb 27, 2021

What in the world is a lexer? And why do we need it? A Lexer parses a input string and recognizes certain characters or a set of characters. These recognized parts are then given a type.

When you stroll around on the internet for a while you will find lots of information on lexing and parsing programming languages. But lexing can also be used on a smaller scale. Doctrine uses this to parse their own query language.

Use case
I am always creating new side projects and every side project I always want to write better code. In this case I was creating a api. This api needed some way of applying a filter to the GET endpoints of my resources. The idea was that you could add filters to as many properties as you liked using ‘any’ operators you want like: ‘eq’, ‘neq’, ‘gt’, ‘gte’, ‘lt’, ‘lte’, ‘like’, etc. Also the expression had te be easy to read. There are a lot of different ways to do this but I settled on using the following syntax: property:operator:value You can also apply multiple filters using AND and OR and even group filters together: name:eq:willem AND (age:gt:25 OR car:eq:volvo) .

This al had to be parsed into something that I could use in creating database queries or whatever storage I wanted to use for the project.

Doctrine Lexer
Most of the time looking for a solution Iwill be looking at my screen for hours. Just thinking, searching the internet for any information Icould find. Eventually someone mentioned the Doctrine Lexer class. The class will give you the tools to walk over your string and find certain patterns. Doctrine Lexer uses regex for pattern matching. Every found pattern can be marked with a type. This way when the lexing is finished you can use these types to determine what you want to do with that part of the string.

I was already using the doctrine libraries in my symfony project. But if you want to use it you can use composer to install it: composer require doctrine/lexer . You will need to implement the AbstractLexer class.
It wants you to implement a few functions:

  • getCatchablePatterns — Provide the patterns you want to find
  • getNonCatchablePatterns — Patterns you want to ignore
  • getType — What is the type of the matched pattern?

Implementation
So we want to parse this expression:name:eq:willem . Where name is the property, eq is the operator and willem is the value to be filtered.

To achieve this we added the following patterns in the getCatchablePatterns :

/* PROPERTIES */ "(?<=\s|^|\()[a-zA-Z0-9_]*:", 
/* OPERATORS */
'(?<=:)[a-z]+(?=:)',
/* Quoted string */ "(?<=:)[a-zA-Z0-9]+(?=\s|$|\))|'(?:[^']|'')*'",
/* LOGIC OPERATORS */ '(?<=\s)AND|OR(?=\s+)',
/* GROUPER */ '(?<=\s)\(|\)(?=$|\s)'

Each pattern will look for a specific section in the expression.

This first regex will look for any value at the start of the string or preceded by a space or a open parenthesis, ending with a colon. If you look at our expression that should match name: .

The next pattern will look for an operator. This is a lowercase string starting and ending with a colon. In our expression this matches eq

The next pattern will look for quoted string starting with a colon and ending with a whitespace, the end of the string or the end of a group signified by )

The fourth pattern looks for the logic operators in the expression. These operator can either be AND or OR . Our example expression doesn’t use this.

The fifth pattern will look for the start or and of a group signified by ( or ) .

getType
For each matched pattern the Doctrine Lexer will call the getType method. For every value you will need to determine what the type is. The type of a pattern is not just property, operator or value. For each operator we can return a different type. This gives more flexability to parsing the results of the lexer.

In my case I defined the following types:

const OP_EQUALS = 0;
const OP_NOT_EQUALS = 1;
const OP_GREATER_THEN = 2;
const OP_LESS_THEN = 3;
const OP_GREATER_THEN_EQUALS = 4;
const OP_LESS_THEN_EQUALS = 5;
const OP_LIKE = 6;

const LOGIC_OPERATOR = 50;

const FILTER_PROPERTY = 100;
const FILTER_VALUE = 200;

const GROUP_START = 300;
const GROUP_END = 310;

const OPERATORS = [
0, 1, 2, 3, 4, 5, 6
];

So how do you give the matched token a type. For our example expression name:eq:willem The getType function will receive the following strings: “name:”, “eq”, “willem”. How are “eq” and “willem” different? How will we make sure “eq” will get the type operator and “willem” will get the type value . It’s all in the order of the conditional statements. An operator will always be a certain value. Ex: ‘eq’, ‘neq’, ‘gt’, ‘gte’, ‘lt’, ‘lte’, ‘like’, etc. So if the passed value is equal to one of these value you can set the corresponding type. All other values will be considered a value .

The property type is a bit different. See how we left the colon at the end of the string. In the getType function we can check if this colon is present at the end of the string. If this is the case we return the type property . And because the value is passed by reference we can strip the colon before returning the property.

Here is a full look at the code for the getType function:

protected function getType(&$value)
{
if ($value == '(') {
return self::GROUP_START;
} elseif ($value == ')') {
return self::GROUP_END;
}

//operators
switch (true) {
case $value === 'eq':
return self::OP_EQUALS;
case $value === 'neq':
return self::OP_NOT_EQUALS;
case $value === 'gt':
return self::OP_GREATER_THEN;
case $value === 'lt':
return self::OP_LESS_THEN;
case $value === 'gte':
return self::OP_GREATER_THEN_EQUALS;
case $value === 'lte':
return self::OP_LESS_THEN_EQUALS;
case $value === 'like':
return self::OP_LIKE;
}

if ($value == 'AND' || $value == 'OR') {
return self::LOGIC_OPERATOR;
}

if (substr($value, -1) == ':') {
$value = substr($value, 0 , -1);
return self::FILTER_PROPERTY;
}

return self::FILTER_VALUE;
}

Conclusion
You will now have a write a lexer which turns your plain string into bits which will be easily digested by your code. The next step would be to write a parser which will create a Abstract Syntax Tree but that’s a subject for another article.

Thank you for reading this. If you have any improvements or questions you can always send me a message or leave a comment!

--

--