Johan Åhlgren

Project 1: Building the parse tree

Now that we have finished our tokenizer, we will look at the second task: building the parse tree. Once we have that, we can ask the tree if a set of strings are a valid match or not. More precisely, we will now build a tree of expressions. We will do this in two steps – first, we will use a TStringEvaluatorTokenization object to create a list of tokens, and then we will traverse that list and build our tree.

Expressions

I have taken the liberty of rearranging the grammar a bit, so that it matches what we will be doing more closely:

Expr  ::= Expr BinOp Expr
          |   UnOp Expr
          |   '(' Expr ')'
          |   String
    BinOp ::= AND | OR
    UnOp  ::= NOT

We will have three types of expressions – the binary operation (“AND” and “OR”), the unary operation (“NOT”) and a string. Note that the parentheses will not generate any expressions in themselves, but simply change the structure of the tree, as we hinted at in the first article.

A “binary operation” expression needs two child nodes and knowledge about whether it represents an “AND” or an “OR”. To evaluate a binary operation means evaluating the two child expressions and then apply either “AND” or “OR” to the results.

A “unary operation” expression (in our case the “NOT” operator”) is even simpler – it needs only one child node, and evaluating it means only evaluating the child node and return the negation of its result.

Finally, a string expression needs only the actual string. To evaluate it against a set of strings, we simply check if one of the supplied strings matches it.

Parsing the token list

We will start by trying to match our expression as a binary expression. This means starting with trying to parse an expression (actually, we continue with trying to parse a unary one). We then see if the parsed expression is followed by a binary operator (“AND” or “OR”) and in that case try to parse another expression and build a tree containing the two parsed expressions as subtrees of an “AND” or “OR” node. If the first expression is not followed by a binary operator, we return that expression immediately. Thus, parsing a binary expression will return either a TBinOpExpr node with two expression child nodes, or simply an expression node of the type returned by the ParseUnaryExpression method.

function TStringEvaluator.ParseBinaryExpression: TExpression;
var
  Operator: String;
  Left, Right: TExpression;
begin
  Left := ParseUnaryExpression;
  if ThereAreMoreTokens then
  begin
    while NextIsBinOp do
    begin
      Operator := ParseNext;
      Right := ParseBinaryExpression;
      Left := TBinOpExpr.Create(Left, Operator, Right);
    end;
  end;
  Result := Left;
end;

Parsing a unary expression means that we try to match the next token against the word “NOT”. It this is a match, we create a “NOT” node with one child and try to parse an expression (actually the one containing parentheses) and place it in the child node. If we do not find “NOT”, we immediately try to parse another expression and return that. Thus, parsing a unary expression will return either a TUnOpExpr node with another expression as its child, or simply an expression of the type returned by the ParseParentheses method.

function TStringEvaluator.ParseUnaryExpression: TExpression;
var
  Operator: String;
  SubExpression: TExpression;
begin
  if NextTokenEquals('NOT') then
    Operator := ParseNext
  else
    Operator := '';

  SubExpression := ParseParentheses;
  if Operator = '' then
    Result := SubExpression
  else
    Result := TUnOpExpr.Create(Operator, SubExpression);
end;

Parsing an expression surrounded by parentheses means that we try to match the next token against “(“. It this is a match, we parse any expression (starting over from the top, by trying to parse a binary expression) and once we have done that, we check that the next token is “)”. If not, we raise an exception to indicate that an error was found. If we don’t find the initial “(“, we continue by trying to parse a string. Thus, parsing an expression inside parentheses will return either what is returned by ParseBinaryExpression or what is returned by ParsePrimitive.

function TStringEvaluator.ParseParentheses: TExpression;
begin
  if NextTokenEquals('(') then
  begin
    ParseNext;
    Result := ParseBinaryExpression;
    if NextTokenEquals(')') then
      ParseNext
    else
      Raise EExpressionError.Create(Format(‘Invalid token "%s". Expected ")".’, [NextToken]));
  end{if}
  else
    Result := ParseString;
end;

Finally, parsing a string means that we try to match the next token against anything really, creating and returning a TStringExpr.

function TStringEvaluator.ParsePrimitive: TExpression;
begin
  Result := TStringExpr.Create(ParseNext);
end;

As you see in the code segments above, we have a number of help methods, such as NextTokenEquals and ParseNext. Hopefully, the purpose of all these will be clear when you look at the source code. The code includes two test projects. The first one is a visual application where you may enter any expression you like and match it against one or more strings. This application will also display a simple representation of the parse tree, so that you may check the effect e.g. of moving parentheses around. The second one is a simple unit test verifying that a number of evaluations give the expected result.

Leave a Reply

Your email address will not be published.