Parse tree

aka: abstract syntax tree (AST), concrete syntax tree (CST).

Parsing is the process of convertion of text (string) to a tree according to the given grammar.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";.

Grammar grpah	Parse tree for `a`

Concatenation

Concatenation of two tokens S -> "a" "b";.

Grammar grpah	Parse tree for `ab`

Unordered choice

S -> "a" | "b".

Grammar grpah	Parse tree for `a`	Parse tree for `b`

General recursion

Classical example S -> "" | S "a";.

Grammar tree	Parse tree for `aa`

Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.

Kleene star

There are 3 different grammars which correspond to the same language:

S -> "" | S "a";
S -> "" | "a" S;
S -> "a"*;

First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:

1. Parse tree for `aa`	2. Parse tree for `aa`	3. Parse tree for `aa`

Controlling tree structure

Note: this feature is specific to my implementaion of PwZ.

Compact tree

We can make parse tree more compact:

by merging all nameless nodes to the closest named parent (“vertical” compaction)
by removing all empty strings (" $\epsilon$ ") and empty nodes (“horizontal” compaction)

Original parse tree	”Vertical” compaction	”Horizontal” compaction

And if we apply both strategies:

Original parse tree	Compact parse tree

You can think of it as turning CST to AST.

Ignored nodes

There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.

For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:

Add -> [0-9] "+" [0-9];
Add -> [0-9] ign("+") [0-9];

1 parse tree without `ign`	2 parse tree with `ign`	2 compact parse tree with `ign`

Lexical nodes

Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scannerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.

In my implementation of PwZ you can use lex for this. For example:

Str -> "\"" [^"]* "\"";
Str -> "\"" lex([^"]*) "\"";
Str -> ign("\"") lex([^"]*) ign("\"");

1 parse tree without `lex`	2 parse tree with `lex`	3 parse tree with `lex` and `ign`

Nameless nodes

Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.

Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
_Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;

1 parse tree for `"ab":"cd"`	2 parse tree for `"ab":"cd"`