Parsing is the process of convertion of text (string) to a tree according to the given grammar.
Let’s explore examples to understand the concept.
Token
The simplest possible grammar: S -> "a";.
Grammar grpah
Parse tree for a
Concatenation
Concatenation of two tokens S -> "a" "b";.
Grammar grpah
Parse tree for ab
Unordered choice
S -> "a" | "b".
Grammar grpah
Parse tree for a
Parse tree for b
General recursion
Classical example S -> "" | S "a";.
Grammar tree
Parse tree for aa
Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.
Kleene star
There are 3 different grammars which correspond to the same language:
S -> "" | S "a";
S -> "" | "a" S;
S -> "a"*;
First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:
1. Parse tree for aa
2. Parse tree for aa
3. Parse tree for aa
Controlling tree structure
Note: this feature is specific to my implementaion of PwZ.
Compact tree
We can make parse tree more compact:
by merging all nameless nodes to the closest named parent (“vertical” compaction)
by removing all empty strings ("ϵ") and empty nodes (“horizontal” compaction)
Original parse tree
”Vertical” compaction
”Horizontal” compaction
And if we apply both strategies:
Original parse tree
Compact parse tree
You can think of it as turning CST to AST.
Ignored nodes
There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.
For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:
Add -> [0-9] "+" [0-9];
Add -> [0-9] ign("+") [0-9];
1 parse tree without ign
2 parse tree with ign
2 compact parse tree with ign
Lexical nodes
Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scannerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.
In my implementation of PwZ you can use lex for this. For example:
Str -> "\"" [^"]* "\"";
Str -> "\"" lex([^"]*) "\"";
Str -> ign("\"") lex([^"]*) ign("\"");
1 parse tree without lex
2 parse tree with lex
3 parse tree with lex and ign
Nameless nodes
Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.
Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
_Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;