Skip to content

Parse tree

aka: abstract syntax tree (AST), concrete syntax tree (CST).

Related: Parse tree

Parsing is the process of convertion of text (string) to a tree according to the given grammar.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";.

Grammar grpahParse tree for a
n9045749999596724 S n11926066025070647 a n9045749999596724->n11926066025070647
n41195121185425454 S n41911625361547333 a n41195121185425454->n41911625361547333

Concatenation

Concatenation of two tokens S -> "a" "b";.

Grammar grpahParse tree for ab
n9275719828898807 S n13858811829854334 a n9275719828898807->n13858811829854334 n0608917576374699 b n9275719828898807->n0608917576374699
n9303045922793305 S n7650376750750165 a n9303045922793305->n7650376750750165 n42882526791292386 b n9303045922793305->n42882526791292386

Unordered choice

S -> "a" | "b".

Grammar grpahParse tree for aParse tree for b
n5093862988777245 S n8222578374431715 a n5093862988777245->n8222578374431715 n6689840189905794 b n5093862988777245->n6689840189905794
n3794395835173099 S n8210612338905203 a n3794395835173099->n8210612338905203
n9344910309976735 S n10334189319787201 b n9344910309976735->n10334189319787201

General recursion

Classical example S -> "" | S "a";.

Grammar treeParse tree for aa
n6662298823076602 S n3347014940928845 ϵ n6662298823076602->n3347014940928845 n6765139321907154 n6662298823076602->n6765139321907154 n8561749272233043 S n6662298823076602->n8561749272233043 n6765139321907154->n8561749272233043 n05314451514827656 a n6765139321907154->n05314451514827656
n5006403791996545 S n42231996027912255 n5006403791996545->n42231996027912255 n6491878904415445 S n42231996027912255->n6491878904415445 n6042320386288764 a n42231996027912255->n6042320386288764 n5053541104499739 n6491878904415445->n5053541104499739 n8956392544214549 S n5053541104499739->n8956392544214549 n15328496078572007 a n5053541104499739->n15328496078572007 n3824833350844432 ϵ n8956392544214549->n3824833350844432

Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.

Kleene star

There are 3 different grammars which correspond to the same language:

  1. S -> "" | S "a";
  2. S -> "" | "a" S;
  3. S -> "a"*;

First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:

1. Parse tree for aa2. Parse tree for aa3. Parse tree for aa
n3043006438163376 S n9728049399011753 n3043006438163376->n9728049399011753 n8319843472498287 S n9728049399011753->n8319843472498287 n6542965663601759 a n9728049399011753->n6542965663601759 n7004871900710512 n8319843472498287->n7004871900710512 n08476003327969273 S n7004871900710512->n08476003327969273 n5890175849443013 a n7004871900710512->n5890175849443013 n8324096090466866 ϵ n08476003327969273->n8324096090466866
n7537102755878418 S n8442850623447136 n7537102755878418->n8442850623447136 n6967294540807689 a n8442850623447136->n6967294540807689 n7339097184795345 S n8442850623447136->n7339097184795345 n6729065784398922 n7339097184795345->n6729065784398922 n046515517983160226 a n6729065784398922->n046515517983160226 n7917684307491757 S n6729065784398922->n7917684307491757 n6606868571645985 ϵ n7917684307491757->n6606868571645985
n4883494029957438 S n2127887264108348 a n4883494029957438->n2127887264108348 n22314792045871457 a n4883494029957438->n22314792045871457

Controlling tree structure

Note: this feature is specific to my implementaion of PwZ.

Compact tree

We can make parse tree more compact:

  • by merging all nameless nodes to the closest named parent (“vertical” compaction)
  • by removing all empty strings ("ϵ\epsilon") and empty nodes (“horizontal” compaction)
Original parse tree”Vertical” compaction”Horizontal” compaction
n6510282769913096 S n7766559486044913 n6510282769913096->n7766559486044913 n69369496377974 S n7766559486044913->n69369496377974 n7764090426902568 a n7766559486044913->n7764090426902568 n6359462172518746 n69369496377974->n6359462172518746 n6924282712430736 S n6359462172518746->n6924282712430736 n05323449982383388 a n6359462172518746->n05323449982383388 n8005595171473734 ϵ n6924282712430736->n8005595171473734
n668184891145793 S n22670337828713905 S n668184891145793->n22670337828713905 n03843972884844482 a n668184891145793->n03843972884844482 n4561020234052182 S n22670337828713905->n4561020234052182 n4981727573289707 a n22670337828713905->n4981727573289707 n8303961819781502 ϵ n4561020234052182->n8303961819781502
n9873646761395711 S n14420038487421194 n9873646761395711->n14420038487421194 n38374772292209247 S n14420038487421194->n38374772292209247 n1884843380328045 a n14420038487421194->n1884843380328045 n07261699754742779 n38374772292209247->n07261699754742779 n28979664051481935 a n07261699754742779->n28979664051481935

And if we apply both strategies:

Original parse treeCompact parse tree
n1753534596556685 S n2746892695128744 n1753534596556685->n2746892695128744 n16573227392022316 S n2746892695128744->n16573227392022316 n6024169160236814 a n2746892695128744->n6024169160236814 n9770138542531104 n16573227392022316->n9770138542531104 n24677040964926356 S n9770138542531104->n24677040964926356 n00626885661193155 a n9770138542531104->n00626885661193155 n9441275839120051 ϵ n24677040964926356->n9441275839120051
n16054467333272893 S n8726319633204935 S n16054467333272893->n8726319633204935 n7634020904901848 a n16054467333272893->n7634020904901848 n3526325471007703 a n8726319633204935->n3526325471007703

You can think of it as turning CST to AST.

Ignored nodes

There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.

For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:

  1. Add -> [0-9] "+" [0-9];
  2. Add -> [0-9] ign("+") [0-9];
1 parse tree without ign2 parse tree with ign2 compact parse tree with ign
n4598044255256266 Add n8418611712276887 1 n4598044255256266->n8418611712276887 n04094158808628601 + n4598044255256266->n04094158808628601 n2192343091977238 2 n4598044255256266->n2192343091977238
n7941217230887205 Add n4290883082986461 1 n7941217230887205->n4290883082986461 n14368428176353953 ϵ n7941217230887205->n14368428176353953 n3004001709862565 2 n7941217230887205->n3004001709862565
n13119092542922428 Add n5491957392655258 1 n13119092542922428->n5491957392655258 n027191408656265237 2 n13119092542922428->n027191408656265237

Lexical nodes

Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scannerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.

In my implementation of PwZ you can use lex for this. For example:

  1. Str -> "\"" [^"]* "\"";
  2. Str -> "\"" lex([^"]*) "\"";
  3. Str -> ign("\"") lex([^"]*) ign("\"");
1 parse tree without lex2 parse tree with lex3 parse tree with lex and ign
n8003870514085889 Str n8092892130329867 " n8003870514085889->n8092892130329867 n7993326765117879 a n8003870514085889->n7993326765117879 n30318277258436077 b n8003870514085889->n30318277258436077 n7698525908846532 " n8003870514085889->n7698525908846532
n382821226500462 Str n8673600512739206 " n382821226500462->n8673600512739206 n7289792222684683 ab n382821226500462->n7289792222684683 n16599085183132045 " n382821226500462->n16599085183132045
n014038622867061035 Str n6890412575184821 ab n014038622867061035->n6890412575184821

Nameless nodes

Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.

  1. Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
  2. _Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;
1 parse tree for "ab":"cd"2 parse tree for "ab":"cd"
n6464202781921584 S n914437454706706 Str n6464202781921584->n914437454706706 n23901888217093048 : n6464202781921584->n23901888217093048 n42457962590009046 Str n6464202781921584->n42457962590009046 n2995846372147848 ab n914437454706706->n2995846372147848 n37151783901027535 cd n42457962590009046->n37151783901027535
n015301433556427657 S n8457480902268921 ab n015301433556427657->n8457480902268921 n4512717323474287 : n015301433556427657->n4512717323474287 n056194224041648644 cd n015301433556427657->n056194224041648644