Skip to content
Parsing
TwitterGitHub

Parse tree

aka: abstract syntax tree (AST), concrete syntax tree (CST).

Parsing is the process of convertion of text (string) to a tree according to the given grammar.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";.

Grammar grpahParse tree for a
n7060007805014068 S n7452539520328383 a n7060007805014068->n7452539520328383
n1871205778921492 S n7322121965521333 a n1871205778921492->n7322121965521333

Concatenation

Concatenation of two tokens S -> "a" "b";.

Grammar grpahParse tree for ab
n2856368964205491 S n15715927627935677 a n2856368964205491->n15715927627935677 n5837503248973428 b n2856368964205491->n5837503248973428
n4040722386197615 S n7562140445944674 a n4040722386197615->n7562140445944674 n24874183835193864 b n4040722386197615->n24874183835193864

Unordered choice

S -> "a" | "b".

Grammar grpahParse tree for aParse tree for b
n8045646947857164 S n14693252912421073 a n8045646947857164->n14693252912421073 n7483466783761956 b n8045646947857164->n7483466783761956
n27886367173634774 S n33864314936985784 a n27886367173634774->n33864314936985784
n02074445370937661 S n871549760573145 b n02074445370937661->n871549760573145

General recursion

Classical example S -> "" | S "a";.

Grammar treeParse tree for aa
n683017876096379 S n6304467161798706 ϵ n683017876096379->n6304467161798706 n21697369711013814 n683017876096379->n21697369711013814 n7113932124340794 S n683017876096379->n7113932124340794 n21697369711013814->n7113932124340794 n893145762809824 a n21697369711013814->n893145762809824
n3136599601872925 S n13194918225267616 n3136599601872925->n13194918225267616 n27953582897169116 S n13194918225267616->n27953582897169116 n012387406176586513 a n13194918225267616->n012387406176586513 n7027360761426871 n27953582897169116->n7027360761426871 n22074715058858718 S n7027360761426871->n22074715058858718 n7415925070188671 a n7027360761426871->n7415925070188671 n618660922048182 ϵ n22074715058858718->n618660922048182

Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.

Kleene star

There are 3 different grammars which correspond to the same language:

  1. S -> "" | S "a";
  2. S -> "" | "a" S;
  3. S -> "a"*;

First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:

1. Parse tree for aa2. Parse tree for aa3. Parse tree for aa
n21718019709781444 S n5482421110861406 n21718019709781444->n5482421110861406 n2917684155211284 S n5482421110861406->n2917684155211284 n8753600193084252 a n5482421110861406->n8753600193084252 n616448862493373 n2917684155211284->n616448862493373 n35399059838500446 S n616448862493373->n35399059838500446 n0008218288928241169 a n616448862493373->n0008218288928241169 n1827234170298404 ϵ n35399059838500446->n1827234170298404
n7380270597746155 S n4753423934092096 n7380270597746155->n4753423934092096 n2568308484582835 a n4753423934092096->n2568308484582835 n4759936602678476 S n4753423934092096->n4759936602678476 n6367332410891906 n4759936602678476->n6367332410891906 n37689252770982873 a n6367332410891906->n37689252770982873 n5291158448653823 S n6367332410891906->n5291158448653823 n15196337763016698 ϵ n5291158448653823->n15196337763016698
n5798025634606105 S n2356030970841121 a n5798025634606105->n2356030970841121 n18616130130612119 a n5798025634606105->n18616130130612119

Controlling tree structure

Note: this feature is specific to my implementaion of PwZ.

Compact tree

We can make parse tree more compact:

  • by merging all nameless nodes to the closest named parent (“vertical” compaction)
  • by removing all empty strings (”ϵ\epsilon”) and empty nodes (“horizontal” compaction)
Original parse tree“Vertical” compaction“Horizontal” compaction
n7674978539124542 S n6590158023573054 n7674978539124542->n6590158023573054 n46544375967444807 S n6590158023573054->n46544375967444807 n6527611975747332 a n6590158023573054->n6527611975747332 n9859842441669182 n46544375967444807->n9859842441669182 n7460103310964217 S n9859842441669182->n7460103310964217 n5891310648759651 a n9859842441669182->n5891310648759651 n25045475586769106 ϵ n7460103310964217->n25045475586769106
n308641378317116 S n7234166300078388 S n308641378317116->n7234166300078388 n7234782005236879 a n308641378317116->n7234782005236879 n7561167291166642 S n7234166300078388->n7561167291166642 n8776692144285414 a n7234166300078388->n8776692144285414 n29408896854927846 ϵ n7561167291166642->n29408896854927846
n6233140959796466 S n8243156586819966 n6233140959796466->n8243156586819966 n33054981445694454 S n8243156586819966->n33054981445694454 n2268228524400009 a n8243156586819966->n2268228524400009 n9852398525007511 n33054981445694454->n9852398525007511 n9819389977888839 a n9852398525007511->n9819389977888839

And if we apply both strategies:

Original parse treeCompact parse tree
n8145363475482481 S n4812740990121245 n8145363475482481->n4812740990121245 n11961119705599299 S n4812740990121245->n11961119705599299 n2976738294516119 a n4812740990121245->n2976738294516119 n3378465226665375 n11961119705599299->n3378465226665375 n4754782338820793 S n3378465226665375->n4754782338820793 n20155457712250668 a n3378465226665375->n20155457712250668 n6651678779670522 ϵ n4754782338820793->n6651678779670522
n24173996660263408 S n18531704294953188 S n24173996660263408->n18531704294953188 n9138198555422579 a n24173996660263408->n9138198555422579 n6930752428338598 a n18531704294953188->n6930752428338598

You can think of it as turning CST to AST.

Ignored nodes

There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.

For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:

  1. Add -> [0-9] "+" [0-9];
  2. Add -> [0-9] ign("+") [0-9];
1 parse tree without ign2 parse tree with ign2 compact parse tree with ign
n9791961155923108 Add n3009355066848798 1 n9791961155923108->n3009355066848798 n21893896935374202 + n9791961155923108->n21893896935374202 n7885671773104035 2 n9791961155923108->n7885671773104035
n029426769988936785 Add n3687645455724562 1 n029426769988936785->n3687645455724562 n021050868879461726 ϵ n029426769988936785->n021050868879461726 n604674112804886 2 n029426769988936785->n604674112804886
n5017440490196137 Add n7986102753026105 1 n5017440490196137->n7986102753026105 n9340032853849445 2 n5017440490196137->n9340032853849445

Lexical nodes

Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scanerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.

In my implementation of PwZ you can use lex for this. For example:

  1. Str -> "\"" [^"]* "\"";
  2. Str -> "\"" lex([^"]*) "\"";
  3. Str -> ign("\"") lex([^"]*) ign("\"");
1 parse tree without lex2 parse tree with lex3 parse tree with lex and ign
n0175377895189337 Str n5655466186198417 " n0175377895189337->n5655466186198417 n06674506989227025 a n0175377895189337->n06674506989227025 n8439042715435887 b n0175377895189337->n8439042715435887 n37883685777063136 " n0175377895189337->n37883685777063136
n2738139522031151 Str n642697871829313 " n2738139522031151->n642697871829313 n12937634696708322 ab n2738139522031151->n12937634696708322 n6523160803721881 " n2738139522031151->n6523160803721881
n2901168456902312 Str n2013146804370638 ab n2901168456902312->n2013146804370638

Nameless nodes

Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.

  1. Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
  2. _Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;
1 parse tree for "ab":"cd"2 parse tree for "ab":"cd"
n37909087534710006 S n712988260475818 Str n37909087534710006->n712988260475818 n6866088978516496 : n37909087534710006->n6866088978516496 n6138730037110032 Str n37909087534710006->n6138730037110032 n2939344152676633 ab n712988260475818->n2939344152676633 n17195379276966682 cd n6138730037110032->n17195379276966682
n014097929830768186 S n4764521751424384 ab n014097929830768186->n4764521751424384 n11386907403955426 : n014097929830768186->n11386907403955426 n1233341918759654 cd n014097929830768186->n1233341918759654