Skip to content

Parse tree

aka: abstract syntax tree (AST), concrete syntax tree (CST).

Related: Parse tree

Parsing is the process of convertion of text (string) to a tree according to the given grammar.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";.

Grammar grpahParse tree for a
n4783407317013666 S n3590642813007243 a n4783407317013666->n3590642813007243
n3107780802906901 S n397108051030874 a n3107780802906901->n397108051030874

Concatenation

Concatenation of two tokens S -> "a" "b";.

Grammar grpahParse tree for ab
n052007353442672155 S n17703023510764226 a n052007353442672155->n17703023510764226 n426172810115234 b n052007353442672155->n426172810115234
n020481441947321155 S n9108876704275235 a n020481441947321155->n9108876704275235 n4576135188110908 b n020481441947321155->n4576135188110908

Unordered choice

S -> "a" | "b".

Grammar grpahParse tree for aParse tree for b
n3754043108705223 S n20881270689124287 a n3754043108705223->n20881270689124287 n8108681281442534 b n3754043108705223->n8108681281442534
n8327232116986778 S n3927437484980527 a n8327232116986778->n3927437484980527
n8960223782042691 S n5612808461361614 b n8960223782042691->n5612808461361614

General recursion

Classical example S -> "" | S "a";.

Grammar treeParse tree for aa
n6997591863283679 S n0795279838207561 ϵ n6997591863283679->n0795279838207561 n14582919107658077 n6997591863283679->n14582919107658077 n9207645680855676 S n6997591863283679->n9207645680855676 n14582919107658077->n9207645680855676 n14464551871511389 a n14582919107658077->n14464551871511389
n6492055652583058 S n5249722959985594 n6492055652583058->n5249722959985594 n8837948431406446 S n5249722959985594->n8837948431406446 n22859616998469723 a n5249722959985594->n22859616998469723 n21124634684803345 n8837948431406446->n21124634684803345 n6289960675929076 S n21124634684803345->n6289960675929076 n7300506233952748 a n21124634684803345->n7300506233952748 n7529239396311636 ϵ n6289960675929076->n7529239396311636

Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.

Kleene star

There are 3 different grammars which correspond to the same language:

  1. S -> "" | S "a";
  2. S -> "" | "a" S;
  3. S -> "a"*;

First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:

1. Parse tree for aa2. Parse tree for aa3. Parse tree for aa
n6017120557004081 S n19337121687245484 n6017120557004081->n19337121687245484 n8880891772879735 S n19337121687245484->n8880891772879735 n2513608085647112 a n19337121687245484->n2513608085647112 n0896644763302965 n8880891772879735->n0896644763302965 n8008824170284816 S n0896644763302965->n8008824170284816 n30000166340799117 a n0896644763302965->n30000166340799117 n39672143804705673 ϵ n8008824170284816->n39672143804705673
n46164704475294305 S n014807923250251376 n46164704475294305->n014807923250251376 n7431166954045925 a n014807923250251376->n7431166954045925 n2178144670087474 S n014807923250251376->n2178144670087474 n6652386961220653 n2178144670087474->n6652386961220653 n693652823695778 a n6652386961220653->n693652823695778 n7420584136866917 S n6652386961220653->n7420584136866917 n2599165206118317 ϵ n7420584136866917->n2599165206118317
n7195513709262811 S n2120341826827159 a n7195513709262811->n2120341826827159 n7680293425967579 a n7195513709262811->n7680293425967579

Controlling tree structure

Note: this feature is specific to my implementaion of PwZ.

Compact tree

We can make parse tree more compact:

  • by merging all nameless nodes to the closest named parent (“vertical” compaction)
  • by removing all empty strings ("ϵ\epsilon") and empty nodes (“horizontal” compaction)
Original parse tree”Vertical” compaction”Horizontal” compaction
n3997816514072263 S n5833781807330256 n3997816514072263->n5833781807330256 n93662199109719 S n5833781807330256->n93662199109719 n838620320208296 a n5833781807330256->n838620320208296 n13648080397381612 n93662199109719->n13648080397381612 n9551508346457325 S n13648080397381612->n9551508346457325 n7197297061677121 a n13648080397381612->n7197297061677121 n5638618390168562 ϵ n9551508346457325->n5638618390168562
n1661160703194282 S n29948122414068634 S n1661160703194282->n29948122414068634 n2604499119735746 a n1661160703194282->n2604499119735746 n5976292650604684 S n29948122414068634->n5976292650604684 n43676546643602054 a n29948122414068634->n43676546643602054 n5262725907790964 ϵ n5976292650604684->n5262725907790964
n6548196011842256 S n9725217260600041 n6548196011842256->n9725217260600041 n03689638330650302 S n9725217260600041->n03689638330650302 n22084412736683 a n9725217260600041->n22084412736683 n8691982751343301 n03689638330650302->n8691982751343301 n6894446126611633 a n8691982751343301->n6894446126611633

And if we apply both strategies:

Original parse treeCompact parse tree
n6597824908382672 S n09022258800868199 n6597824908382672->n09022258800868199 n6101143444899242 S n09022258800868199->n6101143444899242 n8813013924904392 a n09022258800868199->n8813013924904392 n3456019172516869 n6101143444899242->n3456019172516869 n6575612291419073 S n3456019172516869->n6575612291419073 n7057091458928444 a n3456019172516869->n7057091458928444 n6711020502455505 ϵ n6575612291419073->n6711020502455505
n451700451781351 S n35104018736240605 S n451700451781351->n35104018736240605 n6185049711442412 a n451700451781351->n6185049711442412 n7780190693858409 a n35104018736240605->n7780190693858409

You can think of it as turning CST to AST.

Ignored nodes

There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.

For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:

  1. Add -> [0-9] "+" [0-9];
  2. Add -> [0-9] ign("+") [0-9];
1 parse tree without ign2 parse tree with ign2 compact parse tree with ign
n9855488418348683 Add n9007194074590692 1 n9855488418348683->n9007194074590692 n4816269414498566 + n9855488418348683->n4816269414498566 n8032742721316621 2 n9855488418348683->n8032742721316621
n21725053702579378 Add n3658390565465697 1 n21725053702579378->n3658390565465697 n33323007065353694 ϵ n21725053702579378->n33323007065353694 n019995225869479327 2 n21725053702579378->n019995225869479327
n5494565970758611 Add n5616973986945557 1 n5494565970758611->n5616973986945557 n37092613730947877 2 n5494565970758611->n37092613730947877

Lexical nodes

Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scanerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.

In my implementation of PwZ you can use lex for this. For example:

  1. Str -> "\"" [^"]* "\"";
  2. Str -> "\"" lex([^"]*) "\"";
  3. Str -> ign("\"") lex([^"]*) ign("\"");
1 parse tree without lex2 parse tree with lex3 parse tree with lex and ign
n3915263478060973 Str n2169519285856718 " n3915263478060973->n2169519285856718 n38899958537595025 a n3915263478060973->n38899958537595025 n5632502434136455 b n3915263478060973->n5632502434136455 n7451473916980613 " n3915263478060973->n7451473916980613
n45541851910613973 Str n5196733014765964 " n45541851910613973->n5196733014765964 n9906611442833408 ab n45541851910613973->n9906611442833408 n7758987054952506 " n45541851910613973->n7758987054952506
n2740912105812279 Str n013428900280047174 ab n2740912105812279->n013428900280047174

Nameless nodes

Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.

  1. Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
  2. _Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;
1 parse tree for "ab":"cd"2 parse tree for "ab":"cd"
n050584147890006204 S n5791832725934403 Str n050584147890006204->n5791832725934403 n7531011176793065 : n050584147890006204->n7531011176793065 n5563517693141165 Str n050584147890006204->n5563517693141165 n44155237538407643 ab n5791832725934403->n44155237538407643 n44668031045917256 cd n5563517693141165->n44668031045917256
n2913538776125706 S n19950655561609176 ab n2913538776125706->n19950655561609176 n2933319556451959 : n2913538776125706->n2933319556451959 n4009318555370802 cd n2913538776125706->n4009318555370802