Skip to content

Parse tree

aka: abstract syntax tree (AST), concrete syntax tree (CST).

Related: Parse tree

Parsing is the process of convertion of text (string) to a tree according to the given grammar.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";.

Grammar grpahParse tree for a
n813184298163284 S n421311476429173 a n813184298163284->n421311476429173
n4780998647806558 S n5112091785679438 a n4780998647806558->n5112091785679438

Concatenation

Concatenation of two tokens S -> "a" "b";.

Grammar grpahParse tree for ab
n11606046068676701 S n6917593460068514 a n11606046068676701->n6917593460068514 n09814866093295804 b n11606046068676701->n09814866093295804
n04061371929816504 S n9513583410960711 a n04061371929816504->n9513583410960711 n8819367714009747 b n04061371929816504->n8819367714009747

Unordered choice

S -> "a" | "b".

Grammar grpahParse tree for aParse tree for b
n9623589181880099 S n8794472870353884 a n9623589181880099->n8794472870353884 n6843055658389234 b n9623589181880099->n6843055658389234
n18175409133549758 S n8512865300824 a n18175409133549758->n8512865300824
n17158913633944706 S n29153727278509245 b n17158913633944706->n29153727278509245

General recursion

Classical example S -> "" | S "a";.

Grammar treeParse tree for aa
n11703089481518747 S n3895187226479764 ϵ n11703089481518747->n3895187226479764 n19799433052563997 n11703089481518747->n19799433052563997 n43746899581915955 S n11703089481518747->n43746899581915955 n19799433052563997->n43746899581915955 n19503110025817239 a n19799433052563997->n19503110025817239
n4825749565839468 S n5484942576503358 n4825749565839468->n5484942576503358 n9933247903764728 S n5484942576503358->n9933247903764728 n12550013058956155 a n5484942576503358->n12550013058956155 n3311801140571211 n9933247903764728->n3311801140571211 n7454044922622964 S n3311801140571211->n7454044922622964 n16870593769903386 a n3311801140571211->n16870593769903386 n13115369902964713 ϵ n7454044922622964->n13115369902964713

Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.

Kleene star

There are 3 different grammars which correspond to the same language:

  1. S -> "" | S "a";
  2. S -> "" | "a" S;
  3. S -> "a"*;

First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:

1. Parse tree for aa2. Parse tree for aa3. Parse tree for aa
n9497541400379339 S n5304666149727009 n9497541400379339->n5304666149727009 n09570597512284862 S n5304666149727009->n09570597512284862 n9235553489096964 a n5304666149727009->n9235553489096964 n6093911790700168 n09570597512284862->n6093911790700168 n9012457872948172 S n6093911790700168->n9012457872948172 n2748276167726549 a n6093911790700168->n2748276167726549 n4041468247985689 ϵ n9012457872948172->n4041468247985689
n7809089035503398 S n511868803203972 n7809089035503398->n511868803203972 n2933100614541473 a n511868803203972->n2933100614541473 n03821986117297138 S n511868803203972->n03821986117297138 n11923744798526426 n03821986117297138->n11923744798526426 n6681474605340196 a n11923744798526426->n6681474605340196 n844074234528216 S n11923744798526426->n844074234528216 n7663133135711022 ϵ n844074234528216->n7663133135711022
n8285849527186011 S n41631416009966227 a n8285849527186011->n41631416009966227 n5392759241251306 a n8285849527186011->n5392759241251306

Controlling tree structure

Note: this feature is specific to my implementaion of PwZ.

Compact tree

We can make parse tree more compact:

  • by merging all nameless nodes to the closest named parent (“vertical” compaction)
  • by removing all empty strings ("ϵ\epsilon") and empty nodes (“horizontal” compaction)
Original parse tree”Vertical” compaction”Horizontal” compaction
n8922555665076952 S n5273220324692329 n8922555665076952->n5273220324692329 n43439784087089794 S n5273220324692329->n43439784087089794 n192933861101094 a n5273220324692329->n192933861101094 n665436834414459 n43439784087089794->n665436834414459 n9711575698847987 S n665436834414459->n9711575698847987 n4263910680089409 a n665436834414459->n4263910680089409 n3848672186063815 ϵ n9711575698847987->n3848672186063815
n6131444675153426 S n10435564141380382 S n6131444675153426->n10435564141380382 n2620869979859448 a n6131444675153426->n2620869979859448 n7984550199582565 S n10435564141380382->n7984550199582565 n7660758733844635 a n10435564141380382->n7660758733844635 n25427069395128976 ϵ n7984550199582565->n25427069395128976
n40928330306648464 S n8589882311873791 n40928330306648464->n8589882311873791 n5466309484414209 S n8589882311873791->n5466309484414209 n34765873499181565 a n8589882311873791->n34765873499181565 n5733265103604592 n5466309484414209->n5733265103604592 n40047379291763363 a n5733265103604592->n40047379291763363

And if we apply both strategies:

Original parse treeCompact parse tree
n6013151876668503 S n3185072165964684 n6013151876668503->n3185072165964684 n356965941572186 S n3185072165964684->n356965941572186 n09604310647193803 a n3185072165964684->n09604310647193803 n2567346577607006 n356965941572186->n2567346577607006 n3500169529030852 S n2567346577607006->n3500169529030852 n28427880077769596 a n2567346577607006->n28427880077769596 n1691958120326007 ϵ n3500169529030852->n1691958120326007
n7225868314182411 S n3415334463526869 S n7225868314182411->n3415334463526869 n228141169766519 a n7225868314182411->n228141169766519 n8243674297723558 a n3415334463526869->n8243674297723558

You can think of it as turning CST to AST.

Ignored nodes

There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.

For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:

  1. Add -> [0-9] "+" [0-9];
  2. Add -> [0-9] ign("+") [0-9];
1 parse tree without ign2 parse tree with ign2 compact parse tree with ign
n25903784768653315 Add n8487286457535439 1 n25903784768653315->n8487286457535439 n8631301939068938 + n25903784768653315->n8631301939068938 n029784335832441 2 n25903784768653315->n029784335832441
n055528650711698546 Add n5519194048518252 1 n055528650711698546->n5519194048518252 n15012554758956465 ϵ n055528650711698546->n15012554758956465 n43148965342828594 2 n055528650711698546->n43148965342828594
n27738437943177185 Add n19919447902819365 1 n27738437943177185->n19919447902819365 n06756698201579603 2 n27738437943177185->n06756698201579603

Lexical nodes

Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scanerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.

In my implementation of PwZ you can use lex for this. For example:

  1. Str -> "\"" [^"]* "\"";
  2. Str -> "\"" lex([^"]*) "\"";
  3. Str -> ign("\"") lex([^"]*) ign("\"");
1 parse tree without lex2 parse tree with lex3 parse tree with lex and ign
n8928962869514658 Str n34601561264776093 " n8928962869514658->n34601561264776093 n6943474121477113 a n8928962869514658->n6943474121477113 n6549923846506378 b n8928962869514658->n6549923846506378 n37246265129212497 " n8928962869514658->n37246265129212497
n4761643462159788 Str n31494441103763204 " n4761643462159788->n31494441103763204 n3604696950959314 ab n4761643462159788->n3604696950959314 n2417349764729837 " n4761643462159788->n2417349764729837
n1866403513185675 Str n40517725751816935 ab n1866403513185675->n40517725751816935

Nameless nodes

Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.

  1. Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
  2. _Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;
1 parse tree for "ab":"cd"2 parse tree for "ab":"cd"
n45727501520760705 S n3625767565015181 Str n45727501520760705->n3625767565015181 n5086168203751809 : n45727501520760705->n5086168203751809 n5782509022597091 Str n45727501520760705->n5782509022597091 n8624681331870558 ab n3625767565015181->n8624681331870558 n2654070902556831 cd n5782509022597091->n2654070902556831
n3795826758183771 S n014668468531677137 ab n3795826758183771->n014668468531677137 n9705109526801756 : n3795826758183771->n9705109526801756 n715355837081338 cd n3795826758183771->n715355837081338