Skip to content
Parsing
TwitterGitHub

Grammar graph

Grammar can be interpreted as directed graph (tree in a trivial cases). There is always a starting point from each you start traversing and the same node would be a “root” of a parse tree.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";. Grammar tree for it would look like:

n7060007805014068 S n7452539520328383 a n7060007805014068->n7452539520328383

Concatenation

Concatenation of two tokens S -> "a" "b";. In this case we can consider concatentaion as a separate node with label ”\cdot” or not. I prefer second version because it’s more compact:

❌ With explicit node✅ Without explicit node
n1981570908891468 S n011498424537101704 n1981570908891468->n011498424537101704 n8612461732603098 a n011498424537101704->n8612461732603098 n8798271321154361 b n011498424537101704->n8798271321154361
n2856368964205491 S n15715927627935677 a n2856368964205491->n15715927627935677 n5837503248973428 b n2856368964205491->n5837503248973428

Unordered choice

S -> "a" | "b". Again we can show unordered choice as separate node with label ”\cup” or not.

❌ With explicit node✅ Without explicit node
n9595417875498509 S n6442583001216662 n9595417875498509->n6442583001216662 n2946100068805293 a n6442583001216662->n2946100068805293 n27219624106797746 b n6442583001216662->n27219624106797746
n8045646947857164 S n14693252912421073 a n8045646947857164->n14693252912421073 n7483466783761956 b n8045646947857164->n7483466783761956

For unordered choice order of children doesn’t matter (but for concatentaion it’s important).

Kleene star

S -> "a"*;. Again we can show Kleene star as separate node with label ”\ast” or not.

❌ With explicit node✅ Without explicit node
n05840554564222278 S n9714527563649196 n05840554564222278->n9714527563649196 n4133737706936975 a n9714527563649196->n4133737706936975
n7269489612647642 S n5180372803135314 a n7269489612647642->n5180372803135314

For all other operations situation is the same.

Named vs nameless nodes

Let’s take a look at this example A -> "a" "b"; S -> A | "c". In this case each node is named. But what about this example: S -> "a" "b" | "c". It is essentially the same grammar so it should have the same structure.

✅ Named nodes✅ Nameless nodes
n308614525444038 S n8483068989559897 A n308614525444038->n8483068989559897 n8155341890448089 c n308614525444038->n8155341890448089 n3107815494225419 a n8483068989559897->n3107815494225419 n46156924664654597 b n8483068989559897->n46156924664654597
n3734418640287944 S n70917365978309 n3734418640287944->n70917365978309 n5095004341590013 c n3734418640287944->n5095004341590013 n04474686142508322 a n70917365978309->n04474686142508322 n6455078364446363 b n70917365978309->n6455078364446363

Binary vs N-arry operator

Concatentation is a binary operator (per se) so S -> "a" "b" "c"; can be represented as binary-tree (depending on associativity), or we can treat concatenation as N-arry operator.

❌ Binary left associative❌ Binary right associative✅ N-arry
n953203755479699 S n7558936023704126 n953203755479699->n7558936023704126 n5766282887628194 c n953203755479699->n5766282887628194 n4458317872840085 a n7558936023704126->n4458317872840085 n13986519114937535 b n7558936023704126->n13986519114937535
n6515416471064241 S n27084865981607487 a n6515416471064241->n27084865981607487 n6044271556030707 n6515416471064241->n6044271556030707 n0012074200385367995 b n6044271556030707->n0012074200385367995 n40044764569882685 c n6044271556030707->n40044764569882685
n8225901453982913 S n029390192021720107 a n8225901453982913->n029390192021720107 n5749013934006377 b n8225901453982913->n5749013934006377 n3454283158947937 c n8225901453982913->n3454283158947937

General recursion

Classical example for recursion is S -> "" | S "a";.

n683017876096379 S n6304467161798706 ϵ n683017876096379->n6304467161798706 n21697369711013814 n683017876096379->n21697369711013814 n7113932124340794 S n683017876096379->n7113932124340794 n21697369711013814->n7113932124340794 n893145762809824 a n21697369711013814->n893145762809824

ϵ\epsilon” is an empty string. Dotted line shows that this is cycle (loop) e.g. this is the same ”SS” node. It’s drawn twice so it would be easier to understand the structure.

“LCRS” representation

Maybe it would make easier to understans as LCRS-tree, rather than DAG. For example: S -> "a" "b" "c";

As DAGAS LCRS-tree
n8225901453982913 S n029390192021720107 a n8225901453982913->n029390192021720107 n5749013934006377 b n8225901453982913->n5749013934006377 n3454283158947937 c n8225901453982913->n3454283158947937
n8225901453982913 S n029390192021720107 a n8225901453982913->n029390192021720107 n5749013934006377 b n029390192021720107->n5749013934006377 n3454283158947937 c n5749013934006377->n3454283158947937