Parsing

# Grammar

There are different ways to specify language, for example:

• automata - if automata accpets string than it’s part of the language
• regular expression
• grmmar, for example
• Chomsky generative grammar
• BNF (EBNF, ABNF)
• PEG
• etc.
• system of languge equations

Those all approaches equivalent. In this article we’re gonna focus on grammars.

## Chomsky generative grammar

Chomsky generative grammar $G$ is a system used for generating strings over some alphabet. It is defined by a 4-tuple:

$G = (N, \Sigma, P, S)$

Where:

1. $N$ is a finite set of non-terminal symbols. These symbols are never present in the strings of the language, but are used in the production rules to generate those strings.

2. $\Sigma$ is a finite set of terminal symbols. These symbols appear in the strings generated by the grammar and are distinct from the non-terminals, i.e., $N \cap \Sigma = \emptyset$. The symbols in $\Sigma$ form the alphabet of the language generated by the grammar.

3. $P$ is a finite set of production rules. Each production rule is of the form: $\alpha \rightarrow \beta$ Where $\alpha$ is a string of symbols that contains at least one non-terminal (i.e., $\alpha \in (N \cup \Sigma)^+$ and $\alpha$ contains at least one element from $N$), and $\beta$ is a string of symbols from $N \cup \Sigma$, i.e., $\beta \in (N \cup \Sigma)^*$.

4. $S$ is the start symbol, where $S \in N$. This is the symbol from which derivation or string generation starts.

A grammar specifies how to construct valid strings of the language. A string is considered part of the language if there exists some sequence of production rule applications (a derivation) that starts with the start symbol $S$ and results in that string.

### Chomsky hierarchy

The Chomsky hierarchy classifies grammars into four levels based on their generative power:

1. Type 0 - Recursively Enumerable (RE) Grammar

• Production Rules: $\alpha \rightarrow \beta$ where $\alpha, \beta \in (N \cup \Sigma)^+$ and $|\alpha| \leq |\beta|$
• Language: Recursively Enumerable Language
• Recognized by: Turing Machine
2. Type 1 - Context-Sensitive Grammar (CSG)

• Production Rules: $\alpha \rightarrow \beta$ where $\alpha, \beta \in (N \cup \Sigma)^+$ and $|\alpha| \leq |\beta|$
• Language: Context-Sensitive Language (CSL)
• Recognized by: Linear Bounded Automaton (LBA)
3. Type 2 - Context-Free Grammar (CFG)

• Production Rules: $A \rightarrow \beta$ where $A \in N$ and $\beta \in (N \cup \Sigma)^*$
• Language: Context-Free Language (CFL)
• Recognized by: Pushdown Automaton (PDA)
4. Type 3 - Regular Grammar

• Production Rules:
• $A \rightarrow aB$ or $A \rightarrow a$ or $A \rightarrow \epsilon$ for right-linear regular grammars.
• $A \rightarrow Ba$ or $A \rightarrow a$ or $A \rightarrow \epsilon$ for left-linear regular grammars.
• Language: Regular Language
• Recognized by: Finite Automaton (either Deterministic (DFA) or Nondeterministic (NFA))

However, this is a classical hierarchy proposed by Chomsky in 1956 in his paper “Three Models for the Description of Language.” Since then, it has been extended to capture more nuances.

### Extended Chomsky hierarchy

1. Linear grammar or Linear Context-Free Grammar
• Is a Context-Free Grammar that has at most one non-terminal in the right-hand side of each of its productions
• Examples: $A \rightarrow aB$, $A \rightarrow Ba$, $A \rightarrow aBa$
2. Right-Linear grammar
• Examples: $A \rightarrow aB$ or $A \rightarrow a$ or $A \rightarrow \epsilon$
3. Left-Linear grammar
• Examples: $A \rightarrow Ba$ or $A \rightarrow a$ or $A \rightarrow \epsilon$