Lexer and Parser Generator¶
This project implements a simple lexer and parser generator. The lexer is based on a trie data structure to efficiently match token patterns defined in an external file, and the parser uses a pushdown automaton (PDA) approach to build a parse tree from the token stream. The project is organized into separate files for lexing, parsing, and execution.
Table of Contents¶
- Lexer and Parser Generator
- Table of Contents
- Overview
- Project Structure
- Lexer Details
- Parser Details
- Usage Instructions
- Code Walkthrough
- Future Enhancements
- Conclusion
Overview¶
This project comprises two major components:
- Lexer:
- Uses a trie (prefix tree) to store and match token symbols.
- Reads token definitions from
lex_rules.txt
. -
Processes an input stream character by character to output tokens.
-
Parser:
- Reads grammar rules from
parse_rules.txt
using a BNF-like syntax. - Constructs a pushdown automaton (PDA) based on the grammar.
- Uses a breadth-first search to find a valid parsing path and builds a parse tree.
- Visualizes the resulting parse tree.
Project Structure¶
-
lexer.cpp
Contains the implementation of the lexer's trie, token insertion, and tokenization process. -
parser.cpp
Implements the parser using PDA techniques, grammar rule processing, and parse tree construction. -
main.cpp
Provides the main entry point. It runs the lexer and parser on an input file and visualizes the parse tree. -
lex_rules.txt
Specifies lexical rules. For example: -
parse_rules.txt
Defines grammar rules using a production rule format: -
input.txt
Contains the input string to be tokenized and parsed. For example:
Lexer Details¶
Trie Data Structure¶
-
Purpose:
The trie is used to store all token symbols fromlex_rules.txt
for fast prefix matching. -
Implementation:
Each node in the trie is represented by an array of 256 integers (one for each ASCII character), initially set to -1. A separate vector (term
) stores terminal status and token names. -
Key Functions:
initNode()
: Initializes a trie node.insert(std::string sym, std::string name)
: Inserts a symbol into the trie.feed(char c)
: Processes one character at a time and returns a token when a complete symbol is matched.
Example of token insertion:
void lexer::Trie::insert(std::string sym, std::string name){
int node = 0;
for(char c: sym){
if(nodes[node][c] == -1){
nodes[node][c] = initNode();
}
node = nodes[node][c];
}
term[node] = {true, name};
}
Token Generation¶
- When the
feed()
function detects a complete token (a terminal node in the trie), it generates aToken
object. - Some tokens may be marked as special (defined in the
sp
map) so that the actual character is included in the token value.
Lexical Rules File¶
- File:
lex_rules.txt
- Format: Each line maps a symbol (or a range, e.g.,
0-->9
for digits) to a token name. - The lexer constructor (
lexer::genLexer
) reads this file and populates the trie.
Parser Details¶
Grammar Rule Parsing¶
- The parser reads grammar rules from
parse_rules.txt
. - Each rule is defined in a BNF-like format using
::=
and alternatives separated by|
. - Rules are split into tokens and stored in a structured format for later PDA construction.
PDA Construction¶
- Symbol IDs:
Each grammar symbol (terminals and nonterminals) is assigned a unique numeric ID. - Transitions:
The PDA builds transitions for each production rule. Transitions indicate how symbols should be replaced or reduced during parsing. - Epsilon Transitions:
These are added for non-token symbols to allow the parser to handle optional or recursive rules.
Parse Tree Construction¶
- Breadth-First Search (BFS):
The parser uses a BFS strategy to explore possible parsing paths. - Recursive Tree Population:
ThepopulateTree()
function is used to recursively build the parse tree once a valid parsing path is found.
Example snippet from populateTree()
:
void parser::genParser::populateTree(int curNode, int curSym, int &pind, std::vector<int> path, int &tind, std::vector<Token> tokens, parseTree &tree){
tree.id[curNode] = pda.idSym[curSym];
if(pda.isToken[curSym]){
tree.val[curNode] = tokens[tind++].getVal();
return;
}
int ind = path[pind++];
for(int i = 0;i < pda.pro[curSym][ind].size();i++){
int nNode = tree.newNode();
tree.adj[curNode].push_back(nNode);
populateTree(nNode, pda.pro[curSym][ind][i], pind, path, tind, tokens, tree);
}
}
Grammar Rules File¶
- File:
parse_rules.txt
- Format: Each line is a production rule. For example:
Usage Instructions¶
-
Prepare the Rule Files:
-
Edit
lex_rules.txt
to define your token symbols. -
Edit
parse_rules.txt
to define your grammar productions. -
Provide an Input File:
-
Create or edit
input.txt
with the string you wish to tokenize and parse. -
Compile the Project:
Use a C++ compiler (e.g., g++) to compile the source files: -
Run the Executable:
The program will print the tokens produced by the lexer and then display the parse tree.
Execute the program:
Code Walkthrough¶
Lexer (lexer.cpp
)¶
-
Trie Initialization:
initNode()
initializes a node for the trie. -
Token Insertion:
insert()
adds each token symbol to the trie, handling individual characters and ranges. -
Character Feeding:
feed()
processes each input character and matches it against the trie. When a terminal node is reached, a token is produced.
Parser (parser.cpp
)¶
-
Grammar Initialization:
The PDA constructor assigns IDs to grammar symbols and builds transitions based on production rules. -
Parsing Process:
Theparse()
function uses a queue-based search to find a valid sequence of transitions that consume the token stream. -
Tree Population:
Once a valid parse is detected,populateTree()
recursively builds a parse tree reflecting the grammar structure.
Main Application (main.cpp
)¶
-
Lexing Phase:
Readsinput.txt
and uses the lexer to produce tokens. -
Parsing Phase:
Uses the parser to convert the token stream into a parse tree. -
Visualization:
ThevizTree()
function prints the parse tree in a human-readable hierarchical format.
Future Enhancements¶
-
Stream-Based Lexing:
Modify the lexer to accept streams of characters (instead of file names) for increased flexibility. -
Improved Error Handling:
Enhance error reporting in both the lexer and parser to provide more meaningful diagnostics. -
Extended Grammar Support:
Expand the parser to handle more complex grammars and additional language features.
Conclusion¶
This lexer and parser generator provides a modular and extensible framework for processing textual input based on external token and grammar rules. The trie-based lexer efficiently matches token patterns, while the PDA-based parser constructs a parse tree that reflects the input's syntactic structure. This design lays the groundwork for further enhancements and adaptations to support more advanced language processing needs.