Lexers: Understanding and Implementing
What is a Lexer?
A lexer (or lexical analyzer) is a component of a compiler that processes input text to produce tokens, which are the basic units of syntax.
Example: Lexer for Mathematical Expressions
Consider a lexer that processes simple mathematical expressions, recognizing numbers, operators, and parentheses.
Python Example:
import re
token_specification = [
('NUMBER', r'\d+(\.\d*)?'),
('PLUS', r'\+'),
('MINUS', r'-'),
('TIMES', r'\*'),
('DIVIDE', r'/'),
('LPAREN', r'\('),
('RPAREN', r'\)'),
('WHITESPACE', r'\s+'),
]
def lexer(code):
tokens = []
regex = '|'.join(f'(?P<{name}>{pattern})' for name, pattern in token_specification)
for match in re.finditer(regex, code):
kind = match.lastgroup
value = match.group()
if kind != 'WHITESPACE':
tokens.append((kind, value))
return tokens
expr = "3 + 5 * (2 - 8)"
print(lexer(expr))
Lexer for Simple Language Constructs
Consider a basic lexer for a hypothetical programming language:
Python Example:
token_specification = [
('KEYWORD', r'\b(if|else|while|return)\b'),
('IDENTIFIER', r'[a-zA-Z_]\w*'),
('NUMBER', r'\d+'),
('OPERATOR', r'[+\-*/=]'),
('BRACE', r'[{}]'),
('PAREN', r'[()]'),
('SEMICOLON', r';'),
('WHITESPACE', r'\s+'),
]
def lexer(code):
tokens = []
regex = '|'.join(f'(?P<{name}>{pattern})' for name, pattern in token_specification)
for match in re.finditer(regex, code):
kind = match.lastgroup
value = match.group()
if kind != 'WHITESPACE':
tokens.append((kind, value))
return tokens
code = "if (x == 10) { return y + 2; }"
print(lexer(code))
Lexer for a Custom Robot Command Language
Let's define a simple language to control a robot with commands like MOVE, TURN, and STOP.
Python Example:
token_specification = [
('COMMAND', r'\b(MOVE|TURN|STOP)\b'),
('DIRECTION', r'\b(LEFT|RIGHT|FORWARD|BACKWARD)\b'),
('NUMBER', r'\d+'),
('WHITESPACE', r'\s+'),
]
def lexer(code):
tokens = []
regex = '|'.join(f'(?P<{name}>{pattern})' for name, pattern in token_specification)
for match in re.finditer(regex, code):
kind = match.lastgroup
value = match.group()
if kind != 'WHITESPACE':
tokens.append((kind, value))
return tokens
robot_code = "MOVE FORWARD 10\nTURN LEFT\nMOVE BACKWARD 5\nSTOP"
print(lexer(robot_code))
Conclusion
Lexers play a crucial role in parsing code and mathematical expressions. They break input into meaningful tokens for further processing in interpreters and compilers.