Lexers: Understanding and Implementing

What is a Lexer?

A lexer (or lexical analyzer) is a component of a compiler that processes input text to produce tokens, which are the basic units of syntax.

Example: Lexer for Mathematical Expressions

Consider a lexer that processes simple mathematical expressions, recognizing numbers, operators, and parentheses.

Python Example:

import re

token_specification = [
    ('NUMBER',  r'\d+(\.\d*)?'),
    ('PLUS',    r'\+'),
    ('MINUS',   r'-'),
    ('TIMES',   r'\*'),
    ('DIVIDE',  r'/'),
    ('LPAREN',  r'\('),
    ('RPAREN',  r'\)'),
    ('WHITESPACE', r'\s+'),
]

def lexer(code):
    tokens = []
    regex = '|'.join(f'(?P<{name}>{pattern})' for name, pattern in token_specification)
    for match in re.finditer(regex, code):
        kind = match.lastgroup
        value = match.group()
        if kind != 'WHITESPACE':
            tokens.append((kind, value))
    return tokens

expr = "3 + 5 * (2 - 8)"
print(lexer(expr))

Lexer for Simple Language Constructs

Consider a basic lexer for a hypothetical programming language:

Python Example:

token_specification = [
    ('KEYWORD', r'\b(if|else|while|return)\b'),
    ('IDENTIFIER', r'[a-zA-Z_]\w*'),
    ('NUMBER', r'\d+'),
    ('OPERATOR', r'[+\-*/=]'),
    ('BRACE', r'[{}]'),
    ('PAREN', r'[()]'),
    ('SEMICOLON', r';'),
    ('WHITESPACE', r'\s+'),
]

def lexer(code):
    tokens = []
    regex = '|'.join(f'(?P<{name}>{pattern})' for name, pattern in token_specification)
    for match in re.finditer(regex, code):
        kind = match.lastgroup
        value = match.group()
        if kind != 'WHITESPACE':
            tokens.append((kind, value))
    return tokens

code = "if (x == 10) { return y + 2; }"
print(lexer(code))

Lexer for a Custom Robot Command Language

Let's define a simple language to control a robot with commands like MOVE, TURN, and STOP.

Python Example:

token_specification = [
    ('COMMAND', r'\b(MOVE|TURN|STOP)\b'),
    ('DIRECTION', r'\b(LEFT|RIGHT|FORWARD|BACKWARD)\b'),
    ('NUMBER', r'\d+'),
    ('WHITESPACE', r'\s+'),
]

def lexer(code):
    tokens = []
    regex = '|'.join(f'(?P<{name}>{pattern})' for name, pattern in token_specification)
    for match in re.finditer(regex, code):
        kind = match.lastgroup
        value = match.group()
        if kind != 'WHITESPACE':
            tokens.append((kind, value))
    return tokens

robot_code = "MOVE FORWARD 10\nTURN LEFT\nMOVE BACKWARD 5\nSTOP"
print(lexer(robot_code))

Conclusion

Lexers play a crucial role in parsing code and mathematical expressions. They break input into meaningful tokens for further processing in interpreters and compilers.