Tokenizer
Motivation¶
To parse math text into tree structures that encode the Order of Operations of the input, we first need an intermediate representation. Specifically, we want to build a list of characters in the text that correspond to relevant tokens
for a math expression. That is what the tokenizer does.
The tokenization process treats the input string as an array of characters, iterating over them to produce a list of tokens with type
/value
properties. While building the collection, the tokenizer also optionally discards extra whitespace characters.
Visual Example¶
As an example, consider the input text 8 - (2 + 4)
and its token representation.
- The top row contains the token value.
- The bottom row includes the integer type of the token represented by the value.
Code Example¶
Simple tokenization only requires a few lines of code:
from typing import List
from mathy_core import Token, Tokenizer
text = "4x + 2x^3 * 7x"
tokenizer = Tokenizer()
tokens: List[Token] = tokenizer.tokenize(text)
for token in tokens:
print(f"type: {token.type}, value: {token.value}")
Conceptual Example¶
To better understand the tokenizer, let's build a tokens array manually, then compare it to the tokenizer outputs:
from typing import List
from mathy_core import Token, TOKEN_TYPES, Tokenizer
manual_tokens: List[Token] = [
Token("4", TOKEN_TYPES.Constant),
Token("x", TOKEN_TYPES.Variable),
Token("+", TOKEN_TYPES.Plus),
Token("2", TOKEN_TYPES.Constant),
Token("", TOKEN_TYPES.EOF),
]
auto_tokens: List[Token] = Tokenizer().tokenize("4x + 2")
for i, token in enumerate(manual_tokens):
assert auto_tokens[i].value == token.value
assert auto_tokens[i].type == token.type