Skip to content



To parse math text into tree structures that encode the Order of Operations of the input, we first need an intermediate representation. Specifically, we want to build a list of characters in the text that correspond to relevant tokens for a math expression. That is what the tokenizer does.

The tokenization process treats the input string as an array of characters, iterating over them to produce a list of tokens with type/value properties. While building the collection, the tokenizer also optionally discards extra whitespace characters.

Visual Example

As an example, consider the input text 8 - (2 + 4) and its token representation.

8 1 - 8 ( 256 2 1 + 4 4 1 ) 512 8192

  • The top row contains the token value.
  • The bottom row includes the integer type of the token represented by the value.

Code Example

Simple tokenization only requires a few lines of code:

Open Example In Colab

from typing import List

from mathy_core import Token, Tokenizer

text = "4x + 2x^3 * 7x"
tokenizer = Tokenizer()
tokens: List[Token] = tokenizer.tokenize(text)

for token in tokens:
    print(f"type: {token.type}, value: {token.value}")

Conceptual Example

To better understand the tokenizer, let's build a tokens array manually, then compare it to the tokenizer outputs:

Open Example In Colab

from typing import List

from mathy_core import Token, TOKEN_TYPES, Tokenizer

manual_tokens: List[Token] = [
    Token("4", TOKEN_TYPES.Constant),
    Token("x", TOKEN_TYPES.Variable),
    Token("+", TOKEN_TYPES.Plus),
    Token("2", TOKEN_TYPES.Constant),
    Token("", TOKEN_TYPES.EOF),
auto_tokens: List[Token] = Tokenizer().tokenize("4x + 2")

for i, token in enumerate(manual_tokens):
    assert auto_tokens[i].value == token.value
    assert auto_tokens[i].type == token.type

Last update: November 22, 2020