Skip to content

Tokenizer

Motivation

In order to parse math text into tree structures that encode the order of operations of the input, we first need an intermediate representation. Specifically we want to build a list of characters in the text that correspond to relevant tokens in a math expression. That is what the Tokenizer does.

The tokenization process treats the input string as an array of characters, iterating over them to produce an array of tokens that have type/value properties. While building the array, the tokenizer also checks to be sure that the expression is full of valid math tokens, and discards extra whitespace characters.

Visual Example

As an example, consider the input text 8 - (2 + 4) and its token representation.

8 2 - 16 ( 512 2 2 + 8 4 2 ) 1024 8192

  • The top row contains the token value.
  • The bottom row contains the integer type of the token represented by the value.

Code Example

Basic tokenization only requires a few lines of code:

Open Example In Colab

from typing import List
from mathy import Tokenizer, Token

text = "4x + 2x^3 * 7x"
tokenizer = Tokenizer()
tokens: List[Token] = tokenizer.tokenize(text)

for token in tokens:
    print(f"type: {token.type}, value: {token.value}")

Conceptual Example

To better understand the tokenizer, let's build a tokens array manually and compare it to the one that the tokenizer outputs:

Open Example In Colab

from typing import List
from mathy import Token, TokenConstant, TokenEOF, Tokenizer, TokenPlus, TokenVariable

manual_tokens: List[Token] = [
    Token("4", TokenConstant),
    Token("x", TokenVariable),
    Token("+", TokenPlus),
    Token("2", TokenConstant),
    Token("", TokenEOF),
]
auto_tokens: List[Token] = Tokenizer().tokenize("4x + 2")

for i, token in enumerate(manual_tokens):
    assert auto_tokens[i].value == token.value
    assert auto_tokens[i].type == token.type


Last update: December 16, 2019