Skip to content

Tokenizer

Motivation

To parse math text into tree structures that encode the Order of Operations of the input, we first need an intermediate representation. Specifically, we want to build a list of characters in the text that correspond to relevant tokens for a math expression. That is what the tokenizer does.

The tokenization process treats the input string as an array of characters, iterating over them to produce a list of tokens with type/value properties. While building the collection, the tokenizer also optionally discards extra whitespace characters.

Visual Example

As an example, consider the input text 8 - (2 + 4) and its token representation.

8 2 - 16 ( 512 2 2 + 8 4 2 ) 1024 16384

  • The top row contains the token value.
  • The bottom row includes the integer type of the token represented by the value.

Code Example

Simple tokenization only requires a few lines of code:

Open Example In Colab

from typing import List

from mathy_core import Token, Tokenizer

text = "4x + 2x^3 * 7x"
tokenizer = Tokenizer()
tokens: List[Token] = tokenizer.tokenize(text)

for token in tokens:
    print(f"type: {token.type}, value: {token.value}")

Conceptual Example

To better understand the tokenizer, let's build a tokens array manually, then compare it to the tokenizer outputs:

Open Example In Colab

from typing import List

from mathy_core import (
    Token,
    TokenConstant,
    TokenEOF,
    Tokenizer,
    TokenPlus,
    TokenVariable,
)

manual_tokens: List[Token] = [
    Token("4", TokenConstant),
    Token("x", TokenVariable),
    Token("+", TokenPlus),
    Token("2", TokenConstant),
    Token("", TokenEOF),
]
auto_tokens: List[Token] = Tokenizer().tokenize("4x + 2")

for i, token in enumerate(manual_tokens):
    assert auto_tokens[i].value == token.value
    assert auto_tokens[i].type == token.type


Last update: November 22, 2020