Tokenizer Training
The process of building a tokenizer's vocabulary from a corpus of text. The tokenizer learns which subword units to use based on frequency patterns in the training corpus.
Why It Matters
Tokenizer training determines cost-efficiency across languages. A tokenizer trained primarily on English may require 3x more tokens for Chinese text.
Example
Training a BPE tokenizer on a multilingual corpus that learns efficient tokens for English, Chinese, Arabic, and Hindi — balancing vocabulary size with encoding efficiency.
Think of it like...
Like creating a shorthand system — you make abbreviations for frequently used phrases, and the best system depends on what language and domain you are writing in.
Related Terms
Tokenizer
A component that converts raw text into tokens (numerical representations) that a language model can process. Different tokenizers split text differently, affecting model performance and efficiency.
Byte-Pair Encoding
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.
Multilingual AI
AI models capable of understanding and generating text in multiple languages. Modern LLMs often support 50-100+ languages, though performance varies significantly across languages.