Tokenizer Vocabulary
The complete set of tokens (words, subwords, characters) that a tokenizer can recognize and map to numerical IDs. Vocabulary size affects model efficiency and multilingual capability.
Why It Matters
Vocabulary design impacts cost, quality, and language support. A tokenizer with poor coverage of your language makes everything more expensive and lower quality.
Example
GPT-4's tokenizer has ~100K tokens including common English words, subwords, and characters from many languages. The Chinese character for 'dragon' might be one token or three.
Think of it like...
Like the dictionary a translator uses — a bigger, more diverse dictionary handles more languages and expressions efficiently.
Related Terms
Tokenizer
A component that converts raw text into tokens (numerical representations) that a language model can process. Different tokenizers split text differently, affecting model performance and efficiency.
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.
Byte-Pair Encoding
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.