Artificial Intelligence

Tokenizer Vocabulary

The complete set of tokens (words, subwords, characters) that a tokenizer can recognize and map to numerical IDs. Vocabulary size affects model efficiency and multilingual capability.

Why It Matters

Vocabulary design impacts cost, quality, and language support. A tokenizer with poor coverage of your language makes everything more expensive and lower quality.

Example

GPT-4's tokenizer has ~100K tokens including common English words, subwords, and characters from many languages. The Chinese character for 'dragon' might be one token or three.

Think of it like...

Like the dictionary a translator uses — a bigger, more diverse dictionary handles more languages and expressions efficiently.

Related Terms