Tokenizer
A component that converts raw text into tokens (numerical representations) that a language model can process. Different tokenizers split text differently, affecting model performance and efficiency.
Why It Matters
The tokenizer determines how efficiently a model processes text and directly impacts costs, as API pricing is based on token count.
Example
The word 'unhappiness' might be split into ['un', 'happiness'] by one tokenizer or ['un', 'happ', 'iness'] by another, each affecting how the model processes it.
Think of it like...
Like breaking chocolate into pieces before sharing — the size and shape of the pieces determine how many you get and how easy they are to work with.
Related Terms
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.
Byte-Pair Encoding
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the generated output. Larger context windows allow models to handle longer documents.