Token Limit
The maximum number of tokens a model can process in a single request, including both the input prompt and the generated output. Exceeding the limit results in truncated input or errors.
Why It Matters
Token limits constrain what tasks an LLM can handle. Understanding and working within these limits is essential for building reliable LLM applications.
Example
GPT-4 Turbo with a 128K token limit can process roughly a 300-page book, while older models with 4K limits could only handle about 3 pages.
Think of it like...
Like the weight limit on an elevator — you can fit more people or cargo, but there is a hard maximum. Going over means something gets left behind.
Related Terms
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the generated output. Larger context windows allow models to handle longer documents.
Tokenizer
A component that converts raw text into tokens (numerical representations) that a language model can process. Different tokenizers split text differently, affecting model performance and efficiency.
Chunking
The process of breaking large documents into smaller pieces (chunks) before creating embeddings for use in RAG systems. Chunk size and strategy significantly impact retrieval quality.
Summarization
The NLP task of condensing a longer text into a shorter version while preserving the key information and main points. Summarization can be extractive (selecting key sentences) or abstractive (generating new text).