Long Short-Term Memory
A type of recurrent neural network designed to learn long-term dependencies through special gating mechanisms that control information flow. LSTMs address the vanishing gradient problem of standard RNNs.
Why It Matters
LSTMs were the dominant architecture for sequence tasks before transformers. Understanding them provides context for why transformers were such a breakthrough.
Example
An LSTM processing a paragraph where a character's name mentioned in the first sentence is needed to understand a pronoun in the last sentence — maintaining that memory across the gap.
Think of it like...
Like a note-taking system with three decisions at each step: what to forget from your notes, what to add from new information, and what to share as your current understanding.
Related Terms
Recurrent Neural Network
A type of neural network designed for sequential data where the output at each step depends on previous steps. RNNs have a form of memory that allows them to process sequences like text, time series, and audio.
GRU
Gated Recurrent Unit — a simplified version of LSTM that uses fewer gates and parameters while achieving similar performance on many sequence tasks. It is faster to train than LSTM.
Vanishing Gradient Problem
A training difficulty in deep networks where gradients become exponentially smaller as they are propagated back through many layers, making it nearly impossible for early layers to learn.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Sequence-to-Sequence
A model architecture that transforms one sequence into another, where the input and output can be different lengths. It uses an encoder to process input and a decoder to generate output.