Stochastic Gradient Descent
A variant of gradient descent that updates model parameters using a single random training example (or small batch) at each step instead of the entire dataset. It is faster and can escape local minima.
Why It Matters
SGD is the most widely used optimization algorithm in deep learning. Its randomness actually helps find better solutions than deterministic approaches.
Example
Updating a neural network's weights after seeing each individual training image, rather than computing the average error across all 1 million images first.
Think of it like...
Like adjusting your golf swing after every shot rather than waiting until the end of a round — more frequent adjustments lead to faster improvement.
Related Terms
Gradient Descent
An optimization algorithm used to minimize the error (loss) of a model by iteratively adjusting parameters in the direction that reduces the loss most quickly. It is the primary method for training machine learning models.
Learning Rate
A hyperparameter that controls how much the model's weights are adjusted in response to errors during each training step. It determines the size of the steps taken during gradient descent optimization.
Adam Optimizer
An adaptive optimization algorithm that combines momentum and adaptive learning rates for each parameter. Adam maintains running averages of both gradients and squared gradients.
Momentum
An optimization technique that accelerates gradient descent by accumulating a velocity vector in the direction of persistent gradients, helping overcome local minima and noisy gradients.