Data Science

Data Augmentation

Techniques for artificially expanding a training dataset by creating modified versions of existing data. This helps models generalize better, especially when training data is limited.

Why It Matters

Data augmentation is one of the cheapest ways to improve model performance — it effectively multiplies your dataset size without collecting new data.

Example

Flipping, rotating, cropping, and adjusting brightness of training images to create variations, turning 1,000 photos into 10,000 training examples.

Think of it like...

Like a musician practicing a song in different keys and tempos — it is the same core material but the variations build more robust skill.

Related Terms

Training Data

The dataset used to teach a machine learning model. It contains examples (and often labels) that the model learns patterns from during the training process. The quality and quantity of training data directly impact model performance.

Overfitting

When a model learns the training data too well — including its noise and random fluctuations — and performs poorly on new, unseen data. The model essentially memorizes rather than generalizes.

Synthetic Data

Artificially generated data that mimics the statistical properties and patterns of real data. It is created using algorithms, simulations, or generative models rather than collected from real-world events.

Regularization

Techniques used to prevent overfitting by adding constraints or penalties to the model during training. Regularization discourages the model from becoming too complex or fitting noise in the training data.

Generalization

A model's ability to perform well on new, unseen data that was not part of its training set. Generalization is the ultimate goal of machine learning — learning patterns, not memorizing examples.

Back to Glossary