AI Glossary
The definitive dictionary for AI, Machine Learning, and Governance terminology. From Flash Attention to RAG — look up any term.
A
Accuracy
The percentage of correct predictions out of all predictions made by a model. While intuitive, accuracy can be misleading for imbalanced datasets.
Activation Function
A mathematical function applied to the output of each neuron in a neural network that introduces non-linearity. Without activation functions, a neural network would just be a series of linear transformations.
Active Learning
A training strategy where the model identifies the most informative unlabeled examples and requests human labels only for those. This minimizes labeling effort by focusing on the examples that matter most.
Adam Optimizer
An adaptive optimization algorithm that combines momentum and adaptive learning rates for each parameter. Adam maintains running averages of both gradients and squared gradients.
Adversarial Training
A defense technique where adversarial examples are included in the training data to make the model more robust against attacks. The model learns to handle both normal and adversarial inputs.
Anomaly Detection
Techniques for identifying data points, events, or observations that deviate significantly from expected patterns. Anomalies can indicate fraud, equipment failure, security breaches, or other important events.
Autoencoder
A neural network that learns to compress data into a lower-dimensional representation (encoding) and then reconstruct it back (decoding). It learns what features are most important for faithful reconstruction.
AutoML
Automated Machine Learning — tools and techniques that automate the end-to-end process of applying machine learning, including feature engineering, model selection, and hyperparameter tuning.
B
Backpropagation
The primary algorithm used to train neural networks. It calculates how much each weight in the network contributed to the error, then adjusts weights backward from the output layer to reduce future errors.
Batch Normalization
A technique that normalizes the inputs to each layer in a neural network by adjusting and scaling them to have zero mean and unit variance. This stabilizes and accelerates the training process.
Batch Size
The number of training examples processed together before the model updates its parameters. Batch size affects training speed, memory usage, and how smoothly the model learns.
Bayesian Optimization
A sequential optimization strategy for finding the best hyperparameters by building a probabilistic model of the objective function and using it to select the most promising configurations to evaluate.
Bi-Encoder
A model that independently encodes two texts into separate vectors, then compares them using a similarity metric like cosine similarity. Bi-encoders are fast because vectors can be pre-computed.
Bias-Variance Tradeoff
The fundamental tension in ML between a model that is too simple (high bias, underfitting) and one that is too complex (high variance, overfitting). The goal is finding the sweet spot.
C
Catastrophic Forgetting
The tendency of neural networks to completely forget previously learned information when trained on new data or tasks. New learning overwrites old knowledge.
Catastrophic Interference
When learning new information in a neural network severely disrupts previously learned knowledge. It is the underlying mechanism behind catastrophic forgetting.
CatBoost
A gradient boosting library by Yandex that handles categorical features natively without requiring manual encoding. CatBoost also addresses prediction shift and target leakage.
Causal Inference
Statistical methods for determining cause-and-effect relationships from data, going beyond correlation to understand whether X actually causes Y.
Causal Language Model
A training approach where the model predicts the next token given only the preceding tokens (left-to-right). This is how GPT models are trained and is the basis for text generation.
Classification
A type of supervised learning task where the model predicts which category or class an input belongs to. The output is a discrete label rather than a continuous value.
Clustering
An unsupervised learning technique that groups similar data points together based on their characteristics, without predefined labels. The algorithm discovers natural groupings in the data.
Cold Start Problem
The challenge of making recommendations for new users (who have no history) or new items (which have no ratings). Cold start is a fundamental difficulty in recommendation systems.
Collaborative Filtering
A recommendation technique that predicts a user's interests based on the preferences of similar users. It assumes people who agreed in the past will agree again in the future.
Concept Bottleneck
A model architecture that forces predictions through a set of human-interpretable concepts. The model first predicts concepts, then uses those concepts to make the final prediction.
Confusion Matrix
A table that summarizes the performance of a classification model by showing true positives, true negatives, false positives, and false negatives. It reveals the types of errors a model makes.
Confusion Matrix Metrics
The set of performance metrics derived from the confusion matrix including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
Content-Based Filtering
A recommendation technique that suggests items similar to those a user has previously liked, based on the items' features and attributes rather than other users' behavior.
Context Distillation
A technique where the behavior of a model prompted with detailed instructions is distilled into a model that exhibits the same behavior without the instructions.
Contextual Bandits
An extension of multi-armed bandits where the agent observes context (features) before making a decision, enabling personalized choices based on the current situation.
Continual Learning
Training a model on new data or tasks over time without forgetting previously learned knowledge. Also called lifelong learning or incremental learning.
Continual Pre-Training
Extending a pre-trained model's training on new domain-specific data without starting from scratch. It adapts the model to a new domain while preserving general capabilities.
Contrastive Learning
A self-supervised technique where the model learns by comparing similar (positive) and dissimilar (negative) pairs of examples. It learns representations where similar items are close and different items are far apart.
Convolutional Neural Network
A type of neural network specifically designed for processing grid-like data such as images. CNNs use convolutional layers that apply filters to detect patterns like edges, textures, and shapes at different scales.
Cosine Similarity
A metric that measures the similarity between two vectors by calculating the cosine of the angle between them. Values range from -1 (opposite) to 1 (identical), with 0 meaning unrelated.
Cross-Encoder
A model that takes two texts as input simultaneously and outputs a relevance or similarity score. Unlike bi-encoders, cross-encoders consider the full interaction between both texts.
Cross-Entropy
A loss function commonly used in classification tasks that measures the difference between the predicted probability distribution and the actual distribution. Lower cross-entropy means better predictions.
Cross-Validation
A model evaluation technique that splits data into multiple folds, trains on some folds and tests on the held-out fold, repeating so every fold serves as the test set. It provides a robust estimate of model performance.
Curriculum Learning
A training strategy inspired by human education where the model is exposed to training examples in a meaningful order — starting with easier examples and gradually increasing difficulty.
D
Data Parallelism
A distributed training approach where the training data is split across multiple GPUs, each holding a complete copy of the model. Gradients are averaged across GPUs after each batch.
Decision Tree
A supervised learning algorithm that makes predictions by learning a series of if-then-else decision rules from the data. It creates a tree-like structure where each internal node tests a feature and each leaf provides a prediction.
Deep Learning
A specialized subset of machine learning that uses artificial neural networks with multiple layers (hence 'deep') to learn complex patterns in data. Deep learning excels at tasks like image recognition, speech processing, and natural language understanding.
Dimensionality Reduction
Techniques that reduce the number of features (dimensions) in a dataset while preserving the most important information. This makes data easier to visualize, speeds up training, and can improve model performance.
Distributed Training
Splitting model training across multiple GPUs or machines to handle larger models or datasets and reduce training time. Techniques include data parallelism and model parallelism.
DPO
Direct Preference Optimization — a simpler alternative to RLHF that directly optimizes a language model from human preference data without needing a separate reward model. It is more stable and easier to implement.
Dropout
A regularization technique where random neurons are temporarily disabled (dropped out) during each training step. This forces the network to not rely too heavily on any single neuron and builds redundancy.
E
Early Stopping
A regularization technique where training is halted when the model's performance on validation data stops improving, even if training loss continues to decrease. It prevents overfitting by finding the optimal training duration.
Elastic Weight Consolidation
A technique for continual learning that identifies which weights are important for previously learned tasks and penalizes changes to those weights during new learning.
Embedding Fine-Tuning
Adapting a pre-trained embedding model to a specific domain or task by further training it on domain-specific data, improving retrieval quality for specialized applications.
Ensemble Learning
A strategy that combines multiple models to produce better predictions than any single model alone. Ensemble methods leverage the diversity of different models to reduce errors.
Epoch
One complete pass through the entire training dataset during model training. Models typically require multiple epochs to learn effectively, with each pass refining the model's understanding.
Exploding Gradient
A training problem where gradients become extremely large during backpropagation, causing weight updates to be so drastic that the model becomes unstable and training diverges.
Exploration vs Exploitation
The fundamental tradeoff in reinforcement learning between trying new actions (exploration) to discover potentially better strategies and using known good actions (exploitation) to maximize current reward.
F
F1 Score
The harmonic mean of precision and recall, providing a single metric that balances both. F1 scores range from 0 to 1, with 1 being perfect precision and recall.
Feature Engineering
The process of selecting, transforming, and creating input variables (features) from raw data to improve model performance. It requires domain knowledge to identify what information is most useful for the model.
Federated Learning
A decentralized training approach where a model is trained across multiple devices or organizations without sharing raw data. Each participant trains locally and only shares model updates.
Fine-Tuning
The process of taking a pre-trained model and further training it on a smaller, domain-specific dataset to specialize its behavior for a particular task or domain. Fine-tuning adjusts the model's weights to improve performance on the target task.
G
Generalization
A model's ability to perform well on new, unseen data that was not part of its training set. Generalization is the ultimate goal of machine learning — learning patterns, not memorizing examples.
Gradient Accumulation
A technique that simulates larger batch sizes by accumulating gradients over multiple forward passes before performing a single weight update. This enables large effective batch sizes on limited hardware.
Gradient Boosting
An ensemble technique that builds models sequentially, where each new model focuses on correcting the errors made by previous models. It combines many weak learners into a single strong learner.
Gradient Clipping
A technique that caps gradient values at a maximum threshold during training to prevent exploding gradients. If a gradient exceeds the threshold, it is scaled down.
Gradient Descent
An optimization algorithm used to minimize the error (loss) of a model by iteratively adjusting parameters in the direction that reduces the loss most quickly. It is the primary method for training machine learning models.
Graph Neural Network
A type of neural network designed to operate on graph-structured data (nodes and edges). GNNs learn representations of nodes, edges, or entire graphs by aggregating information from neighbors.
GRU
Gated Recurrent Unit — a simplified version of LSTM that uses fewer gates and parameters while achieving similar performance on many sequence tasks. It is faster to train than LSTM.
H
Hyperparameter
Settings that are configured before training begins and control how the model learns, as opposed to parameters which are learned during training. Examples include learning rate, batch size, and number of layers.
Hyperparameter Tuning
The process of systematically searching for the best combination of hyperparameters for a model. Since hyperparameters are set before training, finding optimal values requires experimentation.
K
K-Means
A clustering algorithm that partitions data into K groups by iteratively assigning each data point to the nearest cluster center and then recalculating the centers. K must be specified in advance.
Knowledge Distillation
A model compression technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model. The student learns not just correct answers but the teacher's nuanced probability distributions.
L
Latent Space
A compressed, lower-dimensional representation of data learned by a model. Points in latent space capture the essential features of the data, and nearby points represent similar data items.
Layer Normalization
A normalization technique that normalizes the inputs across the features for each individual example (rather than across the batch). It stabilizes training in transformers and RNNs.
Learning Rate
A hyperparameter that controls how much the model's weights are adjusted in response to errors during each training step. It determines the size of the steps taken during gradient descent optimization.
LightGBM
Light Gradient Boosting Machine — Microsoft's gradient boosting framework optimized for speed and efficiency. LightGBM uses histogram-based splitting and leaf-wise growth for faster training.
LIME
Local Interpretable Model-agnostic Explanations — a technique that explains individual predictions by approximating the complex model locally with a simple, interpretable model.
Linear Regression
The simplest regression algorithm that models the relationship between input features and a continuous output as a straight line (or hyperplane in multiple dimensions). It minimizes the sum of squared errors.
Logistic Regression
A classification algorithm that uses the sigmoid function to predict the probability of a binary outcome. Despite its name containing 'regression,' it is used for classification tasks.
Long Short-Term Memory
A type of recurrent neural network designed to learn long-term dependencies through special gating mechanisms that control information flow. LSTMs address the vanishing gradient problem of standard RNNs.
LoRA
Low-Rank Adaptation — a parameter-efficient fine-tuning technique that freezes the original model weights and adds small trainable matrices to each layer. It dramatically reduces the compute and memory needed for fine-tuning.
Loss Function
A mathematical function that measures how far a model's predictions are from the actual correct values. The goal of training is to minimize this loss function, making predictions as accurate as possible.
M
Machine Learning
A subset of AI where systems learn patterns from data and improve their performance over time without being explicitly programmed for every scenario. ML algorithms build mathematical models from training data to make predictions or decisions.
Masked Language Model
A training approach where random tokens in the input are replaced with a special [MASK] token and the model learns to predict the original tokens from context. This is how BERT was pre-trained.
Meta-Learning
An approach where models 'learn to learn' — they are trained across many tasks so they can quickly adapt to new tasks with minimal data. Also called learning to learn.
Mixed Precision Training
Training neural networks using a combination of 16-bit and 32-bit floating-point numbers to speed up computation and reduce memory usage while maintaining model accuracy.
Model Distillation Pipeline
An end-to-end workflow for transferring knowledge from a large teacher model to a smaller student model, including data generation, training, evaluation, and deployment.
Model Merging
Combining the weights of multiple fine-tuned models into a single model that inherits capabilities from all source models, without additional training.
Model Parallelism
A distributed training approach where the model itself is split across multiple GPUs, with each GPU holding and computing a different portion of the model.
Momentum
An optimization technique that accelerates gradient descent by accumulating a velocity vector in the direction of persistent gradients, helping overcome local minima and noisy gradients.
Multi-Armed Bandit
A simplified reinforcement learning problem where an agent must choose between multiple options (arms) with unknown payoffs, balancing exploration of new options with exploitation of known good ones.
N
Neural Architecture Search
An automated technique for finding optimal neural network architectures by searching through a vast space of possible designs. NAS automates architecture decisions that normally require expert intuition.
Neural Network
A computing system inspired by the biological neural networks in the human brain. It consists of interconnected nodes (neurons) organized in layers that process information and learn to recognize patterns.
Noise
Random variation or errors in data that do not represent true underlying patterns. In deep learning, noise can also refer to the random input used in generative models.
O
Online Learning
A training paradigm where the model updates continuously as new data arrives, one example at a time (or in small batches), rather than training on a fixed dataset.
Overfitting
When a model learns the training data too well — including its noise and random fluctuations — and performs poorly on new, unseen data. The model essentially memorizes rather than generalizes.
Overfitting Prevention
The collection of techniques used to ensure a model generalizes well to unseen data rather than memorizing training examples. Includes regularization, dropout, early stopping, and data augmentation.
P
Parameter
Any learnable value in a machine learning model that is adjusted during training. Parameters include weights and biases in neural networks. Model size is often described by parameter count.
Perceptron
The simplest form of a neural network — a single neuron that takes weighted inputs, sums them, and applies an activation function to produce an output. It is the fundamental building block of neural networks.
Perplexity
A metric that measures how well a language model predicts text. Lower perplexity indicates the model is less 'surprised' by the text, meaning it can predict the next token more accurately.
Pre-training
The initial phase of training a model on a large, general-purpose dataset before specializing it for specific tasks. Pre-training gives the model broad knowledge and capabilities.
Precision
Of all the items the model predicted as positive, the proportion that were actually positive. Precision measures how trustworthy the model's positive predictions are.
Preference Optimization
Training techniques that directly optimize models based on human preference data, where humans indicate which of two model outputs they prefer.
Principal Component Analysis
A dimensionality reduction technique that transforms data into a new coordinate system where the first axis captures the most variance, the second axis the next most, and so on.
Prompt Tuning
A parameter-efficient fine-tuning technique that prepends learnable 'soft prompt' tokens to the input while keeping the main model weights frozen. Only the soft prompt parameters are trained.
Pruning
A model compression technique that removes unnecessary or redundant weights, neurons, or layers from a trained neural network. Like pruning a plant, it removes parts that are not contributing to overall health.
Q
QLoRA
Quantized Low-Rank Adaptation — combines LoRA with quantization to further reduce memory requirements for fine-tuning. It quantizes the base model to 4-bit precision while training LoRA adapters in higher precision.
Quantization-Aware Training
Training a model while simulating the effects of quantization, so the model learns to maintain accuracy even when weights are later reduced to lower precision.
R
Random Forest
An ensemble learning method that builds multiple decision trees during training and outputs the majority vote (classification) or average prediction (regression) of all the trees. The 'forest' of diverse trees is more robust than any single tree.
Recall
Of all the actually positive items in the dataset, the proportion that the model correctly identified. Recall measures how completely the model finds all relevant items.
Recurrent Neural Network
A type of neural network designed for sequential data where the output at each step depends on previous steps. RNNs have a form of memory that allows them to process sequences like text, time series, and audio.
Regression
A type of supervised learning task where the model predicts a continuous numerical value rather than a discrete category. The output can be any number within a range.
Regularization
Techniques used to prevent overfitting by adding constraints or penalties to the model during training. Regularization discourages the model from becoming too complex or fitting noise in the training data.
Reinforcement Learning
A type of machine learning where an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties. The agent aims to maximize cumulative reward over time through trial and error.
Reinforcement Learning from AI Feedback
A variant of RLHF where AI models (instead of humans) provide the feedback used to train reward models and align language models. RLAIF reduces the cost and scalability constraints of human feedback.
ReLU
Rectified Linear Unit — the most commonly used activation function in deep learning. It outputs the input directly if positive, and zero otherwise: f(x) = max(0, x).
Representation Learning
The process of automatically discovering useful features or representations from raw data, rather than manually engineering them. Deep learning excels at learning hierarchical representations.
Residual Connection
A shortcut that allows the input to a layer to bypass one or more layers and be added directly to the output. This enables training of much deeper networks by ensuring gradient flow.
Retraining
The process of training a model again on updated data to restore or improve its performance. Retraining addresses model drift and incorporates new patterns the original model did not learn.
Retrieval-Augmented Fine-Tuning
Combining fine-tuning with retrieval capabilities, training a model to effectively use retrieved context. RAFT teaches the model when and how to leverage external knowledge.
Reward Model
A model trained to predict how good a response is based on human preferences. In RLHF, the reward model scores outputs to guide the language model toward responses humans prefer.
Reward Modeling
Training a separate model to predict human preferences, which then serves as the reward signal for reinforcement learning. The reward model learns what humans consider 'good' responses.
Reward Shaping
The practice of designing intermediate rewards to guide a reinforcement learning agent toward desired behavior, rather than only providing reward at the final goal state.
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.
S
Self-Supervised Learning
A training approach where the model generates its own labels from the data, typically by masking or hiding parts of the input and learning to predict them. No human-annotated labels are needed.
Sentence Transformers
A framework for computing dense vector representations (embeddings) for sentences and paragraphs. Built on top of transformer models and optimized for semantic similarity tasks.
SHAP
SHapley Additive exPlanations — a method based on game theory that explains individual predictions by calculating each feature's contribution to the prediction. SHAP values are additive and consistent.
Sigmoid
An activation function that squashes input values into a range between 0 and 1, creating an S-shaped curve. It is commonly used for binary classification outputs and in certain neural network architectures.
Softmax
A function that converts a vector of numbers into a probability distribution, where each value is between 0 and 1 and all values sum to 1. It is typically used as the final layer in classification models.
Stochastic
Involving randomness or probability. In ML, stochastic processes include random weight initialization, stochastic gradient descent, and probabilistic sampling during text generation.
Stochastic Gradient Descent
A variant of gradient descent that updates model parameters using a single random training example (or small batch) at each step instead of the entire dataset. It is faster and can escape local minima.
Supervised Learning
A type of machine learning where the model is trained on labeled data — input-output pairs where the correct answer is provided. The model learns to map inputs to outputs and can then predict outputs for new, unseen inputs.
Support Vector Machine
A classification algorithm that finds the optimal hyperplane (decision boundary) that maximizes the margin between different classes. SVMs are effective in high-dimensional spaces.
T
Tensor
A multi-dimensional array of numbers — the fundamental data structure in deep learning. Scalars are 0D tensors, vectors are 1D, matrices are 2D, and higher-dimensional arrays are nD tensors.
TF-IDF
Term Frequency-Inverse Document Frequency — a statistical measure that evaluates how important a word is to a document within a collection. Words frequent in one document but rare across documents score high.
Topic Modeling
An unsupervised technique that automatically discovers abstract themes (topics) in a collection of documents. Each document is represented as a mixture of topics.
Training-Serving Skew
A discrepancy between how features are computed during model training versus how they are computed during production serving. This is one of the most common and hardest-to-detect causes of model failure.
Transfer Learning
A technique where a model trained on one task is repurposed as the starting point for a model on a different but related task. Instead of training from scratch, you leverage knowledge the model has already acquired.
U
Underfitting
When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and new data. The model has not learned enough from the training data.
Unsupervised Learning
A type of machine learning where the model learns patterns from unlabeled data without being told what the correct output should be. The algorithm discovers hidden structures, groupings, or patterns in the data on its own.
W
Weight
A numerical parameter in a neural network that is learned during training. Weights determine the strength of connections between neurons and collectively encode the model's knowledge.
Word Embedding
A technique that maps words to dense numerical vectors where semantic relationships are captured. Similar words have similar vectors, and relationships like analogy are encoded in vector arithmetic.