Data Pipeline
An automated workflow that extracts data from sources, transforms it through processing steps, and loads it into a destination for use. In ML, data pipelines ensure consistent data flow from raw sources to model training.
Why It Matters
Data pipelines are the plumbing of AI systems. A broken pipeline means no fresh data, stale models, and degraded performance — often without anyone noticing.
Example
An ETL pipeline that hourly extracts customer data from a CRM, joins it with transaction data from a database, cleans it, and loads it into a feature store for model training.
Think of it like...
Like a factory assembly line — raw materials enter one end, pass through processing stations, and finished products emerge at the other end, all running automatically.
Related Terms
ETL
Extract, Transform, Load — a data integration process that extracts data from source systems, transforms it into a usable format, and loads it into a destination system.
Data Engineering
The practice of designing, building, and maintaining the systems and infrastructure that collect, store, and prepare data for analysis and machine learning.
Feature Store
A centralized repository for storing, managing, and serving machine learning features. It ensures consistent feature computation between training and serving, and enables feature reuse across teams.
MLOps
Machine Learning Operations — the set of practices that combine ML, DevOps, and data engineering to deploy and maintain ML models in production reliably and efficiently.
Data Preprocessing
The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.