Data Science

Data Preprocessing

The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.

Why It Matters

Data preprocessing typically takes 60-80% of a data scientist's time. Poor preprocessing is the most common cause of model failure in production.

Example

Cleaning a customer dataset by filling missing ages with the median, converting 'Male/Female' to 0/1, normalizing income values to a 0-1 range, and removing duplicate records.

Think of it like...

Like preparing ingredients before cooking — washing vegetables, measuring spices, and preheating the oven. The prep work determines the quality of the final dish.

Related Terms