Data Science

Data Preprocessing

The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.

Why It Matters

Data preprocessing typically takes 60-80% of a data scientist's time. Poor preprocessing is the most common cause of model failure in production.

Example

Cleaning a customer dataset by filling missing ages with the median, converting 'Male/Female' to 0/1, normalizing income values to a 0-1 range, and removing duplicate records.

Think of it like...

Like preparing ingredients before cooking — washing vegetables, measuring spices, and preheating the oven. The prep work determines the quality of the final dish.

Related Terms

Feature Engineering

The process of selecting, transforming, and creating input variables (features) from raw data to improve model performance. It requires domain knowledge to identify what information is most useful for the model.

Back to Glossary