Data Science
Data Preprocessing
The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.
Why It Matters
Data preprocessing typically takes 60-80% of a data scientist's time. Poor preprocessing is the most common cause of model failure in production.
Example
Cleaning a customer dataset by filling missing ages with the median, converting 'Male/Female' to 0/1, normalizing income values to a 0-1 range, and removing duplicate records.
Think of it like...
Like preparing ingredients before cooking — washing vegetables, measuring spices, and preheating the oven. The prep work determines the quality of the final dish.