Semi-Structured Data
Data that has some organizational structure but does not conform to a rigid schema like a relational database. Examples include JSON, XML, and HTML.
Why It Matters
Semi-structured data is the format of most APIs and web content. ML systems must parse and normalize it before it can be used for training.
Example
A JSON API response: {"user": {"name": "John", "orders": [{"id": 1, "amount": 29.99}]}} — it has structure but is flexible and can vary between records.
Think of it like...
Like a form letter with blanks — there is a template (structure) but the content varies and can include different amounts of information.
Related Terms
Structured Data
Data organized in a predefined format with clear rows and columns, like spreadsheets and relational databases. Each field has a defined type and meaning.
Unstructured Data
Data without a predefined format or organization — text documents, images, videos, audio, social media posts. Over 80% of enterprise data is unstructured.
Data Preprocessing
The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.