TF-IDF
Term Frequency-Inverse Document Frequency — a statistical measure that evaluates how important a word is to a document within a collection. Words frequent in one document but rare across documents score high.
Why It Matters
TF-IDF is foundational to information retrieval and text analysis. Understanding it explains why search engines do not just count word occurrences.
Example
The word 'neural' appearing 20 times in an AI paper scores high (frequent in this doc, rare overall), while 'the' appearing 50 times scores low (common everywhere).
Think of it like...
Like judging a chef's specialty — if they make pasta every day (high frequency) and no other chef in town does (rare), pasta is clearly their defining dish.
Related Terms
BM25
Best Matching 25 — a widely used ranking function for keyword-based information retrieval. BM25 scores documents based on query term frequency, document length, and corpus statistics.
Text Mining
The process of deriving meaningful patterns, trends, and insights from large collections of text data using NLP and statistical techniques.
Natural Language Processing
The branch of AI that deals with the interaction between computers and human language. NLP enables machines to read, understand, generate, and make sense of human language in a useful way.
Feature Engineering
The process of selecting, transforming, and creating input variables (features) from raw data to improve model performance. It requires domain knowledge to identify what information is most useful for the model.