Vision-Language Model
An AI model that can process both visual and textual inputs, understanding images and generating text about them. VLMs combine computer vision with language understanding.
Why It Matters
VLMs enable applications that require understanding visual content — from analyzing charts and diagrams to answering questions about photos.
Example
GPT-4V analyzing a photograph of a whiteboard with handwritten notes, transcribing the text, understanding the diagrams, and answering questions about the content.
Think of it like...
Like a person who can both see and speak — they can look at something, understand it, and describe or answer questions about it in words.
Related Terms
Multimodal AI
AI systems that can process and generate multiple types of data — text, images, audio, video — within a single model. Multimodal models understand the relationships between different data types.
Computer Vision
A field of AI that trains computers to interpret and understand visual information from the world — images, videos, and real-time camera feeds. It enables machines to 'see' and make decisions based on what they see.
Large Language Model
A type of AI model trained on massive amounts of text data that can understand and generate human-like text. LLMs use transformer architecture and typically have billions of parameters, enabling them to perform a wide range of language tasks.
CLIP
Contrastive Language-Image Pre-training — an OpenAI model trained to understand the relationship between images and text. CLIP can match images to text descriptions without being trained on specific image categories.