concept also: VLM, vision language model, multimodal model

Vision-language model

A model that takes both images and text as input and reasons over them together — describing pictures, answering questions about them, or grounding language in what it sees.

Concept · updated June 2026 · verified 2026-06-04

In a nutshell

A vision-language model joins a vision encoder to a language model so the two share one representation: pixels and words live in the same space. That lets it caption an image, answer questions about a screenshot, or read a document and act on it. The training trick is alignment — pairing huge numbers of images with text until the model learns which words go with what it sees.

VLMs fuse a vision encoder with a language model so the two share a representation. They power image captioning, document understanding, screenshots-to-actions for agents, and most "look at this and tell me…" interactions.

The interesting recent thread is how small a competent VLM can be — readable reference implementations now fit in a few hundred lines, which makes the architecture teachable rather than mysterious.

Where it came from

Year2021

SourceRadford et al. — CLIP (OpenAI)

Why it matteredContrastive image–text pretraining showed a model could connect pixels and language at scale.

In themes

Vision-language model

Where it came from

In themes

Open Models

Agents

Related concepts

Mentioned in 1 find

nanoVLM