Vision-language model
A model that takes both images and text as input and reasons over them together — describing pictures, answering questions about them, or grounding language in what it sees.
A vision-language model joins a vision encoder to a language model so the two share one representation: pixels and words live in the same space. That lets it caption an image, answer questions about a screenshot, or read a document and act on it. The training trick is alignment — pairing huge numbers of images with text until the model learns which words go with what it sees.
VLMs fuse a vision encoder with a language model so the two share a representation. They power image captioning, document understanding, screenshots-to-actions for agents, and most "look at this and tell me…" interactions.
The interesting recent thread is how small a competent VLM can be — readable reference implementations now fit in a few hundred lines, which makes the architecture teachable rather than mysterious.