Vision Language Models: Learning From Text & Images Together

Vision language models are models that can learn from both text and images at the same time to perform a variety of tasks

Multimodal models that are able to learn from both text and images are sometimes referred to as vision language models

Another leaderboard that ranks different vision language models based on similar parameters and average scores is the Open VLM Leaderboard

The most thorough benchmark to assess vision language models is A Massive Multi-discipline Multimodal Understanding 

3000 single-choice questions covering 20 distinct skills, such as object localization and OCR, make up the MMBench assessment benchmark

Unifying the text and image representation and feeding it to a text decoder for generation is the key trick

By consuming visuals and accompanying written descriptions, it learns to link the information from the two modalities