
Every modern multimodal AI system (GPT-4o, Claude Sonnet 4.5, Qwen2.5-VL) is built on the same fundamental architecture: a vision encoder that converts pixels to vectors, a language model that processes text tokens, and a projection layer that connects them. Understanding how these three components interact, and the non-obvious engineering decisions at each stage, is now prerequisite knowledge for building systems that process images alongside text.
The open-source VLM ecosystem in 2025 and 2026 has converged on a set of design patterns that differ meaningfully from the architecture most explanations describe. The early CLIP-plus-GPT framing, while useful for intuition, misrepresents how production VLMs actually process visual information. The differences matter for understanding capability limits, benchmark results, and deployment tradeoffs.
CLIP: The Vision Encoder That Started It
Alec Radford, Jong Wook Kim, Chris Hallacy, and colleagues at OpenAI published CLIP (Contrastive Language-Image Pre-training) in 2021. CLIP is not a VLM. It is a dual-encoder model that learns to align image and text representations in a shared embedding space. Understanding CLIP is essential because its vision encoder became the default visual backbone for most open-source VLMs for the next three years.
CLIP trains two encoders simultaneously: a Vision Transformer (ViT) that encodes images and a text transformer that encodes captions. The training objective is contrastive: for a batch of N image-caption pairs, maximize the cosine similarity between the N correct image-text pairs while minimizing similarity with all N^2 minus N incorrect combinations. This is the InfoNCE contrastive loss applied at the batch level.
CLIP was trained on 400 million image-text pairs scraped from the web, a dataset called WIT (WebImageText). The scale of this training allowed CLIP to learn general visual representations that transfer broadly. A CLIP ViT-L/14 outputs 1,024-dimensional embeddings at 14×14 pixel patch resolution. These embeddings capture semantic meaning at the patch level, not just whether an image contains a cat, but where the cat is and what context surrounds it.
The critical property of CLIP for downstream VLM use is that its embeddings are semantically aligned with language: words and concepts that are linguistically related produce embeddings that are geometrically close in the shared space. This alignment is what enables a language model to make sense of image tokens that were never in its training data.
SigLIP: Replacing the Contrastive Loss
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer at Google published SigLIP (Sigmoid Loss for Language-Image Pre-training) in 2023 as an alternative to CLIP’s contrastive training objective.
CLIP’s batch-level contrastive loss requires large batch sizes to be effective, because the N^2 negative pairs in a batch provide the negative training signal. This creates a memory bottleneck: to see diverse negatives, you need large batches; large batches require proportionally more GPU memory. In practice, CLIP training requires batch sizes of tens of thousands to work well.
SigLIP replaces the softmax normalization across the full batch with a sigmoid function applied independently to each image-text pair. Each pair is treated as a binary classification problem: does this image match this text? The sigmoid loss does not require any specific batch size to be effective, because each pair provides independent signal rather than relying on batch-internal negatives.
The practical consequences are significant. SigLIP achieves CLIP-level performance with smaller batches, enabling training on less memory. SigLIP models also show better zero-shot classification performance on most benchmarks. PaliGemma (Google, 2024), Phi-4, DeepSeek-VL2, and Idefics all use SigLIP rather than CLIP as their visual backbone, reflecting the community consensus that SigLIP is the better choice for new training runs.
The Three Adapter Architectures
The projection layer, the component that converts vision encoder output into the format the language model expects, is where the three dominant VLM architectures diverge most significantly. Each approach makes different tradeoffs between simplicity, efficiency, and expressiveness.
Linear projection (LLaVA 1.0): The original LLaVA paper by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee (2023) used a single linear layer to map CLIP embedding dimensions to the language model’s input dimensions. CLIP ViT-L/14 outputs 1,024-dimensional vectors; Vicuna (a LLaMA-based language model) expects 4,096-dimensional inputs. The linear layer learns a fixed affine mapping between these spaces.
This approach is computationally trivial but limited. A linear layer can rescale and rotate the embedding space but cannot reshape it nonlinearly, which means that if the alignment between visual and language representations requires nonlinear transformation, the linear adapter cannot learn it.
MLP projection (LLaVA 1.5 and most current open-source VLMs): LLaVA 1.5 replaced the linear adapter with a two-layer MLP with GELU activation. This is now the standard approach across most open-source VLMs including LLaMA-3.2-Vision, PaliGemma 2, DeepSeek-VL, and Idefics. The MLP can learn nonlinear mappings that better align the visual representation space with the language model’s expected input distribution.
The MLP approach has a clean implementation, adds minimal parameters, and works well when the vision encoder already produces semantically rich representations (i.e., when the adaptation is a relatively small adjustment). Its limitation is that it provides no mechanism for compressing the token sequence length. A 336×336 image processed by ViT-L/14 at 14-pixel patches produces 576 vision tokens, all of which are passed to the language model and consume context window space.
Q-Former (BLIP-2): Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi at Salesforce Research published BLIP-2 in 2023 with a Querying Transformer (Q-Former) as the projection component. The Q-Former maintains a fixed set of learnable query vectors and attends to the visual tokens from the vision encoder. The output is a fixed number of embeddings regardless of input image size, providing automatic visual token compression.
The Q-Former approach is more expressive than MLP projection but also more complex and slower to train. Its main advantage is output sequence length control: the language model always receives the same number of visual tokens regardless of image resolution. This matters for high-resolution inputs where a naive ViT would produce thousands of visual tokens. BLIP-2 and InstructBLIP used Q-Former. Most subsequent models have moved to MLP projection with explicit token compression instead.
Visual Token Compression: Handling High Resolution
A standard 224×224 image processed by ViT-L/14 produces 256 visual tokens. A 336×336 image produces 576. A 1024×1024 image produces 5,329. Each visual token consumes context window space and self-attention compute proportional to sequence length squared. At high resolutions, the visual token burden becomes the dominant factor in inference cost.
Qwen2.5-VL (Alibaba, 2025) addresses this through grouped visual token compression. Groups of four adjacent visual tokens are concatenated and passed through a two-layer MLP, compressing each group of four tokens into a single embedding of the LLM’s dimension. This reduces visual token count by 4x, because the MLP learns to retain the information from all four tokens in the compressed representation.
The key insight is that adjacent visual tokens in a ViT are highly correlated. Neighboring patches tend to represent similar visual content. Compressing four adjacent tokens into one discards spatial resolution only in the sense that the compressed token represents a 2×2 patch region rather than a 1×1 patch, which is acceptable for most understanding tasks but would harm fine-grained tasks requiring per-pixel localization.
LLaVA-NeXT (LLaVA 1.6) took a different approach to high resolution: dynamic slicing. Large images are divided into up to six sub-images (variable number based on aspect ratio), each sub-image is processed independently through the vision encoder, and the resulting visual tokens from all sub-images are concatenated. This preserves full resolution detail at the cost of a variable and potentially large number of visual tokens passed to the language model.
The Two-Stage Training Pipeline
Almost all open-source VLMs follow a two-stage training recipe that reflects the different types of alignment they need to learn.
Stage 1 (visual-language alignment): The vision encoder and language model are both frozen. Only the projection adapter is trained, on large-scale image-caption pairs. The goal is to teach the adapter to map visual representations into a form the language model can process coherently, without changing either the visual or language representation spaces. Training on 500,000 to 5 million image-caption pairs is typical. This stage is fast because it trains very few parameters.
Stage 2 (instruction tuning): The projection adapter and some or all of the language model are unfrozen. The vision encoder typically remains frozen. Most teams use LoRA fine-tuning for this stage rather than full parameter updates, since the language model backbone requires relatively low-rank adjustments when adapting a capable pre-trained base. The model is trained on instruction-following datasets containing complex multimodal tasks: visual question answering, detailed image description, document understanding, chart reading, spatial reasoning.
The LLaVA training recipe, with its 595K image-caption pretraining set and 158K instruction-following dataset, has become the reference benchmark for comparing Stage 1 and Stage 2 data requirements. Subsequent models have scaled both stages substantially. Qwen2.5-VL’s training data is undisclosed in exact numbers but substantially larger than LLaVA’s.
Where Visual Information Lives in the Language Model
A 2024 interpretability paper from UC Berkeley (“Towards Interpreting Visual Information Processing in Vision-Language Models”) studied LLaVA 1.5 7B to understand where in the transformer’s computation visual information is processed and retained.
The paper found that visual information is highly localized to the token positions corresponding to patch locations in the original image. The model’s representation of “where the cat is” lives in the specific tokens that represent the image patches showing the cat, not distributed across all visual tokens. This locality property means that visual tokens interact with text tokens primarily through cross-token attention in early layers, and that the spatial structure of the visual input is approximately preserved in the language model’s processing.
The logit lens analysis showed that visual token representations progressively converge toward the vocabulary embedding space across layers, without explicit training supervision on the correspondence, as a consequence of the joint training on image-text pairs.
Benchmark Performance: What the Numbers Mean
The dominant VLM benchmarks as of 2025-2026 include MMBench, MMMU (Massive Multidisciplinary Multimodal Understanding), DocVQA (document visual question answering), TextVQA (text recognition in images), and MMStar. Each tests a different capability cluster.
MMBench and MMStar test general multimodal reasoning across categories including cognition, perception, and knowledge. MMMU is the most demanding: it requires graduate-level knowledge across 30 disciplines combined with visual understanding, and no current open-source model scores above approximately 65% on its full validation set, a substantial gap from human performance near 88%.
DocVQA and TextVQA test OCR and text understanding in images. This is where recent models have improved most dramatically. Qwen2.5-VL scores above 95% on TextVQA (the high score reflects near-human text recognition capability), while earlier models like LLaVA 1.5 scored around 58%.
The benchmark numbers that appear in model release posts are not standardized comparisons. Labs evaluate at different resolutions, with different prompts, and on different versions of the evaluation sets. Comparing scores across labs without controlling for these factors produces misleading conclusions about relative capability.
The Context Window Bottleneck
The practical constraint that shapes VLM deployment more than architecture is the context window. A 336×336 image produces 576 visual tokens using standard LLaVA processing. A 4096-token context window language model can fit approximately six images before the visual tokens alone exhaust the context. This is why proprietary models with 1-million-token context windows, like Gemini 1.5 Pro, have a qualitative capability advantage on tasks requiring analysis of long documents, video, or many images simultaneously.
The open-source response to this bottleneck has been visual token compression (Qwen2.5-VL’s 4x compression), dynamic resolution (LLaVA-NeXT’s adaptive slicing), and extended context training. But the fundamental math is unforgiving: high-resolution multi-image processing at scale requires either very long context windows or aggressive token compression, and both carry costs. Speculative decoding reduces per-token generation latency once tokens are being produced, but it cannot compress the volume of visual tokens a VLM must first process through attention.
For the medical imaging domain, this context bottleneck is especially significant. Radiology AI systems like the Merlin CT foundation model process volumetric scans with thousands of cross-sections, each a high-resolution image. Clinical VLMs that need to reason across a full CT scan must either compress aggressively or operate on pre-selected slices rather than full volumes, a meaningful architectural constraint on what clinical AI can do with current VLM designs.
Proprietary vs Open-Source: The Current Gap
GPT-4o, Claude Sonnet 4.5, and Gemini 1.5 Pro all use custom vision architectures that are not publicly disclosed. The performance gap between these models and the best open-source VLMs (InternVL3, Qwen2.5-VL, LLaVA-OneVision) on MMMU is roughly 10-15 percentage points, shrinking from roughly 30 points two years ago.
The gap is smaller on document and text understanding tasks (DocVQA, TextVQA), where open-source models now perform within a few percent of the best proprietary models. The remaining gap on MMMU reflects the reasoning quality of the language model backbone rather than visual processing capability. Stronger language models produce better VLMs even at equivalent visual architectures, which means the frontier model capability gap flows directly from language model quality gaps. The compute-optimal scaling decisions labs make for their base language models set the quality ceiling for any VLM built on top of them.
The practical deployment recommendation as of 2026: InternVL3 or Qwen2.5-VL for open-source deployments requiring strong visual reasoning, with preference for the larger variants (72B) when GPU memory allows. For document-heavy applications (form reading, chart analysis, PDF processing), both open-source and proprietary models now perform at commercially usable levels. For complex multimodal reasoning requiring graduate-level knowledge synthesis, the proprietary frontier models still hold a meaningful edge.
What Current VLMs Cannot Do Well
Spatial reasoning remains a consistent failure mode across all VLM architectures. Tasks that require precise localization or ordinal spatial judgment show accuracy rates well below what would be expected if models were reading spatial information reliably. The visual token locality finding from the UC Berkeley interpretability work may explain part of this: spatial relationships between objects require integrating information across multiple token positions, and current attention patterns may not do this efficiently.
Counting is a known failure mode. VLMs consistently undercount objects, with accuracy degrading as object count increases above roughly five. This is a property of the ViT patch representation: patches that contain multiple small objects do not encode count information explicitly, and the language model cannot reconstruct exact counts from patch embeddings that were not trained to encode them.
Fine-grained text rendering in generated images is not a VLM problem (VLMs read images, they do not generate them), but fine-grained text recognition in low-resolution or stylized inputs shows high error rates even in the best current models. The TextVQA benchmark scores above 95% are achieved on relatively clean text at readable resolution. Handwriting, highly stylized fonts, and partially occluded text reduce accuracy substantially.
For practitioners choosing between VLMs for a specific use case, the most informative evaluation is to run the candidate models on a sample of production inputs rather than relying on published benchmark scores. Benchmark composition rarely matches production distribution, and the specific capabilities that separate models in benchmarks are often not the ones that determine performance on a given real-world task.