My Written Word

Tag: Multimodal AI

Vision-Language Models: Architecture and the Benchmark Gap

May 18, 2026

How CLIP, SigLIP, Q-Former, and MLP adapters work in vision-language models. Why Qwen2.5-VL compresses visual tokens 4x, and what current VLMs still cannot do.