Back to browse
GitHub Repository

Evaluating text-to-image/video/3D models with VQAScore

585 starsPython

VQAScore – open eval metric/reward model, now for text-to-video

by linzhiqiu·Jun 9, 2026·1 point·0 comments

AI Analysis

●●●BangerBig BrainSolve My Problem

Replaced CLIPScore across the field with 2M+ Hugging Face downloads.

Strengths
  • VQA-based scoring approach is genuinely clever—asks VLMs yes/no questions instead of embedding similarity.
  • Real adoption by DeepMind, NVIDIA, ByteDance proves it solves an actual pain point.
  • Supports 20+ VLMs including GPT, Gemini, Qwen—keeps improving as models get better.
Weaknesses
  • The wow moment was 2 years ago with images; video extension is incremental not groundbreaking.
  • Evaluation metrics are inherently niche—only matters if you're training or benchmarking generators.
Category
Target Audience

ML researchers and generative AI developers

Similar To

CLIPScore · PickScore · ImageReward

Post Description

Two years ago we released VQAScore: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field (2M+ downloads on Hugging Face; used by groups at DeepMind, NVIDIA, ByteDance).

We just added text-to-video evaluation with 20+ VLMs (GPT, Gemini, Qwen). It is free and open-source, and it keeps getting better as the underlying VLMs improve.

Paper: https://arxiv.org/abs/2404.01291

Happy to answer questions and would love feedback.

Similar Projects

AI/MLMid

Pipevals – a visual pipeline builder for evaluation-driven AI

Early learning project in a crowded eval space dominated by LangSmith and Arize.

Ship ItBold Bet
tilt
622mo ago