VQAScore – open eval metric/reward model, now for text-to-video

Name: VQAScore – open eval metric/reward model, now for text-to-video
Availability: InStock
Author: linzhiqiu

by linzhiqiu·Jun 9, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My Problem

Replaced CLIPScore across the field with 2M+ Hugging Face downloads.

Strengths

•VQA-based scoring approach is genuinely clever—asks VLMs yes/no questions instead of embedding similarity.
•Real adoption by DeepMind, NVIDIA, ByteDance proves it solves an actual pain point.
•Supports 20+ VLMs including GPT, Gemini, Qwen—keeps improving as models get better.

Weaknesses

•The wow moment was 2 years ago with images; video extension is incremental not groundbreaking.
•Evaluation metrics are inherently niche—only matters if you're training or benchmarking generators.

Post Description

Two years ago we released VQAScore: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field (2M+ downloads on Hugging Face; used by groups at DeepMind, NVIDIA, ByteDance).

We just added text-to-video evaluation with 20+ VLMs (GPT, Gemini, Qwen). It is free and open-source, and it keeps getting better as the underlying VLMs improve.

Paper: https://arxiv.org/abs/2404.01291

Happy to answer questions and would love feedback.