ECCV 2026

Spotlight Identifying and Localizing Video Generation Errors Using VLMs

A benchmark for evaluating Vision Language Models on precisely localizing and explaining errors in AI-generated videos: 1,600+ fine-grained, temporally localized error annotations across 6 categories and 3 difficulty levels.

Aditya Chinchure¹·Sahithya Ravi¹·Pushkar Shukla²·Vered Shwartz¹·Leonid Sigal¹

¹University of British Columbia ²Toyota Technological Institute at Chicago

arXiv Dataset Code

Overview

Can VLMs find the errors in AI-generated video?

As Text-to-Video models push toward higher realism, their artifacts become nuanced, fine-grained, and spatio-temporally localized. VLMs are increasingly used as automatic evaluators — but can they actually detect, localize, and explain these errors? Spotlight is a benchmark of 600 videos from state-of-the-art T2V models (Veo 3, Seedance, LTX-2), annotated with 1,600+ fine-grained error localizations across physics, semantics, and anatomy.

Our experiments reveal that current VLMs lag well behind humans — the best baselines trail human performance by nearly 2×, pointing to the need for more robust perception and hallucination mitigation.

0 AI-generated videos Veo 3 · Seedance · LTX-2
0 Error localizations fine-grained & temporal
0 Error categories across 3 dimensions
0 Human advantage over the best VLM

Spotlight teaser: a kangaroo video with localized error annotations, and a comparison showing Spotlight uniquely provides local errors and time segments. — Spotlight annotates **local errors** with **time segments** and explanations — capabilities other video benchmarks lack.

Examples

See the errors for yourself

Real AI-generated clips, each carrying timestamped, categorized error annotations. Click a card to play the video and step through its localized errors on a timeline.

Taxonomy

Six error types, three dimensions

Every annotation in Spotlight is tagged with one of six fine-grained error types, spanning the physics, semantics, and anatomy of a generated video.

Physics

Physical Violations

Violations of physical laws, e.g., floating objects or unnatural trajectories.

An object hovers with no support.

Appearance / Disappearance

Objects, people, or backgrounds appear or disappear unnaturally between frames.

“The white sheet of paper appears out of thin air.”

Motion Artifacts

Unnatural motion such as objects passing through each other or jittery movement.

A hand clips through a solid table.

Semantics

Prompt Adherence

Failure to follow key elements or intent of the text prompt.

“Prompt says she is reading, but she scrolls too fast to be reading.”

Logical Errors

Actions that cannot logically co-exist, or illogical story progression.

A glass shatters, then is whole again.

Anatomical

Body Pose & Anatomy

Impossible body shapes or joint movements; unrealistic morphing.

A limb bends against its joint.

Method

Scoring localized error predictions

We compare a model's predicted errors against ground-truth annotations with a pairwise-matching metric, then evaluate several inference-time baselines.

1
Pairwise scores

Score every predicted error against every ground-truth error to fill matrix F.
2
Threshold τ = 0.7

Keep only confident matches to build the binary match matrix.
3
Bipartite matching

Find the best one-to-one match pairs, then compute M and M%.

Left: distribution of T2V models across difficulty and average segment length. Right: a worked example of pairwise matching producing M=0.85 and M%=2/4. — Fig. 3: T2V data analysis (a) and a worked example of pairwise matching (b).

Inference-time baselines

Window-Based

Find errors in short windows, then join, filter, and merge with pairwise comparison.

Sequential

First reason about errors over the full video, then localize each one in a second pass.

Multi-Agent

Specialized agents identify errors by type (physics, anatomy, adherence…), then join all errors.

Dataset composition

Videos by difficulty, stacked by T2V model. Click a model in the legend to toggle it.

Key result

Humans still beat the best VLMs by nearly 2×

Across both metrics, the strongest baselines fall far short of human annotators at localizing and explaining errors.

Score (S+P)

M(E, Ê)

%GT (S+P)

M%(E, Ê)

Error analysis

Where the models go wrong

We categorized the failures of Gemini 2.5 Pro. Faulty perception and hallucination dominate — the model often misses what is visibly there, or invents errors that aren't.

Failure modes (Gemini 2.5 Pro)

Faulty perception

Can't compare frames and misses what's visibly there — which leads to hallucinated errors.
Wrong world knowledge

Bad intuition for what is physically plausible.
Localization failure

Open-source models (Qwen3-VL) flag the whole clip as a single error.

Qualitative examples comparing model predictions against ground-truth error annotations. — Fig. 5: Qualitative examples — model successes and failures against ground-truth annotations.

Citation

Cite Spotlight

@inproceedings{chinchure2026spotlight,
  title     = {Spotlight: Identifying and Localizing Video Generation Errors Using VLMs},
  author    = {Chinchure, Aditya and Ravi, Sahithya and Shukla, Pushkar and
               Shwartz, Vered and Sigal, Leonid},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
  eprint    = {2511.18102},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}

arXiv ↗ Dataset ↗ Code ↗