ECCV 2026

Spotlight Identifying and Localizing Video Generation Errors Using VLMs

A benchmark for evaluating Vision Language Models on precisely localizing and explaining errors in AI-generated videos: 1,600+ fine-grained, temporally localized error annotations across 6 categories and 3 difficulty levels.

Aditya Chinchure1·Sahithya Ravi1·Pushkar Shukla2·Vered Shwartz1·Leonid Sigal1

1University of British Columbia 2Toyota Technological Institute at Chicago

Overview

Can VLMs find the errors in AI-generated video?

As Text-to-Video models push toward higher realism, their artifacts become nuanced, fine-grained, and spatio-temporally localized. VLMs are increasingly used as automatic evaluators — but can they actually detect, localize, and explain these errors? Spotlight is a benchmark of 600 videos from state-of-the-art T2V models (Veo 3, Seedance, LTX-2), annotated with 1,600+ fine-grained error localizations across physics, semantics, and anatomy.

Our experiments reveal that current VLMs lag well behind humans — the best baselines trail human performance by nearly 2×, pointing to the need for more robust perception and hallucination mitigation.

  • 0 AI-generated videos Veo 3 · Seedance · LTX-2
  • 0 Error localizations fine-grained & temporal
  • 0 Error categories across 3 dimensions
  • 0 Human advantage over the best VLM
Spotlight teaser: a kangaroo video with localized error annotations, and a comparison showing Spotlight uniquely provides local errors and time segments.
Spotlight annotates local errors with time segments and explanations — capabilities other video benchmarks lack.

Examples

See the errors for yourself

Real AI-generated clips, each carrying timestamped, categorized error annotations. Click a card to play the video and step through its localized errors on a timeline.

Taxonomy

Six error types, three dimensions

Every annotation in Spotlight is tagged with one of six fine-grained error types, spanning the physics, semantics, and anatomy of a generated video.

Physics

1

Physical Violations

Violations of physical laws, e.g., floating objects or unnatural trajectories.

An object hovers with no support.

2

Appearance / Disappearance

Objects, people, or backgrounds appear or disappear unnaturally between frames.

“The white sheet of paper appears out of thin air.”

3

Motion Artifacts

Unnatural motion such as objects passing through each other or jittery movement.

A hand clips through a solid table.

Semantics

4

Prompt Adherence

Failure to follow key elements or intent of the text prompt.

“Prompt says she is reading, but she scrolls too fast to be reading.”

5

Logical Errors

Actions that cannot logically co-exist, or illogical story progression.

A glass shatters, then is whole again.

Anatomical

6

Body Pose & Anatomy

Impossible body shapes or joint movements; unrealistic morphing.

A limb bends against its joint.

Method

Scoring localized error predictions

We compare a model's predicted errors against ground-truth annotations with a pairwise-matching metric, then evaluate several inference-time baselines.

  1. 1

    Pairwise scores

    Score every predicted error against every ground-truth error to fill matrix F.

  2. 2

    Threshold τ = 0.7

    Keep only confident matches to build the binary match matrix.

  3. 3

    Bipartite matching

    Find the best one-to-one match pairs, then compute M and M%.

Left: distribution of T2V models across difficulty and average segment length. Right: a worked example of pairwise matching producing M=0.85 and M%=2/4.
Fig. 3: T2V data analysis (a) and a worked example of pairwise matching (b).

Inference-time baselines

Window-Based

Find errors in short windows, then join, filter, and merge with pairwise comparison.

Sequential

First reason about errors over the full video, then localize each one in a second pass.

Multi-Agent

Specialized agents identify errors by type (physics, anatomy, adherence…), then join all errors.

Diagrams of three inference-time baselines: window-based, sequential, and multi-agent.
Fig. 4: Inference-time baseline strategies.

Dataset composition

Videos by difficulty, stacked by T2V model. Click a model in the legend to toggle it.

Key result

Humans still beat the best VLMs by nearly 2×

Across both metrics, the strongest baselines fall far short of human annotators at localizing and explaining errors.

Score (S+P)

M(E, Ê)

%GT (S+P)

M%(E, Ê)

Error analysis

Where the models go wrong

We categorized the failures of Gemini 2.5 Pro. Faulty perception and hallucination dominate — the model often misses what is visibly there, or invents errors that aren't.

Failure modes (Gemini 2.5 Pro)

  • Faulty perception

    Can't compare frames and misses what's visibly there — which leads to hallucinated errors.

  • Wrong world knowledge

    Bad intuition for what is physically plausible.

  • Localization failure

    Open-source models (Qwen3-VL) flag the whole clip as a single error.

Qualitative examples comparing model predictions against ground-truth error annotations.
Fig. 5: Qualitative examples — model successes and failures against ground-truth annotations.

Citation

Cite Spotlight

@inproceedings{chinchure2026spotlight,
  title     = {Spotlight: Identifying and Localizing Video Generation Errors Using VLMs},
  author    = {Chinchure, Aditya and Ravi, Sahithya and Shukla, Pushkar and
               Shwartz, Vered and Sigal, Leonid},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
  eprint    = {2511.18102},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}