Results

Our Findings on Query-based Video Moment Retrieval

Checks on Visual Input
Checks on Visual Input

SOTA Models Often Ignores Visual Input

Our analyses revealed that some deep models highly rely on language priors on video moment retrieval. We describe visual sanity check for investigating if a model uses visual input. Visual sanity check is easy to try. We randomly reorder visual features of a video and see how output changes.

How Well are the Blind Baselines?
How Well are the Blind Baselines?

Blind Baselines Perform Unexpectedly Well

We built three blind baselines which never use videos for training or inference. Our baselines put scores of deep models in a context. Surprisingly, our blind baselines are competitive and even outperform some deep models.