Uncovering Hidden Challenges in Query-based Video Moment Retrieval

Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä

Paper

BMVC'20 paper and supplementary

arXiv

Experiments

Experiments and notebooks

neptune.ai

Code

Evaluation and visualization toolkit

GitHub

Overview

The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future.

Video Overview

Citation

@inproceedings{otani2020challengesmr,
author={Mayu Otani, Yuta Nakahima, Esa Rahtu, and Janne Heikkil{\"{a}}},
title = {Uncovering Hidden Challenges in Query-Based Video Moment Retrieval},
booktitle={The British Machine Vision Conference (BMVC)},
year = {2020},
}

Latest Posts

Checks on Visual Input
Checks on Visual Input

SOTA Models Often Ignores Visual Input

Our analyses revealed that some deep models highly rely on language priors on video moment retrieval. We describe visual sanity check for investigating if a model uses visual input. Visual sanity check is easy to try. We randomly reorder visual features of a video and see how output changes.

How Well are the Blind Baselines?
How Well are the Blind Baselines?

Blind Baselines Perform Unexpectedly Well

We built three blind baselines which never use videos for training or inference. Our baselines put scores of deep models in a context. Surprisingly, our blind baselines are competitive and even outperform some deep models.