CVLGJul 16, 2016

Exploiting Multi-modal Curriculum in Noisy Web Data for Large-scale Concept Learning

arXiv:1607.04780v11 citations
AI Analysis

This addresses the problem of scalable concept learning from noisy web videos for the multimedia and machine learning community, representing an incremental improvement with novel multi-modal approaches.

The paper tackles learning video concept detectors from noisy web data without manual annotations by proposing WELL, a method that uses multi-modal curriculum learning, and demonstrates it outperforms state-of-the-art methods with statistically significant gains and achieves accuracy comparable to supervised learning on clean data.

Learning video concept detectors automatically from the big but noisy web data with no additional manual annotations is a novel but challenging area in the multimedia and the machine learning community. A considerable amount of videos on the web are associated with rich but noisy contextual information, such as the title, which provides weak annotations or labels about the video content. To leverage the big noisy web labels, this paper proposes a novel method called WEbly-Labeled Learning (WELL), which is established on the state-of-the-art machine learning algorithm inspired by the learning process of human. WELL introduces a number of novel multi-modal approaches to incorporate meaningful prior knowledge called curriculum from the noisy web videos. To investigate this problem, we empirically study the curriculum constructed from the multi-modal features of the videos collected from YouTube and Flickr. The efficacy and the scalability of WELL have been extensively demonstrated on two public benchmarks, including the largest multimedia dataset and the largest manually-labeled video set. The comprehensive experimental results demonstrate that WELL outperforms state-of-the-art studies by a statically significant margin on learning concepts from noisy web video data. In addition, the results also verify that WELL is robust to the level of noisiness in the video data. Notably, WELL trained on sufficient noisy web labels is able to achieve a comparable accuracy to supervised learning methods trained on the clean manually-labeled data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes