Canonical Correlation Inference for Mapping Abstract Scenes to Text
This addresses the challenge of mapping visual abstract scenes to text, but it appears incremental as it applies an existing statistical method to a specific domain without claiming major breakthroughs.
The authors tackled the problem of generating textual descriptions for abstract scenes by developing a structured prediction technique based on canonical correlation analysis, which projects inputs and outputs into a shared space to minimize distance, and demonstrated it on a language-vision task.
We describe a technique for structured prediction, based on canonical correlation analysis. Our learning algorithm finds two projections for the input and the output spaces that aim at projecting a given input and its correct output into points close to each other. We demonstrate our technique on a language-vision problem, namely the problem of giving a textual description to an "abstract scene".