Sequential Coordination of Deep Models for Learning Visual Arithmetic
This addresses the challenge of combining perception and reasoning in AI for tasks like visual arithmetic, though it is incremental as it builds on existing methods like deep neural networks and reinforcement learning.
The paper tackles the problem of integrating perception and reasoning for visual arithmetic tasks, such as performing arithmetic on handwritten digits in images, by proposing a two-tiered architecture with a controller coordinating specialized modules, achieving improved sample efficiency over standard feedforward networks.
Achieving machine intelligence requires a smooth integration of perception and reasoning, yet models developed to date tend to specialize in one or the other; sophisticated manipulation of symbols acquired from rich perceptual spaces has so far proved elusive. Consider a visual arithmetic task, where the goal is to carry out simple arithmetical algorithms on digits presented under natural conditions (e.g. hand-written, placed randomly). We propose a two-tiered architecture for tackling this problem. The lower tier consists of a heterogeneous collection of information processing modules, which can include pre-trained deep neural networks for locating and extracting characters from the image, as well as modules performing symbolic transformations on the representations extracted by perception. The higher tier consists of a controller, trained using reinforcement learning, which coordinates the modules in order to solve the high-level task. For instance, the controller may learn in what contexts to execute the perceptual networks and what symbolic transformations to apply to their outputs. The resulting model is able to solve a variety of tasks in the visual arithmetic domain, and has several advantages over standard, architecturally homogeneous feedforward networks including improved sample efficiency.