M Ganesh Kumar

CV
h-index3
5papers
18citations
Novelty47%
AI Score28

5 Papers

LGSep 8, 2023
Compositional Learning of Visually-Grounded Concepts Using Reinforcement

Zijun Lin, Haidi Azaman, M Ganesh Kumar et al.

Children can rapidly generalize compositionally-constructed rules to unseen test sets. On the other hand, deep reinforcement learning (RL) agents need to be trained over millions of episodes, and their ability to generalize to unseen combinations remains unclear. Hence, we investigate the compositional abilities of RL agents, using the task of navigating to specified color-shape targets in synthetic 3D environments. First, we show that when RL agents are naively trained to navigate to target color-shape combinations, they implicitly learn to decompose the combinations, allowing them to (re-)compose these and succeed at held-out test combinations ("compositional learning"). Second, when agents are pretrained to learn invariant shape and color concepts ("concept learning"), the number of episodes subsequently needed for compositional learning decreased by 20 times. Furthermore, only agents trained on both concept and compositional learning could solve a more complex, out-of-distribution environment in zero-shot fashion. Finally, we verified that only text encoders pretrained on image-text datasets (e.g. CLIP) reduced the number of training episodes needed for our agents to demonstrate compositional learning, and also generalized to 5 unseen colors in zero-shot fashion. Overall, our results are the first to demonstrate that RL agents can be trained to implicitly learn concepts and compositionality, to solve more complex environments in zero-shot fashion.

CVSep 7, 2023
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners

Clarence Lee, M Ganesh Kumar, Cheston Tan

State-of-the-art visual grounding models can achieve high detection accuracy, but they are not designed to distinguish between all objects versus only certain objects of interest. In natural language, in order to specify a particular object or set of objects of interest, humans use determiners such as "my", "either" and "those". Determiners, as an important word class, are a type of schema in natural language about the reference or quantity of the noun. Existing grounded referencing datasets place much less emphasis on determiners, compared to other word classes such as nouns, verbs and adjectives. This makes it difficult to develop models that understand the full variety and complexity of object referencing. Thus, we have developed and released the DetermiNet dataset , which comprises 250,000 synthetically generated images and captions based on 25 determiners. The task is to predict bounding boxes to identify objects of interest, constrained by the semantics of the given determiner. We find that current state-of-the-art visual grounding models do not perform well on the dataset, highlighting the limitations of existing models on reference and quantification tasks.

CVApr 9, 2025
Human-like compositional learning of visually-grounded concepts using synthetic environments

Zijun Lin, M Ganesh Kumar, Cheston Tan

The compositional structure of language enables humans to decompose complex phrases and map them to novel visual concepts, showcasing flexible intelligence. While several algorithms exhibit compositionality, they fail to elucidate how humans learn to compose concept classes and ground visual cues through trial and error. To investigate this multi-modal learning challenge, we designed a 3D synthetic environment in which an agent learns, via reinforcement, to navigate to a target specified by a natural language instruction. These instructions comprise nouns, attributes, and critically, determiners, prepositions, or both. The vast array of word combinations heightens the compositional complexity of the visual grounding task, as navigating to a blue cube above red spheres is not rewarded when the instruction specifies navigating to "some blue cubes below the red sphere". We first demonstrate that reinforcement learning agents can ground determiner concepts to visual targets but struggle with more complex prepositional concepts. Second, we show that curriculum learning, a strategy humans employ, enhances concept learning efficiency, reducing the required training episodes by 15% in determiner environments and enabling agents to easily learn prepositional concepts. Finally, we establish that agents trained on determiner or prepositional concepts can decompose held-out test instructions and rapidly adapt their navigation policies to unseen visual object combinations. Leveraging synthetic environments, our findings demonstrate that multi-modal reinforcement learning agents can achieve compositional understanding of complex concept classes and highlight the efficacy of human-like learning strategies in improving artificial systems' learning efficiency.

NEJun 25, 2021
A nonlinear hidden layer enables actor-critic agents to learn multiple paired association navigation

M Ganesh Kumar, Cheston Tan, Camilo Libedinsky et al.

Navigation to multiple cued reward locations has been increasingly used to study rodent learning. Though deep reinforcement learning agents have been shown to be able to learn the task, they are not biologically plausible. Biologically plausible classic actor-critic agents have been shown to learn to navigate to single reward locations, but which biologically plausible agents are able to learn multiple cue-reward location tasks has remained unclear. In this computational study, we show versions of classic agents that learn to navigate to a single reward location, and adapt to reward location displacement, but are not able to learn multiple paired association navigation. The limitation is overcome by an agent in which place cell and cue information are first processed by a feedforward nonlinear hidden layer with synapses to the actor and critic subject to temporal difference error-modulated plasticity. Faster learning is obtained when the feedforward layer is replaced by a recurrent reservoir network.

NEJun 7, 2021
One-shot learning of paired association navigation with biologically plausible schemas

M Ganesh Kumar, Cheston Tan, Camilo Libedinsky et al.

Schemas are knowledge structures that can enable rapid learning. Rodent one-shot learning in a multiple paired association navigation task has been postulated to be schema-dependent. We still only poorly understand how schemas, conceptualized at Marr's computational level, are neurally implemented. Moreover, a biologically plausible computational model of the rodent learning has not been demonstrated. Accordingly, we here compose an agent from schemas with biologically plausible neural implementations. The agent gradually learns a metric representation of its environment using a path integration temporal difference error, allowing it to localize in any environment. Additionally, the agent contains an associative memory that can stably form numerous one-shot associations between sensory cues and goal coordinates, implemented with a feedforward layer or a reservoir of recurrently connected neurons whose plastic output weights are governed by a 4-factor reward-modulated Exploratory Hebbian (EH) rule. A third network performs vector subtraction between the agent's current and goal location to decide the direction of movement. We further show that schemas supplemented by an actor-critic allows the agent to succeed even if an obstacle prevents direct heading, and that temporal-difference learning of a working memory gating mechanism enables one-shot learning despite distractors. Our agent recapitulates learning behavior observed in experiments and provides testable predictions that can be probed in future experiments.