LGAISep 23, 2024

CANDERE-COACH: Reinforcement Learning from Noisy Feedback

arXiv:2409.15521v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge of noisy feedback in RL for applications where human teachers are imperfect, though it is incremental as it builds on existing feedback-based learning frameworks.

The paper tackles the problem of reinforcement learning from noisy human feedback by proposing the CANDERE-COACH algorithm, which uses a noise-filtering mechanism to handle up to 40% incorrect feedback and demonstrates effectiveness in three domains.

In recent times, Reinforcement learning (RL) has been widely applied to many challenging tasks. However, in order to perform well, it requires access to a good reward function which is often sparse or manually engineered with scope for error. Introducing human prior knowledge is often seen as a possible solution to the above-mentioned problem, such as imitation learning, learning from preference, and inverse reinforcement learning. Learning from feedback is another framework that enables an RL agent to learn from binary evaluative signals describing the teacher's (positive or negative) evaluation of the agent's action. However, these methods often make the assumption that evaluative teacher feedback is perfect, which is a restrictive assumption. In practice, such feedback can be noisy due to limited teacher expertise or other exacerbating factors like cognitive load, availability, distraction, etc. In this work, we propose the CANDERE-COACH algorithm, which is capable of learning from noisy feedback by a nonoptimal teacher. We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect. Experiments on three common domains demonstrate the effectiveness of the proposed approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes