LGMay 2, 2024
Continuously evolving rewards in an open-ended environmentRichard M. Bailey
Unambiguous identification of the rewards driving behaviours of entities operating in complex open-ended real-world environments is difficult, partly because goals and associated behaviours emerge endogenously and are dynamically updated as environments change. Reproducing such dynamics in models would be useful in many domains, particularly where fixed reward functions limit the adaptive capabilities of agents. Simulation experiments described assess a candidate algorithm for the dynamic updating of rewards, RULE: Reward Updating through Learning and Expectation. The approach is tested in a simplified ecosystem-like setting where experiments challenge entities' survival, calling for significant behavioural change. The population of entities successfully demonstrate the abandonment of an initially rewarded but ultimately detrimental behaviour, amplification of beneficial behaviour, and appropriate responses to novel items added to their environment. These adjustment happen through endogenous modification of the entities' underlying reward function, during continuous learning, without external intervention.
AIOct 17, 2025
Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RLRichard M. Bailey
So-called `wicked problems', those involving complex multi-dimensional settings, non-verifiable outcomes, heterogeneous impacts and a lack of single objectively correct answers, have plagued humans throughout history. Modern examples include decisions over justice frameworks, solving environmental pollution, planning for pandemic resilience and food security. The use of state-of-the-art artificial intelligence systems (notably Large Language Model-based agents) collaborating with humans on solving such problems is being actively explored. While the abilities of LLMs can be improved by, for example, fine-tuning, hand-crafted system prompts and scaffolding with external tools, LLMs lack endogenous mechanisms to develop expertise through experience in such settings. This work address this gap with Dialectica, a framework where agents engage in structured dialogue on defined topics, augmented by memory, self-reflection, and policy-constrained context editing. Formally, discussion is viewed as an implicit meta-reinforcement learning process. The `dialogue-trained' agents are evaluated post-hoc using judged pairwise comparisons of elicited responses. Across two model architectures (locally run Qwen3:30b and OpenAI's o4-mini) results show that enabling reflection-based context editing during discussion produces agents which dominate their baseline counterparts on Elo scores, normalized Bradley-Terry-Davidson ability, and AlphaRank mass. The predicted signatures of learning are observed qualitatively in statement and reflection logs, where reflections identify weaknesses and reliably shape subsequent statements. Agreement between quantitative and qualitative evidence supports dialogue-driven context evolution as a practical path to targeted expertise amplification in open non-verifiable domains.