LGFeb 6
Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy OptimizationGaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari et al.
Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training--precisely the failure mode that motivates heuristics such as PPO's clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.
31.3AIMay 3
Beyond Sentiment: A Multi-Agent Pipeline for Actionable Business Advice from ReviewsKartikey Singh Bhandari, Tanish Jain, Archit Agrawal et al.
Customer reviews contain valuable signals about service quality, but converting large-scale review corpora into actionable business recommendations remains difficult. Standard sentiment/aspect analysis is largely descriptive, while direct prompting of large language models (LLMs) often yields generic and repetitive advice that is weakly grounded in user feedback. We propose a hierarchical decision-support pipeline that explicitly separates signal compression, problem abstraction, candidate generation, objective-based evaluation, and cost-aware routing into different agents. This architectural decomposition produces auditable intermediate artifacts and enables controllable trade-offs between advice quality and token budget. Experiments on Yelp reviews from three service domains show consistent improvements over single-pass LLM baselines across multiple advice quality dimensions, including actionability, relevance, and non-redundancy. A human evaluation further indicates that users generally prefer our system's recommendations. These results highlight the value of structured agentic decomposition for scalable, cost-aware business decision support.
LGAug 26, 2025
HAEPO: History-Aggregated Exploratory Policy OptimizationGaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari et al.
Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model's decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to prevent premature collapse and a soft KL penalty relative to a frozen copy of the previous (reference) policy. Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO across diverse tasks. Thus, HAEPO provides a stable and interpretable framework by explicitly leveraging full-trajectory history while balancing exploration and stability.