SYOct 25, 2016
Policy Design for Controlling Set-Point Temperature of ACs in Shared Spaces of BuildingsWayes Tushar, Wang Tao, Lan Lan et al.
Air conditioning systems are responsible for the major percentage of energy consumption in buildings. Shared spaces constitute considerable office space area, in which most office employees perform their meetings and daily tasks, and therefore the ACs in these areas have significant impact on the energy usage of the entire office building. The cost of this energy consumption, however, is not paid by the shared space users, and the AC's temperature set-point is not determined based on the users' preferences. This latter factor is compounded by the fact that different people may have different choices of temperature set-points and sensitivities to change of temperature. Therefore, it is a challenging task to design an office policy to decide on a particular set-point based on such a diverse preference set. As a result, users are not aware of the energy consumption in shared spaces, which may potentially increase the energy wastage and related cost of office buildings. In this context, this paper proposes an energy policy for an office shared space by exploiting an established temperature control mechanism. In particular, we choose meeting rooms in an office building as the test case and design a policy according to which each user of the room can give a preference on the temperature set-point and is paid for felt discomfort if the set-point is not fixed according to the given preference. On the other hand, users who enjoy the thermal comfort compensate the other users of the room. Thus, the policy enables the users to be cognizant and responsible for the payment on the energy consumption of the office space they are sharing, and at the same time ensures that the users are satisfied either via thermal comfort or through incentives. The policy is also shown to be beneficial for building management. Through experiment based case studies, we show the effectiveness of the proposed policy.
CVJan 16, 2025Code
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the KeyZhihe Yang, Xufang Luo, Dongqi Han et al.
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples. Our implementation is available at https://github.com/zhyang2226/OPA-DPO.
CLMay 19, 2025Code
Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMsZhihe Yang, Xufang Luo, Zilong Wang et al.
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.
AIDec 22, 2024
GAS: Generative Auto-bidding with Post-training SearchYewen Li, Shuai Mao, Jingtong Gao et al.
Auto-bidding is essential in facilitating online advertising by automatically placing bids on behalf of advertisers. Generative auto-bidding, which generates bids based on an adjustable condition using models like transformers and diffusers, has recently emerged as a new trend due to its potential to learn optimal strategies directly from data and adjust flexibly to preferences. However, generative models suffer from low-quality data leading to a mismatch between the condition, like return to go, and true action value, especially in long sequential decision-making. Besides, the majority preference in the dataset may hinder models' generalization ability on minority advertisers' preferences. While it is possible to collect high-quality data and retrain multiple models for different preferences, the high cost makes it unaffordable, hindering the advancement of auto-bidding into the era of large foundation models. To address this, we propose a flexible and practical Generative Auto-bidding scheme using post-training Search, termed GAS, to refine a base policy model's output and adapt to various preferences. We use weak-to-strong search alignment by training small critics for different preferences and an MCTS-inspired search to refine the model's output. Specifically, a novel voting mechanism with transformer-based critics trained with policy indications could enhance search alignment performance. Additionally, utilizing the search, we provide a fine-tuning method for high-frequency preference scenarios considering computational efficiency. Extensive experiments conducted on the real-world dataset and online A/B test on the Kuaishou advertising platform demonstrate the effectiveness of GAS, achieving significant improvements, e.g., 4.60% increment of target cost.
LGMay 29, 2025
ADG: Ambient Diffusion-Guided Dataset Recovery for Corruption-Robust Offline Reinforcement LearningZeyuan Liu, Zhihe Yang, Jiawei Xu et al.
Real-world datasets collected from sensors or human inputs are prone to noise and errors, posing significant challenges for applying offline reinforcement learning (RL). While existing methods have made progress in addressing corrupted actions and rewards, they remain insufficient for handling corruption in high-dimensional state spaces and for cases where multiple elements in the dataset are corrupted simultaneously. Diffusion models, known for their strong denoising capabilities, offer a promising direction for this problem-but their tendency to overfit noisy samples limits their direct applicability. To overcome this, we propose Ambient Diffusion-Guided Dataset Recovery (ADG), a novel approach that pioneers the use of diffusion models to tackle data corruption in offline RL. First, we introduce Ambient Denoising Diffusion Probabilistic Models (DDPM) from approximated distributions, which enable learning on partially corrupted datasets with theoretical guarantees. Second, we use the noise-prediction property of Ambient DDPM to distinguish between clean and corrupted data, and then use the clean subset to train a standard DDPM. Third, we employ the trained standard DDPM to refine the previously identified corrupted data, enhancing data quality for subsequent offline RL training. A notable strength of ADG is its versatility-it can be seamlessly integrated with any offline RL algorithm. Experiments on a range of benchmarks, including MuJoCo, Kitchen, and Adroit, demonstrate that ADG effectively mitigates the impact of corrupted data and improves the robustness of offline RL under various noise settings, achieving state-of-the-art results.