Gaurav Chaudhary

LG
h-index8
6papers
13citations
Novelty46%
AI Score49

6 Papers

LGFeb 8, 2023
Predicting the performance of hybrid ventilation in buildings using a multivariate attention-based biLSTM Encoder-Decoder neural network

Gaurav Chaudhary, Hicham Johra, Laurent Georges et al.

Hybrid ventilation is an energy-efficient solution to provide fresh air for most climates, given that it has a reliable control system. To operate such systems optimally, a high-fidelity control-oriented modesl is required. It should enable near-real time forecast of the indoor air temperature based on operational conditions such as window opening and HVAC operating schedules. However, physics-based control-oriented models (i.e., white-box models) are labour-intensive and computationally expensive. Alternatively, black-box models based on artificial neural networks can be trained to be good estimators for building dynamics. This paper investigates the capabilities of a deep neural network (DNN), which is a multivariate multi-head attention-based long short-term memory (LSTM) encoder-decoder neural network, to predict indoor air temperature when windows are opened or closed. Training and test data are generated from a detailed multi-zone office building model (EnergyPlus). Pseudo-random signals are used for the indoor air temperature setpoints and window opening instances. The results indicate that the DNN is able to accurately predict the indoor air temperature of five zones whenever windows are opened or closed. The prediction error plateaus after the 24th step ahead prediction (6 hr ahead prediction).

DCNov 28, 2022
CWD: A Machine Learning based Approach to Detect Unknown Cloud Workloads

Mohammad Hossain, Derssie Mebratu, Niranjan Hasabnis et al.

Workloads in modern cloud data centers are becoming increasingly complex. The number of workloads running in cloud data centers has been growing exponentially for the last few years, and cloud service providers (CSP) have been supporting on-demand services in real-time. Realizing the growing complexity of cloud environment and cloud workloads, hardware vendors such as Intel and AMD are increasingly introducing cloud-specific workload acceleration features in their CPU platforms. These features are typically targeted towards popular and commonly-used cloud workloads. Nonetheless, uncommon, customer-specific workloads (unknown workloads), if their characteristics are different from common workloads (known workloads), may not realize the potential of the underlying platform. To address this problem of realizing the full potential of the underlying platform, we develop a machine learning based technique to characterize, profile and predict workloads running in the cloud environment. Experimental evaluation of our technique demonstrates good prediction performance. We also develop techniques to analyze the performance of the model in a standalone manner.

LGMar 29
Match or Replay: Self Imitating Proximal Policy Optimization

Gaurav Chaudhary, Laxmidhar Behera, Washim Uddin Mondal

Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, thereby reducing sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm that enhances exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by using optimal transport distance in dense reward environments to prioritize state visitation distributions that match the most rewarding trajectory. In sparse-reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments demonstrate substantial improvements in learning efficiency, including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards. Our approach achieves faster convergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings underscore the potential of self-imitation as a robust strategy for enhancing exploration in RL, with applicability to more complex tasks.

LGDec 28, 2025
TEACH: Temporal Variance-Driven Curriculum for Reinforcement Learning

Gaurav Chaudhary, Laxmidhar Behera

Reinforcement Learning (RL) has achieved significant success in solving single-goal tasks. However, uniform goal selection often results in sample inefficiency in multi-goal settings where agents must learn a universal goal-conditioned policy. Inspired by the adaptive and structured learning processes observed in biological systems, we propose a novel Student-Teacher learning paradigm with a Temporal Variance-Driven Curriculum to accelerate Goal-Conditioned RL. In this framework, the teacher module dynamically prioritizes goals with the highest temporal variance in the policy's confidence score, parameterized by the state-action value (Q) function. The teacher provides an adaptive and focused learning signal by targeting these high-uncertainty goals, fostering continual and efficient progress. We establish a theoretical connection between the temporal variance of Q-values and the evolution of the policy, providing insights into the method's underlying principles. Our approach is algorithm-agnostic and integrates seamlessly with existing RL frameworks. We demonstrate this through evaluation across 11 diverse robotic manipulation and maze navigation tasks. The results show consistent and notable improvements over state-of-the-art curriculum learning and goal-selection methods.

LGJun 11, 2025
MOORL: A Framework for Integrating Offline-Online Reinforcement Learning

Gaurav Chaudhary, Wassim Uddin Mondal, Laxmidhar Behera

Sample efficiency and exploration remain critical challenges in Deep Reinforcement Learning (DRL), particularly in complex domains. Offline RL, which enables agents to learn optimal policies from static, pre-collected datasets, has emerged as a promising alternative. However, offline RL is constrained by issues such as out-of-distribution (OOD) actions that limit policy performance and generalization. To overcome these limitations, we propose Meta Offline-Online Reinforcement Learning (MOORL), a hybrid framework that unifies offline and online RL for efficient and scalable learning. While previous hybrid methods rely on extensive design components and added computational complexity to utilize offline data effectively, MOORL introduces a meta-policy that seamlessly adapts across offline and online trajectories. This enables the agent to leverage offline data for robust initialization while utilizing online interactions to drive efficient exploration. Our theoretical analysis demonstrates that the hybrid approach enhances exploration by effectively combining the complementary strengths of offline and online data. Furthermore, we demonstrate that MOORL learns a stable Q-function without added complexity. Extensive experiments on 28 tasks from the D4RL and V-D4RL benchmarks validate its effectiveness, showing consistent improvements over state-of-the-art offline and hybrid RL baselines. With minimal computational overhead, MOORL achieves strong performance, underscoring its potential for practical applications in real-world scenarios.

LGJul 17, 2025
From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning

Gaurav Chaudhary, Laxmidhar Behera

Offline Reinforcement Learning (RL) aims to learn effective policies from a static dataset without requiring further agent-environment interactions. However, its practical adoption is often hindered by the need for explicit reward annotations, which can be costly to engineer or difficult to obtain retrospectively. To address this, we propose ReLOAD (Reinforcement Learning with Offline Reward Annotation via Distillation), a novel reward annotation framework for offline RL. Unlike existing methods that depend on complex alignment procedures, our approach adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations using a simple yet effective embedding discrepancy measure. First, we train a predictor network to mimic a fixed target network's embeddings based on expert state transitions. Later, the prediction error between these networks serves as a reward signal for each transition in the static dataset. This mechanism provides a structured reward signal without requiring handcrafted reward annotations. We provide a formal theoretical construct that offers insights into how RND prediction errors effectively serve as intrinsic rewards by distinguishing expert-like transitions. Experiments on the D4RL benchmark demonstrate that ReLOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods.