5 Papers

LGJan 15, 2025
Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

Xinchen Han, Hossam Afifi, Michel Marot

Offline Reinforcement Learning (RL) faces a critical challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample learning, effectively mitigating the risks associated with OOD actions. However, the fixed hyperparameter in policy evaluation and density-based policy improvement method limit its overall efficiency. In this paper, we propose Proj-IQL, a projective IQL algorithm enhanced with the support constraint. In the policy evaluation phase, Proj-IQL generalizes the one-step approach to a multi-step approach through vector projection, while maintaining in-sample learning and expectile regression framework. In the policy improvement phase, Proj-IQL introduces support constraint that is more aligned with the policy evaluation approach. Furthermore, we theoretically demonstrate that Proj-IQL guarantees monotonic policy improvement and enjoys a progressively more rigorous criterion for superior actions. Empirical results demonstrate the Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, especially in challenging navigation domains.

LGFeb 10
Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Xinchen Han, Hossam Afifi, Michel Marot et al.

Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.

LGJan 30
Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning

Xinchen Han, Qiuyang Fang, Hossam Afifi et al.

Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength critically shape performance. However, most existing methods commit to a single constraint family: weighted behavior cloning, density regularization, or support constraints, without a unified principle that explains their connections or trade-offs. In this work, we propose Continuous Constraint Interpolation (CCI), a unified optimization framework in which these three constraint families arise as special cases along a common constraint spectrum. The CCI framework introduces a single interpolation parameter that enables smooth transitions and principled combinations across constraint types. Building on CCI, we develop Automatic Constraint Policy Optimization (ACPO), a practical primal--dual algorithm that adapts the interpolation parameter via a Lagrangian dual update. Moreover, we establish a maximum-entropy performance difference lemma and derive performance lower bounds for both the closed-form optimal policy and its parametric projection. Experiments on D4RL and NeoRL2 demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.

CRFeb 4, 2015
Hash Chain Links Resynchronization Methods in Video Streaming Security Performance Comparison

Emad Abd-Elrahman, Mohamed Boutabia, Hossam Afifi

Hash chains provide a secure and light way of security to data authentication including two aspects: Data Integrity and Data Origin Authentication. The real challenge of using the hash chains is how it could recover the synchronization state and continue keeping the hash link in case of packet loss? Based on the packet loss tolerance and some accepted delay of video delivery which are representing the permitted tolerance for heavy loaded applications, we propose different mechanisms for such synchronization recovery. Each mechanism is suitable to use according to the video use case and the low capabilities of end devices. This paper proposes comparative results between them based on the status of each one and its overhead. Then, we propose a hybrid technique based Redundancy Code (RC). This hybrid algorithm is simulated and compared analytically against the other techniques (SHHC, TSP, MLHC and TSS). Moreover, a global performance evaluation in terms of delay and overhead is conducted for all techniques.

NIFeb 4, 2015
Optimization of Quality of Experience through File Duplication in Video Sharing Servers

Emad Abd-Elrahman, Tarek Rekik, Hossam Afifi

Consumers of short videos on Internet can have a bad Quality of Experience QoE due to the long distance between the consumers and the servers that hosting the videos. We propose an optimization of the file allocation in telecommunication operators content sharing servers to improve the QoE through files duplication, thus bringing the files closer to the consumers. This optimization allows the network operator to set the level of QoE and to have control over the users access cost by setting a number of parameters. Two optimization methods are given and are followed by a comparison of their efficiency. Also, the hosting costs versus the gain of optimization are analytically discussed.