Mahammad Humayoo

LG
h-index10
4papers
16citations
Novelty51%
AI Score26

4 Papers

LGNov 22, 2024
Segmenting Action-Value Functions Over Time-Scales in SARSA via TD($Δ$)

Mahammad Humayoo

In numerous episodic reinforcement learning (RL) environments, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Traditional SARSA algorithms face challenges in achieving an optimal balance between bias and variation, primarily due to their dependence on a single, constant discount factor ($η$). This investigation enhances the temporal difference decomposition method, TD($Δ$), by applying it to the SARSA algorithm, now designated as SARSA($Δ$). SARSA is a widely used on-policy RL method that enhances action-value functions via temporal difference updates. By splitting the action-value function down into components that are linked to specific discount factors, SARSA($Δ$) makes learning easier across a range of time scales. This analysis makes learning more effective and ensures consistency, particularly in situations where long-horizon improvement is needed. The results of this research show that the suggested strategy works to lower bias in SARSA's updates and speed up convergence in both deterministic and stochastic settings, even in dense reward Atari environments. Experimental results from a variety of benchmark settings show that the proposed SARSA($Δ$) outperforms existing TD learning techniques in both tabular and deep RL environments.

LGNov 21, 2024
Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition

Mahammad Humayoo

Q-Learning is a fundamental off-policy reinforcement learning (RL) algorithm that has the objective of approximating action-value functions in order to learn optimal policies. Nonetheless, it has difficulties in reconciling bias with variance, particularly in the context of long-term rewards. This paper introduces Q($Δ$)-Learning, an extension of TD($Δ$) for the Q-Learning framework. TD($Δ$) facilitates efficient learning over several time scales by breaking the Q($Δ$)-function into distinct discount factors. This approach offers improved learning stability and scalability, especially for long-term tasks where discounting bias may impede convergence. Our methodology guarantees that each element of the Q($Δ$)-function is acquired individually, facilitating expedited convergence on shorter time scales and enhancing the learning of extended time scales. We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q($Δ$)-Learning surpasses conventional Q-Learning and TD learning methods in both tabular and deep RL environments.

LGSep 4, 2019
Parameter Estimation with the Ordered $\ell_{2}$ Regularization via an Alternating Direction Method of Multipliers

Mahammad Humayoo, Xueqi Cheng

Regularization is a popular technique in machine learning for model estimation and avoiding overfitting. Prior studies have found that modern ordered regularization can be more effective in handling highly correlated, high-dimensional data than traditional regularization. The reason stems from the fact that the ordered regularization can reject irrelevant variables and yield an accurate estimation of the parameters. How to scale up the ordered regularization problems when facing the large-scale training data remains an unanswered question. This paper explores the problem of parameter estimation with the ordered $\ell_{2}$-regularization via Alternating Direction Method of Multipliers (ADMM), called ADMM-O$\ell_{2}$. The advantages of ADMM-O$\ell_{2}$ include (i) scaling up the ordered $\ell_{2}$ to a large-scale dataset, (ii) predicting parameters correctly by excluding irrelevant variables automatically, and (iii) having a fast convergence rate. Experiment results on both synthetic data and real data indicate that ADMM-O$\ell_{2}$ can perform better than or comparable to several state-of-the-art baselines.

LGOct 30, 2018
Relative Importance Sampling for off-Policy Actor-Critic in Deep Reinforcement Learning

Mahammad Humayoo, Gengzhong Zheng, Xiaoqing Dong et al.

Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL). The difference in probability distribution between the target policy ($π$) and the behavior policy (b) is a major cause of instability. High variance also originates from distributional mismatch. The variation between the target policy's distribution and the behavior policy's distribution can be reduced using importance sampling (IS). However, importance sampling has high variance, which is exacerbated in sequential scenarios. We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning. To control variance, we alter the value of the smoothness parameter $β\in[0, 1]$ in RIS. We develop the first model-free relative importance sampling off-policy actor-critic (RIS-off-PAC) algorithms in RL using this strategy. Our method uses a network to generate the target policy (actor) and evaluate the current policy ($π$) using a value function (critic) based on behavior policy samples. Our algorithms are trained using behavior policy action values in the reward function, not target policy ones. Both the actor and critic are trained using deep neural networks. Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.