Hao Hu

h-index11

3papers

84citations

Novelty65%

AI Score47

Ranked #33,713 of 194,257 authors (top 17%)#1,883 in AI (top 15%)

3 Papers

25.6AISep 29, 2023Code

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency

Zhihan Liu, Hao Hu, Shenao Zhang et al.

Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it remains unclear how to complete a given task provably within a minimum number of interactions with the external environment, e.g., through an internal mechanism of reasoning. To this end, we propose a principled framework with provable regret guarantees to orchestrate reasoning and acting, which we call "reason for future, act for now" (\texttt{RAFA}). Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future"). At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state. The key idea is to cast reasoning in LLMs as learning and planning in Bayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt LLMs to form an updated posterior of the unknown environment from the memory buffer (learning) and generate an optimal trajectory for multiple future steps that maximizes a value function (planning). The learning and planning subroutines are performed in an "in-context" manner to emulate the actor-critic update for MDPs. Our theoretical analysis proves that the novel combination of long-term reasoning and short-term acting achieves a $\sqrt{T}$ regret. Here, $T$ denotes the number of online interactions. In particular, the regret bound highlights an intriguing interplay between the prior knowledge obtained through pretraining and the uncertainty reduction achieved by reasoning and acting. Our empirical validation shows that it outperforms various existing frameworks and achieves nearly perfect scores on a few benchmarks.

2.3NAOct 29, 2025

Scaling Optimized Hermite Approximation Methods

Hao Hu, Haijun Yu

Hermite polynomials and functions have extensive applications in scientific and engineering problems. Although it is recognized that employing the scaled Hermite functions rather than the standard ones can remarkably enhance the approximation performance, the understanding of the scaling factor remains insufficient. Due to the lack of theoretical analysis, recent publications still cast doubt on whether the Hermite spectral method is inferior to other methods. To dispel this doubt, we show in this article that the inefficiency of the Hermite spectral method comes from the imbalance in the decay speed of the objective function within the spatial and frequency domains. Proper scaling can render the Hermite spectral methods comparable to other methods. To make it solid, we propose a novel error analysis framework for the scaled Hermite approximation. Taking the $L^2$ projection error as an example, our framework illustrates that there are three different components of errors: the spatial truncation error, the frequency truncation error, and the Hermite spectral approximation error. Through this perspective, finding the optimal scaling factor is equivalent to balancing the spatial and frequency truncation errors. As applications, we show that geometric convergence can be recovered by proper scaling for a class of functions. Furthermore, we show that proper scaling can double the convergence order for smooth functions with algebraic decay. The perplexing pre-asymptotic sub-geometric convergence when approximating algebraic decay functions can be perfectly explained by this framework.

19.2LGMay 29, 2023Code

Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration

Zhihan Liu, Miao Lu, Wei Xiong et al.

In online reinforcement learning (online RL), balancing exploration and exploitation is crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration. However, in order to cope with general function approximators, most of them involve impractical algorithmic components to incentivize exploration, such as optimization within data-dependent level-sets or complicated sampling procedures. To address this challenge, we propose an easy-to-implement RL framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs to optimize \emph{unconstrainedly} a single objective that integrates the estimation and planning components while balancing exploration and exploitation automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear regret with general function approximations for Markov decision processes (MDP) and is further extendable to two-player zero-sum Markov games (MG). Meanwhile, we adapt deep RL baselines to design practical versions of \texttt{MEX}, in both model-free and model-based manners, which can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards. Compared with existing sample-efficient online RL algorithms with general function approximations, \texttt{MEX} achieves similar sample efficiency while enjoying a lower computational cost and is more compatible with modern deep RL methods.