CVApr 12, 2023Code
A Closer Look at the Explainability of Contrastive Language-Image Pre-trainingYi Li, Hualiang Wang, Yiqun Duan et al.
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.
LGSep 14, 2022
Distributionally Robust Offline Reinforcement Learning with Linear Function ApproximationXiaoteng Ma, Zhipeng Liang, Jose Blanchet et al. · tsinghua
Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.
MLJan 27, 2023
Single-Trajectory Distributionally Robust Reinforcement LearningZhipeng Liang, Xiaoteng Ma, Jose Blanchet et al.
To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.
LGFeb 10, 2023
Debiasing Recommendation by Learning Identifiable Latent ConfoundersQing Zhang, Xiaoying Zhang, Yang Liu et al.
Recommendation systems aim to predict users' feedback on items not exposed to them. Confounding bias arises due to the presence of unmeasured variables (e.g., the socio-economic status of a user) that can affect both a user's exposure and feedback. Existing methods either (1) make untenable assumptions about these unmeasured variables or (2) directly infer latent confounders from users' exposure. However, they cannot guarantee the identification of counterfactual feedback, which can lead to biased predictions. In this work, we propose a novel method, i.e., identifiable deconfounder (iDCF), which leverages a set of proxy variables (e.g., observed user features) to resolve the aforementioned non-identification issue. The proposed iDCF is a general deconfounded recommendation framework that applies proximal causal inference to infer the unmeasured confounders and identify the counterfactual feedback with theoretical guarantees. Extensive experiments on various real-world and synthetic datasets verify the proposed method's effectiveness and robustness.
LGOct 21, 2022
Optimal Contextual Bandits with Knapsacks under Realizability via Regression OraclesYuxuan Han, Jialin Zeng, Yang Wang et al.
We study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expected cost are functions of contexts and actions in some given general function classes $\mathcal{F}$ and $\mathcal{G}$, respectively. Existing works on CBwK are restricted to the linear function class since they use UCB-type algorithms, which heavily rely on the linear form and thus are difficult to extend to general function classes. Motivated by online regression oracles that have been successfully applied to contextual bandits, we propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression. We also establish the lower regret bound to show the optimality of our algorithm for a variety of function classes.
LGAug 29, 2023
Stochastic Graph Bandit Learning with Side-ObservationsXueping Gong, Jiheng Zhang
In this paper, we investigate the stochastic contextual bandit with general function space and graph feedback. We propose an algorithm that addresses this problem by adapting to both the underlying graph structures and reward gaps. To the best of our knowledge, our algorithm is the first to provide a gap-dependent upper bound in this stochastic setting, bridging the research gap left by the work in [35]. In comparison to [31,33,35], our method offers improved regret upper bounds and does not require knowledge of graphical quantities. We conduct numerical experiments to demonstrate the computational efficiency and effectiveness of our approach in terms of regret upper bounds. These findings highlight the significance of our algorithm in advancing the field of stochastic contextual bandits with graph feedback, opening up avenues for practical applications in various domains.
LGAug 7, 2023
Efficient Transfer Learning via Causal BoundsXueping Gong, Wei You, Jiheng Zhang
Transfer learning seeks to accelerate sequential decision-making by leveraging offline data from related agents. However, data from heterogeneous sources that differ in observed features, distributions, or unobserved confounders often render causal effects non-identifiable and bias naive estimators. We address this by forming ambiguity sets of structural causal models defined via integral constraints on their joint densities. Optimizing any causal effect over these sets leads to generally non-convex programs whose solutions tightly bound the range of possible effects under heterogeneity or confounding. To solve these programs efficiently, we develop a hit-and-run sampler that explores the entire ambiguity set and, when paired with a local optimization oracle, produces causal bound estimates that converge almost surely to the true limits. We further accommodate estimation error by relaxing the ambiguity set and exploit the Lipschitz continuity of causal effects to establish precise error propagation guarantees. These causal bounds are then embedded into bandit algorithms via arm elimination and truncated UCB indices, yielding optimal gap-dependent and minimax regret bounds. To handle estimation error, we also develop a safe algorithm for incorporating noisy causal bounds. In the contextual-bandit setting with function approximation, our method uses causal bounds to prune both the function class and the per-context action set, achieving matching upper and lower regret bounds with only logarithmic dependence on function-class complexity. Our analysis precisely characterizes when and how causal side-information accelerates online learning, and experiments on synthetic benchmarks confirm substantial regret reductions in data-scarce or confounded regimes.
LGSep 7, 2022
Dual Instrumental Method for Confounded Kernelized BanditsXueping Gong, Jiheng Zhang
The contextual bandit problem is a theoretically justified framework with wide applications in various fields. While the previous study on this problem usually requires independence between noise and contexts, our work considers a more sensible setting where the noise becomes a latent confounder that affects both contexts and rewards. Such a confounded setting is more realistic and could expand to a broader range of applications. However, the unresolved confounder will cause a bias in reward function estimation and thus lead to a large regret. To deal with the challenges brought by the confounder, we apply the dual instrumental variable regression, which can correctly identify the true reward function. We prove the convergence rate of this method is near-optimal in two types of widely used reproducing kernel Hilbert spaces. Therefore, we can design computationally efficient and regret-optimal algorithms based on the theoretical guarantees for confounded bandit problems. The numerical results illustrate the efficacy of our proposed algorithms in the confounded bandit setting.
LGJun 16, 2022
On Private Online Convex Optimization: Optimal Algorithms in $\ell_p$-Geometry and High Dimensional Contextual BanditsYuxuan Han, Zhicong Liang, Zhipeng Liang et al.
Differentially private (DP) stochastic convex optimization (SCO) is ubiquitous in trustworthy machine learning algorithm design. This paper studies the DP-SCO problem with streaming data sampled from a distribution and arrives sequentially. We also consider the continual release model where parameters related to private information are updated and released upon each new data, often known as the online algorithms. Despite that numerous algorithms have been developed to achieve the optimal excess risks in different $\ell_p$ norm geometries, yet none of the existing ones can be adapted to the streaming and continual release setting. To address such a challenge as the online convex optimization with privacy protection, we propose a private variant of online Frank-Wolfe algorithm with recursive gradients for variance reduction to update and reveal the parameters upon each data. Combined with the adaptive differential privacy analysis, our online algorithm achieves in linear time the optimal excess risk when $1<p\leq 2$ and the state-of-the-art excess risk meeting the non-private lower ones when $2<p\leq\infty$. Our algorithm can also be extended to the case $p=1$ to achieve nearly dimension-independent excess risk. While previous variance reduction results on recursive gradient have theoretical guarantee only in the independent and identically distributed sample setting, we establish such a guarantee in a non-stationary setting. To demonstrate the virtues of our method, we design the first DP algorithm for high-dimensional generalized linear bandits with logarithmic regret. Comparative experiments with a variety of DP-SCO and DP-Bandit algorithms exhibit the efficacy and utility of the proposed algorithms.
AINov 12, 2025
OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement LearningZezhen Ding, Zhen Tan, Jiheng Zhang et al.
Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7\%$, using only $1/10$ the synthetic data required by prior methods such as ORLM, exceeding ORLM's solving accuracy by up to $4.2\%$. Remarkably, OR-R1 outperforms ORLM by over $2.4\%$ with just $100$ synthetic samples. Furthermore, TGRPO contributes an additional $3.1\%-6.4\%$ improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from $13\%$ to $7\%$. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.
LGNov 12, 2024Code
FlowTS: Time Series Generation via Rectified FlowYang Hu, Xiao Wang, Zezhen Ding et al.
Diffusion-based models have significant achievements in time series generation but suffer from inefficient computation: solving high-dimensional ODEs/SDEs via iterative numerical solvers demands hundreds to thousands of drift function evaluations per sample, incurring prohibitive costs. To resolve this, we propose FlowTS, an ODE-based model that leverages rectified flow with straight-line transport in probability space. By learning geodesic paths between distributions, FlowTS achieves computational efficiency through exact linear trajectory simulation, accelerating training and generation while improving performances. We further introduce an adaptive sampling strategy inspired by the exploration-exploitation trade-off, balancing noise adaptation and precision. Notably, FlowTS enables seamless adaptation from unconditional to conditional generation without retraining, ensuring efficient real-world deployment. Also, to enhance generation authenticity, FlowTS integrates trend and seasonality decomposition, attention registers (for global context aggregation), and Rotary Position Embedding (RoPE) (for position information). For unconditional setting, extensive experiments demonstrate that FlowTS achieves state-of-the-art performance, with context FID scores of 0.019 and 0.011 on Stock and ETTh datasets (prev. best: 0.067, 0.061). For conditional setting, we have achieved superior performance in solar forecasting (MSE 213, prev. best: 375) and MuJoCo imputation tasks (MSE 7e-5, prev. best 2.7e-4). The code is available at https://github.com/UNITES-Lab/FlowTS.
LGMay 10
Learning to Bid with Unknown Private Values in Budget-Constrained First-Price AuctionsZihao Hu, Yuxiao Wen, Yuan Yao et al.
The transition to First-Price Auctions (FPA) in digital advertising has spurred significant research, yet existing work typically assumes access to a valuation oracle, ignoring the reality that values must be inferred from censored data. While Linear Treatment Effect (LTE) models address this by learning value uplift, they have not been adapted to realistic settings with hard Budget constraints or Return-on-Spend (RoS) targets requiring regret and violation control. In this work, we propose a unified primal-dual framework for constrained FPAs that jointly learns the latent LTE valuation parameters and the competitor's bid distribution. This simultaneous learning introduces a critical technical challenge: the estimation error is dynamically scaled by the Lagrangian multiplier, potentially leading to unbounded regret. We resolve this by leveraging a strong Slater condition and a novel adaptive burn-in procedure to stabilize the dual variables. Our approach achieves near-optimal regret guarantees, providing the first theoretically grounded solution for constrained bidding with latent valuations.
LGApr 27
Geometry-Aware Offline-to-Online Learning in Linear Contextual BanditsZean Han, Ruihan Lin, Zezhen Ding et al.
We study offline-to-online learning in linear contextual bandits with biased offline regression data: the offline parameter need not match the online one, so history should not be treated as a single warm start. We model directional transfer with a shift certificate $(M_{\mathrm{shift}},ρ)$ and offline ridge estimation, yielding a geometry-aware confidence region for the online parameter rather than an isotropic radius. We propose \emph{Ellipsoidal-MINUCB}, which combines a standard online branch with an offline-informed pooled branch and uses offline information only when it tightens uncertainty. With high probability, regret is bounded by the minimum of a standard SupLinUCB-style fallback and a pooled term that separates statistical width from a certificate-weighted shift penalty. Under a simple alignment condition, the pooled term further simplifies to a rate governed by an effective dimension induced by the offline geometry. We also show that a purely Euclidean (scalar) shift bound, by itself, does not determine which feature directions are transferable. Beyond this fixed certificate, we show how to learn a data-driven certificate from data at finitely many refresh times and establish a high-probability regret bound for Ellipsoidal-MINUCB with epoch-wise learned certificates. Experiments match the main prediction: gains are strongest at intermediate horizons when offline coverage and transferability align, while the method otherwise tracks the safe online baseline.
LGJan 23, 2025
Learning to Bid in Non-Stationary Repeated First-Price AuctionsZihao Hu, Xiaoyu Fan, Yuan Yao et al.
First-price auctions have recently gained significant traction in digital advertising markets, exemplified by Google's transition from second-price to first-price auctions. Unlike in second-price auctions, where bidding one's private valuation is a dominant strategy, determining an optimal bidding strategy in first-price auctions is more complex. From a learning perspective, the learner (a specific bidder) can interact with the environment (other bidders, i.e., opponents) sequentially to infer their behaviors. Existing research often assumes specific environmental conditions and benchmarks performance against the best fixed policy (static benchmark). While this approach ensures strong learning guarantees, the static benchmark can deviate significantly from the optimal strategy in environments with even mild non-stationarity. To address such scenarios, a dynamic benchmark--representing the sum of the highest achievable rewards at each time step--offers a more suitable objective. However, achieving no-regret learning with respect to the dynamic benchmark requires additional constraints. By inspecting reward functions in online first-price auctions, we introduce two metrics to quantify the regularity of the sequence of opponents' highest bids, which serve as measures of non-stationarity. We provide a minimax-optimal characterization of the dynamic regret for the class of sequences of opponents' highest bids that satisfy either of these regularity constraints. Our main technical tool is the Optimistic Mirror Descent (OMD) framework with a novel optimism configuration, which is well-suited for achieving minimax-optimal dynamic regret rates in this context. We then use synthetic datasets to validate our theoretical guarantees and demonstrate that our methods outperform existing ones.
LGJan 23, 2025
Unveiling Discrete Clues: Superior Healthcare Predictions for Rare DiseasesChuang Zhao, Hui Tang, Jiheng Zhang et al.
Accurate healthcare prediction is essential for improving patient outcomes. Existing work primarily leverages advanced frameworks like attention or graph networks to capture the intricate collaborative (CO) signals in electronic health records. However, prediction for rare diseases remains challenging due to limited co-occurrence and inadequately tailored approaches. To address this issue, this paper proposes UDC, a novel method that unveils discrete clues to bridge consistent textual knowledge and CO signals within a unified semantic space, thereby enriching the representation semantics of rare diseases. Specifically, we focus on addressing two key sub-problems: (1) acquiring distinguishable discrete encodings for precise disease representation and (2) achieving semantic alignment between textual knowledge and the CO signals at the code level. For the first sub-problem, we refine the standard vector quantized process to include condition awareness. Additionally, we develop an advanced contrastive approach in the decoding stage, leveraging synthetic and mixed-domain targets as hard negatives to enrich the perceptibility of the reconstructed representation for downstream tasks. For the second sub-problem, we introduce a novel codebook update strategy using co-teacher distillation. This approach facilitates bidirectional supervision between textual knowledge and CO signals, thereby aligning semantically equivalent information in a shared discrete latent space. Extensive experiments on three datasets demonstrate our superiority.
LGMar 14, 2025
Make Optimization Once and for All with Fine-grained GuidanceMingjia Shi, Ruihan Lin, Xuxi Chen et al.
Learning to Optimize (L2O) enhances optimization efficiency with integrated neural networks. L2O paradigms achieve great outcomes, e.g., refitting optimizer, generating unseen solutions iteratively or directly. However, conventional L2O methods require intricate design and rely on specific optimization processes, limiting scalability and generalization. Our analyses explore general framework for learning optimization, called Diff-L2O, focusing on augmenting sampled solutions from a wider view rather than local updates in real optimization process only. Meanwhile, we give the related generalization bound, showing that the sample diversity of Diff-L2O brings better performance. This bound can be simply applied to other fields, discussing diversity, mean-variance, and different tasks. Diff-L2O's strong compatibility is empirically verified with only minute-level training, comparing with other hour-levels.
LGMar 18, 2024
RL in Markov Games with Independent Function Approximation: Improved Sample Complexity Bound under the Local Access ModelJunyi Fan, Yuxuan Han, Jialin Zeng et al.
Efficiently learning equilibria with large state and action spaces in general-sum Markov games while overcoming the curse of multi-agency is a challenging problem. Recent works have attempted to solve this problem by employing independent linear function classes to approximate the marginal $Q$-value for each agent. However, existing sample complexity bounds under such a framework have a suboptimal dependency on the desired accuracy $\varepsilon$ or the action space. In this work, we introduce a new algorithm, Lin-Confident-FTRL, for learning coarse correlated equilibria (CCE) with local access to the simulator, i.e., one can interact with the underlying environment on the visited states. Up to a logarithmic dependence on the size of the state space, Lin-Confident-FTRL learns $ε$-CCE with a provable optimal accuracy bound $O(ε^{-2})$ and gets rids of the linear dependency on the action space, while scaling polynomially with relevant problem parameters (such as the number of agents and time horizon). Moreover, our analysis of Linear-Confident-FTRL generalizes the virtual policy iteration technique in the single-agent local planning literature, which yields a new computationally efficient algorithm with a tighter sample complexity bound when assuming random access to the simulator.
LGMar 2, 2025
Parameter-Adaptive Dynamic PricingXueping Gong, Jiheng Zhang
Dynamic pricing is crucial in sectors like e-commerce and transportation, balancing exploration of demand patterns and exploitation of pricing strategies. Existing methods often require precise knowledge of the demand function, e.g., the H{ö}lder smoothness level and Lipschitz constant, limiting practical utility. This paper introduces an adaptive approach to address these challenges without prior parameter knowledge. By partitioning the demand function's domain and employing a linear bandit structure, we develop an algorithm that manages regret efficiently, enhancing flexibility and practicality. Our Parameter-Adaptive Dynamic Pricing (PADP) algorithm outperforms existing methods, offering improved regret bounds and extensions for contextual information. Numerical experiments validate our approach, demonstrating its superiority in handling unknown demand parameters.
LGJun 24, 2024
Minimax Optimality in Contextual Dynamic Pricing with General Valuation ModelsXueping Gong, Wei You, Jiheng Zhang
We study contextual dynamic pricing, where a decision maker posts personalized prices based on observable contexts and receives binary purchase feedback indicating whether the customer's valuation exceeds the price. Each valuation is modeled as an unknown latent function of the context, corrupted by independent and identically distributed market noise from an unknown distribution. Relying only on Lipschitz continuity of the noise distribution and bounded valuations, we propose a minimax-optimal algorithm. To accommodate the unknown distribution, our method discretizes the relevant noise range to form a finite set of candidate prices, then applies layered data partitioning to obtain confidence bounds substantially tighter than those derived via the elliptical-potential lemma. A key advantage is that estimation bias in the valuation function cancels when comparing upper confidence bounds, eliminating the need to know the Lipschitz constant. The framework extends beyond linear models to general function classes through offline regression oracles. Our regret analysis depends solely on the oracle's estimation error, typically governed by the statistical complexity of the class. These techniques yield a regret upper bound matching the minimax lower bound up to logarithmic factors. Furthermore, we refine these guarantees under additional structures -- e.g., linear valuation models, second-order smoothness, sparsity, and known noise distribution or observable valuations -- and compare our bounds and assumptions with prior dynamic-pricing methods. Finally, numerical experiments corroborate the theory and show clear improvements over benchmark methods.
CRJul 22, 2021
Improving Blockchain Consistency Bound by Assigning Weights to Random BlocksQing Zhang, Xueping Gong, Huizhong Li et al.
Blockchains based on the celebrated Nakamoto consensus protocol have shown promise in several applications, including cryptocurrencies. However, these blockchains have inherent scalability limits caused by the protocol's consensus properties. In particular, the \emph{consistency} property demonstrates a tight trade-off between block production speed and the system's security in terms of resisting adversarial attacks. This paper proposes a novel method, Ironclad, that improves blockchain consistency bound by assigning a different weight to randomly selected blocks. We apply our method to the original Nakamoto protocol and rigorously prove that such a combination can improve the consistency bound significantly by analyzing the fundamental consensus properties. Such an improvement enables a much faster block production rate than the original Nakamoto protocol under the same security guarantee with the same proportion of malicious mining power (see Figure 1).
MLJun 7, 2021
Generalized Linear Bandits with Local Differential PrivacyYuxuan Han, Zhipeng Liang, Yang Wang et al.
Contextual bandit algorithms are useful in personalized online decision-making. However, many applications such as personalized medicine and online advertising require the utilization of individual-specific information for effective learning, while user's data should remain private from the server due to privacy concerns. This motivates the introduction of local differential privacy (LDP), a stringent notion in privacy, to contextual bandits. In this paper, we design LDP algorithms for stochastic generalized linear bandits to achieve the same regret bound as in non-privacy settings. Our main idea is to develop a stochastic gradient-based estimator and update mechanism to ensure LDP. We then exploit the flexibility of stochastic gradient descent (SGD), whose theoretical guarantee for bandit problems is rarely explored, in dealing with generalized linear bandits. We also develop an estimator and update mechanism based on Ordinary Least Square (OLS) for linear bandits. Finally, we conduct experiments with both simulation and real-world datasets to demonstrate the consistently superb performance of our algorithms under LDP constraints with reasonably small parameters $(\varepsilon, δ)$ to ensure strong privacy protection.