Jiaming Liang

CV
h-index31
35papers
879citations
Novelty53%
AI Score60

35 Papers

LGJun 2, 2023
RITA: Group Attention is All You Need for Timeseries Analytics

Jiaming Liang, Lei Cao, Samuel Madden et al.

Timeseries analytics is of great importance in many real-world applications. Recently, the Transformer model, popular in natural language processing, has been leveraged to learn high quality feature embeddings from timeseries, core to the performance of various timeseries analytics tasks. However, the quadratic time and space complexities limit Transformers' scalability, especially for long timeseries. To address these issues, we develop a timeseries analytics tool, RITA, which uses a novel attention mechanism, named group attention, to address this scalability issue. Group attention dynamically clusters the objects based on their similarity into a small number of groups and approximately computes the attention at the coarse group granularity. It thus significantly reduces the time and space complexity, yet provides a theoretical guarantee on the quality of the computed attention. The dynamic scheduler of RITA continuously adapts the number of groups and the batch size in the training process, ensuring group attention always uses the fewest groups needed to meet the approximation quality requirement. Extensive experiments on various timeseries datasets and analytics tasks demonstrate that RITA outperforms the state-of-the-art in accuracy and is significantly faster -- with speedups of up to 63X.

SPMar 26, 2023
Driver Profiling and Bayesian Workload Estimation Using Naturalistic Peripheral Detection Study Data

Nermin Caber, Bashar I. Ahmad, Jiaming Liang et al.

Monitoring drivers' mental workload facilitates initiating and maintaining safe interactions with in-vehicle information systems, and thus delivers adaptive human machine interaction with reduced impact on the primary task of driving. In this paper, we tackle the problem of workload estimation from driving performance data. First, we present a novel on-road study for collecting subjective workload data via a modified peripheral detection task in naturalistic settings. Key environmental factors that induce a high mental workload are identified via video analysis, e.g. junctions and behaviour of vehicle in front. Second, a supervised learning framework using state-of-the-art time series classifiers (e.g. convolutional neural network and transform techniques) is introduced to profile drivers based on the average workload they experience during a journey. A Bayesian filtering approach is then proposed for sequentially estimating, in (near) real-time, the driver's instantaneous workload. This computationally efficient and flexible method can be easily personalised to a driver (e.g. incorporate their inferred average workload profile), adapted to driving/environmental contexts (e.g. road type) and extended with data streams from new sources. The efficacy of the presented profiling and instantaneous workload estimation approaches are demonstrated using the on-road study data, showing $F_{1}$ scores of up to 92% and 81%, respectively.

FLU-DYNOct 31, 2022
GotFlow3D: Recurrent Graph Optimal Transport for Learning 3D Flow Motion in Particle Tracking

Jiaming Liang, Chao Xu, Shengze Cai

Flow visualization technologies such as particle tracking velocimetry (PTV) are broadly used in understanding the all-pervasiveness three-dimensional (3D) turbulent flow from nature and industrial processes. Despite the advances in 3D acquisition techniques, the developed motion estimation algorithms in particle tracking remain great challenges of large particle displacements, dense particle distributions and high computational cost. By introducing a novel deep neural network based on recurrent Graph Optimal Transport, called GotFlow3D, we present an end-to-end solution to learn the 3D fluid flow motion from double-frame particle sets. The proposed network constructs two graphs in the geometric and feature space and further enriches the original particle representations with the fused intrinsic and extrinsic features learnt from a graph neural network. The extracted deep features are subsequently utilized to make optimal transport plans indicating the correspondences of particle pairs, which are then iteratively and adaptively retrieved to guide the recurrent flow learning. Experimental evaluations, including assessments on numerical experiments and validations on real-world experiments, demonstrate that the proposed GotFlow3D achieves state-of-the-art performance against both recently-developed scene flow learners and particle tracking algorithms, with impressive accuracy, robustness and generalization ability, which can provide deeper insight into the complex dynamics of broad physical and biological systems.

IVMar 16, 2023
SigVIC: Spatial Importance Guided Variable-Rate Image Compression

Jiaming Liang, Meiqin Liu, Chao Yao et al.

Variable-rate mechanism has improved the flexibility and efficiency of learning-based image compression that trains multiple models for different rate-distortion tradeoffs. One of the most common approaches for variable-rate is to channel-wisely or spatial-uniformly scale the internal features. However, the diversity of spatial importance is instructive for bit allocation of image compression. In this paper, we introduce a Spatial Importance Guided Variable-rate Image Compression (SigVIC), in which a spatial gating unit (SGU) is designed for adaptively learning a spatial importance mask. Then, a spatial scaling network (SSN) takes the spatial importance mask to guide the feature scaling and bit allocation for variable-rate. Moreover, to improve the quality of decoded image, Top-K shallow features are selected to refine the decoded features through a shallow feature fusion module (SFFM). Experiments show that our method outperforms other learning-based methods (whether variable-rate or not) and traditional codecs, with storage saving and high flexibility.

CVMay 9Code
Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression

Jiaming Liang, Chi-Man Pun, Weisi Lin et al.

Learned image compression (LIC) integrates deep neural networks (DNNs) to map high-dimensional images into compact latent representations, reducing redundancy and achieving superior rate-distortion (RD) performance in benign settings. Unfortunately, due to inherent vulnerabilities in DNNs, LIC systems are susceptible to adversarial perturbations that lead to downstream deterioration, compression rate degradation, untargeted distortion, and both local semantic manipulation (LSM) and low-resolution ($3\times28\times28$) global semantic manipulation (GSM). However, high-resolution GSM remains unexplored due to its intractability. Notably, the existing project gradient descent (PGD) method achieves near-perfect white-box attacks for classification, segmentation, and other tasks, yet fails to generalize to high-resolution GSM. Our theoretical and empirical analyses reveal that well-performing GSM drives adversarial examples from the Identity Region to the Amplification Region through the Lazying-Oscillating-Refining stages. General $\ell_{\infty}$-bounded attacks fail on high-resolution GSM because their step-size schedules cannot accommodate both the Oscillating and Refining stages. Based on this, we propose the Periodic Geometric Decay schedule that enables $\ell_{\infty}$-bounded high-resolution GSM. To verify our approach, we integrate it with PGD, yielding a minimal variant, PGD$^{2}$-GSM. Extensive experiments on the Kodak $(3\times768\times512)$ demonstrate that our PGD$^{2}$-GSM is the first to stably achieve high-resolution GSM, thereby exposing a novel threat to LIC systems. Code is available at https://github.com/chinaliangjiaming/PGD2-GSM.

ROApr 22
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Tianle Zhang, Zhihao Yuan, Dafeng Chi et al.

Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

LGFeb 12Code
SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

Zhaoxin Wang, Jiaming Liang, Fengbin Zhu et al.

Large language models (LLMs) and multimodal LLMs are typically safety-aligned before release to prevent harmful content generation. However, recent studies show that safety behaviors are concentrated in a small subset of parameters, making alignment brittle and easily bypassed through neuron-level attacks. Moreover, most existing alignment methods operate at the behavioral level, offering limited control over the model's internal safety mechanisms. In this work, we propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network. SafeNeuron first identifies safety-related neurons, then freezes these neurons during preference optimization to prevent reliance on sparse safety pathways and force the model to construct redundant safety representations. Extensive experiments across models and modalities demonstrate that SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities. Furthermore, our layer-wise analysis reveals that safety behaviors are governed by stable and shared internal representations. Overall, SafeNeuron provides an interpretable and robust perspective for model alignment.

LGMay 7Code
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

Haydn Jones, Yimeng Zeng, Alden Rose et al.

Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.

LGMay 20, 2022
A Proximal Algorithm for Sampling from Non-convex Potentials

Jiaming Liang, Yongxin Chen

We study sampling problems associated with non-convex potentials that meanwhile lack smoothness. In particular, we consider target distributions that satisfy either logarithmic-Sobolev inequality or Poincaré inequality. Rather than smooth, the potentials are assumed to be semi-smooth or the summation of multiple semi-smooth functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling in the non-convex and semi-smooth setting. This work extends the recent algorithm in \cite{LiaChe21,LiaChe22} for non-smooth/semi-smooth log-concave distribution to the setting with non-convex potentials. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves better complexity than all existing methods.

OCMay 11
Log-Averaged Mirror Prox for Fast, Large-Scale Optimal Transport in Linear Space

Matthew X. Burns, Jiaming Liang

We propose Log-Averaged Mirror Prox (LAMP), a linear-space primal-dual method for large-scale optimal transport. LAMP implements primal mirror prox updates by tracking an averaged dual sequence, reducing storage complexity from ${O}(nm)$ to $O(n+m)$ while preserving dense, GPU-friendly reductions. Consequently, LAMP preserves the last-iterate $\widetilde{O}( nm\varepsilon^{-1})$ arithmetic complexity of conservatively parameterized primal-dual mirror prox. We further analyze LAMP as a direct optimal transport solver in a more performant parameter regime, providing a last-iterate sub-optimality certificate dependent on infeasibility and an explicit $O(1/t)$ term. Moreover, we give a computable sufficient condition for best-iterate convergence to a saddle-point. Numerical experiments with an optimized CUDA implementation show that LAMP outperforms first-order baselines in several high-accuracy (entropic) optimal transport problems. LAMP is further shown to scale up to problems with $n=m=2^{18}$ marginal supports, which were previously beyond the reach of primal-dual first-order methods.

CVMay 10, 2022
STDC-MA Network for Semantic Segmentation

Xiaochun Lei, Linjun Lu, Zetao Jiang et al.

Semantic segmentation is applied extensively in autonomous driving and intelligent transportation with methods that highly demand spatial and semantic information. Here, an STDC-MA network is proposed to meet these demands. First, the STDC-Seg structure is employed in STDC-MA to ensure a lightweight and efficient structure. Subsequently, the feature alignment module (FAM) is applied to understand the offset between high-level and low-level features, solving the problem of pixel offset related to upsampling on the high-level feature map. Our approach implements the effective fusion between high-level features and low-level features. A hierarchical multiscale attention mechanism is adopted to reveal the relationship among attention regions from two different input sizes of one image. Through this relationship, regions receiving much attention are integrated into the segmentation results, thereby reducing the unfocused regions of the input image and improving the effective utilization of multiscale features. STDC- MA maintains the segmentation speed as an STDC-Seg network while improving the segmentation accuracy of small objects. STDC-MA was verified on the verification set of Cityscapes. The segmentation result of STDC-MA attained 76.81% mIOU with the input of 0.5x scale, 3.61% higher than STDC-Seg.

ROApr 26Code
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks

Yihang Li, Xuelong Wei, Jingzhou Luo et al.

The advancement of robot learning is currently hindered by the scarcity of large-scale, high-quality datasets. While established data collection methods such as teleoperation and universal manipulation interfaces dominate current datasets, they suffer from inherent limitations in scalability and real-world deployability. Human egocentric video collection, by contrast, has emerged as a promising approach to enable scalable, natural and in-the-wild data collection. As such, we present EgoLive, a large-scale, high-quality egocentric dataset designed explicitly for robot manipulation learning. EgoLive establishes three distinctive technical advantages over existing egocentric datasets: first, it represents the largest open-source annotated egocentric dataset focused on real-world task-oriented human routines to date; second, it delivers leading data quality via a customized head-mounted capture device and comprehensive high-precision multi-modal annotations; third, all data is collected exclusively in unconstrained real-world scenarios and encompasses vertical field human working data, including home service, retail, and other practical work scenarios, providing superior diversity and ecological validity. With the introduction of EgoLive, we aim to provide the research community with a scalable, high-quality dataset that accelerates breakthroughs in generalizable robotic models and facilitates the real-world deployment of robot systems.

CVJan 24
OTI: A Model-free and Visually Interpretable Measure of Image Attackability

Jiaming Liang, Haowei Liu, Chi-Man Pun

Despite the tremendous success of neural networks, benign images can be corrupted by adversarial perturbations to deceive these models. Intriguingly, images differ in their attackability. Specifically, given an attack configuration, some images are easily corrupted, whereas others are more resistant. Evaluating image attackability has important applications in active learning, adversarial training, and attack enhancement. This prompts a growing interest in developing attackability measures. However, existing methods are scarce and suffer from two major limitations: (1) They rely on a model proxy to provide prior knowledge (e.g., gradients or minimal perturbation) to extract model-dependent image features. Unfortunately, in practice, many task-specific models are not readily accessible. (2) Extracted features characterizing image attackability lack visual interpretability, obscuring their direct relationship with the images. To address these, we propose a novel Object Texture Intensity (OTI), a model-free and visually interpretable measure of image attackability, which measures image attackability as the texture intensity of the image's semantic object. Theoretically, we describe the principles of OTI from the perspectives of decision boundaries as well as the mid- and high-frequency characteristics of adversarial perturbations. Comprehensive experiments demonstrate that OTI is effective and computationally efficient. In addition, our OTI provides the adversarial machine learning community with a visual understanding of attackability.

CVMar 27
SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection

Jiaming Liang, Yifeng Zhan, Chunlin Liu et al.

Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector's ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.

CVMar 26
A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks

Jiaming Liang, Chi-Man Pun

Transformation-based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non-spatial and spatial transformations inspire us to address this challenge. We find that for non-structured tasks, labels are spatially non-structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non-targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir-SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.

CVMar 26
Efficient Preemptive Robustification with Image Sharpening

Jiaming Liang, Chi-Man Pun

Despite their great success, deep neural networks rely on high-dimensional, non-robust representations, making them vulnerable to imperceptible perturbations, even in transfer scenarios. To address this, both training-time defenses (e.g., adversarial training and robust architecture design) and post-attack defenses (e.g., input purification and adversarial detection) have been extensively studied. Recently, a limited body of work has preliminarily explored a pre-attack defense paradigm, termed preemptive robustification, which introduces subtle modifications to benign samples prior to attack to proactively resist adversarial perturbations. Unfortunately, their practical applicability remains questionable due to several limitations, including (1) reliance on well-trained classifiers as surrogates to provide robustness priors, (2) substantial computational overhead arising from iterative optimization or trained generators for robustification, and (3) limited interpretability of the optimization- or generation-based robustification processes. Inspired by recent studies revealing a positive correlation between texture intensity and the robustness of benign samples, we show that image sharpening alone can efficiently robustify images. To the best of our knowledge, this is the first surrogate-free, optimization-free, generator-free, and human-interpretable robustification approach. Extensive experiments demonstrate that sharpening yields remarkable robustness gains with low computational cost, especially in transfer scenarios.

CVNov 4, 2025Code
MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation

Jiawen Liu, Yuanbo Zeng, Jiaming Liang et al.

Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 % on DRIVE and 1.25 % on STARE, demonstrating its effectiveness and advancement. The project code is public via https://github.com/liujiawen-jpg/MM-UNet.

CVOct 16, 2023
GreatSplicing: A Semantically Rich Splicing Dataset

Jiaming Liang, Yuwan Xue, Haowei Liu et al.

In existing splicing forgery datasets, the insufficient semantic variety of spliced regions causes trained detection models to overfit semantic features rather than learn genuine splicing traces. Meanwhile, the lack of a reasonable benchmark dataset has led to inconsistent experimental settings across existing detection methods. To address these issues, we propose GreatSplicing, a manually created, large-scale, high-quality splicing dataset. GreatSplicing comprises 5,000 spliced images and covers spliced regions across 335 distinct semantic categories, enabling detection models to learn splicing traces more effectively. Empirical results show that detection models trained on GreatSplicing achieve low misidentification rates and stronger cross-dataset generalization compared to existing datasets. GreatSplicing is now publicly available for research purposes at the following link.

CVMar 29, 2024Code
Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions

Runhao Zeng, Xiaoyong Chen, Jiaming Liang et al.

Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.

CVNov 15, 2025
Dynamic Parameter Optimization for Highly Transferable Transformation-Based Attacks

Jiaming Liang, Chi-Man Pun

Despite their wide application, the vulnerabilities of deep neural networks raise societal concerns. Among them, transformation-based attacks have demonstrated notable success in transfer attacks. However, existing attacks suffer from blind spots in parameter optimization, limiting their full potential. Specifically, (1) prior work generally considers low-iteration settings, yet attacks perform quite differently at higher iterations, so characterizing overall performance based only on low-iteration results is misleading. (2) Existing attacks use uniform parameters for different surrogate models, iterations, and tasks, which greatly impairs transferability. (3) Traditional transformation parameter optimization relies on grid search. For n parameters with m steps each, the complexity is O(mn). Large computational overhead limits further optimization of parameters. To address these limitations, we conduct an empirical study with various transformations as baselines, revealing three dynamic patterns of transferability with respect to parameter strength. We further propose a novel Concentric Decay Model (CDM) to effectively explain these patterns. Building on these insights, we propose an efficient Dynamic Parameter Optimization (DPO) based on the rise-then-fall pattern, reducing the complexity to O(nlogm). Comprehensive experiments on existing transformation-based attacks across different surrogate models, iterations, and tasks demonstrate that our DPO can significantly improve transferability.

IVMay 15, 2025Code
HWA-UNETR: Hierarchical Window Aggregate UNETR for 3D Multimodal Gastric Lesion Segmentation

Jiaming Liang, Lihuan Dai, Xiaoqi Sheng et al.

Multimodal medical image segmentation faces significant challenges in the context of gastric cancer lesion analysis. This clinical context is defined by the scarcity of independent multimodal datasets and the imperative to amalgamate inherently misaligned modalities. As a result, algorithms are constrained to train on approximate data and depend on application migration, leading to substantial resource expenditure and a potential decline in analysis accuracy. To address those challenges, we have made two major contributions: First, we publicly disseminate the GCM 2025 dataset, which serves as the first large-scale, open-source collection of gastric cancer multimodal MRI scans, featuring professionally annotated FS-T2W, CE-T1W, and ADC images from 500 patients. Second, we introduce HWA-UNETR, a novel 3D segmentation framework that employs an original HWA block with learnable window aggregation layers to establish dynamic feature correspondences between different modalities' anatomical structures, and leverages the innovative tri-orientated fusion mamba mechanism for context modeling and capturing long-range spatial dependencies. Extensive experiments on our GCM 2025 dataset and the publicly BraTS 2021 dataset validate the performance of our framework, demonstrating that the new approach surpasses existing methods by up to 1.68\% in the Dice score while maintaining solid robustness. The dataset and code are public via https://github.com/JeMing-creater/HWA-UNETR.

LGFeb 26
Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang, Zhaoxin Wang, Handing Wang

Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.

LGAug 14, 2025Code
A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

Haydn Thomas Jones, Natalie Maus, Josh Magnus Ludan et al.

AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60\% of molecules proposed had high probability of being mutagenic. In this work, we introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. Medex consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLava architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). Medex is highly effective for creating models with strong priors: in supervised prediction problems that use our data as pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with Medex can be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset at https://huggingface.co/datasets/medexanon/Medex, and will provide expanded versions as available literature grows.

ROMar 23
Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai, Qiwei Liang, Jiawei Li et al.

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces overfitting. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, as well as the difficulty of collecting additional viewpoints in real world settings, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. The generated data consistently improves downstream policies in both simulation and real-world environments.

ROFeb 27, 2025
CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-scale Reinforcement Learning in Autonomous Driving

Dongkun Zhang, Jiaming Liang, Ke Guo et al.

Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios. In this paper, we introduce \textbf{CarPlanner}, a \textbf{C}onsistent \textbf{a}uto-\textbf{r}egressive \textbf{Planner} that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance. Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving. To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.

MLFeb 16
Constrained and Composite Sampling via Proximal Sampler

Thanh Dang, Jiaming Liang

We study two log-concave sampling problems: constrained sampling and composite sampling. First, we consider sampling from a target distribution with density proportional to $\exp(-f(x))$ supported on a convex set $K \subset \mathbb{R}^d$, where $f$ is convex. The main challenge is enforcing feasibility without degrading mixing. Using an epigraph transformation, we reduce this task to sampling from a nearly uniform distribution over a lifted convex set in $\mathbb{R}^{d+1}$. We then solve the lifted problem using a proximal sampler. Assuming only a separation oracle for $K$ and a subgradient oracle for $f$, we develop an implementation of the proximal sampler based on the cutting-plane method and rejection sampling. Unlike existing constrained samplers that rely on projection, reflection, barrier functions, or mirror maps, our approach enforces feasibility using only minimal oracle access, resulting in a practical and unbiased sampler without knowing the geometry of the constraint set. Second, we study composite sampling, where the target is proportional to $\exp(-f(x)-h(x))$ with closed and convex $f$ and $h$. This composite structure is standard in Bayesian inference with $f$ modeling data fidelity and $h$ encoding prior information. We reduce composite sampling via an epigraph lifting of $h$ to constrained sampling in $\mathbb{R}^{d+1}$, which allows direct application of the constrained sampling algorithm developed in the first part. This reduction results in a double epigraph lifting formulation in $\mathbb{R}^{d+2}$, on which we apply a proximal sampler. By keeping $f$ and $h$ separate, we further demonstrate how different combinations of oracle access (such as subgradient and proximal) can be leveraged to construct separation oracles for the lifted problem. For both sampling problems, we establish mixing time bounds measured in Rényi and $χ^2$ divergences.

OCApr 2, 2024
Proximal Oracles for Optimization and Sampling

Jiaming Liang, Yongxin Chen

We consider convex optimization with non-smooth objective function and log-concave sampling with non-smooth potential (negative log density). In particular, we study two specific settings where the convex objective/potential function is either Hölder smooth or in hybrid form as the finite sum of Hölder smooth components. To overcome the challenges caused by non-smoothness, our algorithms employ two powerful proximal frameworks in optimization and sampling: the proximal point framework for optimization and the alternating sampling framework (ASF) that uses Gibbs sampling on an augmented distribution. A key component of both optimization and sampling algorithms is the efficient implementation of the proximal map by the regularized cutting-plane method. We establish its iteration-complexity under both Hölder smoothness and hybrid settings using novel convergence analysis, yielding results that are new to the literature. We further propose an adaptive proximal bundle method for non-smooth optimization that employs an aggressive adaptive stepsize strategy, which adjusts stepsizes only when necessary and never rejects iterates. The proposed method is universal since it does not need any problem parameters as input. Additionally, we provide an exact implementation of a proximal sampling oracle, analogous to the proximal map in optimization, along with simple complexity analyses for both the Hölder smooth and hybrid cases, using a novel technique based on a modified Gaussian integral. Finally, we combine this proximal sampling oracle and ASF to obtain a Markov chain Monte Carlo method with non-asymptotic complexity bounds for sampling in Hölder smooth and hybrid settings.

OCFeb 14, 2024
Variance Reduction and Low Sample Complexity in Stochastic Optimization via Proximal Point Method

Jiaming Liang

High-probability guarantees in stochastic optimization are often obtained only under strong noise assumptions such as sub-Gaussian tails. We show that such guarantees can also be achieved under the weaker assumption of bounded variance by developing a stochastic proximal point method. This method combines a proximal subproblem solver, which inherently reduces variance, with a probability booster that amplifies per-iteration reliability into high-confidence results. The analysis demonstrates convergence with low sample complexity, without restrictive noise assumptions or reliance on mini-batching.

RONov 25, 2025
Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

Qiwei Liang, Boyang Cai, Minghao Lai et al.

Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state-action-state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a self-supervised framework that learns dynamics-aware 3D representations without action or reconstruction supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for 3D representation learning in robotics. Project page: https://kolakivy.github.io/AFRO/

DSOct 3, 2025
Oracle-based Uniform Sampling from Convex Bodies

Thanh Dang, Jiaming Liang

We propose new Markov chain Monte Carlo algorithms to sample a uniform distribution on a convex body $K$. Our algorithms are based on the Alternating Sampling Framework/proximal sampler, which uses Gibbs sampling on an augmented distribution and assumes access to the so-called restricted Gaussian oracle (RGO). The key contribution of this work is the efficient implementation of RGO for uniform sampling on $K$ via rejection sampling and access to either a projection oracle or a separation oracle on $K$. In both oracle cases, we establish non-asymptotic complexities to obtain unbiased samples where the accuracy is measured in Rényi divergence or $χ^2$-divergence.

CVMar 21, 2025
Temporal Action Detection Model Compression by Progressive Block Drop

Xiaoyong Chen, Yong Guo, Jiaming Liang et al.

Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a Progressive Block Drop method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield further efficiency gains.

DBJun 13, 2024
FeatNavigator: Automatic Feature Augmentation on Tabular Data

Jiaming Liang, Chuan Lei, Xiao Qin et al.

Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.

LGFeb 28, 2022
A Proximal Algorithm for Sampling

Jiaming Liang, Yongxin Chen

We study sampling problems associated with potentials that lack smoothness. The potentials can be either convex or non-convex. Departing from the standard smooth setting, the potentials are only assumed to be weakly smooth or non-smooth, or the summation of multiple such functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling for both non-convex and convex potentials that are not necessarily smooth. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves better complexity than all existing methods.

LGOct 9, 2021
A Proximal Algorithm for Sampling from Non-smooth Potentials

Jiaming Liang, Yongxin Chen

In this work, we examine sampling problems with non-smooth potentials. We propose a novel Markov chain Monte Carlo algorithm for sampling from non-smooth potentials. We provide a non-asymptotical analysis of our algorithm and establish a polynomial-time complexity $\tilde {\cal O}(d\varepsilon^{-1})$ to obtain $\varepsilon$ total variation distance to the target density, better than most existing results under the same assumptions. Our method is based on the proximal bundle method and an alternating sampling framework. This framework requires the so-called restricted Gaussian oracle, which can be viewed as a sampling counterpart of the proximal mapping in convex optimization. One key contribution of this work is a fast algorithm that realizes the restricted Gaussian oracle for any convex non-smooth potential with bounded Lipschitz constant.

CLMay 31, 2021
A Semantic-based Method for Unsupervised Commonsense Question Answering

Yilin Niu, Fei Huang, Jiaming Liang et al.

Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. Among existing work, a popular solution is to use pre-trained language models to score candidate choices directly conditioned on the question or context. However, such scores from language models can be easily affected by irrelevant factors, such as word frequencies, sentence structures, etc. These distracting factors may not only mislead the model to choose a wrong answer but also make it oversensitive to lexical perturbations in candidate answers. In this paper, we present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering. Instead of directly scoring each answer choice, our method first generates a set of plausible answers with generative models (e.g., GPT-2), and then uses these plausible answers to select the correct choice by considering the semantic similarity between each plausible answer and each choice. We devise a simple, yet sound formalism for this idea and verify its effectiveness and robustness with extensive experiments. We evaluate the proposed method on four benchmark datasets, and our method achieves the best results in unsupervised settings. Moreover, when attacked by TextFooler with synonym replacement, SEQA demonstrates much less performance drops than baselines, thereby indicating stronger robustness.