80.0ROMay 22
Point Tracking Improves World Action ModelsJiarui Guan, Wenshuai Zhao, Yue Pei et al.
Robot policy learning benefits from world-action models that capture environment dynamics, but pixel-level prediction entangles dynamics with nuisance factors such as lighting and texture, making learned representations vulnerable to task-irrelevant visual variation. We propose JOPAT, a JOint Pixel-And-Track World-Action Model that predicts latent visual observations, 2D point tracks with visibility, and actions in a single denoising diffusion transformer. The key insight is that tracks provide an explicit representation of motion that captures long-horizon dynamics and remains robust under occlusion or partial out-of-frame motion, offering greater utility than modeling pixel appearance alone. On LIBERO and real-world LeRobot tasks, JOPAT improves over pixel-based baselines, with the largest gains on long-horizon tasks involving occlusion, object interaction, and off-screen motion.
GNFeb 9
AntigenLM: Structure-Aware DNA Language Modeling for InfluenzaYue Pei, Xuebin Chi, Yu Kang
Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.
LGAug 22, 2025
Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement LearningYue Pei, Hongming Zhang, Chao Gao et al.
Offline reinforcement learning (RL) has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making. Most existing methods primarily focus on training policies that maximize cumulative returns from a given dataset. However, many real-world applications require precise control over policy performance levels, rather than simply pursuing the best possible return. Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task, enabling the extraction of diverse policies by conditioning on different desired returns. Yet, existing RvS-based transformers, such as Decision Transformer (DT), struggle to reliably align the actual achieved returns with specified target returns, especially when interpolating within underrepresented returns or extrapolating beyond the dataset. To address this limitation, we propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL. Doctor integrates the strengths of supervised learning (SL) and temporal difference (TD) learning by jointly optimizing the action prediction and value estimation. During inference, Doctor introduces a double-check mechanism: actions are first sampled around the desired target returns and then validated with value functions. This ensures more accurate alignment between predicted actions and desired target returns. We evaluate Doctor on the D4RL and EpiCare benchmarks, demonstrating aligned control yields stronger performance and tunable expertise, showing its effectiveness in a wide range of tasks.