AIAug 16, 2025
MOVER: Multimodal Optimal Transport with Volume-based Embedding RegularizationHaochen You, Baojing Liu
Recent advances in multimodal learning have largely relied on pairwise contrastive objectives to align different modalities, such as text, video, and audio, in a shared embedding space. While effective in bi-modal setups, these approaches struggle to generalize across multiple modalities and often lack semantic structure in high-dimensional spaces. In this paper, we propose MOVER, a novel framework that combines optimal transport-based soft alignment with volume-based geometric regularization to build semantically aligned and structured multimodal representations. By integrating a transport-guided matching mechanism with a geometric volume minimization objective (GAVE), MOVER encourages consistent alignment across all modalities in a modality-agnostic manner. Experiments on text-video-audio retrieval tasks demonstrate that MOVER significantly outperforms prior state-of-the-art methods in both zero-shot and finetuned settings. Additional analysis shows improved generalization to unseen modality combinations and stronger structural consistency in the learned embedding space.
LGAug 24, 2025
Modular MeanFlow: Towards Stable and Scalable One-Step Generative ModelingHaochen You, Baojing Liu, Hongyang He
One-step generative modeling seeks to generate high-quality data samples in a single function evaluation, significantly improving efficiency over traditional diffusion or flow-based models. In this work, we introduce Modular MeanFlow (MMF), a flexible and theoretically grounded approach for learning time-averaged velocity fields. Our method derives a family of loss functions based on a differential identity linking instantaneous and average velocities, and incorporates a gradient modulation mechanism that enables stable training without sacrificing expressiveness. We further propose a curriculum-style warmup schedule to smoothly transition from coarse supervision to fully differentiable training. The MMF formulation unifies and generalizes existing consistency-based and flow-matching methods, while avoiding expensive higher-order derivatives. Empirical results across image synthesis and trajectory modeling tasks demonstrate that MMF achieves competitive sample quality, robust convergence, and strong generalization, particularly under low-data or out-of-distribution settings.
LGFeb 1
Generalized Radius and Integrated Codebook Transforms for Differentiable Vector QuantizationHaochen You, Heng Zhang, Hongyang He et al.
Vector quantization (VQ) underpins modern generative and representation models by turning continuous latents into discrete tokens. Yet hard nearest-neighbor assignments are non-differentiable and are typically optimized with heuristic straight-through estimators, which couple the update step size to the quantization gap and train each code in isolation, leading to unstable gradients and severe codebook under-utilization at scale. In this paper, we introduce GRIT-VQ (Generalized Radius and Integrated Transform-Vector Quantization), a unified surrogate framework that keeps hard assignments in the forward pass while making VQ fully differentiable. GRIT-VQ replaces the straight-through estimator with a radius-based update that moves latents along the quantization direction with a controllable, geometry-aware step, and applies a data-agnostic integrated transform to the codebook so that all codes are updated through shared parameters instead of independently. Our theoretical analysis clarifies the fundamental optimization dynamics introduced by GRIT-VQ, establishing conditions for stable gradient flow, coordinated codebook evolution, and reliable avoidance of collapse across a broad family of quantizers. Across image reconstruction, image generation, and recommendation tokenization benchmarks, GRIT-VQ consistently improves reconstruction error, generative quality, and recommendation accuracy while substantially increasing codebook utilization compared to existing VQ variants.
CLOct 2, 2025
ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context ReasoningHaochen You, Baojing Liu
While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.
LGOct 2, 2025
Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude ControlHaochen You, Baojing Liu
Gradient clipping is widely used to stabilize deep network training, but its formulation as a hard, fixed threshold limits flexibility and ignores gradient distribution dynamics. We propose SPAMP (Statistical Per-layer Adaptive Modulation and Projection), a unified framework that generalizes clipping into smooth, per-layer gradient shaping. SPAMP tracks local gradient statistics, dynamically estimates thresholds, and applies power-based transformations to modulate update magnitudes in a differentiable manner. This perspective recasts clipping and warmup as dual mechanisms for controlling the effective update scale $η_t \|g_t\|$, offering a principled alternative to rigid heuristics. Extensive experiments across image and language tasks demonstrate that SPAMP improves stability, convergence, and robustness over existing methods.
LGSep 7, 2025
MCIGLE: Multimodal Exemplar-Free Class-Incremental Graph LearningHaochen You, Baojing Liu
Exemplar-free class-incremental learning enables models to learn new classes over time without storing data from old ones. As multimodal graph-structured data becomes increasingly prevalent, existing methods struggle with challenges like catastrophic forgetting, distribution bias, memory limits, and weak generalization. We propose MCIGLE, a novel framework that addresses these issues by extracting and aligning multimodal graph features and applying Concatenated Recursive Least Squares for effective knowledge retention. Through multi-channel processing, MCIGLE balances accuracy and memory preservation. Experiments on public datasets validate its effectiveness and generalizability.
LGSep 7, 2025
Metric Embedding Initialization-Based Differentially Private and Explainable Graph ClusteringHaochen You, Baojing Liu
Graph clustering under the framework of differential privacy, which aims to process graph-structured data while protecting individual privacy, has been receiving increasing attention. Despite significant achievements in current research, challenges such as high noise, low efficiency and poor interpretability continue to severely constrain the development of this field. In this paper, we construct a differentially private and interpretable graph clustering approach based on metric embedding initialization. Specifically, we construct an SDP optimization, extract the key set and provide a well-initialized clustering configuration using an HST-based initialization method. Subsequently, we apply an established k-median clustering strategy to derive the cluster results and offer comparative explanations for the query set through differences from the cluster centers. Extensive experiments on public datasets demonstrate that our proposed framework outperforms existing methods in various clustering metrics while strictly ensuring privacy.