CVJun 2Code
MemoGen: Can Past Experience Improve Future Text-to-Image Generation?Wenshuo Chen, Kuimou Yu, Bowen Tian et al.
Modern text-to-image models have achieved strong visual synthesis, yet remain unreliable when prompts require implicit visual constraints, relational reasoning, or external knowledge. Existing retrieval-augmented and agentic generation methods mitigate this issue by acquiring external knowledge, references, or refined prompts for the current request, yet they typically treat each generation as an isolated episode and do not systematically preserve past successes or failures for future use. In this work, we ask whether a text-to-image system can continually improve from its own generation experience without updating the underlying generator. We propose MemoGen, a training-free framework that augments existing image generators with an agentic evolution layer. For each task, MemoGen explicitly infers visual requirements, retrieves external evidence and references when necessary, translates them into executable generation constraints, evaluates the generated result, and stores task understanding, reference choices, visual feedback, successful strategies, and failure lessons as reusable experience memory. Across evolution rounds, the agent retrieves relevant experience to improve similar future generations, selectively repairing previously failed cases while preserving successful ones, thereby enabling test-time self-evolution without parameter updates. Extensive experiments on knowledge-intensive and reasoning-oriented benchmarks demonstrate the effectiveness of this paradigm: after only two evolution rounds, MemoGen built upon the open-source Qwen-Image backbone surpasses strong proprietary systems such as Nano Banana Pro and GPT-Image-1 on WISE and Mind-Bench, showing that explicit experience memory can serve as a powerful continual learning signal for reliable text-to-image generation.
AIMay 31
AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian SurpriseBowen Tian, Caixue He, Jiemin Wu et al.
Editing complex, long-form knowledge in Large Language Models remains a significant challenge due to the difficulty of maintaining generation coherence. Existing autoregressive methods like AnyEdit alleviate length constraints but rely on Fixed-window Chunking, which disregards logical structure and compromises consistency. To address this, we present AnyEdit++, a structure-aware framework incorporating Bayes-Chunk, an adaptive segmentation mechanism that dynamically identifies semantic boundaries based on Bayesian Surprise. We underpin this approach with a theoretical framework establishing two key principles: (1) Structural Independence: we prove that cross-segment interference is minimized when anchor keys are geometrically orthogonal (a condition naturally satisfied by our surprisal-based boundaries but violated by fixed windows), and (2) Causal Locality: we demonstrate that updates injected at these semantic peaks yield strictly superior control compared to arbitrary split points. Extensive experiments across mathematical reasoning, code generation, and narrative tasks demonstrate that AnyEdit++ achieves superior performance and robustness compared to state-of-the-art baselines, validating that structural awareness is critical for effective long-form knowledge editing.
LGMay 20Code
Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation AlignmentHaozhe Jia, Pengyu Yin, Wenshuo Chen et al.
Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce **REPA-P**, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing **zero overhead**. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$, with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P).
CVSep 5, 2024Code
PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised LearningBowen Tian, Songning Lai, Lujundong Li et al.
Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.Our code has been open sourced at https://github.com/TianSuya/SemiFG.
LGApr 28, 2022
Anomaly Detection by Leveraging Incomplete Anomalous Knowledge with Anomaly-Aware Bidirectional GANsBowen Tian, Qinliang Su, Jian Yin
The goal of anomaly detection is to identify anomalous samples from normal ones. In this paper, a small number of anomalies are assumed to be available at the training stage, but they are assumed to be collected only from several anomaly types, leaving the majority of anomaly types not represented in the collected anomaly dataset at all. To effectively leverage this kind of incomplete anomalous knowledge represented by the collected anomalies, we propose to learn a probability distribution that can not only model the normal samples, but also guarantee to assign low density values for the collected anomalies. To this end, an anomaly-aware generative adversarial network (GAN) is developed, which, in addition to modeling the normal samples as most GANs do, is able to explicitly avoid assigning probabilities for collected anomalous samples. Moreover, to facilitate the computation of anomaly detection criteria like reconstruction error, the proposed anomaly-aware GAN is designed to be bidirectional, attaching an encoder for the generator. Extensive experimental results demonstrate that our proposed method is able to effectively make use of the incomplete anomalous information, leading to significant performance gains compared to existing methods.
CLAug 20, 2024
QUITO-X: A New Perspective on Context Compression from the Information Bottleneck TheoryYihang Wang, Xu Huang, Bowen Tian et al.
Generative LLM have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the "lost in the middle" problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or PPL, which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.
LGFeb 3, 2023
Leveraging Contaminated Datasets to Learn Clean-Data Distribution with Purified Generative Adversarial NetworksBowen Tian, Qinliang Su, Jianxing Yu
Generative adversarial networks (GANs) are known for their strong abilities on capturing the underlying distribution of training instances. Since the seminal work of GAN, many variants of GAN have been proposed. However, existing GANs are almost established on the assumption that the training dataset is clean. But in many real-world applications, this may not hold, that is, the training dataset may be contaminated by a proportion of undesired instances. When training on such datasets, existing GANs will learn a mixture distribution of desired and contaminated instances, rather than the desired distribution of desired data only (target distribution). To learn the target distribution from contaminated datasets, two purified generative adversarial networks (PuriGAN) are developed, in which the discriminators are augmented with the capability to distinguish between target and contaminated instances by leveraging an extra dataset solely composed of contamination instances. We prove that under some mild conditions, the proposed PuriGANs are guaranteed to converge to the distribution of desired instances. Experimental results on several datasets demonstrate that the proposed PuriGANs are able to generate much better images from the desired distribution than comparable baselines when trained on contaminated datasets. In addition, we also demonstrate the usefulness of PuriGAN on downstream applications by applying it to the tasks of semi-supervised anomaly detection on contaminated datasets and PU-learning. Experimental results show that PuriGAN is able to deliver the best performance over comparable baselines on both tasks.
CVOct 10, 2025Code
RadioFlow: Efficient Radio Map Construction Framework with Flow MatchingHaozhe Jia, Wenshuo Chen, Xiucheng Wang et al.
Accurate and real-time radio map (RM) generation is crucial for next-generation wireless systems, yet diffusion-based approaches often suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment. To overcome these limitations, we propose \textbf{RadioFlow}, a novel flow-matching-based generative framework that achieves high-fidelity RM generation through single-step efficient sampling. Unlike conventional diffusion models, RadioFlow learns continuous transport trajectories between noise and data, enabling both training and inference to be significantly accelerated while preserving reconstruction accuracy. Comprehensive experiments demonstrate that RadioFlow achieves state-of-the-art performance with \textbf{up to 8$\times$ fewer parameters} and \textbf{over 4$\times$ faster inference} compared to the leading diffusion-based baseline (RadioDiff). This advancement provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks. We release the code at \href{https://github.com/Hxxxz0/RadioFlow}{GitHub}.
LGAug 19, 2025Code
Text2Weight: Bridging Natural Language and Neural Network Weight SpacesBowen Tian, Wenshuo Chen, Zexi Li et al.
How far are we really from automatically generating neural networks? While neural network weight generation shows promise, current approaches struggle with generalization to unseen tasks and practical application exploration. To address this, we propose T2W, a diffusion transformer framework that generates task-specific weights conditioned on natural language descriptions. T2W hierarchically processes network parameters into uniform blocks, integrates text embeddings from CLIP via a prior attention mechanism, and employs adversarial training with weight-space augmentation to enhance generalization. Experiments on Cifar100, Caltech256, and TinyImageNet demonstrate T2W's ability to produce high-quality weights for unseen tasks, outperforming optimization-based initialization and enabling novel applications such as weight enhancement and text-guided model fusion. Our work bridges textual semantics with weight-space dynamics, supported by an open-source dataset of text-weight pairs, advancing the practicality of generative models in neural network parameter synthesis. Our code is available on Github.
CVApr 29
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion ModelsHaosen Li, Wenshuo Chen, Lei Wang et al.
Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.
IRNov 7, 2024
Best Practices for Distilling Large Language Models into BERT for Web Search RankingDezhi Ye, Junwei Hu, Jiabin Fan et al.
Recent studies have highlighted the significant potential of Large Language Models (LLMs) as zero-shot relevance rankers. These methods predominantly utilize prompt learning to assess the relevance between queries and documents by generating a ranked list of potential documents. Despite their promise, the substantial costs associated with LLMs pose a significant challenge for their direct implementation in commercial search systems. To overcome this barrier and fully exploit the capabilities of LLMs for text ranking, we explore techniques to transfer the ranking expertise of LLMs to a more compact model similar to BERT, using a ranking loss to enable the deployment of less resource-intensive models. Specifically, we enhance the training of LLMs through Continued Pre-Training, taking the query as input and the clicked title and summary as output. We then proceed with supervised fine-tuning of the LLM using a rank loss, assigning the final token as a representative of the entire sentence. Given the inherent characteristics of autoregressive language models, only the final token </s> can encapsulate all preceding tokens. Additionally, we introduce a hybrid point-wise and margin MSE loss to transfer the ranking knowledge from LLMs to smaller models like BERT. This method creates a viable solution for environments with strict resource constraints. Both offline and online evaluations have confirmed the efficacy of our approach, and our model has been successfully integrated into a commercial web search engine as of February 2024.
CVJun 3, 2025
ANT: Adaptive Neural Temporal-Aware Text-to-Motion ModelWenshuo Chen, Kuimou Yu, Haozhe Jia et al.
While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose **(ANT)**, an **A**daptive **N**eural **T**emporal-Aware architecture. ANT orchestrates semantic granularity through: **(i) Semantic Temporally Adaptive (STA) Module:** Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. **(ii) Dynamic Classifier-Free Guidance scheduling (DCFG):** Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion.
CVJan 31, 2025
Physics-Informed Representation Alignment for Sparse Radio-Map ReconstructionHaozhe Jia, Wenshuo Chen, Zhihui Huang et al.
Radio map reconstruction is essential for enabling advanced applications, yet challenges such as complex signal propagation and sparse observational data hinder accurate reconstruction in practical scenarios. Existing methods often fail to align physical constraints with data-driven features, particularly under sparse measurement conditions. To address these issues, we propose **Phy**sics-Aligned **R**adio **M**ap **D**iffusion **M**odel (**PhyRMDM**), a novel framework that establishes cross-domain representation alignment between physical principles and neural network features through dual learning pathways. The proposed model integrates **Physics-Informed Neural Networks (PINNs)** with a **representation alignment mechanism** that explicitly enforces consistency between Helmholtz equation constraints and environmental propagation patterns. Experimental results demonstrate significant improvements over state-of-the-art methods, achieving **NMSE of 0.0031** under *Static Radio Map (SRM)* conditions, and **NMSE of 0.0047** with **Dynamic Radio Map (DRM)** scenarios. The proposed representation alignment paradigm provides **37.2%** accuracy enhancement in ultra-sparse cases (**1%** sampling rate), confirming its effectiveness in bridging physics-based modeling and deep learning for radio map reconstruction.
CVSep 29, 2025
LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion ModelHaozhe Jia, Wenshuo Chen, Yuqi Lin et al.
While current diffusion-based models, typically built on U-Net architectures, have shown promising results on the text-to-motion generation task, they still suffer from semantic misalignment and kinematic artifacts. Through analysis, we identify severe gradient attenuation in the deep layers of the network as a key bottleneck, leading to insufficient learning of high-level features. To address this issue, we propose \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment}), a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment. The first path incorporates a lightweight MoCLIP model trained via contrastive learning without relying on external data, offering semantic supervision in the temporal domain. The second path introduces complementary alignment signals in the frequency domain, extracted from low-frequency DCT components known for their rich semantic content. These two anchors are adaptively fused through a temporal modulation mechanism, allowing the model to progressively transition from coarse alignment to fine-grained semantic refinement throughout the denoising process. Experimental results on HumanML3D and KIT-ML demonstrate that LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively. Furthermore, LUMA accelerates convergence by 1.4$\times$ compared to the baseline, making it an efficient and scalable solution for high-fidelity text-to-motion generation.
SEMar 12, 2025
You Only Train Once: A Flexible Training Framework for Code Vulnerability Detection Driven by Vul-VectorBowen Tian, Zhengyang Xu, Mingqiang Wu et al.
With the pervasive integration of computer applications across industries, the presence of vulnerabilities within code bases poses significant risks. The diversity of software ecosystems coupled with the intricate nature of modern software engineering has led to a shift from manual code vulnerability identification towards the adoption of automated tools. Among these, deep learning-based approaches have risen to prominence due to their superior accuracy; however, these methodologies encounter several obstacles. Primarily, they necessitate extensive labeled datasets and prolonged training periods, and given the rapid emergence of new vulnerabilities, the frequent retraining of models becomes a resource-intensive endeavor, thereby limiting their applicability in cutting-edge scenarios. To mitigate these challenges, this paper introduces the \underline{\textbf{YOTO}}--\underline{\textbf{Y}}ou \underline{\textbf{O}}nly \underline{\textbf{T}}rain \underline{\textbf{O}}nce framework. This innovative approach facilitates the integration of multiple types of vulnerability detection models via parameter fusion, eliminating the need for joint training. Consequently, YOTO enables swift adaptation to newly discovered vulnerabilities, significantly reducing both the time and computational resources required for model updates.
LGOct 4, 2021
Global Convergence and Stability of Stochastic Gradient DescentVivak Patel, Shushu Zhang, Bowen Tian
In machine learning, stochastic gradient descent (SGD) is widely deployed to train models using highly non-convex objectives with equally complex noise models. Unfortunately, SGD theory often makes restrictive assumptions that fail to capture the non-convexity of real problems, and almost entirely ignore the complex noise models that exist in practice. In this work, we make substantial progress on this shortcoming. First, we establish that SGD's iterates will either globally converge to a stationary point or diverge under nearly arbitrary nonconvexity and noise models. Under a slightly more restrictive assumption on the joint behavior of the non-convexity and noise model that generalizes current assumptions in the literature, we show that the objective function cannot diverge, even if the iterates diverge. As a consequence of our results, SGD can be applied to a greater range of stochastic optimization problems with confidence about its global convergence behavior and stability.