CVAug 14, 2024Code
KIND: Knowledge Integration and Diversion for Training Decomposable ModelsYucheng Xie, Fu Feng, Ruixiao Shi et al.
Pre-trained models have become the preferred backbone due to the increasing complexity of model parameters. However, traditional pre-trained models often face deployment challenges due to their fixed sizes, and are prone to negative transfer when discrepancies arise between training tasks and target tasks. To address this, we propose KIND, a novel pre-training method designed to construct decomposable models. KIND integrates knowledge by incorporating Singular Value Decomposition (SVD) as a structural constraint, with each basic component represented as a combination of a column vector, singular value, and row vector from U, Σ, and V^\top matrices. These components are categorized into learngenes for encapsulating class-agnostic knowledge and tailors for capturing class-specific knowledge, with knowledge diversion facilitated by a class gate mechanism during training. Extensive experiments demonstrate that models pre-trained with KIND can be decomposed into learngenes and tailors, which can be adaptively recombined for diverse resource-constrained deployments. Moreover, for tasks with large domain shifts, transferring only learngenes with task-agnostic knowledge, when combined with randomly initialized tailors, effectively mitigates domain shifts. Code will be made available at https://github.com/Te4P0t/KIND.
NEJun 17, 2023
Genes in Intelligent AgentsFu Feng, Jing Wang, Xu Yang et al.
The genes in nature give the lives on earth the current biological intelligence through transmission and accumulation over billions of years. Inspired by the biological intelligence, artificial intelligence (AI) has devoted to building the machine intelligence. Although it has achieved thriving successes, the machine intelligence still lags far behind the biological intelligence. The reason may lie in that animals are born with some intelligence encoded in their genes, but machines lack such intelligence and learn from scratch. Inspired by the genes of animals, we define the ``genes'' of machines named as the ``learngenes'' and propose the Genetic Reinforcement Learning (GRL). GRL is a computational framework that simulates the evolution of organisms in reinforcement learning (RL) and leverages the learngenes to learn and evolve the intelligence agents. Leveraging GRL, we first show that the learngenes take the form of the fragments of the agents' neural networks and can be inherited across generations. Second, we validate that the learngenes can transfer ancestral experience to the agents and bring them instincts and strong learning abilities. Third, we justify the Lamarckian inheritance of the intelligent agents and the continuous evolution of the learngenes. Overall, the learngenes have taken the machine intelligence one more step toward the biological intelligence.
CVSep 28, 2024
FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion ModelsYucheng Xie, Fu Feng, Ruixiao Shi et al.
The training of diffusion models is computationally intensive, making effective pre-training essential. However, real-world deployments often demand models of variable sizes due to diverse memory and computational constraints, posing challenges when corresponding pre-trained versions are unavailable. To address this, we propose FINE, a novel pre-training method whose resulting model can flexibly factorize its knowledge into fundamental components, termed learngenes, enabling direct initialization of models of various sizes and eliminating the need for repeated pre-training. Rather than optimizing a conventional full-parameter model, FINE represents each layer's weights as the product of $U_{\star}$, $Σ_{\star}^{(l)}$, and $V_{\star}^\top$, where $U_{\star}$ and $V_{\star}$ serve as size-agnostic learngenes shared across layers, while $Σ_{\star}^{(l)}$ remains layer-specific. By jointly training these components, FINE forms a decomposable and transferable knowledge structure that allows efficient initialization through flexible recombination of learngenes, requiring only light retraining of $Σ_{\star}^{(l)}$ on limited data. Extensive experiments demonstrate the efficiency of FINE, achieving state-of-the-art performance in initializing variable-sized models across diverse resource-constrained deployments. Furthermore, models initialized by FINE effectively adapt to diverse tasks, showcasing the task-agnostic versatility of learngenes.
81.5LGApr 16
Constraint-based Pre-training: From Structured Constraints to Scalable Model InitializationFu Feng, Yucheng Xie, Ruixiao Shi et al.
The pre-training and fine-tuning paradigm has become the dominant approach for model adaptation. However, conventional pre-training typically yields models at a fixed scale, whereas practical deployment often requires models of varying sizes, exposing its limitations when target model scales differ from those used during pre-training. To address this, we propose an innovative constraint-based pre-training paradigm that imposes structured constraints during pre-training to disentangle size-agnostic knowledge into reusable weight templates, while assigning size-specific adaptation to lightweight weight scalers, thereby reformulating variable-sized model initialization as a multi-task adaptation problem. Within this paradigm, we further introduce WeiT, which employs Kronecker-based constraints to regularize the pre-training process. Specifically, model parameters are represented as compositions of weight templates via concatenation and weighted aggregation, with adaptive connections governed by lightweight weight scalers whose parameters are learned from limited data. This design enables flexible and efficient construction of model weights across diverse downstream scales. Extensive experiments demonstrate the efficiency and effectiveness of WeiT, achieving state-of-the-art performance in initializing models with varying depths and widths across a broad range of perception and embodied learning tasks, including Image Classification, Image Generation, and Embodied Control. Moreover, its effectiveness generalizes to both Transformer-based and Convolution-based architectures, consistently enabling faster convergence and improved performance even under full training.
63.7CVMar 18
A Creative Agent is Worth a 64-Token TemplateRuixiao Shi, Fu Feng, Yucheng Xie et al.
Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as ``a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes ``creativity'' costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents' intrinsic understanding of ``creativity'' through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent's latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.
LGDec 10, 2025
Knowledge Diversion for Efficient Morphology Control and Policy TransferFu Feng, Ruixiao Shi, Yucheng Xie et al.
Universal morphology control aims to learn a universal policy that generalizes across heterogeneous agent morphologies, with Transformer-based controllers emerging as a popular choice. However, such architectures incur substantial computational costs, resulting in high deployment overhead, and existing methods exhibit limited cross-task generalization, necessitating training from scratch for each new task. To this end, we propose \textbf{DivMorph}, a modular training paradigm that leverages knowledge diversion to learn decomposable controllers. DivMorph factorizes randomly initialized Transformer weights into factor units via SVD prior to training and employs dynamic soft gating to modulate these units based on task and morphology embeddings, separating them into shared \textit{learngenes} and morphology- and task-specific \textit{tailors}, thereby achieving knowledge disentanglement. By selectively activating relevant components, DivMorph enables scalable and efficient policy deployment while supporting effective policy transfer to novel tasks. Extensive experiments demonstrate that DivMorph achieves state-of-the-art performance, achieving a 3$\times$ improvement in sample efficiency over direct finetuning for cross-task transfer and a 17$\times$ reduction in model size for single-agent deployment.
CVJan 27
Self-Supervised Weight Templates for Scalable Vision Model InitializationYucheng Xie, Fu Feng, Ruixiao Shi et al.
The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textsc{classification}, \textsc{detection}, \textsc{segmentation} and \textsc{generation} tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.
LGJun 25, 2024Code
WAVE: Weight Templates for Adaptive Initialization of Variable-sized ModelsFu Feng, Yucheng Xie, Jing Wang et al.
The growing complexity of model parameters underscores the significance of pre-trained models. However, deployment constraints often necessitate models of varying sizes, exposing limitations in the conventional pre-training and fine-tuning paradigm, particularly when target model sizes are incompatible with pre-trained ones. To address this challenge, we propose WAVE, a novel approach that reformulates variable-sized model initialization from a multi-task perspective, where initializing each model size is treated as a distinct task. WAVE employs shared, size-agnostic weight templates alongside size-specific weight scalers to achieve consistent initialization across various model sizes. These weight templates, constructed within the Learngene framework, integrate knowledge from pre-trained models through a distillation process constrained by Kronecker-based rules. Target models are then initialized by concatenating and weighting these templates, with adaptive connection rules established by lightweight weight scalers, whose parameters are learned from minimal training data. Extensive experiments demonstrate the efficiency of WAVE, achieving state-of-the-art performance in initializing models of various depth and width. The knowledge encapsulated in weight templates is also task-agnostic, allowing for seamless transfer across diverse downstream datasets. Code will be made available at https://github.com/fu-feng/WAVE.
CVOct 31, 2024Code
Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative GenerationFu Feng, Yucheng Xie, Xu Yang et al.
``Creative'' remains an inherently abstract concept for both humans and diffusion models. While text-to-image (T2I) diffusion models can easily generate out-of-distribution concepts like ``a blue banana'', they struggle with generating combinatorial objects such as ``a creative mixture that resembles a lettuce and a mantis'', due to difficulties in understanding the semantic depth of ``creative''. Current methods rely heavily on synthesizing reference prompts or images to achieve a creative effect, typically requiring retraining for each unique creative output-a process that is computationally intensive and limits practical applications. To address this, we introduce CreTok, which brings meta-creativity to diffusion models by redefining ``creative'' as a new token, \texttt{<CreTok>}, thus enhancing models' semantic understanding for combinatorial creativity. CreTok achieves such redefinition by iteratively sampling diverse text pairs from our proposed CangJie dataset to form adaptive prompts and restrictive prompts, and then optimizing the similarity between their respective text embeddings. Extensive experiments demonstrate that <CreTok> enables the universal and direct generation of combinatorial creativity across diverse concepts without additional training, achieving state-of-the-art performance with improved text-image alignment and higher human preference ratings. Code will be made available at https://github.com/fu-feng/CreTok.
LGMar 8
A Unified Framework for Knowledge Transfer in Bidirectional Model ScalingJianlu Shen, Fu Feng, Jiaze Xu et al.
Transferring pre-trained knowledge from a source model to a target model of a different architectural size is a key challenge for flexible and efficient model scaling. However, current parameter-space methods treat Small-to-Large (S2L) and Large-to-Small (L2S) scaling as separate, incompatible problems, focusing on parameter synthesis and selection, respectively. This fragmented perspective has resulted in specialized tools, hindering a unified, bidirectional framework. In this paper, we propose BoT (Bidirectional knowledge Transfer), the first size-agnostic framework to unify S2L and L2S scaling. Our core insight is to treat model weights as continuous signals, where models of different sizes represent distinct discretizations of the transferable knowledge. This multi-resolution perspective directly casts S2L and L2S scaling as the signal processing operations of upsampling and downsampling, naturally leading to the adoption of the Discrete Wavelet Transform (DWT) and its Inverse (IDWT). BoT leverages the recursive nature of wavelets, using the decomposition level as a dynamic scaling factor to bridge disparate model sizes in a parameter-free and computationally efficient manner. Extensive experiments on DeiT, BERT, and GPT demonstrate significant pre-training FLOPs savings (up to 67.1% for S2L, 52.8% for L2S) and state-of-the-art performance on benchmarks like GLUE and SQuAD.
CVMay 13, 2025
FAD: Frequency Adaptation and Diversion for Cross-domain Few-shot LearningRuixiao Shi, Fu Feng, Yucheng Xie et al.
Cross-domain few-shot learning (CD-FSL) requires models to generalize from limited labeled samples under significant distribution shifts. While recent methods enhance adaptability through lightweight task-specific modules, they operate solely in the spatial domain and overlook frequency-specific variations that are often critical for robust transfer. We observe that spatially similar images across domains can differ substantially in their spectral representations, with low and high frequencies capturing complementary semantic information at coarse and fine levels. This indicates that uniform spatial adaptation may overlook these spectral distinctions, thus constraining generalization. To address this, we introduce Frequency Adaptation and Diversion (FAD), a frequency-aware framework that explicitly models and modulates spectral components. At its core is the Frequency Diversion Adapter, which transforms intermediate features into the frequency domain using the discrete Fourier transform (DFT), partitions them into low, mid, and high-frequency bands via radial masks, and reconstructs each band using inverse DFT (IDFT). Each frequency band is then adapted using a dedicated convolutional branch with a kernel size tailored to its spectral scale, enabling targeted and disentangled adaptation across frequencies. Extensive experiments on the Meta-Dataset benchmark demonstrate that FAD consistently outperforms state-of-the-art methods on both seen and unseen domains, validating the utility of frequency-domain representations and band-wise adaptation for improving generalization in CD-FSL.
CVMay 6, 2025
Distribution-Conditional Generation: From Class Distribution to Creative GenerationFu Feng, Yucheng Xie, Xu Yang et al.
Text-to-image (T2I) diffusion models are effective at producing semantically aligned images, but their reliance on training data distributions limits their ability to synthesize truly novel, out-of-distribution concepts. Existing methods typically enhance creativity by combining pairs of known concepts, yielding compositions that, while out-of-distribution, remain linguistically describable and bounded within the existing semantic space. Inspired by the soft probabilistic outputs of classifiers on ambiguous inputs, we propose Distribution-Conditional Generation, a novel formulation that models creativity as image synthesis conditioned on class distributions, enabling semantically unconstrained creative generation. Building on this, we propose DisTok, an encoder-decoder framework that maps class distributions into a latent space and decodes them into tokens of creative concept. DisTok maintains a dynamic concept pool and iteratively sampling and fusing concept pairs, enabling the generation of tokens aligned with increasingly complex class distributions. To enforce distributional consistency, latent vectors sampled from a Gaussian prior are decoded into tokens and rendered into images, whose class distributions-predicted by a vision-language model-supervise the alignment between input distributions and the visual semantics of generated tokens. The resulting tokens are added to the concept pool for subsequent composition. Extensive experiments demonstrate that DisTok, by unifying distribution-conditioned fusion and sampling-based synthesis, enables efficient and flexible token-level generation, achieving state-of-the-art performance with superior text-image alignment and human preference scores.
LGMar 8
One-for-All Model Initialization with Frequency-Domain KnowledgeJianlu Shen, Fu Feng, Yucheng Xie et al.
Transferring knowledge by fine-tuning large-scale pre-trained networks has become a standard paradigm for downstream tasks, yet the knowledge of a pre-trained model is tightly coupled with monolithic architecture, which restricts flexible reuse across models of varying scales. In response to this challenge, recent approaches typically resort to either parameter selection, which fails to capture the interdependent structure of this knowledge, or parameter prediction using generative models that depend on impractical access to large network collections. In this paper, we empirically demonstrate that a model's foundational, task-agnostic knowledge, its "learngene", is encoded within the low-frequency components of its weights, and can be efficiently inherited by downstream models. Based on this insight, we propose FRONT (FRequency dOmain kNowledge Transfer), a novel framework that uses the Discrete Cosine Transform (DCT) to isolate the low-frequency "learngene". This learngene can be seamlessly adapted to initialize models of arbitrary size via simple truncation or padding, a process that is entirely training-free. For enhanced performance, we propose an optional low-cost refinement process that introduces a spectral regularizer to further improve the learngene's transferability. Extensive experiments demonstrate that FRONT achieves the state-of-the-art performance, accelerates convergence by up to 15 times in vision tasks, and reduces training FLOPs by an average of 40.5% in language tasks.
CVJul 31, 2025
DivControl: Knowledge Diversion for Controllable Image GenerationYucheng Xie, Fu Feng, Ruixiao Shi et al.
Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.
LGJan 16, 2024
Transferring Core Knowledge via LearngenesFu Feng, Jing Wang, Xin Geng
The pre-training paradigm fine-tunes the models trained on large-scale datasets to downstream tasks with enhanced performance. It transfers all knowledge to downstream tasks without discriminating which part is necessary or unnecessary, which may lead to negative transfer. In comparison, knowledge transfer in nature is much more efficient. When passing genetic information to descendants, ancestors encode only the essential knowledge into genes, which act as the medium. Inspired by that, we adopt a recent concept called ``learngene'' and refine its structures by mimicking the structures of natural genes. We propose the Genetic Transfer Learning (GTL) -- a framework to copy the evolutionary process of organisms into neural networks. GTL trains a population of networks, selects superior learngenes by tournaments, performs learngene mutations, and passes the learngenes to next generations. Finally, we successfully extract the learngenes of VGG11 and ResNet12. We show that the learngenes bring the descendant networks instincts and strong learning ability: with 20% parameters, the learngenes bring 12% and 16% improvements of accuracy on CIFAR-FS and miniImageNet. Besides, the learngenes have the scalability and adaptability on the downstream structure of networks and datasets. Overall, we offer a novel insight that transferring core knowledge via learngenes may be sufficient and efficient for neural networks.