83.5CVApr 15
Seedance 2.0: Advancing Video Generation for World ComplexityTeam Seedance, De Chen, Liyang Chen et al. · gatech
Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.
CVAug 19, 2024Code
RealCustom++: Representing Images as Real Textual Word for Real-Time CustomizationZhendong Mao, Mengqi Huang, Fei Ding et al.
Given a text and an image of a specific subject, text-to-image customization aims to generate new images that align with both the text and the subject's appearance. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word causes semantic conflict from its different learning objective and entanglement from overlapping influence scopes with other texts, resulting in a dual-optimum paradox where subject similarity and text controllability cannot be optimal simultaneously. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to firstly generate a coherent guidance image and corresponding subject mask, thereby disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real words in the text; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. In contrast to previous methods that excel in either controllability or similarity, RealCustom++ achieves superior performance in both, with improvements of 7.48% in controllability, 3.04% in similarity, and 76.43% in generation quality. For multi-subject customization, RealCustom++ further achieves improvements of 4.6% in controllability and 6.34% in multi-subject similarity. Our work has been applied in JiMeng of ByteDance, and codes are released at https://github.com/bytedance/RealCustom.
CLNov 2, 2022
Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language ModelMingqi Li, Fei Ding, Dan Zhang et al.
Pre-trained multilingual language models play an important role in cross-lingual natural language understanding tasks. However, existing methods did not focus on learning the semantic structure of representation, and thus could not optimize their performance. In this paper, we propose Multi-level Multilingual Knowledge Distillation (MMKD), a novel method for improving multilingual language models. Specifically, we employ a teacher-student framework to adopt rich semantic representation knowledge in English BERT. We propose token-, word-, sentence-, and structure-level alignment objectives to encourage multiple levels of consistency between source-target pairs and correlation similarity between teacher and student models. We conduct experiments on cross-lingual evaluation benchmarks including XNLI, PAWS-X, and XQuAD. Experimental results show that MMKD outperforms other baseline models of similar size on XNLI and XQuAD and obtains comparable performance on PAWS-X. Especially, MMKD obtains significant performance gains on low-resource languages.
98.5CVMay 18Code
Lance: Unified Multimodal Modeling by Multi-Task SynergyFengyi Fu, Mengqi Huang, Shaojin Wu et al.
We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.
CVApr 11, 2022
XMP-Font: Self-Supervised Cross-Modality Pre-training for Few-Shot Font GenerationWei Liu, Fangyue Liu, Fei Ding et al.
Generating a new font library is a very labor-intensive and time-consuming job for glyph-rich scripts. Few-shot font generation is thus required, as it requires only a few glyph references without fine-tuning during test. Existing methods follow the style-content disentanglement paradigm and expect novel fonts to be produced by combining the style codes of the reference glyphs and the content representations of the source. However, these few-shot font generation methods either fail to capture content-independent style representations, or employ localized component-wise style representations, which is insufficient to model many Chinese font styles that involve hyper-component features such as inter-component spacing and "connected-stroke". To resolve these drawbacks and make the style representations more reliable, we propose a self-supervised cross-modality pre-training strategy and a cross-modality transformer-based encoder that is conditioned jointly on the glyph image and the corresponding stroke labels. The cross-modality encoder is pre-trained in a self-supervised manner to allow effective capture of cross- and intra-modality correlations, which facilitates the content-style disentanglement and modeling style representations of all scales (stroke-level, component-level and character-level). The pre-trained encoder is then applied to the downstream font generation task without fine-tuning. Experimental comparisons of our method with state-of-the-art methods demonstrate our method successfully transfers styles of all scales. In addition, it only requires one reference glyph and achieves the lowest rate of bad cases in the few-shot font generation task 28% lower than the second best
82.8LGMay 22
State commitment learning: training language models to distinguish computation from memoryFei Ding, Yongkang Zhang, Runhao Liu et al.
Reasoning language models do not distinguish tokens used for computation from tokens that constitute persistent state: once generated, all hidden thoughts remain in context and influence future predictions. As a result, downstream reasoning may depend on failed attempts, dead ends, and private scratch work that should not be safely relied on later. We recast this phenomenon as a new training objective, state commitment learning: training models to explicitly distinguish information that should be committed as persistent state from temporary computation that can be discarded. We define a counterfactual criterion, persistent-state sufficiency, which makes it trainable and measurable whether an answer remains usable after hidden thoughts are erased. We then propose Counterfactual Erasure RL (CERL), which evaluates, under the same prefix, both a path that keeps hidden thoughts and a path that erases them, and gives reward only when the erasure path remains correct. We also introduce the Erasure Dependence Protocol and show across mathematics, long-chain logic, scientific QA, and multi-turn tool-use evaluation that CERL substantially reduces answer dependence on hidden thoughts without sacrificing accuracy, consistently outperforming correctness-only RL and long-answer SFT baselines.
99.5IRApr 6
SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUsBi Xue, Hong Wu, Lei Chen et al.
Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing approaches rely on dedicated ANN indexing and filtering services on CPUs, suffering from non-negligible costs and missing co-design opportunities. Such inefficiency makes them difficult to support complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we present SilverTorch, a model-based serving system that brings all components into one unified model. It unifies model serving by replacing standalone indexing and filtering services with model layers. We propose a model-based GPU Bloom index for feature filtering and a fused Int8 ANN kernel for nearest neighbor search. Through co-design of the ANN search and feature filtering, we reduce GPU memory usage and eliminate computation. Benefiting from this design, we scale up retrieval by introducing an OverArch scoring layer and a multi-task retrieval with a Value Model to aggregate scores. These advancements improve the retrieval accuracy and enable future studies for serving more complex models. Our evaluation on industry-scale datasets show that SilverTorch achieves up to 23.7\times higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch solution is 13.35\times more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch is deployed at scale, serving hundreds of models online and supporting recommendation for diverse applications.
CVNov 30, 2025Code
Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple RewardsQiang Lyu, Zicong Chen, Chongxiao Wang et al.
Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \textit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) \textit{Reward-mixing}: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)--which have mismatched scales and variances--lead to unstable gradients and conflicting updates. To address these issues, we propose \textbf{Multi-GRPO}, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms \emph{temporal groups}, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce \emph{reward-based grouping} to compute advantages for each reward function \textit{independently} before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate \textit{OCR-Color-10}, a visual text rendering dataset with explicit color constraints. Across the single-reward \textit{PickScore-25k} and multi-objective \textit{OCR-Color-10} benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at \href{https://github.com/fikry102/Multi-GRPO}{https://github.com/fikry102/Multi-GRPO}.
31.8CEMay 4
From Production Envelopes to Executable Schedules: Sound Constructive Refinement for High-Mix ManufacturingRunhao Liu, Zhengyang Cheng, Fei Ding et al.
High-mix manufacturing systems require production plans that are both profitable and refinable into executable machine-level schedules under heterogeneous resources, mold-dependent compatibility, setup losses,delivery windows, and accessory synchronization. We study this problem as a production-envelope refinement task. A rolling-horizon mixed-integer linear programming (MILP) planner generates a valid production envelope that fixes daily production, fulfillment, mold states, inventory flows, outsourcing, and unmet-demand variables. A structure-aware constructive scheduler then refines this envelope into concrete order-machine allocations while preserving capacity feasibility, product-mold-machine compatibility, and delivery-window compliance. The scheduler enforces a one-mold-per-machine-per-day stability rule to avoid intra-day mold fragmentation. We establish residual invariants and prove a soundness theorem: whenever refinement terminates with zero residual fulfillment, the returned allocation is executable with respect to the valid envelope. The framework is implemented as an Advanced Planning and Scheduling (APS) prototype and evaluated on a real industrial case from a Jiangsu smartphone-case manufacturer in China with 37 product types, 150 orders, and over 8.3 million requested units. The proposed stable refinement achieves 100% on-time delivery, eliminates outsourcing, and bounds changeover-driven capacity loss to 1.9-4.6%. Across nine demand and changeover perturbation scenarios, it maintains robust delivery performance, showing that sound envelope refinement is a practical mechanism for reliable manufacturing scheduling.
60.8ROMay 3
DexSim2Real: Foundation Model-Guided Sim-to-Real Transfer for Generalizable Dexterous ManipulationZijian Zeng, Fei Ding, Huiming Yang et al.
Sim-to-real transfer remains a critical bottleneck for deploying dexterous manipulation policies learned in simulation to real-world robots. Existing approaches rely on manually designed domain randomization or task-specific adaptation, limiting their generalizability across diverse manipulation scenarios. We present DexSim2Real, an integrated framework that leverages vision-language foundation models to bridge the sim-to-real gap for dexterous manipulation. Our system combines three components: (1) Foundation Model-Guided Domain Randomization (FM-DR), which uses a vision-language model as a visual realism critic to optimize simulation parameters via closed-loop CMA-ES, complementing text-based approaches like DrEureka with direct visual feedback; (2) a Tactile-Visual Cross-Attention Policy (TVCAP) that adapts cross-attention visuo-tactile fusion to zero-shot sim-to-real RL; and (3) a Progressive Skill Curriculum (PSC) that builds on LLM-based task decomposition with a difficulty scheduler tailored to contact-rich dexterous tasks. Extensive experiments on six challenging manipulation tasks with blinded evaluation demonstrate that DexSim2Real achieves a 78.2% average real-world success rate, outperforming DrEureka and DeXtreme while reducing the sim-to-real performance gap to only 8.3%.
CVAug 26, 2025Code
USO: Unified Style and Subject-Driven Generation via Disentangled and Reward LearningShaojin Wu, Mengqi Huang, Yufeng Cheng et al.
Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO
LGJun 8, 2023
Contrastive Representation Disentanglement for ClusteringFei Ding, Dan Zhang, Yin Yang et al.
Clustering continues to be a significant and challenging task. Recent studies have demonstrated impressive results by applying clustering to feature representations acquired through self-supervised learning, particularly on small datasets. However, when dealing with datasets containing a large number of clusters, such as ImageNet, current methods struggle to achieve satisfactory clustering performance. In this paper, we introduce a novel method called Contrastive representation Disentanglement for Clustering (CDC) that leverages contrastive learning to directly disentangle the feature representation for clustering. In CDC, we decompose the representation into two distinct components: one component encodes categorical information under an equipartition constraint, and the other component captures instance-specific factors. To train our model, we propose a contrastive loss that effectively utilizes both components of the representation. We conduct a theoretical analysis of the proposed loss and highlight how it assigns different weights to negative samples during the process of disentangling the feature representation. Further analysis of the gradients reveals that larger weights emphasize a stronger focus on hard negative samples. As a result, the proposed loss exhibits strong expressiveness, enabling efficient disentanglement of categorical information. Through experimental evaluation on various benchmark datasets, our method demonstrates either state-of-the-art or highly competitive clustering performance. Notably, on the complete ImageNet dataset, we achieve an accuracy of 53.4%, surpassing existing methods by a substantial margin of +10.2%.
CLMar 4
Order Is Not Layout: Order-to-Space Bias in Image GenerationYongkang Zhang, Zonglin Zhao, Yuechen Zhang et al.
We study a systematic bias in modern image generation models: the mention order of entities in text spuriously determines spatial layout and entity--role binding. We term this phenomenon Order-to-Space Bias (OTS) and show that it arises in both text-to-image and image-to-image generation, often overriding grounded cues and causing incorrect layouts or swapped assignments. To quantify OTS, we introduce OTS-Bench, which isolates order effects with paired prompts differing only in entity order and evaluates models along two dimensions: homogenization and correctness. Experiments show that Order-to-Space Bias (OTS) is widespread in modern image generation models, and provide evidence that it is primarily data-driven and manifests during the early stages of layout formation. Motivated by this insight, we show that both targeted fine-tuning and early-stage intervention strategies can substantially reduce OTS, while preserving generation quality.
CVSep 8, 2025Code
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching RewardYufeng Cheng, Wenxu Wu, Shaojin Wu et al.
Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO
CLJun 11, 2025Code
Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic ForgettingFei Ding, Baiqiao Wang
Supervised Fine-Tuning (SFT) is a critical step for enhancing the instruction-following capabilities of Large Language Models (LLMs) and adapting them to specialized domains. However, SFT often leads to a degradation of the model's general abilities, a phenomenon known as catastrophic forgetting. This problem is exacerbated when third-party practitioners fine-tune open-source models, as the original SFT data is typically not available. To address this challenge, we propose a novel and cost-effective SFT method that effectively mitigates catastrophic forgetting without requiring access to the original SFT data. Our approach first reconstructs the likely instruction distribution of the base model. It then employs a multi-model generation and filtering pipeline to synthesize a high-quality general-purpose dataset. This synthetic dataset is mixed with new, domain-specific data for fine-tuning. Experimental results show that our method not only preserves the model's capabilities in general domains but also improves task-specific performance, outperforming baselines that use publicly available SFT datasets.
50.7LGApr 20
HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action ManipulationZijian Zeng, Fei Ding, Huiming Yang et al.
Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap. We present HELM, a model-agnostic framework that addresses these deficiencies with three components: an Episodic Memory Module (EMM) that retrieves key task history via CLIP-indexed keyframes, a learned State Verifier (SV) that predicts action failure before execution from observation, action, subgoal, and memory-conditioned context, and a Harness Controller (HC) that performs rollback and replanning. The SV is the core learning contribution: it consistently outperforms rule-based feasibility checks and ensemble uncertainty baselines, and its effectiveness depends critically on access to episodic memory. On LIBERO-LONG, HELM improves task success rate by 23.1 percentage points over OpenVLA (58.4% to 81.5%), while extending the context window to H=32 yields only a 5.4-point gain and same-budget LoRA adaptation remains 12.2 points below HELM. HELM also improves long-horizon performance on CALVIN and substantially boosts recovery success under controlled perturbations. Ablations and mechanism analyses isolate the contribution of each component, and we release LIBERO-Recovery as a perturbation-injection protocol for evaluating failure recovery in long-horizon manipulation.
70.8LGApr 19
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample ConstructionFei Ding, Yongkang Zhang, Runhao Liu et al.
This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insufficiently characterized: the comparison units used during training lack inherent comparability. Building on this observation, we propose a new perspective: the length problem should not be viewed merely as a loss-scaling or normalization bias, but rather as a \emph{comparison unit construction} problem. We further establish a sample-construction-based training framework that, instead of applying post-hoc corrections to unequal-length responses, proactively constructs equal-length, alignable, and comparable training segments during generation. Within this framework, we propose EqLen, a concrete method applicable to group-relative comparison algorithms such as GRPO, GSPO, and RLOO. Through dual-track synchronous generation, prefix inheritance, and segment masking, EqLen efficiently collects effective equal-length training segments and enables stable
94.5LGApr 19
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for ReasoningFei Ding, Yongkang Zhang, Runhao Liu et al.
The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.
CVApr 2, 2025
Less-to-More Generalization: Unlocking More Controllability by In-Context GenerationShaojin Wu, Mengqi Huang, Wenxu Wu et al.
Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.
CVDec 21, 2023
DreamTuner: Single Image is Enough for Subject-Driven GenerationMiao Hua, Jiawei Liu, Fei Ding et al.
Diffusion-based models have demonstrated impressive capabilities for text-to-image generation and are expected for personalized applications of subject-driven generation, which require the generation of customized concepts with one or a few reference images. However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Moreover, other methods that utilize additional image encoders tend to lose important details of the subject due to encoding compression. To address these challenges, we propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively. DreamTurner introduces a subject-encoder for coarse subject identity preservation, where the compressed general subject features are introduced through an attention layer before visual-text cross-attention. We then modify the self-attention layers within pretrained text-to-image models to self-subject-attention layers to refine the details of the target subject. The generated image queries detailed features from both the reference image and itself in self-subject-attention. It is worth emphasizing that self-subject-attention is an effective, elegant, and training-free method for maintaining the detailed features of customized subjects and can serve as a plug-and-play solution during inference. Finally, with additional subject-driven fine-tuning, DreamTurner achieves remarkable performance in subject-driven image generation, which can be controlled by a text or other conditions such as pose. For further details, please visit the project page at https://dreamtuner-diffusion.github.io/.
79.6LGApr 4
Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient CancellationFei Ding, Yongkang Zhang, youwei wang et al.
In sparse termination rewards, intra-group comparisons have become the dominant paradigm for fine-tuning reasoning models via reinforcement learning. However, long-term training often leads to issues like ineffective update accumulation (learning tax), solution probability drift, and entropy collapse. This paper presents a necessary condition for algorithm design from a token-level credit assignment perspective: to prevent reward-irrelevant drift, intra-group objectives must maintain gradient exchangeability across token updates, enabling gradient cancellation on weak-credit/high-frequency tokens. We show that two common mechanisms disrupting exchangeability make "non-cancellation" a structural norm. Based on this, we propose minimal intra-group transformations to restore or approximate the cancellation structure in the shared token space. Experimental results demonstrate that these transformations stabilize training, improve sample efficiency, and enhance final performance, validating the value of this design condition.
CVApr 23, 2025
DreamO: A Unified Framework for Image CustomizationChong Mou, Yanze Wu, Wenxu Wu et al.
Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.
76.0SYApr 26
Physics-Aware LLM-Based Probabilistic Wind Power Scenario Generation under Extreme Icing ConditionsLei Wang, Ying Zhang, Di Shi et al.
Accurately characterizing wind power uncertainty under icing and post-disaster conditions remains a critical challenge for resilient power system operation. To address this issue, this paper proposes a physics-aware large language model (LLM) framework for probabilistic wind power scenario generation under extreme icing conditions. The proposed framework integrates supervisory control and data acquisition (SCADA)-based physical modeling, multimodal tokenization, and a causal Transformer architecture trained in an autoregressive manner. A physics-aware decoding scheme effectively enforces rated power limits and ramping constraints on the generated trajectories while preserving stochastic diversity. Case studies using real wind turbine data show that the proposed method reproduces icing-induced power degradation and temporal variability observed during extreme weather. The resulting scenarios are physically consistent and high-fidelity, thereby significantly enhancing resilience assessment and recovery planning in renewable-integrated power systems.
CVDec 30, 2024
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing ControlShaojin Wu, Fei Ding, Mengqi Huang et al.
While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.
CVDec 28, 2025
PoseStreamer: A Multi-modal Framework for 3D Tracking of Unseen Moving ObjectsHuiming Yang, Linglin Liao, Fei Ding et al.
Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.
CVApr 12, 2024
MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression RecognitionLinhuang Wang, Xin Kang, Fei Ding et al.
Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.
85.5LGApr 20
Reducing Credit Assignment Variance via Counterfactual Reasoning PathsFei Ding, Yongkang Zhang, Yeling Peng et al.
Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement. We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals. Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training stability and performance upper bounds on mathematical and code reasoning benchmarks, pointing to a promising direction for unlocking the performance potential of LLMs.
LGJun 5, 2025
Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language ModelsFei Ding, Baiqiao Wang, Zijian Zeng et al.
The Group Relative Policy Optimization (GRPO) algorithm has demonstrated considerable success in enhancing the reasoning capabilities of large language models (LLMs), as evidenced by DeepSeek-R1. However, the absence of intermediate supervision in GRPO frequently leads to inefficient exploration dynamics. A single error in a complex reasoning chain can invalidate the entire solution, resulting in abrupt reward vanishing and compromising training stability.To address these challenges, we propose MGRPO (Multi-layer GRPO). MGRPO operates in two layers: the first layer employs standard GRPO to generate an initial response. This response, along with the original query, is then fed into a second-layer GRPO process. This second layer is specifically trained to identify and correct errors in the initial response, effectively creating a self-correction loop. This mechanism provides implicit process-level supervision by rewarding successful error correction, without requiring an explicit, densely-annotated reward model. Experimental results on several mathematical reasoning benchmarks demonstrate that MGRPO significantly outperforms standard GRPO, achieving superior performance by fostering both reasoning and self-correction abilities.
ROApr 28, 2025
Socially-Aware Autonomous Driving: Inferring Yielding Intentions for Safer InteractionsJing Wang, Yan Jin, Hamid Taghavifar et al.
Since the emergence of autonomous driving technology, it has advanced rapidly over the past decade. It is becoming increasingly likely that autonomous vehicles (AVs) would soon coexist with human-driven vehicles (HVs) on the roads. Currently, safety and reliable decision-making remain significant challenges, particularly when AVs are navigating lane changes and interacting with surrounding HVs. Therefore, precise estimation of the intentions of surrounding HVs can assist AVs in making more reliable and safe lane change decision-making. This involves not only understanding their current behaviors but also predicting their future motions without any direct communication. However, distinguishing between the passing and yielding intentions of surrounding HVs still remains ambiguous. To address the challenge, we propose a social intention estimation algorithm rooted in Directed Acyclic Graph (DAG), coupled with a decision-making framework employing Deep Reinforcement Learning (DRL) algorithms. To evaluate the method's performance, the proposed framework can be tested and applied in a lane-changing scenario within a simulated environment. Furthermore, the experiment results demonstrate how our approach enhances the ability of AVs to navigate lane changes safely and efficiently on roads.
CVFeb 2, 2022
Decoupled IoU Regression for Object DetectionYan Gao, Qimeng Wang, Xu Tang et al.
Non-maximum suppression (NMS) is widely used in object detection pipelines for removing duplicated bounding boxes. The inconsistency between the confidence for NMS and the real localization confidence seriously affects detection performance. Prior works propose to predict Intersection-over-Union (IoU) between bounding boxes and corresponding ground-truths to improve NMS, while accurately predicting IoU is still a challenging problem. We argue that the complex definition of IoU and feature misalignment make it difficult to predict IoU accurately. In this paper, we propose a novel Decoupled IoU Regression (DIR) model to handle these problems. The proposed DIR decouples the traditional localization confidence metric IoU into two new metrics, Purity and Integrity. Purity reflects the proportion of the object area in the detected bounding box, and Integrity refers to the completeness of the detected object area. Separately predicting Purity and Integrity can divide the complex mapping between the bounding box and its IoU into two clearer mappings and model them independently. In addition, a simple but effective feature realignment approach is also introduced to make the IoU regressor work in a hindsight manner, which can make the target mapping more stable. The proposed DIR can be conveniently integrated with existing two-stage detectors and significantly improve their performance. Through a simple implementation of DIR with HTC, we obtain 51.3% AP on MS COCO benchmark, which outperforms previous methods and achieves state-of-the-art.
CVDec 17, 2020
Multi-shot Temporal Event Localization: a BenchmarkXiaolong Liu, Yao Hu, Song Bai et al.
Current developments in temporal event or action localization usually target actions captured by a single camera. However, extensive events or actions in the wild may be captured as a sequence of shots by multiple cameras at different positions. In this paper, we propose a new and challenging task called multi-shot temporal event localization, and accordingly, collect a large scale dataset called MUlti-Shot EventS (MUSES). MUSES has 31,477 event instances for a total of 716 video hours. The core nature of MUSES is the frequent shot cuts, for an average of 19 shots per instance and 176 shots per video, which induces large intrainstance variations. Our comprehensive evaluations show that the state-of-the-art method in temporal action localization only achieves an mAP of 13.1% at IoU=0.5. As a minor contribution, we present a simple baseline approach for handling the intra-instance variations, which reports an mAP of 18.9% on MUSES and 56.9% on THUMOS14 at IoU=0.5. To facilitate research in this direction, we release the dataset and the project code at https://songbai.site/muses/ .
CVDec 1, 2020
Multi-level Knowledge Distillation via Knowledge Alignment and CorrelationFei Ding, Yin Yang, Hongxin Hu et al.
Knowledge distillation (KD) has become an important technique for model compression and knowledge transfer. In this work, we first perform a comprehensive analysis of the knowledge transferred by different KD methods. We demonstrate that traditional KD methods, which minimize the KL divergence of softmax outputs between networks, are related to the knowledge alignment of an individual sample only. Meanwhile, recent contrastive learning-based KD methods mainly transfer relational knowledge between different samples, namely, knowledge correlation. While it is important to transfer the full knowledge from teacher to student, we introduce the Multi-level Knowledge Distillation (MLKD) by effectively considering both knowledge alignment and correlation. MLKD is task-agnostic and model-agnostic, and can easily transfer knowledge from supervised or self-supervised pretrained teachers. We show that MLKD can improve the reliability and transferability of learned representations. Experiments demonstrate that MLKD outperforms other state-of-the-art methods on a large number of experimental settings including different (a) pretraining strategies (b) network architectures (c) datasets (d) tasks.
SYJun 24, 2020
Model-Free Voltage Regulation of Unbalanced Distribution Network Based on Surrogate Model and Deep Reinforcement LearningDi Cao, Junbo Zhao, Weihao Hu et al.
Accurate knowledge of the distribution system topology and parameters is required to achieve good voltage controls, but this is difficult to obtain in practice. This paper develops a model-free approach based on the surrogate model and deep reinforcement learning (DRL). We have also extended it to deal with unbalanced three-phase scenarios. The key idea is to learn a surrogate model to capture the relationship between the power injections and voltage fluctuation of each node from historical data instead of using the original inaccurate model affected by errors and uncertainties. This allows us to integrate the DRL with the learned surrogate model. In particular, DRL is applied to learn the optimal control strategy from the experiences obtained by continuous interactions with the surrogate model. The integrated framework contains training three networks, i.e., surrogate model, actor, and critic networks, which fully leverage the strong nonlinear fitting ability of deep learning and DRL for online decision making. Several single-phase approaches have also been extended to deal with three-phase unbalance scenarios and the simulation results on the IEEE 123-bus system show that our proposed method can achieve similar performance as those that use accurate physical models.
SYMay 31, 2020
Distributed Voltage Regulation of Active Distribution System Based on Enhanced Multi-agent Deep Reinforcement LearningDi Cao, Junbo Zhao, Weihao Hu et al.
This paper proposes a data-driven distributed voltage control approach based on the spectrum clustering and the enhanced multi-agent deep reinforcement learning (MADRL) algorithm. Via the unsupervised clustering, the whole distribution system can be decomposed into several sub-networks according to the voltage and reactive power sensitivity. Then, the distributed control problem of each sub-network is modeled as Markov games and solved by the enhanced MADRL algorithm, where each sub-network is modeled as an adaptive agent. Deep neural networks are used in each agent to approximate the policy function and the action value function. All agents are centrally trained to learn the optimal coordinated voltage regulation strategy while executed in a distributed manner to make decisions based on only local information. The proposed method can significantly reduce the requirements of communications and knowledge of system parameters. It also effectively deals with uncertainties and can provide online coordinated control based on the latest local information. Comparison results with other existing model-based and data-driven methods on IEEE 33-bus and 123-bus systems demonstrate the effectiveness and benefits of the proposed approach.
LGMar 18, 2020
Unsupervised Hierarchical Graph Representation Learning by Mutual Information MaximizationFei Ding, Xiaohong Zhang, Justin Sybrandt et al.
Graph representation learning based on graph neural networks (GNNs) can greatly improve the performance of downstream tasks, such as node and graph classification. However, the general GNN models do not aggregate node information in a hierarchical manner, and can miss key higher-order structural features of many graphs. The hierarchical aggregation also enables the graph representations to be explainable. In addition, supervised graph representation learning requires labeled data, which is expensive and error-prone. To address these issues, we present an unsupervised graph representation learning method, Unsupervised Hierarchical Graph Representation (UHGR), which can generate hierarchical representations of graphs. Our method focuses on maximizing mutual information between "local" and high-level "global" representations, which enables us to learn the node embeddings and graph embeddings without any labeled data. To demonstrate the effectiveness of the proposed method, we perform the node and graph classification using the learned node and graph embeddings. The results show that the proposed method achieves comparable results to state-of-the-art supervised methods on several benchmarks. In addition, our visualization of hierarchical representations indicates that our method can capture meaningful and interpretable clusters.
CVNov 20, 2019
Hierarchical Attention Networks for Medical Image SegmentationFei Ding, Gang Yang, Jinlu Liu et al.
The medical image is characterized by the inter-class indistinction, high variability, and noise, where the recognition of pixels is challenging. Unlike previous self-attention based methods that capture context information from one level, we reformulate the self-attention mechanism from the view of the high-order graph and propose a novel method, namely Hierarchical Attention Network (HANet), to address the problem of medical image segmentation. Concretely, an HA module embedded in the HANet captures context information from neighbors of multiple levels, where these neighbors are extracted from the high-order graph. In the high-order graph, there will be an edge between two nodes only if the correlation between them is high enough, which naturally reduces the noisy attention information caused by the inter-class indistinction. The proposed HA module is robust to the variance of input and can be flexibly inserted into the existing convolution neural networks. We conduct experiments on three medical image segmentation tasks including optic disc/cup segmentation, blood vessel segmentation, and lung segmentation. Extensive results show our method is more effective and robust than the existing state-of-the-art methods.
LGNov 13, 2019
Double cycle-consistent generative adversarial network for unsupervised conditional generationFei Ding, Feng Luo, Yin Yang
Conditional generative models have achieved considerable success in the past few years, but usually require a lot of labeled data. Recently, ClusterGAN combines GAN with an encoder to achieve remarkable clustering performance via unsupervised conditional generation. However, it ignores the real conditional distribution of data, which leads to generating less diverse samples for each class and makes the encoder only achieve sub-optimal clustering performance. Here, we propose a new unsupervised conditional generation framework, Double Cycle-Consistent Conditional GAN (DC3-GAN), which can generate diverse class-conditioned samples. We enforce the encoder and the generator of GAN to form an encoder-generator pair in addition to the generator-encoder pair, which enables us to avoid the low-diversity generation and the triviality of latent features. We train the encoder-generator pair using real data, which can indirectly estimate the real conditional distribution. Meanwhile, this framework enforces the outputs of the encoder to match the inputs of GAN and the prior noise distribution, which disentangles latent space into two parts: one-hot discrete and continuous latent variables. The former can be directly expressed as clusters and the latter represents remaining unspecified factors. This work demonstrates that enhancing the diversity of unsupervised conditional generated samples can improve the clustering performance. Experiments on different benchmark datasets show that the proposed method outperforms existing generative model-based clustering methods, and also achieves the optimal disentanglement performance.