CLAug 14, 2024Code
Bridging and Modeling Correlations in Pairwise Data for Direct Preference OptimizationYuxin Jiang, Bo Huang, Yufei Wang et al.
Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the generation of the winning response and the losing response within pairwise data are typically isolated, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. Firstly, we increase the consistency and informativeness of the pairwise preference signals through targeted modifications, synthesizing a pseudo-winning response by improving the losing response with the winning response as a reference. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model's confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method's superior performance over DPO and showcases its versatility to other DPO variants. We release our repository at https://github.com/YJiangcm/BMC.
CVJan 23Code
StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided BackdoorsQinkai Yu, Chong Zhang, Gaojie Jin et al.
Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models-crucial to medical image analysis-remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under black-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model's performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model's segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved ASR above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores, significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at: https://github.com/Qinkaiyu/StealthMark.
LGMar 13, 2023
Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified ModelBo Qiang, Yiran Zhou, Yuheng Ding et al.
Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a novel pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results on challenging downstream tasks. By possessing chemical knowledge, our generative framework overcome the limitations of current molecule generation models that rely on a small number of reaction templates. In the extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a significant step toward a large-scale deep-learning framework for a variety of reaction-based applications.
AISep 11, 2023Code
UniKG: A Benchmark and Universal Embedding for Large-Scale Knowledge GraphsYide Qiu, Shaoxiang Ling, Tong Zhang et al.
Irregular data in real-world are usually organized as heterogeneous graphs (HGs) consisting of multiple types of nodes and edges. To explore useful knowledge from real-world data, both the large-scale encyclopedic HG datasets and corresponding effective learning methods are crucial, but haven't been well investigated. In this paper, we construct a large-scale HG benchmark dataset named UniKG from Wikidata to facilitate knowledge mining and heterogeneous graph representation learning. Overall, UniKG contains more than 77 million multi-attribute entities and 2000 diverse association types, which significantly surpasses the scale of existing HG datasets. To perform effective learning on the large-scale UniKG, two key measures are taken, including (i) the semantic alignment strategy for multi-attribute entities, which projects the feature description of multi-attribute nodes into a common embedding space to facilitate node aggregation in a large receptive field; (ii) proposing a novel plug-and-play anisotropy propagation module (APM) to learn effective multi-hop anisotropy propagation kernels, which extends methods of large-scale homogeneous graphs to heterogeneous graphs. These two strategies enable efficient information propagation among a tremendous number of multi-attribute entities and meantimes adaptively mine multi-attribute association through the multi-hop aggregation in large-scale HGs. We set up a node classification task on our UniKG dataset, and evaluate multiple baseline methods which are constructed by embedding our APM into large-scale homogenous graph learning methods. Our UniKG dataset and the baseline codes have been released at https://github.com/Yide-Qiu/UniKG.
AIApr 14Code
IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric CalibrationYanji He, Yuxin Jiang, Yiwen Wu et al.
Large Language Models are increasingly deployed for decision-making, yet their adoption in high-stakes domains remains limited by miscalibrated probabilities, unfaithful explanations, and inability to incorporate expert knowledge precisely. We propose IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantically meaningful factors. Through joint learning of verbal-to-numerical mappings and decision parameters via EM, correlated sampling that preserves factor dependencies, and direct parameter editing with mathematical guarantees, IDEA produces calibrated probabilities while enabling quantitative human-AI collaboration. Experiments across five datasets show IDEA with Qwen-3-32B (78.6%) outperforms DeepSeek R1 (68.1%) and GPT-5.2 (77.9%), achieving perfect factor exclusion and exact calibration -- precision unattainable through prompting alone. The implementation is publicly available at https://github.com/leonbig/IDEA.
CVOct 15, 2023Code
Bounding and Filling: A Fast and Flexible Framework for Image CaptioningZheng Ma, Changxin Wang, Bo Huang et al.
Most image captioning models following an autoregressive manner suffer from significant inference latency. Several models adopted a non-autoregressive manner to speed up the process. However, the vanilla non-autoregressive manner results in subpar performance, since it generates all words simultaneously, which fails to capture the relationships between words in a description. The semi-autoregressive manner employs a partially parallel method to preserve performance, but it sacrifices inference speed. In this paper, we introduce a fast and flexible framework for image captioning called BoFiCap based on bounding and filling techniques. The BoFiCap model leverages the inherent characteristics of image captioning tasks to pre-define bounding boxes for image regions and their relationships. Subsequently, the BoFiCap model fills corresponding words in each box using two-generation manners. Leveraging the box hints, our filling process allows each word to better perceive other words. Additionally, our model offers flexible image description generation: 1) by employing different generation manners based on speed or performance requirements, 2) producing varied sentences based on user-specified boxes. Experimental evaluations on the MS-COCO benchmark dataset demonstrate that our framework in a non-autoregressive manner achieves the state-of-the-art on task-specific metric CIDEr (125.6) while speeding up 9.22x than the baseline model with an autoregressive manner; in a semi-autoregressive manner, our method reaches 128.4 on CIDEr while a 3.69x speedup. Our code and data is available at https://github.com/ChangxinWang/BoFiCap.
CVDec 5, 2023Code
SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary ConstraintsXianping Ma, Qianqian Wu, Xingyu Zhao et al.
Semantic segmentation of remote sensing imagery plays a pivotal role in extracting precise information for diverse down-stream applications. Recent development of the Segment Anything Model (SAM), an advanced general-purpose segmentation model, has revolutionized this field, presenting new avenues for accurate and efficient segmentation. However, SAM is limited to generating segmentation results without class information. Consequently, the utilization of such a powerful general vision model for semantic segmentation in remote sensing images has become a focal point of research. In this paper, we present a streamlined framework aimed at leveraging the raw output of SAM by exploiting two novel concepts called SAM-Generated Object (SGO) and SAM-Generated Boundary (SGB). More specifically, we propose a novel object loss and further introduce a boundary loss as augmentative components to aid in model optimization in a general semantic segmentation framework. Taking into account the content characteristics of SGO, we introduce the concept of object consistency to leverage segmented regions lacking semantic information. By imposing constraints on the consistency of predicted values within objects, the object loss aims to enhance semantic segmentation performance. Furthermore, the boundary loss capitalizes on the distinctive features of SGB by directing the model's attention to the boundary information of the object. Experimental results on two well-known datasets, namely ISPRS Vaihingen and LoveDA Urban, demonstrate the effectiveness of our proposed method. The source code for this work will be accessible at https://github.com/sstary/SSRS.
LGNov 23, 2023
Which Matters Most in Making Fund Investment Decisions? A Multi-granularity Graph Disentangled Learning FrameworkChunjing Gan, Binbin Hu, Bo Huang et al.
In this paper, we highlight that both conformity and risk preference matter in making fund investment decisions beyond personal interest and seek to jointly characterize these aspects in a disentangled manner. Consequently, we develop a novel M ulti-granularity Graph Disentangled Learning framework named MGDL to effectively perform intelligent matching of fund investment products. Benefiting from the well-established fund graph and the attention module, multi-granularity user representations are derived from historical behaviors to separately express personal interest, conformity and risk preference in a fine-grained way. To attain stronger disentangled representations with specific semantics, MGDL explicitly involve two self-supervised signals, i.e., fund type based contrasts and fund popularity. Extensive experiments in offline and online environments verify the effectiveness of MGDL.
AIDec 18, 2025Code
StarCraft+: Benchmarking Multi-agent Algorithms in Adversary ParadigmYadong Li, Tong Zhang, Bo Huang et al.
Deep multi-agent reinforcement learning (MARL) algorithms are booming in the field of collaborative intelligence, and StarCraft multi-agent challenge (SMAC) is widely-used as the benchmark therein. However, imaginary opponents of MARL algorithms are practically configured and controlled in a fixed built-in AI mode, which causes less diversity and versatility in algorithm evaluation. To address this issue, in this work, we establish a multi-agent algorithm-vs-algorithm environment, named StarCraft II battle arena (SC2BA), to refresh the benchmarking of MARL algorithms in an adversary paradigm. Taking StarCraft as infrastructure, the SC2BA environment is specifically created for inter-algorithm adversary with the consideration of fairness, usability and customizability, and meantime an adversarial PyMARL (APyMARL) library is developed with easy-to-use interfaces/modules. Grounding in SC2BA, we benchmark those classic MARL algorithms in two types of adversarial modes: dual-algorithm paired adversary and multi-algorithm mixed adversary, where the former conducts the adversary of pairwise algorithms while the latter focuses on the adversary to multiple behaviors from a group of algorithms. The extensive benchmark experiments exhibit some thought-provoking observations/problems in the effectivity, sensibility and scalability of these completed algorithms. The SC2BA environment as well as reproduced experiments are released in \href{https://github.com/dooliu/SC2BA}{Github}, and we believe that this work could mark a new step for the MARL field in the coming years.
SYMar 13, 2024Code
LLM-Assisted Light: Leveraging Large Language Model Capabilities for Human-Mimetic Traffic Signal Control in Complex Urban EnvironmentsMaonan Wang, Aoyu Pang, Yuheng Kan et al.
Traffic congestion in metropolitan areas presents a formidable challenge with far-reaching economic, environmental, and societal ramifications. Therefore, effective congestion management is imperative, with traffic signal control (TSC) systems being pivotal in this endeavor. Conventional TSC systems, designed upon rule-based algorithms or reinforcement learning (RL), frequently exhibit deficiencies in managing the complexities and variabilities of urban traffic flows, constrained by their limited capacity for adaptation to unfamiliar scenarios. In response to these limitations, this work introduces an innovative approach that integrates Large Language Models (LLMs) into TSC, harnessing their advanced reasoning and decision-making faculties. Specifically, a hybrid framework that augments LLMs with a suite of perception and decision-making tools is proposed, facilitating the interrogation of both the static and dynamic traffic information. This design places the LLM at the center of the decision-making process, combining external traffic data with established TSC methods. Moreover, a simulation platform is developed to corroborate the efficacy of the proposed framework. The findings from our simulations attest to the system's adeptness in adjusting to a multiplicity of traffic environments without the need for additional training. Notably, in cases of Sensor Outage (SO), our approach surpasses conventional RL-based systems by reducing the average waiting time by $20.4\%$. This research signifies a notable advance in TSC strategies and paves the way for the integration of LLMs into real-world, dynamic scenarios, highlighting their potential to revolutionize traffic management. The related code is available at https://github.com/Traffic-Alpha/LLM-Assisted-Light.
CVOct 15, 2024Code
MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic SegmentationXianping Ma, Xiaokang Zhang, Man-On Pun et al.
Multimodal remote sensing data, collected from a variety of sensors, provide a comprehensive and integrated perspective of the Earth's surface. By employing multimodal fusion techniques, semantic segmentation offers more detailed insights into geographic scenes compared to single-modality approaches. Building upon recent advancements in vision foundation models, particularly the Segment Anything Model (SAM), this study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation. At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data. In addition, a pyramid-based Deep Fusion Module (DFM) is incorporated to further integrate high-level geographic features across multiple scales before decoding. This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data. Experimental results on two well-established fine-resolution multimodal remote sensing datasets, ISPRS Vaihingen and ISPRS Potsdam, confirm that the proposed MANet significantly surpasses current models in the task of multimodal semantic segmentation. The source code for this work will be accessible at https://github.com/sstary/SSRS.
AIJan 18
Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic WebXiaohang Nie, Zihan Guo, Zicai Cui et al.
As large language models (LLM)-driven agents transition from isolated task solvers to persistent digital entities, the emergence of the Agentic Web, an ecosystem where heterogeneous agents autonomously interact and co-evolve, marks a pivotal shift toward Artificial General Intelligence (AGI). However, LLM-based multi-agent systems (LaMAS) are hindered by open-world issues such as scaling friction, coordination breakdown, and value dissipation. To address these challenges, we introduce Holos, a web-scale LaMAS architected for long-term ecological persistence. Holos adopts a five-layer architecture, with core modules primarily featuring the Nuwa engine for high-efficiency agent generation and hosting, a market-driven Orchestrator for resilient coordination, and an endogenous value cycle to achieve incentive compatibility. By bridging the gap between micro-level collaboration and macro-scale emergence, Holos hopes to lay the foundation for the next generation of the self-organizing and continuously evolving Agentic Web. We have publicly released Holos (accessible at https://holosai.io), providing a resource for the community and a testbed for future research in large-scale agentic ecosystems.
CVOct 30, 2025
FullPart: Generating each 3D Part at Full ResolutionLihe Ding, Shaocong Dong, Yaokun Li et al.
Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.
HEP-PHApr 8, 2024Code
Xiwu: A Basis Flexible and Learnable LLM for High Energy PhysicsZhengde Zhang, Yiyu Zhang, Haodong Yao et al.
Large Language Models (LLMs) are undergoing a period of rapid updates and changes, with state-of-the-art (SOTA) model frequently being replaced. When applying LLMs to a specific scientific field, it's challenging to acquire unique domain knowledge while keeping the model itself advanced. To address this challenge, a sophisticated large language model system named as Xiwu has been developed, allowing you switch between the most advanced foundation models and quickly teach the model domain knowledge. In this work, we will report on the best practices for applying LLMs in the field of high-energy physics (HEP), including: a seed fission technology is proposed and some data collection and cleaning tools are developed to quickly obtain domain AI-Ready dataset; a just-in-time learning system is implemented based on the vector store technology; an on-the-fly fine-tuning system has been developed to facilitate rapid training under a specified foundation model. The results show that Xiwu can smoothly switch between foundation models such as LLaMA, Vicuna, ChatGLM and Grok-1. The trained Xiwu model is significantly outperformed the benchmark model on the HEP knowledge question-and-answering and code generation. This strategy significantly enhances the potential for growth of our model's performance, with the hope of surpassing GPT-4 as it evolves with the development of open-source models. This work provides a customized LLM for the field of HEP, while also offering references for applying LLM to other fields, the corresponding codes are available on Github.
CVDec 28, 2025
TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long ContextsHao Zhang, Mengsi Lyu, Bo Huang et al.
Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach can reduce up to 80% of visual tokens while maintaining performance in long context settings.
AIMay 9
From Holo Pockets to Electron Density: GPT-style Drug Design with DensityJiahao Chen, Letian Gao, Yanhao Zhu et al.
Recent advances in generative modeling have enabled significant progress in structure-based drug design (SBDD). Existing methods typically condition molecule generation on empty binding pockets from holo complexes, overlooking informative components such as the filler (ligands and solvent). Here, we leverage low-resolution electron density (ED) derived from the filler as a physically grounded condition for \textit{de novo} drug design. We consider two types of ED, calculated and cryo-EM/X-ray, obtainable from computational or experimental sources, supporting unified pre-training and experimental integration. Compared with rigid pocket representations, experimental ED naturally captures conformational flexibility and provides a more faithful description of the binding environment. Based on this, we introduce EDMolGPT, a decoder-only autoregressive framework that generates molecules from low-resolution ED point clouds. By grounding generation in physically meaningful density signals, EDMolGPT mitigates structural bias and produces molecules with 3D conformations. Evaluations on 101 biological targets verify the effectiveness. Our project page: https://jiahaochen1.github.io/EDMolGPT_Page/.
LGSep 11, 2025
Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language ModelsHao Zhang, Bo Huang, Zhenjia Li et al.
Large Language Models (LLMs) have transformed both everyday life and scientific research. However, adapting LLMs from general-purpose models to specialized tasks remains challenging, particularly in resource-constrained environments. Low-Rank Adaptation (LoRA), a prominent method within Parameter-Efficient Fine-Tuning (PEFT), has emerged as a promising approach to LLMs by approximating model weight updates using low-rank decomposition. However, LoRA is limited by its uniform rank ( r ) allocation to each incremental matrix, and existing rank allocation techniques aimed at addressing this issue remain computationally inefficient, complex, and unstable, hindering practical applications. To address these limitations, we propose Sensitivity-LoRA, an efficient fine-tuning method that dynamically allocates ranks to weight matrices based on both their global and local sensitivities. It leverages the second-order derivatives (Hessian Matrix) of the loss function to effectively capture weight sensitivity, enabling optimal rank allocation with minimal computational overhead. Our experimental results have demonstrated robust effectiveness, efficiency and stability of Sensitivity-LoRA across diverse tasks and benchmarks.
AIFeb 1, 2025
Understanding and Optimizing Agentic Workflows via Shapley valueYingxuan Yang, Bo Huang, Siyuan Qi et al.
Agentic workflows have become the dominant paradigm for building complex AI systems, orchestrating specialized components, such as planning, reasoning, action execution, and reflection, to tackle sophisticated real-world tasks. However, systematically analyzing and optimizing these workflows remains challenging due to intricate component interdependencies and the lack of principled attribution methods. In this work, we introduce ShapleyFlow, the first framework that employs cooperative game theory to analyze and optimize agentic workflows. By applying the Shapley value to evaluate all possible component configurations, ShapleyFlow enables fine-grained attribution of each component's contribution and facilitates the identification of task-specific optimal configurations. Through a constructed dataset evaluated across 7 scenarios, such as navigation, math and OS, we demonstrate 3 key contributions: (1) Theoretical Framework: a principled game-theoretic approach for the attribution of contributions in agentic workflows. (2) Optimal Workflow Discovery: ShapleyFlow identifies task-specific component configurations that consistently outperform workflows relying on a single LLM across all tested tasks. (3) Comprehensive Analysis: we construct and analyze over 1,500 tasks, providing actionable insights and design guidelines for optimizing workflows across multiple domains.
CVDec 14, 2023
Dataset Distillation via Adversarial Prediction MatchingMingyang Chen, Bo Huang, Junda Lu et al.
Dataset distillation is the technique of synthesizing smaller condensed datasets from large original datasets while retaining necessary information to persist the effect. In this paper, we approach the dataset distillation problem from a novel perspective: we regard minimizing the prediction discrepancy on the real data distribution between models, which are respectively trained on the large original dataset and on the small distilled dataset, as a conduit for condensing information from the raw data into the distilled version. An adversarial framework is proposed to solve the problem efficiently. In contrast to existing distillation methods involving nested optimization or long-range gradient unrolling, our approach hinges on single-level optimization. This ensures the memory efficiency of our method and provides a flexible tradeoff between time and memory budgets, allowing us to distil ImageNet-1K using a minimum of only 6.5GB of GPU memory. Under the optimal tradeoff strategy, it requires only 2.5$\times$ less memory and 5$\times$ less runtime compared to the state-of-the-art. Empirically, our method can produce synthetic datasets just 10% the size of the original, yet achieve, on average, 94% of the test accuracy of models trained on the full original datasets including ImageNet-1K, significantly surpassing state-of-the-art. Additionally, extensive tests reveal that our distilled datasets excel in cross-architecture generalization capabilities.
CVJan 22, 2024
MetaSeg: Content-Aware Meta-Net for Omni-Supervised Semantic SegmentationShenwang Jiang, Jianan Li, Ying Wang et al.
Noisy labels, inevitably existing in pseudo segmentation labels generated from weak object-level annotations, severely hampers model optimization for semantic segmentation. Previous works often rely on massive hand-crafted losses and carefully-tuned hyper-parameters to resist noise, suffering poor generalization capability and high model complexity. Inspired by recent advances in meta learning, we argue that rather than struggling to tolerate noise hidden behind clean labels passively, a more feasible solution would be to find out the noisy regions actively, so as to simply ignore them during model optimization. With this in mind, this work presents a novel meta learning based semantic segmentation method, MetaSeg, that comprises a primary content-aware meta-net (CAM-Net) to sever as a noise indicator for an arbitrary segmentation model counterpart. Specifically, CAM-Net learns to generate pixel-wise weights to suppress noisy regions with incorrect pseudo labels while highlighting clean ones by exploiting hybrid strengthened features from image content, providing straightforward and reliable guidance for optimizing the segmentation model. Moreover, to break the barrier of time-consuming training when applying meta learning to common large segmentation models, we further present a new decoupled training strategy that optimizes different model layers in a divide-and-conquer manner. Extensive experiments on object, medical, remote sensing and human segmentation shows that our method achieves superior performance, approaching that of fully supervised settings, which paves a new promising way for omni-supervised semantic segmentation.
CVMar 10, 2025
AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion ModelsBo Huang, Wenlun Xu, Qizhuo Han et al.
While diffusion models have achieved remarkable progress in style transfer tasks, existing methods typically rely on fine-tuning or optimizing pre-trained models during inference, leading to high computational costs and challenges in balancing content preservation with style integration. To address these limitations, we introduce AttenST, a training-free attention-driven style transfer framework. Specifically, we propose a style-guided self-attention mechanism that conditions self-attention on the reference style by retaining the query of the content image while substituting its key and value with those from the style image, enabling effective style feature integration. To mitigate style information loss during inversion, we introduce a style-preserving inversion strategy that refines inversion accuracy through multiple resampling steps. Additionally, we propose a content-aware adaptive instance normalization, which integrates content statistics into the normalization process to optimize style fusion while mitigating the content degradation. Furthermore, we introduce a dual-feature cross-attention mechanism to fuse content and style features, ensuring a harmonious synthesis of structural fidelity and stylistic expression. Extensive experiments demonstrate that AttenST outperforms existing methods, achieving state-of-the-art performance in style transfer dataset.
MAApr 5
Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and BenchmarkLinyao Chen, Bo Huang, Qinlao Zhao et al.
Agentic Web, as a new paradigm that redefines the internet through autonomous, goal-driven interactions, plays an important role in group intelligence. As the foundational semantic primitives of the Agentic Web, digital assets encapsulate interactive web elements into agents, which expand the capacities and coverage of agents in agentic web. The lack of automated methodologies for agent generation limits the wider usage of digital assets and the advancement of the Agentic Web. In this paper, we first formalize these challenges by strictly defining the A2A-Agentization process, decomposing it into critical stages and identifying key technical hurdles on top of the A2A protocol. Based on this framework, we develop an Agentization Agent to agentize digital assets for the Agentic Web. To rigorously evaluate this capability, we propose A2A-Agentization Bench, the first benchmark explicitly designed to evaluate agentization quality in terms of fidelity and interoperability. Our experiments demonstrate that our approach effectively activates the functional capabilities of digital assets and enables interoperable A2A multi-agent collaboration. We believe this work will further facilitate scalable and standardized integration of digital assets into the Agentic Web ecosystem.
CVNov 24, 2025
Diffusion Reconstruction-based Data Likelihood Estimation for Core-Set SelectionMingyang Chen, Jiawei Du, Bo Huang et al.
Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.
LGOct 3, 2025
HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing PerformanceHao Zhang, Zhenjia Li, Runfeng Bao et al.
Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), has emerged as a promising approach to fine-tuning large language models(LLMs) while reducing computational and memory overhead. However, LoRA assumes a uniform rank \textit{r} for each incremental matrix, not accounting for the varying significance of weight matrices across different modules and layers. AdaLoRA leverages Singular Value Decomposition (SVD) to parameterize updates and employs pruning of singular values to introduce dynamic rank allocation, thereby enhancing adaptability. However, during the training process, it often encounters issues of slow convergence speed and high computational overhead. To address this issue, we propose HyperAdaLoRA, a novel framework that accelerates the convergence of AdaLoRA by leveraging a hypernetwork. Instead of directly optimizing the components of Singular Value Decomposition $(P, Λ, Q)$, HyperAdaLoRA employs a hypernetwork based on attention mechanisms to dynamically generate these parameters. By pruning the outputs of the hypernetwork that generates the singular values, dynamic rank allocation is achieved. Comprehensive experiments on various datasets and models demonstrate that our method achieves faster convergence without sacrificing performance. Additionally, further extension experiments on other LoRA-based approaches validate the broad applicability of our method.
LGJul 29, 2025
GraphTorque: Torque-Driven Rewiring Graph Neural NetworkSujia Huang, Lele Fu, Zhen Cui et al.
Graph Neural Networks (GNNs) have emerged as powerful tools for learning from graph-structured data, leveraging message passing to diffuse information and update node representations. However, most efforts have suggested that native interactions encoded in the graph may not be friendly for this process, motivating the development of graph rewiring methods. In this work, we propose a torque-driven hierarchical rewiring strategy, inspired by the notion of torque in classical mechanics, dynamically modulating message passing to improve representation learning in heterophilous and homophilous graphs. Specifically, we define the torque by treating the feature distance as a lever arm vector and the neighbor feature as a force vector weighted by the homophily disparity between nodes. We use the metric to hierarchically reconfigure receptive field of each layer by judiciously pruning high-torque edges and adding low-torque links, suppressing the impact of irrelevant information and boosting pertinent signals during message passing. Extensive evaluations on benchmark datasets show that the proposed approach surpasses state-of-the-art rewiring methods on both heterophilous and homophilous graphs.
CVJan 5, 2025
Fitting Different Interactive Information: Joint Classification of Emotion and IntentionXinger Li, Zhiqiang Zhong, Bo Huang et al.
This paper is the first-place solution for ICASSP MEIJU@2025 Track I, which focuses on low-resource multimodal emotion and intention recognition. How to effectively utilize a large amount of unlabeled data, while ensuring the mutual promotion of different difficulty levels tasks in the interaction stage, these two points become the key to the competition. In this paper, pseudo-label labeling is carried out on the model trained with labeled data, and samples with high confidence and their labels are selected to alleviate the problem of low resources. At the same time, the characteristic of easy represented ability of intention recognition found in the experiment is used to make mutually promote with emotion recognition under different attention heads, and higher performance of intention recognition is achieved through fusion. Finally, under the refined processing data, we achieve the score of 0.5532 in the Test set, and win the championship of the track.
LGMay 17, 2023
Generation of 3D Molecules in Pockets via Language ModelWei Feng, Lvwei Wang, Zaiyun Lin et al.
Generative models for molecules based on sequential line notation (e.g. SMILES) or graph representation have attracted an increasing interest in the field of structure-based drug design, but they struggle to capture important 3D spatial interactions and often produce undesirable molecular structures. To address these challenges, we introduce Lingo3DMol, a pocket-based 3D molecule generation method that combines language models and geometric deep learning technology. A new molecular representation, fragment-based SMILES with local and global coordinates, was developed to assist the model in learning molecular topologies and atomic spatial positions. Additionally, we trained a separate noncovalent interaction predictor to provide essential binding pattern information for the generative model. Lingo3DMol can efficiently traverse drug-like chemical spaces, preventing the formation of unusual structures. The Directory of Useful Decoys-Enhanced (DUD-E) dataset was used for evaluation. Lingo3DMol outperformed state-of-the-art methods in terms of drug-likeness, synthetic accessibility, pocket binding mode, and molecule generation speed.
LGDec 30, 2021
Delving into Sample Loss Curve to Embrace Noisy and Imbalanced DataShenwang Jiang, Jianan Li, Ying Wang et al.
Corrupted labels and class imbalance are commonly encountered in practically collected training data, which easily leads to over-fitting of deep neural networks (DNNs). Existing approaches alleviate these issues by adopting a sample re-weighting strategy, which is to re-weight sample by designing weighting function. However, it is only applicable for training data containing only either one type of data biases. In practice, however, biased samples with corrupted labels and of tailed classes commonly co-exist in training data. How to handle them simultaneously is a key but under-explored problem. In this paper, we find that these two types of biased samples, though have similar transient loss, have distinguishable trend and characteristics in loss curves, which could provide valuable priors for sample weight assignment. Motivated by this, we delve into the loss curves and propose a novel probe-and-allocate training strategy: In the probing stage, we train the network on the whole biased training data without intervention, and record the loss curve of each sample as an additional attribute; In the allocating stage, we feed the resulting attribute to a newly designed curve-perception network, named CurveNet, to learn to identify the bias type of each sample and assign proper weights through meta-learning adaptively. The training speed of meta learning also blocks its application. To solve it, we propose a method named skip layer meta optimization (SLMO) to accelerate training speed by skipping the bottom layers. Extensive synthetic and real experiments well validate the proposed method, which achieves state-of-the-art performance on multiple challenging benchmarks.
CYNov 29, 2021
MDistMult: A Multiple Scoring Functions Model for Link Prediction on Antiviral Drugs Knowledge GraphWeichuan Wang, Zhiwen Xie, Jin Liu et al.
Knowledge graphs (KGs) on COVID-19 have been constructed to accelerate the research process of COVID-19. However, KGs are always incomplete, especially the new constructed COVID-19 KGs. Link prediction task aims to predict missing entities for (e, r, t) or (h, r, e), where h and t are certain entities, e is an entity that needs to be predicted and r is a relation. This task also has the potential to solve COVID-19 related KGs' incomplete problem. Although various knowledge graph embedding (KGE) approaches have been proposed to the link prediction task, these existing methods suffer from the limitation of using a single scoring function, which fails to capture rich features of COVID-19 KGs. In this work, we propose the MDistMult model that leverages multiple scoring functions to extract more features from existing triples. We employ experiments on the CCKS2020 COVID-19 Antiviral Drugs Knowledge Graph (CADKG). The experimental results demonstrate that our MDistMult achieves state-of-the-art performance in link prediction task on the CADKG dataset
LGNov 24, 2020
WeiPS: a symmetric fusion model framework for large-scale online learningXiang Yu, Fuping Chu, Junqi Wu et al.
The recommendation system is an important commercial application of machine learning, where billions of feed views in the information flow every day. In reality, the interaction between user and item usually makes user's interest changing over time, thus many companies (e.g. ByteDance, Baidu, Alibaba, and Weibo) employ online learning as an effective way to quickly capture user interests. However, hundreds of billions of model parameters present online learning with challenges for real-time model deployment. Besides, model stability is another key point for online learning. To this end, we design and implement a symmetric fusion online learning system framework called WeiPS, which integrates model training and model inference. Specifically, WeiPS carries out second level model deployment by streaming update mechanism to satisfy the consistency requirement. Moreover, it uses multi-level fault tolerance and real-time domino degradation to achieve high availability requirement.
AIAug 22, 2020
Reimagining City Configuration: Automated Urban Planning via Adversarial LearningDongjie Wang, Yanjie Fu, Pengyang Wang et al.
Urban planning refers to the efforts of designing land-use configurations. Effective urban planning can help to mitigate the operational and social vulnerability of a urban system, such as high tax, crimes, traffic congestion and accidents, pollution, depression, and anxiety. Due to the high complexity of urban systems, such tasks are mostly completed by professional planners. But, human planners take longer time. The recent advance of deep learning motivates us to ask: can machines learn at a human capability to automatically and quickly calculate land-use configuration, so human planners can finally adjust machine-generated plans for specific needs? To this end, we formulate the automated urban planning problem into a task of learning to configure land-uses, given the surrounding spatial contexts. To set up the task, we define a land-use configuration as a longitude-latitude-channel tensor, where each channel is a category of POIs and the value of an entry is the number of POIs. The objective is then to propose an adversarial learning framework that can automatically generate such tensor for an unplanned area. In particular, we first characterize the contexts of surrounding areas of an unplanned area by learning representations from spatial graphs using geographic and human mobility data. Second, we combine each unplanned area and its surrounding context representation as a tuple, and categorize all the tuples into positive (well-planned areas) and negative samples (poorly-planned areas). Third, we develop an adversarial land-use configuration approach, where the surrounding context representation is fed into a generator to generate a land-use configuration, and a discriminator learns to distinguish among positive and negative samples.
IVAug 12, 2020
An Inter- and Intra-Band Loss for Pansharpening Convolutional Neural NetworksJiajun Cai, Bo Huang
Pansharpening aims to fuse panchromatic and multispectral images from the satellite to generate images with both high spatial and spectral resolution. With the successful applications of deep learning in the computer vision field, a lot of scholars have proposed many convolutional neural networks (CNNs) to solve the pansharpening task. These pansharpening networks focused on various distinctive structures of CNNs, and most of them are trained by L2 loss between fused images and simulated desired multispectral images. However, L2 loss is designed to directly minimize the difference of spectral information of each band, which does not consider the inter-band relations in the training process. In this letter, we propose a novel inter- and intra-band (IIB) loss to overcome the drawback of original L2 loss. Our proposed IIB loss can effectively preserve both inter- and intra-band relations and can be directly applied to different pansharpening CNNs.
IRAug 6, 2020
UBER-GNN: A User-Based Embeddings Recommendation based on Graph Neural NetworksBo Huang, Ye Bi, Zhenyu Wu et al.
The problem of session-based recommendation aims to predict user next actions based on session histories. Previous methods models session histories into sequences and estimate user latent features by RNN and GNN methods to make recommendations. However under massive-scale and complicated financial recommendation scenarios with both virtual and real commodities , such methods are not sufficient to represent accurate user latent features and neglect the long-term characteristics of users. To take long-term preference and dynamic interests into account, we propose a novel method, i.e. User-Based Embeddings Recommendation with Graph Neural Network, UBER-GNN for brevity. UBER-GNN takes advantage of structured data to generate longterm user preferences, and transfers session sequences into graphs to generate graph-based dynamic interests. The final user latent feature is then represented as the composition of the long-term preferences and the dynamic interests using attention mechanism. Extensive experiments conducted on real Ping An scenario show that UBER-GNN outperforms the state-of-the-art session-based recommendation methods.
CVMar 28, 2020
Multiform Fonts-to-Fonts Translation via Style and Content Disentangled Representations of Chinese CharacterFenxi Xiao, Jie Zhang, Bo Huang et al.
This paper mainly discusses the generation of personalized fonts as the problem of image style transfer. The main purpose of this paper is to design a network framework that can extract and recombine the content and style of the characters. These attempts can be used to synthesize the entire set of fonts with only a small amount of characters. The paper combines various depth networks such as Convolutional Neural Network, Multi-layer Perceptron and Residual Network to find the optimal model to extract the features of the fonts character. The result shows that those characters we have generated is very close to real characters, using Structural Similarity index and Peak Signal-to-Noise Ratio evaluation criterions.
CVMar 27, 2020
Automatic Generation of Chinese Handwriting via Fonts Style Representation LearningFenxi Xiao, Bo Huang, Xia Wu
In this paper, we propose and end-to-end deep Chinese font generation system. This system can generate new style fonts by interpolation of latent style-related embeding variables that could achieve smooth transition between different style. Our method is simpler and more effective than other methods, which will help to improve the font design efficiency
SPMar 16, 2020
Inverse design of multilayer nanoparticles using artificial neural networks and genetic algorithmCankun Qiu, Zhi Luo, Xia Wu et al.
The light scattering of multilayer nanoparticles can be solved by Maxwell equations. However, it is difficult to solve the inverse design of multilayer nanoparticles by using the traditional trial-and-error method. Here, we present a method for forward simulation and inverse design of multilayer nanoparticles. We combine the global search ability of genetic algorithm with the local search ability of neural network. First, the genetic algorithm is used to find a suitable solution, and then the neural network is used to fine-tune it. Due to the non-unique relationship between physical structures and optical responses, we first train a forward neural network, and then it is applied to the inverse design of multilayer nanoparticles. Not only here, this method can easily be extended to predict and find the best design parameters for other optical structures.
IVMar 16, 2020
u-net CNN based fourier ptychographyYican Chen, Zhi Luo, Xia Wu et al.
Fourier ptychography is a recently explored imaging method for overcoming the diffraction limit of conventional cameras with applications in microscopy and yielding high-resolution images. In order to splice together low-resolution images taken under different illumination angles of coherent light source, an iterative phase retrieval algorithm is adopted. However, the reconstruction procedure is slow and needs a good many of overlap in the Fourier domain for the continuous recorded low-resolution images and is also worse under system aberrations such as noise or random update sequence. In this paper, we propose a new retrieval algorithm that is based on convolutional neural networks. Once well trained, our model can perform high-quality reconstruction rapidly by using the graphics processing unit. The experiments demonstrate that our model achieves better reconstruction results and is more robust under system aberrations.
IVAug 22, 2019
Multi-Task Deep Learning with Dynamic Programming for Embryo Early Development Stage Classification from Time-Lapse VideosZihan Liu, Bo Huang, Yuqi Cui et al.
Time-lapse is a technology used to record the development of embryos during in-vitro fertilization (IVF). Accurate classification of embryo early development stages can provide embryologists valuable information for assessing the embryo quality, and hence is critical to the success of IVF. This paper proposes a multi-task deep learning with dynamic programming (MTDL-DP) approach for this purpose. It first uses MTDL to pre-classify each frame in the time-lapse video to an embryo development stage, and then DP to optimize the stage sequence so that the stage number is monotonically non-decreasing, which usually holds in practice. Different MTDL frameworks, e.g., one-to-many, many-to-one, and many-to-many, are investigated. It is shown that the one-to-many MTDL framework achieved the best compromise between performance and computational cost. To our knowledge, this is the first study that applies MTDL to embryo early development stage classification from time-lapse videos.
CVJan 25, 2017
Towards End-to-End Face Recognition through Alignment LearningYuanyi Zhong, Jiansheng Chen, Bo Huang
Plenty of effective methods have been proposed for face recognition during the past decade. Although these methods differ essentially in many aspects, a common practice of them is to specifically align the facial area based on the prior knowledge of human face structure before feature extraction. In most systems, the face alignment module is implemented independently. This has actually caused difficulties in the designing and training of end-to-end face recognition models. In this paper we study the possibility of alignment learning in end-to-end face recognition, in which neither prior knowledge on facial landmarks nor artificially defined geometric transformations are required. Specifically, spatial transformer layers are inserted in front of the feature extraction layers in a Convolutional Neural Network (CNN) for face recognition. Only human identity clues are used for driving the neural network to automatically learn the most suitable geometric transformation and the most appropriate facial area for the recognition task. To ensure reproducibility, our model is trained purely on the publicly available CASIA-WebFace dataset, and is tested on the Labeled Face in the Wild (LFW) dataset. We have achieved a verification accuracy of 99.08\% which is comparable to state-of-the-art single model based methods.
MLJul 22, 2013
Square Deal: Lower Bounds and Improved Relaxations for Tensor RecoveryCun Mu, Bo Huang, John Wright et al.
Recovering a low-rank tensor from incomplete information is a recurring problem in signal processing and machine learning. The most popular convex relaxation of this problem minimizes the sum of the nuclear norms of the unfoldings of the tensor. We show that this approach can be substantially suboptimal: reliably recovering a $K$-way tensor of length $n$ and Tucker rank $r$ from Gaussian measurements requires $Ω(r n^{K-1})$ observations. In contrast, a certain (intractable) nonconvex formulation needs only $O(r^K + nrK)$ observations. We introduce a very simple, new convex relaxation, which partially bridges this gap. Our new formulation succeeds with $O(r^{\lfloor K/2 \rfloor}n^{\lceil K/2 \rceil})$ observations. While these results pertain to Gaussian measurements, simulations strongly suggest that the new norm also outperforms the sum of nuclear norms for tensor completion from a random subset of entries. Our lower bound for the sum-of-nuclear-norms model follows from a new result on recovering signals with multiple sparse structures (e.g. sparse, low rank), which perhaps surprisingly demonstrates the significant suboptimality of the commonly used recovery approach via minimizing the sum of individual sparsity inducing norms (e.g. $l_1$, nuclear norm). Our new formulation for low-rank tensor recovery however opens the possibility in reducing the sample complexity by exploiting several structures jointly.