Hao Jiang

CV
h-index98
184papers
5,630citations
Novelty51%
AI Score61

184 Papers

CVMar 25, 2023Code
DoNet: Deep De-overlapping Network for Cytology Instance Segmentation

Hao Jiang, Rushan Zhang, Yanning Zhou et al.

Cell instance segmentation in cytology images has significant importance for biology analysis and cancer screening, while remains challenging due to 1) the extensive overlapping translucent cell clusters that cause the ambiguous boundaries, and 2) the confusion of mimics and debris as nuclei. In this work, we proposed a De-overlapping Network (DoNet) in a decompose-and-recombined strategy. A Dual-path Region Segmentation Module (DRM) explicitly decomposes the cell clusters into intersection and complement regions, followed by a Semantic Consistency-guided Recombination Module (CRM) for integration. To further introduce the containment relationship of the nucleus in the cytoplasm, we design a Mask-guided Region Proposal Strategy (MRP) that integrates the cell attention maps for inner-cell instance prediction. We validate the proposed approach on ISBI2014 and CPS datasets. Experiments show that our proposed DoNet significantly outperforms other state-of-the-art (SOTA) cell instance segmentation methods. The code is available at https://github.com/DeepDoNet/DoNet.

95.2CVMay 28Code
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

Tianpeng Bu, Xin Liu, Qihua Chen et al.

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.

AISep 2, 2022Code
Multi-Modal Experience Inspired AI Creation

Qian Cao, Xu Chen, Ruihua Song et al.

AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences. More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics. The code and data are available at: \url{https://github.com/Aman-4-Real/MMTG}.

CVMar 28, 2023
Egocentric Auditory Attention Localization in Conversations

Fiona Ryan, Hao Jiang, Abhinav Shukla et al. · gatech

In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saal

CVAug 28, 2024Code
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Fangxun Shu, Yue Liao, Le Zhuo et al.

We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.

98.7LGMay 27Code
Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

Hao Jiang, Shurui Li, Tianpeng Bu et al.

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.

87.6CVJun 4
LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Jianzong Wu, Hao Lian, Jiongfan Yang et al.

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

97.0IRJun 4
OneReason Technical Report

OneRec Team, Biao Yang, Boyang Ding et al.

Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.

98.5LGJun 4
Balancing Image Compression and Generation with Bootstrapped Tokenization

Haozhe Chi, Jinghan Li, Hao Jiang et al.

Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.

CVJul 28, 2024Code
On the Evaluation Consistency of Attribution-based Explanations

Jiarui Duan, Haoling Li, Haofei Zhang et al.

Attribution-based explanations are garnering increasing attention recently and have emerged as the predominant approach towards \textit{eXplanable Artificial Intelligence}~(XAI). However, the absence of consistent configurations and systematic investigations in prior literature impedes comprehensive evaluations of existing methodologies. In this work, we introduce {Meta-Rank}, an open platform for benchmarking attribution methods in the image domain. Presently, Meta-Rank assesses eight exemplary attribution methods using six renowned model architectures on four diverse datasets, employing both the \textit{Most Relevant First} (MoRF) and \textit{Least Relevant First} (LeRF) evaluation protocols. Through extensive experimentation, our benchmark reveals three insights in attribution evaluation endeavors: 1) evaluating attribution methods under disparate settings can yield divergent performance rankings; 2) although inconsistent across numerous cases, the performance rankings exhibit remarkable consistency across distinct checkpoints along the same training trajectory; 3) prior attempts at consistent evaluation fare no better than baselines when extended to more heterogeneous models and datasets. Our findings underscore the necessity for future research in this domain to conduct rigorous evaluations encompassing a broader range of models and datasets, and to reassess the assumptions underlying the empirical success of different attribution methods. Our code is publicly available at \url{https://github.com/TreeThree-R/Meta-Rank}.

97.1CVMay 26Code
DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

Peng Zhang, Guanghao Zhang, Wanggui He et al.

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at https://github.com/zhangguanghao523/DynFrame.

CLAug 19, 2024Code
TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition

Tianwei Lin, Jiang Liu, Wenqiao Zhang et al.

While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multidimensional task scenarios. To address this issue, one straightforward solution is to introduce task-specific LoRA modules as domain experts, leveraging the modeling of multiple experts' capabilities and thus enhancing the general capability of multi-task learning. Despite promising, these additional components often add complexity to the training and inference process, contravening the efficient characterization of PEFT designed for. Considering this, we introduce an innovative PEFT method, TeamLoRA, consisting of a collaboration and competition module for experts, and thus achieving the right balance of effectiveness and efficiency: (i) For collaboration, a novel knowledge-sharing and -organizing mechanism is devised to appropriately reduce the scale of matrix operations, thereby boosting the training and inference speed. (ii) For competition, we propose leveraging a game-theoretic interaction mechanism for experts, encouraging experts to transfer their domain-specific knowledge while facing diverse downstream tasks, and thus enhancing the performance. By doing so, TeamLoRA elegantly connects the experts as a "Team" with internal collaboration and competition, enabling a faster and more accurate PEFT paradigm for multi-task learning. To validate the superiority of TeamLoRA, we curate a comprehensive multi-task evaluation(CME) benchmark to thoroughly assess the capability of multi-task learning. Experiments conducted on our CME and other benchmarks indicate the effectiveness and efficiency of TeamLoRA. Our project is available at https://github.com/Lin-Tianwei/TeamLoRA.

CVSep 26, 2024Code
Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

Qihan Huang, Siming Fu, Jinlong Liu et al.

Personalized text-to-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation. Our code is available at https://github.com/hqhQAQ/MIP-Adapter.

CLMar 14, 2022
Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering

Jiawei Zhou, Xiaoguang Li, Lifeng Shang et al.

To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.

CVJan 4, 2023
Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

Sagnik Majumder, Hao Jiang, Pierre Moulon et al.

Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people ("egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff. Project: http://vision.cs.utexas.edu/projects/chat2map.

CVApr 7, 2023Code
DATE: Domain Adaptive Product Seeker for E-commerce

Haoyuan Li, Hao Jiang, Tao Jin et al.

Product Retrieval (PR) and Grounding (PG), aiming to seek image and object-level products respectively according to a textual query, have attracted great interest recently for better shopping experience. Owing to the lack of relevant datasets, we collect two large-scale benchmark datasets from Taobao Mall and Live domains with about 474k and 101k image-query pairs for PR, and manually annotate the object bounding boxes in each image for PG. As annotating boxes is expensive and time-consuming, we attempt to transfer knowledge from annotated domain to unannotated for PG to achieve un-supervised Domain Adaptation (PG-DA). We propose a {\bf D}omain {\bf A}daptive Produc{\bf t} S{\bf e}eker ({\bf DATE}) framework, regarding PR and PG as Product Seeking problem at different levels, to assist the query {\bf date} the product. Concretely, we first design a semantics-aggregated feature extractor for each modality to obtain concentrated and comprehensive features for following efficient retrieval and fine-grained grounding tasks. Then, we present two cooperative seekers to simultaneously search the image for PR and localize the product for PG. Besides, we devise a domain aligner for PG-DA to alleviate uni-modal marginal and multi-modal conditional distribution shift between source and target domains, and design a pseudo box generator to dynamically select reliable instances and generate bounding boxes for further knowledge transfer. Extensive experiments show that our DATE achieves satisfactory performance in fully-supervised PR, PG and un-supervised PG-DA. Our desensitized datasets will be publicly available here\footnote{\url{https://github.com/Taobao-live/Product-Seeking}}.

CVApr 27, 2023
MIPI 2023 Challenge on RGB+ToF Depth Completion: Methods and Results

Qingpeng Zhu, Wenxiu Sun, Yuekun Dai et al.

Depth completion from RGB images and sparse Time-of-Flight (ToF) measurements is an important problem in computer vision and robotics. While traditional methods for depth completion have relied on stereo vision or structured light techniques, recent advances in deep learning have enabled more accurate and efficient completion of depth maps from RGB images and sparse ToF measurements. To evaluate the performance of different depth completion methods, we organized an RGB+sparse ToF depth completion competition. The competition aimed to encourage research in this area by providing a standardized dataset and evaluation metrics to compare the accuracy of different approaches. In this report, we present the results of the competition and analyze the strengths and weaknesses of the top-performing methods. We also discuss the implications of our findings for future research in RGB+sparse ToF depth completion. We hope that this competition and report will help to advance the state-of-the-art in this important area of research. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2023.

LGJul 4, 2023
Training Energy-Based Models with Diffusion Contrastive Divergences

Weijian Luo, Hao Jiang, Tianyang Hu et al.

Energy-Based Models (EBMs) have been widely used for generative modeling. Contrastive Divergence (CD), a prevailing training objective for EBMs, requires sampling from the EBM with Markov Chain Monte Carlo methods (MCMCs), which leads to an irreconcilable trade-off between the computational burden and the validity of the CD. Running MCMCs till convergence is computationally intensive. On the other hand, short-run MCMC brings in an extra non-negligible parameter gradient term that is difficult to handle. In this paper, we provide a general interpretation of CD, viewing it as a special instance of our proposed Diffusion Contrastive Divergence (DCD) family. By replacing the Langevin dynamic used in CD with other EBM-parameter-free diffusion processes, we propose a more efficient divergence. We show that the proposed DCDs are both more computationally efficient than the CD and are not limited to a non-negligible gradient term. We conduct intensive experiments, including both synthesis data modeling and high-dimensional image denoising and generation, to show the advantages of the proposed DCDs. On the synthetic data learning and image denoising experiments, our proposed DCD outperforms CD by a large margin. In image generation experiments, the proposed DCD is capable of training an energy-based model for generating the Celab-A $32\times 32$ dataset, which is comparable to existing EBMs.

AISep 27, 2024Code
Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Hongzhe Huang, Jiang Liu, Zhewen Yu et al.

Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding of instruction alignment. (ii) For LLM preference alignment, given the instruction selected by the reward model, we propose leveraging the inner LLM used in MLLM to align the writing style of visual instructions with that of the inner LLM itself, resulting in LLM-aligned instruction improvement. Extensive experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%. Impressively, by aggressively reducing the training instructions from 158k to 14k (9$\times$ smaller), our model consistently outperforms its full-size dataset counterpart across various MLLM benchmarks. Our project is available at https://github.com/DCDmllm/Align2LLaVA.

CVSep 21, 2024Code
GAInS: Gradient Anomaly-aware Biomedical Instance Segmentation

Runsheng Liu, Hao Jiang, Yanning Zhou et al.

Instance segmentation plays a vital role in the morphological quantification of biomedical entities such as tissues and cells, enabling precise identification and delineation of different structures. Current methods often address the challenges of touching, overlapping or crossing instances through individual modeling, while neglecting the intrinsic interrelation between these conditions. In this work, we propose a Gradient Anomaly-aware Biomedical Instance Segmentation approach (GAInS), which leverages instance gradient information to perceive local gradient anomaly regions, thus modeling the spatial relationship between instances and refining local region segmentation. Specifically, GAInS is firstly built on a Gradient Anomaly Mapping Module (GAMM), which encodes the radial fields of instances through window sliding to obtain instance gradient anomaly maps. To efficiently refine boundaries and regions with gradient anomaly attention, we propose an Adaptive Local Refinement Module (ALRM) with a gradient anomaly-aware loss function. Extensive comparisons and ablation experiments in three biomedical scenarios demonstrate that our proposed GAInS outperforms other state-of-the-art (SOTA) instance segmentation methods. The code is available at https://github.com/DeepGAInS/GAInS.

CVFeb 26Code
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

Bowen Cui, Yuanbin Wang, Huajiang Xu et al.

Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.

87.3CVApr 14Code
PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Jinlong Liu, Wanggui He, Peng Zhang et al.

Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.

SEAug 21, 2024
RePair: Automated Program Repair with Process-based Feedback

Yuze Zhao, Zhenya Huang, Yixiao Ma et al.

The gap between the trepidation of program reliability and the expense of repairs underscores the indispensability of Automated Program Repair (APR). APR is instrumental in transforming vulnerable programs into more robust ones, bolstering program reliability while simultaneously diminishing the financial burden of manual repairs. Commercial-scale language models (LM) have taken APR to unprecedented levels. However, the emergence reveals that for models fewer than 100B parameters, making single-step modifications may be difficult to achieve the desired effect. Moreover, humans interact with the LM through explicit prompts, which hinders the LM from receiving feedback from compiler and test cases to automatically optimize its repair policies. In this literature, we explore how small-scale LM (less than 20B) achieve excellent performance through process supervision and feedback. We start by constructing a dataset named CodeNet4Repair, replete with multiple repair records, which supervises the fine-tuning of a foundational model. Building upon the encouraging outcomes of reinforcement learning, we develop a reward model that serves as a critic, providing feedback for the fine-tuned LM's action, progressively optimizing its policy. During inference, we require the LM to generate solutions iteratively until the repair effect no longer improves or hits the maximum step limit. The results show that process-based not only outperforms larger outcome-based generation methods, but also nearly matches the performance of closed-source commercial large-scale LMs.

CLJul 5, 2022
PReGAN: Answer Oriented Passage Ranking with Weakly Supervised GAN

Pan Du, Jian-Yun Nie, Yutao Zhu et al.

Beyond topical relevance, passage ranking for open-domain factoid question answering also requires a passage to contain an answer (answerability). While a few recent studies have incorporated some reading capability into a ranker to account for answerability, the ranker is still hindered by the noisy nature of the training data typically available in this area, which considers any passage containing an answer entity as a positive sample. However, the answer entity in a passage is not necessarily mentioned in relation with the given question. To address the problem, we propose an approach called \ttt{PReGAN} for Passage Reranking based on Generative Adversarial Neural networks, which incorporates a discriminator on answerability, in addition to a discriminator on topical relevance. The goal is to force the generator to rank higher a passage that is topically relevant and contains an answer. Experiments on five public datasets show that \ttt{PReGAN} can better rank appropriate passages, which in turn, boosts the effectiveness of QA systems, and outperforms the existing approaches without using external data.

CVSep 21, 2024Code
Holistic and Historical Instance Comparison for Cervical Cell Detection

Hao Jiang, Runsheng Liu, Yanning Zhou et al.

Cytology screening from Papanicolaou (Pap) smears is a common and effective tool for the preventive clinical management of cervical cancer, where abnormal cell detection from whole slide images serves as the foundation for reporting cervical cytology. However, cervical cell detection remains challenging due to 1) hazily-defined cell types (e.g., ASC-US) with subtle morphological discrepancies caused by the dynamic cancerization process, i.e., cell class ambiguity, and 2) imbalanced class distributions of clinical data may cause missed detection, especially for minor categories, i.e., cell class imbalance. To this end, we propose a holistic and historical instance comparison approach for cervical cell detection. Specifically, we first develop a holistic instance comparison scheme enforcing both RoI-level and class-level cell discrimination. This coarse-to-fine cell comparison encourages the model to learn foreground-distinguishable and class-wise representations. To emphatically improve the distinguishability of minor classes, we then introduce a historical instance comparison scheme with a confident sample selection-based memory bank, which involves comparing current embeddings with historical embeddings for better cell instance discrimination. Extensive experiments and analysis on two large-scale cytology datasets including 42,592 and 114,513 cervical cells demonstrate the effectiveness of our method. The code is available at https://github.com/hjiangaz/HERO.

OCNov 28, 2018
A convergence framework for inexact nonconvex and nonsmooth algorithms and its applications to several iterations

Tao Sun, Hao Jiang, Lizhi Cheng et al.

In this paper, we consider the convergence of an abstract inexact nonconvex and nonsmooth algorithm. We promise a pseudo sufficient descent condition and a pseudo relative error condition, which are both related to an auxiliary sequence, for the algorithm; and a continuity condition is assumed to hold. In fact, a lot of classical inexact nonconvex and nonsmooth algorithms allow these three conditions. Under a special kind of summable assumption on the auxiliary sequence, we prove the sequence generated by the general algorithm converges to a critical point of the objective function if being assumed Kurdyka- Lojasiewicz property. The core of the proofs lies in building a new Lyapunov function, whose successive difference provides a bound for the successive difference of the points generated by the algorithm. And then, we apply our findings to several classical nonconvex iterative algorithms and derive the corresponding convergence results

LGMar 3, 2022
MetaDT: Meta Decision Tree with Class Hierarchy for Interpretable Few-Shot Learning

Baoquan Zhang, Hao Jiang, Xutao Li et al.

Few-Shot Learning (FSL) is a challenging task, which aims to recognize novel classes with few examples. Recently, lots of methods have been proposed from the perspective of meta-learning and representation learning. However, few works focus on the interpretability of FSL decision process. In this paper, we take a step towards the interpretable FSL by proposing a novel meta-learning based decision tree framework, namely, MetaDT. In particular, the FSL interpretability is achieved from two aspects, i.e., a concept aspect and a visual aspect. On the concept aspect, we first introduce a tree-like concept hierarchy as FSL prior. Then, resorting to the prior, we split each few-shot task to a set of subtasks with different concept levels and then perform class prediction via a model of decision tree. The advantage of such design is that a sequence of high-level concept decisions that lead up to a final class prediction can be obtained, which clarifies the FSL decision process. On the visual aspect, a set of subtask-specific classifiers with visual attention mechanism is designed to perform decision at each node of the decision tree. As a result, a subtask-specific heatmap visualization can be obtained to achieve the decision interpretability of each tree node. At last, to alleviate the data scarcity issue of FSL, we regard the prior of concept hierarchy as an undirected graph, and then design a graph convolution-based decision tree inference network as our meta-learner to infer parameters of the decision tree. Extensive experiments on performance comparison and interpretability analysis show superiority of our MetaDT.

AIFeb 21, 2023
Future Aware Pricing and Matching for Sustainable On-demand Ride Pooling

Xianjie Zhang, Pradeep Varakantham, Hao Jiang

The popularity of on-demand ride pooling is owing to the benefits offered to customers (lower prices), taxi drivers (higher revenue), environment (lower carbon footprint due to fewer vehicles) and aggregation companies like Uber (higher revenue). To achieve these benefits, two key interlinked challenges have to be solved effectively: (a) pricing -- setting prices to customer requests for taxis; and (b) matching -- assignment of customers (that accepted the prices) to taxis/cars. Traditionally, both these challenges have been studied individually and using myopic approaches (considering only current requests), without considering the impact of current matching on addressing future requests. In this paper, we develop a novel framework that handles the pricing and matching problems together, while also considering the future impact of the pricing and matching decisions. In our experimental results on a real-world taxi dataset, we demonstrate that our framework can significantly improve revenue (up to 17% and on average 6.4%) in a sustainable manner by reducing the number of vehicles (up to 14% and on average 10.6%) required to obtain a given fixed revenue and the overall distance travelled by vehicles (up to 11.1% and on average 3.7%). That is to say, we are able to provide an ideal win-win scenario for all stakeholders (customers, drivers, aggregator, environment) involved by obtaining higher revenue for customers, drivers, aggregator (ride pooling company) while being good for the environment (due to fewer number of vehicles on the road and lesser fuel consumed).

MMJun 7, 2023
Video Compression with Arbitrary Rescaling Network

Mengxi Guo, Shijie Zhao, Hao Jiang et al.

Most video platforms provide video streaming services with different qualities, and the quality of the services is usually adjusted by the resolution of the videos. So high-resolution videos need to be downsampled for compression. In order to solve the problem of video coding at different resolutions, we propose a rate-guided arbitrary rescaling network (RARN) for video resizing before encoding. To help the RARN be compatible with standard codecs and generate compression-friendly results, an iteratively optimized transformer-based virtual codec (TVC) is introduced to simulate the key components of video encoding and perform bitrate estimation. By iteratively training the TVC and the RARN, we achieved 5%-29% BD-Rate reduction anchored by linear interpolation under different encoding configurations and resolutions, exceeding the previous methods on most test videos. Furthermore, the lightweight RARN structure can process FHD (1080p) content at real-time speed (91 FPS) and obtain a considerable rate reduction.

CVFeb 6Code
Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Yunze Tong, Mushui Liu, Canyu Zhao et al.

Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.

NAFeb 17, 2017
Accurate Quotient-Difference algorithm: error analysis, improvements and applications

Peibing Du, Roberto Barrio, Hao Jiang et al.

The compensated quotient-difference (Compqd) algorithm is proposed along with some applications. The main motivation is based on the fact that the standard quotient-difference (qd) algorithm can be numerically unstable. The Compqd algorithm is obtained by applying error-free transformations to improve the traditional qd algorithm. We study in detail the error analysis of the qd and Compqd algorithms and we introduce new condition numbers so that the relative forward rounding error bounds can be derived directly. Our numerical experiments illustrate that the Compqd algorithm is much more accurate than the qd algorithm, relegating the influence of the condition numbers up to second order in the rounding unit of the computer. Three applications of the new algorithm in the obtention of continued fractions and in pole and zero detection are shown.

88.5ITApr 21
Cramer-Rao Bounds for Activity Detection in Conventional and Fluid Antenna Systems

Zhentian Zhang, Kai-Kit Wong, Hao Jiang et al.

In this letter, we develop a unified Cramér-Rao bound (CRB) framework to characterize the fundamental performance limits of transmission activity detection in fluid antenna systems (FASs) and conventional multiple fixed-position antenna (FPA) systems. To facilitate CRB analysis applicable to activity indicators, we relax the binary activity states to continuous parameters, thereby aligning the bound-based evaluation with practical threshold-based detection decisions. Closed-form CRB expressions are derived for two representative detection formulations, namely covariance-oriented and coherent models. Moreover, for single-antenna FASs, we obtain a closed-form coherent CRB by leveraging random matrix theory. The results demonstrate that CRB-based analysis provides a tractable and informative benchmark for evaluating activity detection across architectures and detection schemes, and further reveal that FASs can deliver strong spatial-diversity gains with significantly reduced complexity.

LGJan 27, 2023
Solving Richly Constrained Reinforcement Learning through State Augmentation and Reward Penalties

Hao Jiang, Tien Mai, Pradeep Varakantham et al.

Constrained Reinforcement Learning has been employed to enforce safety constraints on policy through the use of expected cost constraints. The key challenge is in handling expected cost accumulated using the policy and not just in a single step. Existing methods have developed innovative ways of converting this cost constraint over entire policy to constraints over local decisions (at each time step). While such approaches have provided good solutions with regards to objective, they can either be overly aggressive or conservative with respect to costs. This is owing to use of estimates for "future" or "backward" costs in local cost constraints. To that end, we provide an equivalent unconstrained formulation to constrained RL that has an augmented state space and reward penalties. This intuitive formulation is general and has interesting theoretical properties. More importantly, this provides a new paradigm for solving constrained RL problems effectively. As we show in our experimental results, we are able to outperform leading approaches on multiple benchmark problems from literature.

CVJul 10, 2024
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Wanggui He, Siming Fu, Mushui Liu et al.

Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details. Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications.

88.3SPMay 26
Geometry-Structured Channel Reconstruction for Conventional and Fluid Antenna Systems: Bayesian Inference and Fundamental Limits

Zhentian Zhang, Kai-Kit Wong, Kaitao Meng et al.

Accurate channel state information (CSI) acquisition is critical for exploiting the spatial flexibility of fluid antenna systems (FASs). However, port selection and transmission optimization require CSI over a large number of candidate port positions, making direct port-wise estimation prohibitively costly in terms of pilot overhead. This paper addresses this challenge through geometry-structured channel reconstruction, which exploits the fact that the port-domain CSI can be parameterized by a small number of dominant propagation paths. We first establish fundamental mean square error (MSE) and normalized MSE (NMSE) benchmarks for both geometry-structured and unstructured channel reconstruction, providing analytical references for evaluating the intrinsic benefit of geometric modeling in conventional antenna systems and FASs. Motivated by the strong spatial correlation induced by densely distributed fluid antenna ports, we further propose a Bayesian reconstruction framework, termed geometry-structured expectation-maximization approximate message passing (GS-EM-AMP). The proposed algorithm incorporates geometric channel structure into the EM-AMP procedure and adaptively learns unknown statistical parameters from noisy observations. Numerical results demonstrate that GS-EM-AMP achieves near-bound reconstruction accuracy while maintaining strong robustness against steering-domain correlation, thereby offering an efficient and reliable solution for large-scale CSI acquisition in FASs.

ETNov 13, 2023
Pruning random resistive memory for optimizing analogue AI

Yi Li, Songqi Wang, Yaping Zhao et al.

The rapid advancement of artificial intelligence (AI) has been marked by the large language models exhibiting human-like intelligence. However, these models also present unprecedented challenges to energy consumption and environmental sustainability. One promising solution is to revisit analogue computing, a technique that predates digital computing and exploits emerging analogue electronic devices, such as resistive memory, which features in-memory computing, high scalability, and nonvolatility. However, analogue computing still faces the same challenges as before: programming nonidealities and expensive programming due to the underlying devices physics. Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning to optimize the topology of a randomly weighted analogue resistive memory neural network. Software-wise, the topology of a randomly weighted neural network is optimized by pruning connections rather than precisely tuning resistive memory weights. Hardware-wise, we reveal the physical origin of the programming stochasticity using transmission electron microscopy, which is leveraged for large-scale and low-cost implementation of an overparameterized random neural network containing high-performance sub-networks. We implemented the co-design on a 40nm 256K resistive memory macro, observing 17.3% and 19.9% accuracy improvements in image and audio classification on FashionMNIST and Spoken digits datasets, as well as 9.8% (2%) improvement in PR (ROC) in image segmentation on DRIVE datasets, respectively. This is accompanied by 82.1%, 51.2%, and 99.8% improvement in energy efficiency thanks to analogue in-memory computing. By embracing the intrinsic stochasticity and in-memory computing, this work may solve the biggest obstacle of analogue computing systems and thus unleash their immense potential for next-generation AI hardware.

85.1ROMay 11Code
Muninn: Your Trajectory Diffusion Model But Faster

Gokul Puthumanaillam, Hao Jiang, Ruben Hernandez et al.

Diffusion-based trajectory planners can synthesize rich, multimodal robot motions, but their iterative denoising makes online planning and control prohibitively slow. Existing accelerations either modify the sampler or compress the network--sacrificing plan quality or requiring retraining without accounting for downstream control risk. We address the problem of making diffusion-based trajectory planners fast enough for real-time robot use without retraining the model or sacrificing trajectory quality, and in a way that works across diverse state-space diffusion architectures. Our key insight is that diffusion trajectory planners expose two signals we can exploit: a cheap probe of how their internal trajectory representation changes across steps, and analytic coefficients that describe how denoiser errors affect the sampler's state update. By calibrating the first signal against the second on offline runs, we obtain a per-step score that upper-bounds how far the final trajectory can deviate when we reuse a cached denoiser output, and we treat this bound as an uncertainty budget that we can spend over the denoising process. Building on this insight, we present Muninn, a training-free caching wrapper that tracks this uncertainty budget during sampling and, at each diffusion step, chooses between reusing a cached denoiser output when the predicted deviation is small and recomputing the denoiser when it is not. Across standard benchmarks Muninn delivers up to 4.6x wall-clock speedups across several trajectory diffusion models by reducing denoiser evaluations, while preserving task performance and safety metrics. Muninn further certifies that cached rollouts remain within a specified distance of their full-compute counterparts, and we validate these gains in real-time closed-loop navigation and manipulation hardware deployments. Project page: https://github.com/gokulp01/Muninn.

CVAug 8, 2024
Improving Network Interpretability via Explanation Consistency Evaluation

Hefeng Wu, Hao Jiang, Keze Wang et al.

While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increase the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model's visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.

CVFeb 5, 2024Code
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Yang Jin, Zhicheng Sun, Kun Xu et al. · pku

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models are available at https://video-lavit.github.io.

ARJul 12, 2024
Dynamic neural network with memristive CIM and CAM for 2D and 3D vision

Yue Zhang, Woyu Zhang, Shaocong Wang et al.

The brain is dynamic, associative and efficient. It reconfigures by associating the inputs with past experiences, with fused memory and processing. In contrast, AI models are static, unable to associate inputs with past experiences, and run on digital computers with physically separated memory and processing. We propose a hardware-software co-design, a semantic memory-based dynamic neural network (DNN) using memristor. The network associates incoming data with the past experience stored as semantic vectors. The network and the semantic memory are physically implemented on noise-robust ternary memristor-based Computing-In-Memory (CIM) and Content-Addressable Memory (CAM) circuits, respectively. We validate our co-designs, using a 40nm memristor macro, on ResNet and PointNet++ for classifying images and 3D points from the MNIST and ModelNet datasets, which not only achieves accuracy on par with software but also a 48.1% and 15.9% reduction in computational budget. Moreover, it delivers a 77.6% and 93.3% reduction in energy consumption.

AINov 11, 2023
TrainerAgent: Customizable and Efficient Model Training through LLM-Powered Multi-Agent System

Haoyuan Li, Hao Jiang, Tianke Zhang et al.

Training AI models has always been challenging, especially when there is a need for custom models to provide personalized services. Algorithm engineers often face a lengthy process to iteratively develop models tailored to specific business requirements, making it even more difficult for non-experts. The quest for high-quality and efficient model development, along with the emergence of Large Language Model (LLM) Agents, has become a key focus in the industry. Leveraging the powerful analytical, planning, and decision-making capabilities of LLM, we propose a TrainerAgent system comprising a multi-agent framework including Task, Data, Model and Server agents. These agents analyze user-defined tasks, input data, and requirements (e.g., accuracy, speed), optimizing them comprehensively from both data and model perspectives to obtain satisfactory models, and finally deploy these models as online service. Experimental evaluations on classical discriminative and generative tasks in computer vision and natural language processing domains demonstrate that our system consistently produces models that meet the desired criteria. Furthermore, the system exhibits the ability to critically identify and reject unattainable tasks, such as fantastical scenarios or unethical requests, ensuring robustness and safety. This research presents a significant advancement in achieving desired models with increased efficiency and quality as compared to traditional model development, facilitated by the integration of LLM-powered analysis, decision-making, and execution capabilities, as well as the collaboration among four agents. We anticipate that our work will contribute to the advancement of research on TrainerAgent in both academic and industry communities, potentially establishing it as a new paradigm for model development in the field of AI.

CVFeb 14, 2025Code
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Tianwei Lin, Wenqiao Zhang, Sijing Li et al.

We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.

90.9ITApr 17
Beyond Covariance: Generative Spatial Correlation Modeling and Channel Interpolation for Fluid Antenna Systems

Zhentian Zhang, Hao Jiang, Kai-Kit Wong et al.

Fluid antenna systems (FAS) enable unprecedented spatial diversity within a compact form factor by flexibly switching among high-density antenna ports. To activate this capability, channel state information (CSI) over the ports is required, which implies high estimation overhead because the number of ports is usually very large. Conventional estimation schemes tend to first estimate the CSI for a small number of ports and then infer the CSI for the remaining antenna ports by interpolation exploiting correlation characteristics. However, existing correlation-based techniques lack generalization ability, and the fundamental limits of interpolating the CSI from sparse observations remain poorly understood. This paper adopts a generative modeling framework for characterizing the channel correlation among the FAS ports that departs fundamentally from covariance-descriptive models. Specifically, we represent the spatially sampled channel as a $p$th-order autoregressive (AR) Gauss-Markov process, which provides a principled and tunable tradeoff between model complexity and approximation accuracy via the AR order. In so doing, we can characterize the limits of channel interpolation by deriving the globally optimal minimum mean-square error (MMSE) estimator and establishing a tight lower bound on the minimum number of observations required to meet a prescribed reconstruction error. To reduce the complexity of MMSE estimation, we then exploit the state-space structure due to the ${\rm AR}(p)$ model and develop a Kalman filtering/smoothing-based interpolation algorithm. The resulting method attains the optimal MMSE performance with strictly linear complexity $\mathcal{O}(N)$ with $N$ denoting the number of ports, resulting in a scalable, efficient, and theoretically grounded framework for practical FAS channel reconstruction.

IRAug 27, 2024
MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce

Hao Jiang, Haoxiang Zhang, Qingshan Hou et al.

Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook individual preferences for different modalities, leading to suboptimal results. To address these issues, we propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences through lightweight mixture-of-expert (LMoE) modules to better align features across and within modalities. MRSE also builds user profiles at a multi-modality level and introduces a novel hybrid loss function that enhances consistency and robustness using hard negative sampling. Experiments on a large-scale dataset from Shopee and online A/B testing show that MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee's state-of-the-art uni-modality system.

81.3ITMay 21
Finite-Aperture Planar Fluid Antenna Array

Zhentian Zhang, Jingyuan Xu, Kai-Kit Wong et al.

Fluid antenna systems (FASs) are emerging as a reconfigurable-aperture technology that expands physical-layer design beyond fixed, rigid antenna geometries. While the \emph{fading diversity} of FASs -- which exploits spatial channel fluctuations for signal enhancement and interference avoidance -- has been widely studied, the \emph{geometry diversity} created by reconfigurable port placement remains far less understood, particularly for planar architectures under finite-aperture constraints. This paper develops a systematic analytical framework for finite-aperture planar fluid antenna arrays (FAAs). First, we derive a closed-form characterization of the minimum inter-port distance under uniform random placement over a rectangular aperture and show that it follows a Rayleigh law. Its mean scales as $\mathcal{O}(M^{-1})$, in sharp contrast to the $\mathcal{O}(M^{-2})$ behavior in the linear case in which $M$ represents the number of candidate ports, revealing a fundamentally more favorable packing geometry in two dimensions. Secondly, we establish a universal Cramér-Rao bound (CRB) for joint elevation-azimuth estimation, governed by a $2\times 2$ \emph{geometric inertia matrix} whose determinant and eigenstructure fully capture the role of port placement in estimation precision. We further prove that both the trace and determinant of this matrix are invariant to the azimuth look direction. Third, we uncover an intrinsic \emph{precision--ambiguity trade-off}: maximizing the geometric determinant to minimize the CRB drives ports toward the aperture boundary, but simultaneously increases sidelobe-induced spatial ambiguity.

47.2CVMar 31
TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding

Jingbin You, Zehao Li, Hao Jiang et al.

3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.

AIFeb 26
FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

Zehao Li, Hongwei Yu, Hao Jiang et al.

Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an iterative reasoning process built upon MLLMs. FactGuard explicitly assesses task ambiguity and selectively invokes external tools to acquire critical evidence, enabling progressive refinement of reasoning trajectories. To further strengthen this capability, we introduce a two-stage training strategy that combines domain-specific agentic supervised fine-tuning with decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decision making. Extensive experiments on FakeSV, FakeTT, and FakeVV demonstrate FactGuard's state-of-the-art performance and validate its excellent robustness and generalization capacity.

NAMar 19, 2016
A note on the convergence of nonconvex line search

Tao Sun, Lizhi Chenga, Hao Jiang

In this note, we consider the line search for a class of abstract nonconvex algorithm which have been deeply studied in the Kurdyka-Lojasiewicz theory. We provide a weak convergence result of the line search in general. When the objective function satisfies the Kurdyka-Lojasiewicz property and some certain assumption, a global convergence result can be derived. An application is presented for the L0-regularized least square minimization in the end of the paper.

CLFeb 2Code
D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

Bowen Xu, Shaoyu Wu, Hao Jiang et al.

Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs' task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs' reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7\% accuracy, surpassing the best-performing 8B model by 5.7\%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3\%, outperforming 70B models despite being 5$\times$ smaller. The source code is available at https://github.com/alibaba/EfficientAI.

AINov 6, 2025Code
Agentmandering: A Game-Theoretic Framework for Fair Redistricting via Large Language Model Agents

Hao Li, Haotian Chen, Ruoyuan Gong et al.

Redistricting plays a central role in shaping how votes are translated into political power. While existing computational methods primarily aim to generate large ensembles of legally valid districting plans, they often neglect the strategic dynamics involved in the selection process. This oversight creates opportunities for partisan actors to cherry-pick maps that, while technically compliant, are politically advantageous. Simply satisfying formal constraints does not ensure fairness when the selection process itself can be manipulated. We propose \textbf{Agentmandering}, a framework that reimagines redistricting as a turn-based negotiation between two agents representing opposing political interests. Drawing inspiration from game-theoretic ideas, particularly the \textit{Choose-and-Freeze} protocol, our method embeds strategic interaction into the redistricting process via large language model (LLM) agents. Agents alternate between selecting and freezing districts from a small set of candidate maps, gradually partitioning the state through constrained and interpretable choices. Evaluation on post-2020 U.S. Census data across all states shows that Agentmandering significantly reduces partisan bias and unfairness, while achieving 2 to 3 orders of magnitude lower variance than standard baselines. These results demonstrate both fairness and stability, especially in swing-state scenarios. Our code is available at https://github.com/Lihaogx/AgentMandering.