CLJul 13, 2023Code
In-context Autoencoder for Context Compression in a Large Language ModelTao Ge, Jing Hu, Lei Wang et al. · microsoft-research
We propose the In-context Autoencoder (ICAE), leveraging the power of a large language model (LLM) to compress a long context into short compact memory slots that can be directly conditioned on by the LLM for various purposes. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data, enabling it to generate memory slots that accurately and comprehensively represent the original context. Then, it is fine-tuned on instruction data for producing desirable responses to various prompts. Experiments demonstrate that our lightweight ICAE, introducing about 1% additional parameters, effectively achieves $4\times$ context compression based on Llama, offering advantages in both improved latency and GPU memory cost during inference, and showing an interesting insight in memorization as well as potential for scalability. These promising results imply a novel perspective on the connection between working memory in cognitive science and representation learning in LLMs, revealing ICAE's significant implications in addressing the long context problem and suggesting further research in LLM context management. Our data, code and models are available at https://github.com/getao/icae.
CLDec 1, 2022
Extensible Prompts for Language Models on Zero-shot Language Style CustomizationTao Ge, Jing Hu, Li Dong et al. · microsoft-research
We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words. Registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words, thereby making a prompt more descriptive. Also, these imaginary words are designed to be out-of-distribution (OOD) robust so that they can be (re)used like NL words in various prompts, distinguishing X-Prompt from soft prompt that is for fitting in-distribution data. We propose context-augmented learning (CAL) to learn imaginary words for general usability, enabling them to work properly in OOD (unseen) prompts. We experiment X-Prompt for zero-shot language style customization as a case study. The promising results of X-Prompt demonstrate its potential to facilitate advanced interaction beyond the natural language interface, bridging the communication gap between humans and LLMs.
7.9CLJun 4
YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA TransitionPSBC LLM Team, Huawei LLM Team, Ruihan Long et al.
Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.
CVSep 24, 2023Code
RL-I2IT: Image-to-Image Translation with Deep Reinforcement LearningJing Hu, Ziwei Luo, Chengming Feng et al.
Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is available at https://github.com/Algolzw/SPAC-Deformable-Registration.
LGFeb 25, 2023
DeepOHeat: Operator Learning-based Ultra-fast Thermal Simulation in 3D-IC DesignZiyue Liu, Yixing Li, Jing Hu et al.
Thermal issue is a major concern in 3D integrated circuit (IC) design. Thermal optimization of 3D IC often requires massive expensive PDE simulations. Neural network-based thermal prediction models can perform real-time prediction for many unseen new designs. However, existing works either solve 2D temperature fields only or do not generalize well to new designs with unseen design configurations (e.g., heat sources and boundary conditions). In this paper, for the first time, we propose DeepOHeat, a physics-aware operator learning framework to predict the temperature field of a family of heat equations with multiple parametric or non-parametric design configurations. This framework learns a functional map from the function space of multiple key PDE configurations (e.g., boundary conditions, power maps, heat transfer coefficients) to the function space of the corresponding solution (i.e., temperature fields), enabling fast thermal analysis and optimization by changing key design configurations (rather than just some parameters). We test DeepOHeat on some industrial design cases and compare it against Celsius 3D from Cadence Design Systems. Our results show that, for the unseen testing cases, a well-trained DeepOHeat can produce accurate results with $1000\times$ to $300000\times$ speedup.
LGJul 10, 2022
TCR: A Transformer Based Deep Network for Predicting Cancer Drugs ResponseJie Gao, Jing Hu, Wanqing Sun et al.
Predicting clinical outcomes to anti-cancer drugs on a personalized basis is challenging in cancer treatment due to the heterogeneity of tumors. Traditional computational efforts have been made to model the effect of drug response on individual samples depicted by their molecular profile, yet overfitting occurs because of the high dimension for omics data, hindering models from clinical application. Recent research shows that deep learning is a promising approach to build drug response models by learning alignment patterns between drugs and samples. However, existing studies employed the simple feature fusion strategy and only considered the drug features as a whole representation while ignoring the substructure information that may play a vital role when aligning drugs and genes. Hereby in this paper, we propose TCR (Transformer based network for Cancer drug Response) to predict anti-cancer drug response. By utilizing an attention mechanism, TCR is able to learn the interactions between drug atom/sub-structure and molecular signatures efficiently in our study. Furthermore, a dual loss function and cross sampling strategy were designed to improve the prediction power of TCR. We show that TCR outperformed all other methods under various data splitting strategies on all evaluation matrices (some with significant improvement). Extensive experiments demonstrate that TCR shows significantly improved generalization ability on independent in-vitro experiments and in-vivo real patient data. Our study highlights the prediction power of TCR and its potential value for cancer drug repurpose and precision oncology treatment.
IVNov 10, 2023
UMedNeRF: Uncertainty-aware Single View Volumetric Rendering for Medical Neural Radiance FieldsJing Hu, Qinrui Fan, Shu Hu et al.
In the field of clinical medicine, computed tomography (CT) is an effective medical imaging modality for the diagnosis of various pathologies. Compared with X-ray images, CT images can provide more information, including multi-planar slices and three-dimensional structures for clinical diagnosis. However, CT imaging requires patients to be exposed to large doses of ionizing radiation for a long time, which may cause irreversible physical harm. In this paper, we propose an Uncertainty-aware MedNeRF (UMedNeRF) network based on generated radiation fields. The network can learn a continuous representation of CT projections from 2D X-ray images by obtaining the internal structure and depth information and using adaptive loss weights to ensure the quality of the generated images. Our model is trained on publicly available knee and chest datasets, and we show the results of CT projection rendering with a single X-ray and compare our method with other methods based on generated radiation fields.
CLFeb 4
ERNIE 5.0 Technical ReportHaifeng Wang, Hua Wu, Tian Wu et al.
In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
CVSep 30, 2023
Controlling Neural Style Transfer with Deep Reinforcement LearningChengming Feng, Jing Hu, Xin Wang et al.
Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters. In this paper, we propose the first deep Reinforcement Learning (RL) based architecture that splits one-step style transfer into a step-wise process for the NST task. Our RL-based method tends to preserve more details and structures of the content image in early steps, and synthesize more style patterns in later steps. It is a user-easily-controlled style-transfer method. Additionally, as our RL-based model performs the stylization progressively, it is lightweight and has lower computational complexity than existing one-step Deep Learning (DL) based models. Experimental results demonstrate the effectiveness and robustness of our method.
IVAug 8, 2024
An Explainable Non-local Network for COVID-19 DiagnosisJingfu Yang, Peng Huang, Jing Hu et al.
The CNN has achieved excellent results in the automatic classification of medical images. In this study, we propose a novel deep residual 3D attention non-local network (NL-RAN) to classify CT images included COVID-19, common pneumonia, and normal to perform rapid and explainable COVID-19 diagnosis. We built a deep residual 3D attention non-local network that could achieve end-to-end training. The network is embedded with a nonlocal module to capture global information, while a 3D attention module is embedded to focus on the details of the lesion so that it can directly analyze the 3D lung CT and output the classification results. The output of the attention module can be used as a heat map to increase the interpretability of the model. 4079 3D CT scans were included in this study. Each scan had a unique label (novel coronavirus pneumonia, common pneumonia, and normal). The CT scans cohort was randomly split into a training set of 3263 scans, a validation set of 408 scans, and a testing set of 408 scans. And compare with existing mainstream classification methods, such as CovNet, CBAM, ResNet, etc. Simultaneously compare the visualization results with visualization methods such as CAM. Model performance was evaluated using the Area Under the ROC Curve(AUC), precision, and F1-score. The NL-RAN achieved the AUC of 0.9903, the precision of 0.9473, and the F1-score of 0.9462, surpass all the classification methods compared. The heat map output by the attention module is also clearer than the heat map output by CAM. Our experimental results indicate that our proposed method performs significantly better than existing methods. In addition, the first attention module outputs a heat map containing detailed outline information to increase the interpretability of the model. Our experiments indicate that the inference of our model is fast. It can provide real-time assistance with diagnosis.
LGMay 27, 2022
Efficient Forecasting of Large Scale Hierarchical Time Series via Multilevel ClusteringXing Han, Tongzheng Ren, Jing Hu et al.
We propose a novel approach to the problem of clustering hierarchically aggregated time-series data, which has remained an understudied problem though it has several commercial applications. We first group time series at each aggregated level, while simultaneously leveraging local and global information. The proposed method can cluster hierarchical time series (HTS) with different lengths and structures. For common two-level hierarchies, we employ a combined objective for local and global clustering over spaces of discrete probability measures, using Wasserstein distance coupled with Soft-DTW divergence. For multi-level hierarchies, we present a bottom-up procedure that progressively leverages lower-level information for higher-level clustering. Our final goal is to improve both the accuracy and speed of forecasts for a larger number of HTS needed for a real-world application. To attain this goal, each time series is first assigned the forecast for its cluster representative, which can be considered as a "shrinkage prior" for the set of time series it represents. Then this base forecast can be quickly fine-tuned to adjust to the specifics of that time series. We empirically show that our method substantially improves performance in terms of both speed and accuracy for large-scale forecasting tasks involving much HTS.
96.0CEMay 20
HyFrac.fun: A 3D Hydraulic Fracturing Simulator on CloudJing Hu, Qian Liu, Jaroon Rungamornrat
When multiple hydraulic fractures propagate simultaneously from a horizontal wellbore, elastic stress-shadow interactions generate complex non-planar three-dimensional geometries whose effect on subsequent reservoir drainage has infrequently been quantified, because the propagation and production solvers have historically been incompatible stand-alone tools. This paper presents HyFrac.fun, a cloud-native platform that bridges this gap by exploiting a structural isomorphism between the two SGBEM--FEM governing operator systems. The platform enables automated zero-conversion handoff of the evolved 3D fracture mesh directly to the steady-state Darcy production solver for realizing a fully integrated lifecycle simulation of multi-stage non-planar hydraulic fractures. The lifecycle analysis reveals a double shadow phenomenon: the mechanical stress shadow that suppresses inner-fracture growth during stimulation mirrors a fluid pressure shadow that reduces the inner fracture's drawout rate at small cluster spacing. Critically, switching to a shear-thinning power-law fracturing fluid leaves the fracture trajectories and production rates almost unchanged, demonstrating that stress-shadow-controlled fracture geometry instead of fluid rheology is the primary determinant of long-term production efficiency at equal injection rates. These physics findings are accessible from integrated fracture propagation and production simulations.
SDJan 23
CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal DistillationJing Hu, Danxiang Zhu, Xianlong Luo et al.
Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens. At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO). Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio-text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.
CVMar 27, 2025Code
VADMamba: Exploring State Space Models for Fast Video Anomaly DetectionJiahao Lyu, Minghua Zhao, Jing Hu et al.
Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at https://github.com/jLooo/VADMamba.
CVJul 22, 2024
Bidirectional skip-frame prediction for video anomaly detection with intra-domain disparity-driven attentionJiahao Lyu, Minghua Zhao, Jing Hu et al.
With the widespread deployment of video surveillance devices and the demand for intelligent system development, video anomaly detection (VAD) has become an important part of constructing intelligent surveillance systems. Expanding the discriminative boundary between normal and abnormal events to enhance performance is the common goal and challenge of VAD. To address this problem, we propose a Bidirectional Skip-frame Prediction (BiSP) network based on a dual-stream autoencoder, from the perspective of learning the intra-domain disparity between different features. The BiSP skips frames in the training phase to achieve the forward and backward frame prediction respectively, and in the testing phase, it utilizes bidirectional consecutive frames to co-predict the same intermediate frames, thus expanding the degree of disparity between normal and abnormal events. The BiSP designs the variance channel attention and context spatial attention from the perspectives of movement patterns and object scales, respectively, thus ensuring the maximization of the disparity between normal and abnormal in the feature extraction and delivery with different dimensions. Extensive experiments from four benchmark datasets demonstrate the effectiveness of the proposed BiSP, which substantially outperforms state-of-the-art competing methods.
CVMay 10, 2025Code
Improving Generalization of Medical Image Registration Foundation ModelJing Hu, Kaiwei Yu, Hongjiang Xian et al.
Deformable registration is a fundamental task in medical image processing, aiming to achieve precise alignment by establishing nonlinear correspondences between images. Traditional methods offer good adaptability and interpretability but are limited by computational efficiency. Although deep learning approaches have significantly improved registration speed and accuracy, they often lack flexibility and generalizability across different datasets and tasks. In recent years, foundation models have emerged as a promising direction, leveraging large and diverse datasets to learn universal features and transformation patterns for image registration, thus demonstrating strong cross-task transferability. However, these models still face challenges in generalization and robustness when encountering novel anatomical structures, varying imaging conditions, or unseen modalities. To address these limitations, this paper incorporates Sharpness-Aware Minimization (SAM) into foundation models to enhance their generalization and robustness in medical image registration. By optimizing the flatness of the loss landscape, SAM improves model stability across diverse data distributions and strengthens its ability to handle complex clinical scenarios. Experimental results show that foundation models integrated with SAM achieve significant improvements in cross-dataset registration performance, offering new insights for the advancement of medical image registration technology. Our code is available at https://github.com/Promise13/fm_sam}{https://github.com/Promise13/fm\_sam.
CVSep 26, 2024
Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly DetectionJiahao Lyu, Minghua Zhao, Jing Hu et al.
Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.
CVMay 7, 2025Code
RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style GenerationJing Hu, Chengming Feng, Shu Hu et al.
Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement learning policy to iteratively guide the style transfer process by exploring and exploiting stylization feedback, generating smooth sequences of stylized results while achieving model lightweight. Furthermore, we introduce an uncertainty-aware multi-task learning strategy that automatically adjusts loss weights to adapt to the content and style balance requirements at different training stages, thereby accelerating model convergence. Through a series of experiments across image various resolutions, we have validated the advantages of RLMiniStyler over other state-of-the-art methods in generating high-quality, diverse artistic image sequences at a lower cost. Codes are available at https://github.com/fengxiaoming520/RLMiniStyler.
CVJan 26
Forward Consistency Learning with Gated Context Aggregation for Video Anomaly DetectionJiahao Lyu, Minghua Zhao, Xuewen Huang et al.
As a crucial element of public security, video anomaly detection (VAD) aims to measure deviations from normal patterns for various events in real-time surveillance systems. However, most existing VAD methods rely on large-scale models to pursue extreme accuracy, limiting their feasibility on resource-limited edge devices. Moreover, mainstream prediction-based VAD detects anomalies using only single-frame future prediction errors, overlooking the richer constraints from longer-term temporal forward information. In this paper, we introduce FoGA, a lightweight VAD model that performs Forward consistency learning with Gated context Aggregation, containing about 2M parameters and tailored for potential edge devices. Specifically, we propose a Unet-based method that performs feature extraction on consecutive frames to generate both immediate and forward predictions. Then, we introduce a gated context aggregation module into the skip connections to dynamically fuse encoder and decoder features at the same spatial scale. Finally, the model is jointly optimized with a novel forward consistency loss, and a hybrid anomaly measurement strategy is adopted to integrate errors from both immediate and forward frames for more accurate detection. Extensive experiments demonstrate the effectiveness of the proposed method, which substantially outperforms state-of-the-art competing methods, running up to 155 FPS. Hence, our FoGA achieves an excellent trade-off between performance and the efficiency metric.
72.7CVApr 3
The Eleventh NTIRE 2026 Efficient Super-Resolution Challenge ReportBin Ren, Hang Guo, Yan Shu et al.
This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of this challenge is to devise a network that reduces one or several aspects, such as runtime, parameters, and FLOPs, while maintaining PSNR of around 26.90 dB on the DIV2K_LSDIR_valid dataset, and 26.99 dB on the DIV2K_LSDIR_test dataset. The challenge had 95 registered participants, and 15 teams made valid submissions. They gauge the state-of-the-art results for efficient single-image super-resolution.
CVApr 14, 2025
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge ReportBin Ren, Hang Guo, Lei Sun et al.
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
CVDec 13, 2024
HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object DetectionZican Shi, Jing Hu, Jie Ren et al.
The introduction of Feature Pyramid Network (FPN) has significantly improved object detection performance. However, substantial challenges remain in detecting tiny objects, as their features occupy only a very small proportion of the feature maps. Although FPN integrates multi-scale features, it does not directly enhance or enrich the features of tiny objects. Furthermore, FPN lacks spatial perception ability. To address these issues, we propose a novel High Frequency and Spatial Perception Feature Pyramid Network (HS-FPN) with two innovative modules. First, we designed a high frequency perception module (HFP) that generates high frequency responses through high pass filters. These high frequency responses are used as mask weights from both spatial and channel perspectives to enrich and highlight the features of tiny objects in the original feature maps. Second, we developed a spatial dependency perception module (SDP) to capture the spatial dependencies that FPN lacks. Our experiments demonstrate that detectors based on HS-FPN exhibit competitive advantages over state-of-the-art models on the AI-TOD dataset for tiny object detection.
BMApr 16, 2024
HelixFold-Multimer: Elevating Protein Complex Structure Prediction to New HeightsXiaomin Fang, Jie Gao, Jing Hu et al.
While monomer protein structure prediction tools boast impressive accuracy, the prediction of protein complex structures remains a daunting challenge in the field. This challenge is particularly pronounced in scenarios involving complexes with protein chains from different species, such as antigen-antibody interactions, where accuracy often falls short. Limited by the accuracy of complex prediction, tasks based on precise protein-protein interaction analysis also face obstacles. In this report, we highlight the ongoing advancements of our protein complex structure prediction model, HelixFold-Multimer, underscoring its enhanced performance. HelixFold-Multimer provides precise predictions for diverse protein complex structures, especially in therapeutic protein interactions. Notably, HelixFold-Multimer achieves remarkable success in antigen-antibody and peptide-protein structure prediction, greatly surpassing AlphaFold 3. HelixFold-Multimer is now available for public use on the PaddleHelix platform, offering both a general version and an antigen-antibody version. Researchers can conveniently access and utilize this service for their development needs.
CVJan 17, 2024
Efficient Image Super-Resolution via Symmetric Visual Attention NetworkChengxu Wu, Qinrui Fan, Shu Hu et al.
An important development direction in the Single-Image Super-Resolution (SISR) algorithms is to improve the efficiency of the algorithms. Recently, efficient Super-Resolution (SR) research focuses on reducing model complexity and improving efficiency through improved deep small kernel convolution, leading to a small receptive field. The large receptive field obtained by large kernel convolution can significantly improve image quality, but the computational cost is too high. To improve the reconstruction details of efficient super-resolution reconstruction, we propose a Symmetric Visual Attention Network (SVAN) by applying large receptive fields. The SVAN decomposes a large kernel convolution into three different combinations of convolution operations and combines them with an attention mechanism to form a Symmetric Large Kernel Attention Block (SLKAB), which forms a symmetric attention block with a bottleneck structure by the size of the receptive field in the convolution combination to extract depth features effectively as the basic component of the SVAN. Our network gets a large receptive field while minimizing the number of parameters and improving the perceptual ability of the model. The experimental results show that the proposed SVAN can obtain high-quality super-resolution reconstruction results using only about 30% of the parameters of existing SOTA methods.
CVAug 2, 2025
Self-Enhanced Image Clustering with Cross-Modal Semantic ConsistencyZihan Li, Wei Sun, Jing Hu et al.
While large language-image pre-trained models like CLIP offer powerful generic features for image clustering, existing methods typically freeze the encoder. This creates a fundamental mismatch between the model's task-agnostic representations and the demands of a specific clustering task, imposing a ceiling on performance. To break this ceiling, we propose a self-enhanced framework based on cross-modal semantic consistency for efficient image clustering. Our framework first builds a strong foundation via Cross-Modal Semantic Consistency and then specializes the encoder through Self-Enhancement. In the first stage, we focus on Cross-Modal Semantic Consistency. By mining consistency between generated image-text pairs at the instance, cluster assignment, and cluster center levels, we train lightweight clustering heads to align with the rich semantics of the pre-trained model. This alignment process is bolstered by a novel method for generating higher-quality cluster centers and a dynamic balancing regularizer to ensure well-distributed assignments. In the second stage, we introduce a Self-Enhanced fine-tuning strategy. The well-aligned model from the first stage acts as a reliable pseudo-label generator. These self-generated supervisory signals are then used to feed back the efficient, joint optimization of the vision encoder and clustering heads, unlocking their full potential. Extensive experiments on six mainstream datasets show that our method outperforms existing deep clustering methods by significant margins. Notably, our ViT-B/32 model already matches or even surpasses the accuracy of state-of-the-art methods built upon the far larger ViT-L/14.
BMDec 13, 2024
Precise Antigen-Antibody Structure Predictions Enhance Antibody Development with HelixFold-MultimerJie Gao, Jing Hu, Lihang Liu et al.
The accurate prediction of antigen-antibody structures is essential for advancing immunology and therapeutic development, as it helps elucidate molecular interactions that underlie immune responses. Despite recent progress with deep learning models like AlphaFold and RoseTTAFold, accurately modeling antigen-antibody complexes remains a challenge due to their unique evolutionary characteristics. HelixFold-Multimer, a specialized model developed for this purpose, builds on the framework of AlphaFold-Multimer and demonstrates improved precision for antigen-antibody structures. HelixFold-Multimer not only surpasses other models in accuracy but also provides essential insights into antibody development, enabling more precise identification of binding sites, improved interaction prediction, and enhanced design of therapeutic antibodies. These advances underscore HelixFold-Multimer's potential in supporting antibody research and therapeutic innovation.
46.9CVApr 1
VADMamba++: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale SpaceJihao Lyu, Minghua Zhao, Jing Hu et al.
VADMamba pioneered the introduction of Mamba to Video Anomaly Detection (VAD), achieving high accuracy and fast inference through hybrid proxy tasks. Nevertheless, its heavy reliance on optical flow as auxiliary input and inter-task fusion scoring constrains its applicability to a single proxy task. In this paper, we introduce VADMamba++, an efficient VAD method based on the Gray-to-RGB paradigm that enforces a Single-Channel to Three-Channel reconstruction mapping, designed for a single proxy task and operating without auxiliary inputs. This paradigm compels inferring color appearances from grayscale structures, allowing anomalies to be more effectively revealed through dual inconsistencies between structure and chromatic cues. Specifically, VADMamba++ reconstructs grayscale frames into the RGB space to simultaneously discriminate structural geometry and chromatic fidelity, thereby enhancing sensitivity to explicit visual anomalies. We further design a hybrid modeling backbone that integrates Mamba, CNN, and Transformer modules to capture diverse normal patterns while suppressing the appearance of anomalies. Furthermore, an intra-task fusion scoring strategy integrates explicit future-frame prediction errors with implicit quantized feature errors, further improving accuracy under a single task setting. Extensive experiments on three benchmark datasets demonstrate that VADMamba++ outperforms state-of-the-art methods while meeting performance and efficiency, especially under a strict single-task setting with only frame-level inputs.
SDFeb 15
Eureka-Audio: Triggering Audio Intelligence in Compact Language ModelsDan Zhang, Yishu Lei, Jing Hu et al.
We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning, matching or surpassing multiple 7B to 30B audio and omni-modal baselines. The model adopts a unified end-to-end architecture composed of a lightweight language backbone, a Whisper-based audio encoder, and a sparsely activated Mixture-of-Experts (MoE) adapter that explicitly accounts for audio heterogeneity and alleviates cross-modal optimization conflicts under limited capacity. To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline that constructs high quality, logically consistent supervision from raw audio. Extensive evaluations across ASR, knowledge reasoning, safety, instruction following, and paralinguistic benchmarks, demonstrate that Eureka-Audio achieves an efficient balance between computational cost and performance. These results establish Eureka Audio as a strong and practical baseline for lightweight audio understanding models.
CVJul 23, 2025
SRMambaV2: Biomimetic Attention for Sparse Point Cloud Upsampling in Autonomous DrivingChuang Chen, Xiaolin Qin, Jing Hu et al.
Upsampling LiDAR point clouds in autonomous driving scenarios remains a significant challenge due to the inherent sparsity and complex 3D structures of the data. Recent studies have attempted to address this problem by converting the complex 3D spatial scenes into 2D image super-resolution tasks. However, due to the sparse and blurry feature representation of range images, accurately reconstructing detailed and complex spatial topologies remains a major difficulty. To tackle this, we propose a novel sparse point cloud upsampling method named SRMambaV2, which enhances the upsampling accuracy in long-range sparse regions while preserving the overall geometric reconstruction quality. Specifically, inspired by human driver visual perception, we design a biomimetic 2D selective scanning self-attention (2DSSA) mechanism to model the feature distribution in distant sparse areas. Meanwhile, we introduce a dual-branch network architecture to enhance the representation of sparse features. In addition, we introduce a progressive adaptive loss (PAL) function to further refine the reconstruction of fine-grained details during the upsampling process. Experimental results demonstrate that SRMambaV2 achieves superior performance in both qualitative and quantitative evaluations, highlighting its effectiveness and practical value in automotive sparse point cloud upsampling tasks.
BMJul 3, 2025
HelixDesign-Antibody: A Scalable Production-Grade Platform for Antibody Design Built on HelixFold3Jie Gao, Jing Hu, Shanzhuo Zhang et al.
Antibody engineering is essential for developing therapeutics and advancing biomedical research. Traditional discovery methods often rely on time-consuming and resource-intensive experimental screening. To enhance and streamline this process, we introduce a production-grade, high-throughput platform built on HelixFold3, HelixDesign-Antibody, which utilizes the high-accuracy structure prediction model, HelixFold3. The platform facilitates the large-scale generation of antibody candidate sequences and evaluates their interaction with antigens. Integrated high-performance computing (HPC) support enables high-throughput screening, addressing challenges such as fragmented toolchains and high computational demands. Validation on multiple antigens showcases the platform's ability to generate diverse and high-quality antibodies, confirming a scaling law where exploring larger sequence spaces increases the likelihood of identifying optimal binders. This platform provides a seamless, accessible solution for large-scale antibody design and is available via the antibody design page of PaddleHelix platform.
CVJun 15, 2025
Combining Self-attention and Dilation Convolutional for Semantic Segmentation of Coal Maceral GroupsZhenghao Xi, Zhengnan Lv, Yang Zheng et al.
The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts model training efficiency. At the same time, due to the professionalism and diversity of coal maceral group images sampling, obtaining the number of samples for model training requires a long time and professional personnel operation. To address these issues, We have innovatively developed an IoT-based DA-VIT parallel network model. By utilizing this model, we can continuously broaden the dataset through IoT and achieving sustained improvement in the accuracy of coal maceral groups segmentation. Besides, we decouple the parallel network from the backbone network to ensure the normal using of the backbone network during model data updates. Secondly, DCSA mechanism of DA-VIT is introduced to enhance the local feature information of coal microscopic images. This DCSA can decompose the large kernels of convolutional attention into multiple scales and reduce 81.18% of parameters.Finally, we performed the contrast experiment and ablation experiment between DA-VIT and state-of-the-art methods at lots of evaluation metrics. Experimental results show that DA-VIT-Base achieves 92.14% pixel accuracy and 63.18% mIoU. Params and FLOPs of DA-VIT-Tiny are 4.95M and 8.99G, respectively. All of the evaluation metrics of the proposed DA-VIT are better than other state-of-the-art methods.
BMMay 28, 2025
HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold3Jie Gao, Jun Li, Jing Hu et al.
Protein binder design is central to therapeutics, diagnostics, and synthetic biology, yet practical deployment remains challenging due to fragmented workflows, high computational costs, and complex tool integration. We present HelixDesign-Binder, a production-grade, high-throughput platform built on HelixFold3 that automates the full binder design pipeline, from backbone generation and sequence design to structural evaluation and multi-dimensional scoring. By unifying these stages into a scalable and user-friendly system, HelixDesign-Binder enables efficient exploration of binder candidates with favorable structural, energetic, and physicochemical properties. The platform leverages Baidu Cloud's high-performance infrastructure to support large-scale design and incorporates advanced scoring metrics, including ipTM, predicted binding free energy, and interface hydrophobicity. Benchmarking across six protein targets demonstrates that HelixDesign-Binder reliably produces diverse and high-quality binders, some of which match or exceed validated designs in predicted binding affinity. HelixDesign-Binder is accessible via an interactive web interface in PaddleHelix platform, supporting both academic research and industrial applications in antibody and protein binder development.
LGDec 12, 2024
Goal-Conditioned Supervised Learning for Multi-Objective RecommendationShijun Li, Hilaf Hasson, Jing Hu et al.
Multi-objective learning endeavors to concurrently optimize multiple objectives using a single model, aiming to achieve high and balanced performance across diverse objectives. However, this often entails a more complex optimization problem, particularly when navigating potential conflicts between objectives, leading to solutions with higher memory requirements and computational complexity. This paper introduces a Multi-Objective Goal-Conditioned Supervised Learning (MOGCSL) framework for automatically learning to achieve multiple objectives from offline sequential data. MOGCSL extends the conventional GCSL method to multi-objective scenarios by redefining goals from one-dimensional scalars to multi-dimensional vectors. It benefits from naturally eliminating the need for complex architectures and optimization constraints. Moreover, MOGCSL effectively filters out uninformative or noisy instances that fail to achieve desirable long-term rewards across multiple objectives. We also introduces a novel goal-selection algorithm for MOGCSL to model and identify "high" achievable goals for inference. While MOGCSL is quite general, we focus on its application to the next action prediction problem in commercial-grade recommender systems. In this context, any viable solution needs to be reasonably scalable and also be robust to large amounts of noisy data that is characteristic of this application space. We show that MOGCSL performs admirably on both counts by extensive experiments on real-world recommendation datasets. Also, analysis and experiments are included to explain its strength in discounting the noisier portions of training data in recommender systems with multiple objectives.
AO-PHApr 26, 2024
MetaSD: A Unified Framework for Scalable Downscaling of Meteorological Variables in Diverse SituationsJing Hu, Honghu Zhang, Peng Zheng et al.
Addressing complex meteorological processes at a fine spatial resolution requires substantial computational resources. To accelerate meteorological simulations, researchers have utilized neural networks to downscale meteorological variables from low-resolution simulations. Despite notable advancements, contemporary cutting-edge downscaling algorithms tailored to specific variables. Addressing meteorological variables in isolation overlooks their interconnectedness, leading to an incomplete understanding of atmospheric dynamics. Additionally, the laborious processes of data collection, annotation, and computational resources required for individual variable downscaling are significant hurdles. Given the limited versatility of existing models across different meteorological variables and their failure to account for inter-variable relationships, this paper proposes a unified downscaling approach leveraging meta-learning. This framework aims to facilitate the downscaling of diverse meteorological variables derived from various numerical models and spatiotemporal scales. Trained at variables consisted of temperature, wind, surface pressure and total precipitation from ERA5 and GFS, the proposed method can be extended to downscale convective precipitation, potential energy, height, humidity and ozone from CFS, S2S and CMIP6 at different spatiotemporal scales, which demonstrating its capability to capture the interconnections among diverse variables. Our approach represents the initial effort to create a generalized downscaling model. Experimental evidence demonstrates that the proposed model outperforms existing top downscaling methods in both quantitative and qualitative assessments.
LGDec 22, 2021
Dynamic Combination of Heterogeneous Models for Hierarchical Time SeriesXing Han, Jing Hu, Joydeep Ghosh
We introduce a framework to dynamically combine heterogeneous models called \texttt{DYCHEM}, which forecasts a set of time series that are related through an aggregation hierarchy. Different types of forecasting models can be employed as individual ``experts'' so that each model is tailored to the nature of the corresponding time series. \texttt{DYCHEM} learns hierarchical structures during the training stage to help generalize better across all the time series being modeled and also mitigates coherency issues that arise due to constraints imposed by the hierarchy. To improve the reliability of forecasts, we construct quantile estimations based on the point forecasts obtained from combined heterogeneous models. The resulting quantile forecasts are coherent and independent of the choice of forecasting models. We conduct a comprehensive evaluation of both point and quantile forecasts for hierarchical time series (HTS), including public data and user records from a large financial software company. In general, our method is robust, adaptive to datasets with different properties, and highly configurable and efficient for large-scale forecasting pipelines.
IVDec 14, 2021
Stochastic Planner-Actor-Critic for Unsupervised Deformable Image RegistrationZiwei Luo, Jing Hu, Xin Wang et al.
Large deformations of organs, caused by diverse shapes and nonlinear shape changes, pose a significant challenge for medical image registration. Traditional registration methods need to iteratively optimize an objective function via a specific deformation model along with meticulous parameter tuning, but which have limited capabilities in registering images with large deformations. While deep learning-based methods can learn the complex mapping from input images to their respective deformation field, it is regression-based and is prone to be stuck at local minima, particularly when large deformations are involved. To this end, we present Stochastic Planner-Actor-Critic (SPAC), a novel reinforcement learning-based framework that performs step-wise registration. The key notion is warping a moving image successively by each time step to finally align to a fixed image. Considering that it is challenging to handle high dimensional continuous action and state spaces in the conventional reinforcement learning (RL) framework, we introduce a new concept `Plan' to the standard Actor-Critic model, which is of low dimension and can facilitate the actor to generate a tractable high dimensional action. The entire framework is based on unsupervised training and operates in an end-to-end manner. We evaluate our method on several 2D and 3D medical image datasets, some of which contain large deformations. Our empirical results highlight that our work achieves consistent, significant gains and outperforms state-of-the-art methods.
CVDec 14, 2021
Stochastic Actor-Executor-Critic for Image-to-Image TranslationZiwei Luo, Jing Hu, Xin Wang et al.
Training a model-free deep reinforcement learning model to solve image-to-image translation is difficult since it involves high-dimensional continuous state and action spaces. In this paper, we draw inspiration from the recent success of the maximum entropy reinforcement learning framework designed for challenging continuous control problems to develop stochastic policies over high dimensional continuous spaces including image representation, generation, and control simultaneously. Central to this method is the Stochastic Actor-Executor-Critic (SAEC) which is an off-policy actor-critic model with an additional executor to generate realistic images. Specifically, the actor focuses on the high-level representation and control policy by a stochastic latent action, as well as explicitly directs the executor to generate low-level actions to manipulate the state. Experiments on several image-to-image translation tasks have demonstrated the effectiveness and robustness of the proposed SAEC when facing high-dimensional continuous space problems.
CVDec 4, 2021
Feature-based Recognition Framework for Super-resolution ImagesJing Hu, Meiqi Zhang, Rui Zhang
In practical application, the performance of recognition network usually decreases when being applied on super-resolution images. In this paper, we propose a feature-based recognition network combined with GAN (FGAN). Our network improves the recognition accuracy by extracting more features that benefit recognition from SR images. In the experiment, we build three datasets using three different super-resolution algorithm, and our network increases the recognition accuracy by more than 6% comparing with ReaNet50 and DenseNet121.
CVJan 29, 2020
Robust Multimodal Image Registration Using Deep Recurrent Reinforcement LearningShanhui Sun, Jing Hu, Mingqing Yao et al.
The crucial components of a conventional image registration method are the choice of the right feature representations and similarity measures. These two components, although elaborately designed, are somewhat handcrafted using human knowledge. To this end, these two components are tackled in an end-to-end manner via reinforcement learning in this work. Specifically, an artificial agent, which is composed of a combined policy and value network, is trained to adjust the moving image toward the right direction. We train this network using an asynchronous reinforcement learning algorithm, where a customized reward function is also leveraged to encourage robust image registration. This trained network is further incorporated with a lookahead inference to improve the registration capability. The advantage of this algorithm is fully demonstrated by our superior performance on clinical MR and CT image pairs to other state-of-the-art medical image registration methods.