Xiaobo Li

CV
h-index34
30papers
891citations
Novelty48%
AI Score57

30 Papers

CVMar 26, 2022Code
GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

Yue Liao, Aixi Zhang, Miao Lu et al.

The task of Human-Object Interaction~(HOI) detection could be divided into two core problems, i.e., human-object association and interaction understanding. In this paper, we reveal and address the disadvantages of the conventional query-driven HOI detectors from the two aspects. For the association, previous two-branch methods suffer from complex and costly post-matching, while single-branch methods ignore the features distinction in different tasks. We propose Guided-Embedding Network~(GEN) to attain a two-branch pipeline without post-matching. In GEN, we design an instance decoder to detect humans and objects with two independent query sets and a position Guided Embedding~(p-GE) to mark the human and object in the same position as a pair. Besides, we design an interaction decoder to classify interactions, where the interaction queries are made of instance Guided Embeddings (i-GE) generated from the outputs of each instance decoder layer. For the interaction understanding, previous methods suffer from long-tailed distribution and zero-shot discovery. This paper proposes a Visual-Linguistic Knowledge Transfer (VLKT) training strategy to enhance interaction understanding by transferring knowledge from a visual-linguistic pre-trained model CLIP. In specific, we extract text embeddings for all labels with CLIP to initialize the classifier and adopt a mimic loss to minimize the visual feature distance between GEN and CLIP. As a result, GEN-VLKT outperforms the state of the art by large margins on multiple datasets, e.g., +5.05 mAP on HICO-Det. The source codes are available at https://github.com/YueLiao/gen-vlkt.

CVMay 9, 2022Code
Towards 3D Face Reconstruction in Perspective Projection: Estimating 6DoF Face Pose from Monocular Image

Yueying Kao, Bowen Pan, Miao Xu et al.

In 3D face reconstruction, orthogonal projection has been widely employed to substitute perspective projection to simplify the fitting process. This approximation performs well when the distance between camera and face is far enough. However, in some scenarios that the face is very close to camera or moving along the camera axis, the methods suffer from the inaccurate reconstruction and unstable temporal fitting due to the distortion under the perspective projection. In this paper, we aim to address the problem of single-image 3D face reconstruction under perspective projection. Specifically, a deep neural network, Perspective Network (PerspNet), is proposed to simultaneously reconstruct 3D face shape in canonical space and learn the correspondence between 2D pixels and 3D points, by which the 6DoF (6 Degrees of Freedom) face pose can be estimated to represent perspective projection. Besides, we contribute a large ARKitFace dataset to enable the training and evaluation of 3D face reconstruction solutions under the scenarios of perspective projection, which has 902,724 2D facial images with ground-truth 3D face mesh and annotated 6DoF pose parameters. Experimental results show that our approach outperforms current state-of-the-art methods by a significant margin. The code and data are available at https://github.com/cbsropenproject/6dof_face.

36.0CLJun 2
Hint-Guided Diversified Policy Optimization for LLM Reasoning

Zhiyu Cao, Kaixin Wu, Mingjie Zhong et al.

Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.

CVApr 8, 2023
GANHead: Towards Generative Animatable Neural Head Avatars

Sijing Wu, Yichao Yan, Yunhao Li et al.

To bring digital avatars into people's lives, it is highly demanded to efficiently generate complete, realistic, and animatable head avatars. This task is challenging, and it is difficult for existing methods to satisfy all the requirements at once. To achieve these goals, we propose GANHead (Generative Animatable Neural Head Avatar), a novel generative head model that takes advantages of both the fine-grained control over the explicit expression parameters and the realistic rendering results of implicit representations. Specifically, GANHead represents coarse geometry, fine-gained details and texture via three networks in canonical space to obtain the ability to generate complete and realistic head avatars. To achieve flexible animation, we define the deformation filed by standard linear blend skinning (LBS), with the learned continuous pose and expression bases and LBS weights. This allows the avatars to be directly animated by FLAME parameters and generalize well to unseen poses and expressions. Compared to state-of-the-art (SOTA) methods, GANHead achieves superior performance on head avatar generation and raw scan fitting.

CVJul 1, 2022
ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022

Naiyuan Liu, Xiaohan Wang, Xiaobo Li et al.

In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2022. Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained. To tackle this task, we propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips. Besides, we propose two data augmentation strategies to increase the diversity of training samples. The experimental results demonstrate the effectiveness of our method. The final submission ranked first on the leaderboard.

CVAug 23, 2022
Quality-Constant Per-Shot Encoding by Two-Pass Learning-based Rate Factor Prediction

Chunlei Cai, Yi Wang, Xiaobo Li et al.

Providing quality-constant streams can simultaneously guarantee user experience and prevent wasting bit-rate. In this paper, we propose a novel deep learning based two-pass encoder parameter prediction framework to decide rate factor (RF), with which encoder can output streams with constant quality. For each one-shot segment in a video, the proposed method firstly extracts spatial, temporal and pre-coding features by an ultra fast pre-process. Based on these features, a RF parameter is predicted by a deep neural network. Video encoder uses the RF to compress segment as the first encoding pass. Then VMAF quality of the first pass encoding is measured. If the quality doesn't meet target, a second pass RF prediction and encoding will be performed. With the help of first pass predicted RF and corresponding actual quality as feedback, the second pass prediction will be highly accurate. Experiments show the proposed method requires only 1.55 times encoding complexity on average, meanwhile the accuracy, that the compressed video's actual VMAF is within $\pm1$ around the target VMAF, reaches 98.88%.

CVDec 2, 2022
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

Fangxun Shu, Biaolong Chen, Yue Liao et al.

We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only feed visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from incomplete and unstable inputs. Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance. Our MAC achieves state-of-the-art results on various video-text retrieval datasets, including MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input modalities. With minimal modifications, we achieve competitive results on image-text retrieval tasks.

CVSep 10, 2024Code
ALSS-YOLO: An Adaptive Lightweight Channel Split and Shuffling Network for TIR Wildlife Detection in UAV Imagery

Ang He, Xiaobo Li, Ximei Wu et al.

Unmanned aerial vehicles (UAVs) equipped with thermal infrared (TIR) cameras play a crucial role in combating nocturnal wildlife poaching. However, TIR images often face challenges such as jitter, and wildlife overlap, necessitating UAVs to possess the capability to identify blurred and overlapping small targets. Current traditional lightweight networks deployed on UAVs struggle to extract features from blurry small targets. To address this issue, we developed ALSS-YOLO, an efficient and lightweight detector optimized for TIR aerial images. Firstly, we propose a novel Adaptive Lightweight Channel Split and Shuffling (ALSS) module. This module employs an adaptive channel split strategy to optimize feature extraction and integrates a channel shuffling mechanism to enhance information exchange between channels. This improves the extraction of blurry features, crucial for handling jitter-induced blur and overlapping targets. Secondly, we developed a Lightweight Coordinate Attention (LCA) module that employs adaptive pooling and grouped convolution to integrate feature information across dimensions. This module ensures lightweight operation while maintaining high detection precision and robustness against jitter and target overlap. Additionally, we developed a single-channel focus module to aggregate the width and height information of each channel into four-dimensional channel fusion, which improves the feature representation efficiency of infrared images. Finally, we modify the localization loss function to emphasize the loss value associated with small objects to improve localization accuracy. Extensive experiments on the BIRDSAI and ISOD TIR UAV wildlife datasets show that ALSS-YOLO achieves state-of-the-art performance, Our code is openly available at https://github.com/helloworlder8/computer_vision.

CVSep 18, 2024Code
SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba

Xiangning Zhang, Qingwei Zhang, Jinnan Chen et al.

Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure initially developed for early gastric cancer treatment and has expanded to address diverse gastrointestinal lesions. While computer-assisted surgery (CAS) systems enhance ESD precision and safety, their efficacy hinges on accurate real-time surgical phase recognition, a task complicated by ESD's inherent complexity, including heterogeneous lesion characteristics and dynamic tissue interactions. Existing video-based phase recognition algorithms, constrained by inefficient temporal context modeling, exhibit limited performance in capturing fine-grained phase transitions and long-range dependencies. To overcome these limitations, we propose SPRMamba, a novel framework integrating a Mamba-based architecture with a Scaled Residual TranMamba (SRTM) block to synergize long-term temporal modeling and localized detail extraction. SPRMamba further introduces the Hierarchical Sampling Strategy to optimize computational efficiency, enabling real-time processing critical for clinical deployment. Evaluated on the ESD385 dataset and the cholecystectomy benchmark Cholec80, SPRMamba achieves state-of-the-art performance (87.64% accuracy on ESD385, +1.0% over prior methods), demonstrating robust generalizability across surgical workflows. This advancement bridges the gap between computational efficiency and temporal sensitivity, offering a transformative tool for intraoperative guidance and skill assessment in ESD surgery. The code is accessible at https://github.com/Zxnyyyyy/SPRMamba.

IVJun 20, 2023
CAMP-Net: Consistency-Aware Multi-Prior Network for Accelerated MRI Reconstruction

Liping Zhang, Xiaobo Li, Weitian Chen

Undersampling k-space data in MRI reduces scan time but pose challenges in image reconstruction. Considerable progress has been made in reconstructing accelerated MRI. However, restoration of high-frequency image details in highly undersampled data remains challenging. To address this issue, we propose CAMP-Net, an unrolling-based Consistency-Aware Multi-Prior Network for accelerated MRI reconstruction. CAMP-Net leverages complementary multi-prior knowledge and multi-slice information from various domains to enhance reconstruction quality. Specifically, CAMP-Net comprises three interleaved modules for image enhancement, k-space restoration, and calibration consistency, respectively. These modules jointly learn priors from data in image domain, k-domain, and calibration region, respectively, in data-driven manner during each unrolled iteration. Notably, the encoded calibration prior knowledge extracted from auto-calibrating signals implicitly guides the learning of consistency-aware k-space correlation for reliable interpolation of missing k-space data. To maximize the benefits of image domain and k-domain prior knowledge, the reconstructions are aggregated in a frequency fusion module, exploiting their complementary properties to optimize the trade-off between artifact removal and fine detail preservation. Additionally, we incorporate a surface data fidelity layer during the learning of k-domain and calibration domain priors to prevent degradation of the reconstruction caused by padding-induced data imperfections. We evaluate the generalizability and robustness of our method on three large public datasets with varying acceleration factors and sampling patterns. The experimental results demonstrate that our method outperforms state-of-the-art approaches in terms of both reconstruction quality and $T_2$ mapping estimation, particularly in scenarios with high acceleration factors.

34.9CVApr 18
IMA-MoE: An Interpretable Modality-Aware Mixture-of-Experts Framework for Characterizing the Neurobiological Signatures of Binge Eating Disorder

Lin Zhao, Qiaohui Gao, Elizabeth Martin et al.

Binge eating disorder (BED) is the most prevalent eating disorder. However, current diagnostic frameworks remain largely grounded in symptom-based criteria rather than underlying biological mechanisms, thereby limiting early detection and the development of biologically-informed interventions. Emerging studies have begun to investigate the neurobiological signatures of BED, yet their findings are often difficult to generalize due to the reliance on hypothesis-driven parametric models, single-modality analyses, and limited data diversity. Therefore, there is a critical need for advanced data-driven frameworks capable of modeling multimodal data to uncover generalizable and biologically meaningful signatures of BED. In this study, we propose the Interpretable Modality-Aware Mixture-of-Experts (IMA-MoE), a novel architecture designed to integrate heterogeneous neuroimaging, behavioral, hormonal, and demographic measures within a unified predictive framework. By encoding each measure as a distinct token, IMA-MoE enables flexible modeling of cross-modal dependencies while preserving modality-specific characteristics. We further introduce a token-importance mechanism to enhance interpretability by quantifying the contribution of each measure to model predictions. Evaluated on the large-scale Adolescent Brain Cognitive Development (ABCD) dataset, IMA-MoE demonstrates superior performance in differentiating BED from healthy controls compared with baseline methods, while revealing sex-specific predictive patterns, with hormonal measures contributing more prominently to prediction in females. Collectively, these findings highlight the promise of interpretable, data-driven multimodal modeling in advancing biologically-informed characterization of BED and facilitating more precise and personalized interventions in neuropsychiatric disorders.

89.6AIMay 26
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax, Aili Chen, Aonian Li et al.

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

CVMay 24, 2022
Context Attention Network for Skeleton Extraction

Zixuan Huang, Yunfeng Wang, Zhiwen Chen et al.

Skeleton extraction is a task focused on providing a simple representation of an object by extracting the skeleton from the given binary or RGB image. In recent years many attractive works in skeleton extraction have been made. But as far as we know, there is little research on how to utilize the context information in the binary shape of objects. In this paper, we propose an attention-based model called Context Attention Network (CANet), which integrates the context extraction module in a UNet architecture and can effectively improve the ability of network to extract the skeleton pixels. Meanwhile, we also use some novel techniques including distance transform, weight focal loss to achieve good results on the given dataset. Finally, without model ensemble and with only 80% of the training images, our method achieves 0.822 F1 score during the development phase and 0.8507 F1 score during the final phase of the Pixel SkelNetOn Competition, ranking 1st place on the leaderboard.

CLJun 16, 2025Code
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, Aili Chen, Aonian Li et al.

We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

CVSep 23, 2024
HOTVCOM: Generating Buzzworthy Comments for Videos

Yuyan Chen, Yiwen Qian, Songzhou Yan et al.

In the era of social media video platforms, popular ``hot-comments'' play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or ``danmaku'' in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces \textsc{HotVCom}, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the \texttt{ComHeat} framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets.

12.2CVApr 18
Conditional Evidence Reconstruction and Decomposition for Interpretable Multimodal Diagnosis

Shaowen Wan, Yanjun Lv, Lu Zhang et al.

Neurobiological and neurodegenerative diseases are inherently multifactorial, arising from coupled influences spanning genetic susceptibility, brain alterations, and environmental and behavioral factors. Multimodal modeling has therefore been increasingly adopted for disease diagnosis by integrating complementary evidence across data sources. However, in both large-scale cohorts and real-world clinical workflows, modality coverage is often incomplete, making many multimodal models brittle when one or more modalities are unavailable. Existing approaches to incomplete multimodal diagnosis typically rely on group-wise or static priors, which may fail to capture subject-specific cross-modal dependencies; moreover, many models provide limited interpretability into which evidence sources drive the final decision. To address these limitations, we propose Conditional Evidence Reconstruction and Decomposition (CERD), a framework for interpretable multimodal diagnosis with incomplete modalities. CERD first reconstructs missing modality representations conditioned on each subject's observed inputs, then decomposes diagnostic evidence into shared cross-modal corroboration and modality-specific cues via logit-level attribution. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate that CERD outperforms competitive baselines under incomplete-modality settings while producing structured and clinically aligned evidence attributions for trustworthy decision support.

CVAug 11, 2021Code
Mining the Benefits of Two-stage and One-stage HOI Detection

Aixi Zhang, Yue Liao, Si Liu et al.

Two-stage methods have dominated Human-Object Interaction (HOI) detection for several years. Recently, one-stage HOI detection methods have become popular. In this paper, we aim to explore the essential pros and cons of two-stage and one-stage methods. With this as the goal, we find that conventional two-stage methods mainly suffer from positioning positive interactive human-object pairs, while one-stage methods are challenging to make an appropriate trade-off on multi-task learning, i.e., object detection, and interaction classification. Therefore, a core problem is how to take the essence and discard the dregs from the conventional two types of methods. To this end, we propose a novel one-stage framework with disentangling human-object detection and interaction classification in a cascade manner. In detail, we first design a human-object pair generator based on a state-of-the-art one-stage HOI detector by removing the interaction classification module or head and then design a relatively isolated interaction classifier to classify each human-object pair. Two cascade decoders in our proposed framework can focus on one specific task, detection or interaction classification. In terms of the specific implementation, we adopt a transformer-based HOI detector as our base model. The newly introduced disentangling paradigm outperforms existing methods by a large margin, with a significant relative mAP gain of 9.32% on HICO-Det. The source codes are available at https://github.com/YueLiao/CDN.

28.4CVMay 8
LoHGNet: Infrared Small Target Detection through Lorentz Geometric Encoding with High-Order Relation Learning

Qianwen Ma, Yang Xu, Shangwei Deng et al.

Infrared small target detection (IRSTD) remains challenging due to the scarcity of useful target cues and the presence of severe background clutter. Most current methods rely on conventional feature learning and local interaction modeling, where features are represented in Euclidean space. However, such designs may still be limited in describing the subtle differences of weak targets and the contextual relations between targets and backgrounds. To address these limitations, we propose LoHGNet, an IRSTD network that integrates Lorentz geometric encoding with high-order relation learning. By introducing Lorentz manifold based feature learning, LoHGNet offers a different feature representation from conventional IRSTD methods and provides new discriminative cues for IRSTD. Specifically, a Lorentz encoding branch is constructed with the Geometric Attention Guided Lorentz Residual Convolution Module (GA-LRCM) to perform feature modeling under hyperbolic geometric constraints and enhance the hierarchical geometric representation capability of weak targets. Subsequently, the hyperbolic features are mapped into the Euclidean tangent space through logarithmic mapping, and a High-Order Relation Learning Module (HORL) is designed to model the high-order contextual dependencies between targets and backgrounds via hypergraph construction, thereby improving target discrimination in complex backgrounds. Experimental results on three datasets demonstrate that the proposed LoHGNet achieves competitive performance in both detection accuracy and adaptability to complex scenes. The code will be available at https://github.com/Kingwin97.

IMApr 2, 2024
CSST Strong Lensing Preparation: a Framework for Detecting Strong Lenses in the Multi-color Imaging Survey by the China Survey Space Telescope (CSST)

Xu Li, Ruiqi Sun, Jiameng Lv et al.

Strong gravitational lensing is a powerful tool for investigating dark matter and dark energy properties. With the advent of large-scale sky surveys, we can discover strong lensing systems on an unprecedented scale, which requires efficient tools to extract them from billions of astronomical objects. The existing mainstream lens-finding tools are based on machine learning algorithms and applied to cut-out-centered galaxies. However, according to the design and survey strategy of optical surveys by CSST, preparing cutouts with multiple bands requires considerable efforts. To overcome these challenges, we have developed a framework based on a hierarchical visual Transformer with a sliding window technique to search for strong lensing systems within entire images. Moreover, given that multi-color images of strong lensing systems can provide insights into their physical characteristics, our framework is specifically crafted to identify strong lensing systems in images with any number of channels. As evaluated using CSST mock data based on an Semi-Analytic Model named CosmoDC2, our framework achieves precision and recall rates of 0.98 and 0.90, respectively. To evaluate the effectiveness of our method in real observations, we have applied it to a subset of images from the DESI Legacy Imaging Surveys and media images from Euclid Early Release Observations. 61 new strong lensing system candidates are discovered by our method. However, we also identified false positives arising primarily from the simplified galaxy morphology assumptions within the simulation. This underscores the practical limitations of our approach while simultaneously highlighting potential avenues for future improvements.

CVSep 22, 2025
Development and validation of an AI foundation model for endoscopic diagnosis of esophagogastric junction adenocarcinoma: a cohort and deep learning study

Yikun Ma, Bo Li, Ying Chen et al.

The early detection of esophagogastric junction adenocarcinoma (EGJA) is crucial for improving patient prognosis, yet its current diagnosis is highly operator-dependent. This paper aims to make the first attempt to develop an artificial intelligence (AI) foundation model-based method for both screening and staging diagnosis of EGJA using endoscopic images. In this cohort and learning study, we conducted a multicentre study across seven Chinese hospitals between December 28, 2016 and December 30, 2024. It comprises 12,302 images from 1,546 patients; 8,249 of them were employed for model training, while the remaining were divided into the held-out (112 patients, 914 images), external (230 patients, 1,539 images), and prospective (198 patients, 1,600 images) test sets for evaluation. The proposed model employs DINOv2 (a vision foundation model) and ResNet50 (a convolutional neural network) to extract features of global appearance and local details of endoscopic images for EGJA staging diagnosis. Our model demonstrates satisfactory performance for EGJA staging diagnosis across three test sets, achieving an accuracy of 0.9256, 0.8895, and 0.8956, respectively. In contrast, among representative AI models, the best one (ResNet50) achieves an accuracy of 0.9125, 0.8382, and 0.8519 on the three test sets, respectively; the expert endoscopists achieve an accuracy of 0.8147 on the held-out test set. Moreover, with the assistance of our model, the overall accuracy for the trainee, competent, and expert endoscopists improves from 0.7035, 0.7350, and 0.8147 to 0.8497, 0.8521, and 0.8696, respectively. To our knowledge, our model is the first application of foundation models for EGJA staging diagnosis and demonstrates great potential in both diagnostic accuracy and efficiency.

CVJul 4, 2025
CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection

Xiangning Zhang, Jinnan Chen, Qingwei Zhang et al.

Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.

CVJun 26, 2025
Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

Cheng Zou, Senlin Cheng, Bolei Xu et al.

Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.

CLJan 8, 2022
A Unified Review of Deep Learning for Automated Medical Coding

Shaoxiong Ji, Wei Sun, Xiaobo Li et al.

Automated medical coding, an essential task for healthcare operation and delivery, makes unstructured data manageable by predicting medical codes from clinical documents. Recent advances in deep learning and natural language processing have been widely applied to this task. However, deep learning-based medical coding lacks a unified view of the design of neural network architectures. This review proposes a unified framework to provide a general understanding of the building blocks of medical coding models and summarizes recent advanced models under the proposed framework. Our unified framework decomposes medical coding into four main components, i.e., encoder modules for text feature extraction, mechanisms for building deep encoder architectures, decoder modules for transforming hidden representations into medical codes, and the usage of auxiliary information. Finally, we introduce the benchmarks and real-world usage and discuss key research challenges and future directions.

CVDec 14, 2021
Weakly Supervised High-Fidelity Clothing Model Generation

Ruili Feng, Cheng Ma, Chengji Shen et al.

The development of online economics arouses the demand of generating images of models on product clothes, to display new clothes and promote sales. However, the expensive proprietary model images challenge the existing image virtual try-on methods in this scenario, as most of them need to be trained on considerable amounts of model images accompanied with paired clothes images. In this paper, we propose a cheap yet scalable weakly-supervised method called Deep Generative Projection (DGP) to address this specific scenario. Lying in the heart of the proposed method is to imitate the process of human predicting the wearing effect, which is an unsupervised imagination based on life experience rather than computation rules learned from supervisions. Here a pretrained StyleGAN is used to capture the practical experience of wearing. Experiments show that projecting the rough alignment of clothing and body onto the StyleGAN space can yield photo-realistic wearing results. Experiments on real scene proprietary model images demonstrate the superiority of DGP over several state-of-the-art supervised methods when generating clothing model images.

CVJun 18, 2021
Multi-Granularity Network with Modal Attention for Dense Affective Understanding

Baoming Yan, Lin Wang, Ke Gao et al.

Video affective understanding, which aims to predict the evoked expressions by the video content, is desired for video creation and recommendation. In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. In this paper, we propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. Specifically, the multi-granularity features could be divided into frame-level, clips-level and video-level features, which corresponds to visual-salient content, semantic-context and video theme information. Then the modal attention fusion module is designed to fuse the multi-granularity features and emphasize more affection-relevant modals. Finally, the fused feature is fed into a Mixtures Of Experts (MOE) classifier to predict the expressions. Further employing model-ensemble post-processing, the proposed method achieves the correlation score of 0.02292 in the EEV challenge.

CVApr 5, 2021
A Dual-Critic Reinforcement Learning Framework for Frame-level Bit Allocation in HEVC/H.265

Yung-Han Ho, Guo-Lun Jin, Yun Liang et al.

This paper introduces a dual-critic reinforcement learning (RL) framework to address the problem of frame-level bit allocation in HEVC/H.265. The objective is to minimize the distortion of a group of pictures (GOP) under a rate constraint. Previous RL-based methods tackle such a constrained optimization problem by maximizing a single reward function that often combines a distortion and a rate reward. However, the way how these rewards are combined is usually ad hoc and may not generalize well to various coding conditions and video sequences. To overcome this issue, we adapt the deep deterministic policy gradient (DDPG) reinforcement learning algorithm for use with two critics, with one learning to predict the distortion reward and the other the rate reward. In particular, the distortion critic works to update the agent when the rate constraint is satisfied. By contrast, the rate critic makes the rate constraint a priority when the agent goes over the bit budget. Experimental results on commonly used datasets show that our method outperforms the bit allocation scheme in x265 and the single-critic baseline by a significant margin in terms of rate-distortion performance while offering fairly precise rate control.

CVSep 25, 2020
AIM 2020 Challenge on Real Image Super-Resolution: Methods and Results

Pengxu Wei, Hannan Lu, Radu Timofte et al.

This paper introduces the real image Super-Resolution (SR) challenge that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2020. This challenge involves three tracks to super-resolve an input image for $\times$2, $\times$3 and $\times$4 scaling factors, respectively. The goal is to attract more attention to realistic image degradation for the SR task, which is much more complicated and challenging, and contributes to real-world image super-resolution applications. 452 participants were registered for three tracks in total, and 24 teams submitted their results. They gauge the state-of-the-art approaches for real image SR in terms of PSNR and SSIM.

CVAug 28, 2020
Pixel-Face: A Large-Scale, High-Resolution Benchmark for 3D Face Reconstruction

Jiangjing Lyu, Xiaobo Li, Xiangyu Zhu et al.

3D face reconstruction is a fundamental task that can facilitate numerous applications such as robust facial analysis and augmented reality. It is also a challenging task due to the lack of high-quality datasets that can fuel current deep learning-based methods. However, existing datasets are limited in quantity, realisticity and diversity. To circumvent these hurdles, we introduce Pixel-Face, a large-scale, high-resolution and diverse 3D face dataset with massive annotations. Specifically, Pixel-Face contains 855 subjects aging from 18 to 80. Each subject has more than 20 samples with various expressions. Each sample is composed of high-resolution multi-view RGB images and 3D meshes with various expressions. Moreover, we collect precise landmarks annotation and 3D registration result for each data. To demonstrate the advantages of Pixel-Face, we re-parameterize the 3D Morphable Model (3DMM) into Pixel-3DM using the collected data. We show that the obtained Pixel-3DM is better in modeling a wide range of face shapes and expressions. We also carefully benchmark existing 3D face reconstruction methods on our dataset. Moreover, Pixel-Face serves as an effective training source. We observe that the performance of current face reconstruction models significantly improves both on existing benchmarks and Pixel-Face after being fine-tuned using our newly collected data. Extensive experiments demonstrate the effectiveness of Pixel-3DM and the usefulness of Pixel-Face.

CVJul 15, 2020
Augmented Bi-path Network for Few-shot Learning

Baoming Yan, Chen Zhou, Bo Zhao et al.

Few-shot Learning (FSL) which aims to learn from few labeled training data is becoming a popular research topic, due to the expensive labeling cost in many real-world applications. One kind of successful FSL method learns to compare the testing (query) image and training (support) image by simply concatenating the features of two images and feeding it into the neural network. However, with few labeled data in each class, the neural network has difficulty in learning or comparing the local features of two images. Such simple image-level comparison may cause serious mis-classification. To solve this problem, we propose Augmented Bi-path Network (ABNet) for learning to compare both global and local features on multi-scales. Specifically, the salient patches are extracted and embedded as the local features for every image. Then, the model learns to augment the features for better robustness. Finally, the model learns to compare global and local features separately, i.e., in two paths, before merging the similarities. Extensive experiments show that the proposed ABNet outperforms the state-of-the-art methods. Both quantitative and visual ablation studies are provided to verify that the proposed modules lead to more precise comparison results.

CRMay 6, 2013
How to Collaborate between Threshold Schemes

Daoshun Wang, Ziwei Ye, Xiaobo Li

Threshold schemes have been used to protect secrets by distributing shares to participants. To protect two secrets, we can use two separate traditional schemes, say, a (t1, n1) scheme and a (t2, n2) scheme. If there are u (<=min(t1, t2)) participants involved in both schemes, each of these u participants must keep two different shares. This paper proposes a method that allows each common participant to keep only one share. Our method constructs two polynomials with u common crossover points. We give theoretical details and two demonstrative examples. This algorithm can also handle the collaboration between more than two schemes.