Jiahao Xia

CV
h-index13
17papers
162citations
Novelty51%
AI Score59

17 Papers

CVMar 13, 2022Code
Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning

Jiahao Xia, Weiwei qu, Wenjian Huang et al.

Heatmap regression methods have dominated face alignment area in recent years while they ignore the inherent relation between different landmarks. In this paper, we propose a Sparse Local Patch Transformer (SLPT) for learning the inherent relation. The SLPT generates the representation of each single landmark from a local patch and aggregates them by an adaptive inherent relation based on the attention mechanism. The subpixel coordinate of each landmark is predicted independently based on the aggregated feature. Moreover, a coarse-to-fine framework is further introduced to incorporate with the SLPT, which enables the initial landmarks to gradually converge to the target facial landmarks using fine-grained features from dynamically resized local patches. Extensive experiments carried out on three popular benchmarks, including WFLW, 300W and COFW, demonstrate that the proposed method works at the state-of-the-art level with much less computational complexity by learning the inherent relation between facial landmarks. The code is available at the project website.

59.1CVJun 1
PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images

Yuan Zhang, Jiahao Xia, Junzhang Huang et al.

Data scarcity in multimodal pathology motivates unified generative models that synthesize modality-specific appearance while preserving anatomically coherent structure. Although modalities differ in appearance statistics, morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols. However, existing methods often model these factors within a homogeneous token stream, implicitly coupling structure with appearance and weakening structural controllability under modality shifts. To address this, we propose pathology Autorgressive modeling (PathAR), a structure-first autoregressive synthesis framework that explicitly factorizes structure and appearance for modality-label-conditioned pathology generation.PathAR employs a dual vector quantization (Dual-VQ) tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive (IAR) transformer with asymmetric attention visibility to enforce structure-to-appearance dependence. PathAR stabilizes morphology under heterogeneous modality-specific appearances and enables spatially aligned image--mask pair generation. Extensive experiments show that PathAR improves structural consistency and modality fidelity over baselines, maintains sample diversity, supports downstream segmentation in data-scarce regimes, and demonstrates extensibility to finer-grained intra-modality organ-label variation.

CVJul 17, 2024Code
Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

Hu Cao, Zehua Zhang, Yan Xia et al.

In frame-based vision, object detection faces substantial performance degradation under challenging conditions due to the limited sensing capability of conventional cameras. Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. However, effectively fusing two heterogeneous modalities remains an open issue. In this work, we propose a novel hierarchical feature refinement network for event-frame fusion. The core concept is the design of the coarse-to-fine fusion module, denoted as the cross-modality adaptive feature refinement (CAFR) module. In the initial phase, the bidirectional cross-modality interaction (BCI) part facilitates information bridging from two distinct sources. Subsequently, the features are further refined by aligning the channel-level mean and variance in the two-fold adaptive feature refinement (TAFR) part. We conducted extensive experiments on two benchmarks: the low-resolution PKU-DDD17-Car dataset and the high-resolution DSEC dataset. Experimental results show that our method surpasses the state-of-the-art by an impressive margin of $\textbf{8.0}\%$ on the DSEC dataset. Besides, our method exhibits significantly better robustness (\textbf{69.5}\% versus \textbf{38.7}\%) when introducing 15 different corruption types to the frame images. The code can be found at the link (https://github.com/HuCaoFighting/FRN).

CVOct 8, 2023Code
Enhancing Cross-Dataset Performance of Distracted Driving Detection With Score Softmax Classifier And Dynamic Gaussian Smoothing Supervision

Cong Duan, Zixuan Liu, Jiahao Xia et al.

Deep neural networks enable real-time monitoring of in-vehicle drivers, facilitating the timely prediction of distractions, fatigue, and potential hazards. This technology is now integral to intelligent transportation systems. Recent research has exposed unreliable cross-dataset driver behavior recognition due to a limited number of data samples and background noise. In this paper, we propose a Score-Softmax classifier, which reduces the model overconfidence by enhancing category independence. Imitating the human scoring process, we designed a two-dimensional dynamic supervisory matrix consisting of one-dimensional Gaussian-smoothed labels. The dynamic loss descent direction and Gaussian smoothing increase the uncertainty of training to prevent the model from falling into noise traps. Furthermore, we introduce a simple and convenient multi-channel information fusion method;it addresses the fusion issue among arbitrary Score-Softmax classification heads. We conducted cross-dataset experiments using the SFDDD, AUCDD, and the 100-Driver datasets, demonstrating that Score-Softmax improves cross-dataset performance without modifying the model architecture. The experiments indicate that the Score-Softmax classifier reduces the interference of background noise, enhancing the robustness of the model. It increases the cross-dataset accuracy by 21.34%, 11.89%, and 18.77% on the three datasets, respectively. The code is publicly available at https://github.com/congduan-HNU/SSoftmax.

43.4CVMar 20Code
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement

Chunlei Zhang, Jiahao Xia, Yun Xiao et al.

Multimodal image registration is a fundamental task and a prerequisite for downstream cross-modal analysis. Despite recent progress in shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, allowing modality-private cues to leak into the shared space. Second, most multi-scale frameworks support only a single transformation type, limiting their applicability when global misalignment and local deformation coexist. To address these issues, we formulate hybrid multimodal registration as jointly learning a stable shared feature space and a unified hybrid transformation. Based on this view, we propose HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) extracts multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues and projects shared features into a stable subspace for matching. Built on this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields, which are fused into a coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on rigid and non-rigid registration tasks. The code is available at the project website.

CVAug 15, 2024
Unsupervised Part Discovery via Dual Representation Alignment

Jiahao Xia, Wenjian Huang, Min Xu et al.

Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention.

CVApr 2, 2023
Multimodal Hyperspectral Image Classification via Interconnected Fusion

Lu Huo, Jiahao Xia, Leijie Zhang et al.

Existing multiple modality fusion methods, such as concatenation, summation, and encoder-decoder-based fusion, have recently been employed to combine modality characteristics of Hyperspectral Image (HSI) and Light Detection And Ranging (LiDAR). However, these methods consider the relationship of HSI-LiDAR signals from limited perspectives. More specifically, they overlook the contextual information across modalities of HSI and LiDAR and the intra-modality characteristics of LiDAR. In this paper, we provide a new insight into feature fusion to explore the relationships across HSI and LiDAR modalities comprehensively. An Interconnected Fusion (IF) framework is proposed. Firstly, the center patch of the HSI input is extracted and replicated to the size of the HSI input. Then, nine different perspectives in the fusion matrix are generated by calculating self-attention and cross-attention among the replicated center patch, HSI input, and corresponding LiDAR input. In this way, the intra- and inter-modality characteristics can be fully exploited, and contextual information is considered in both intra-modality and inter-modality manner. These nine interrelated elements in the fusion matrix can complement each other and eliminate biases, which can generate a multi-modality representation for classification accurately. Extensive experiments have been conducted on three widely used datasets: Trento, MUUFL, and Houston. The IF framework achieves state-of-the-art results on these datasets compared to existing approaches.

LGJul 3, 2024
Differential Encoding for Improved Representation Learning over Graphs

Haimin Zhang, Jiahao Xia, Min Xu

Combining the message-passing paradigm with the global attention mechanism has emerged as an effective framework for learning over graphs. The message-passing paradigm and the global attention mechanism fundamentally generate node embeddings based on information aggregated from a node's local neighborhood or from the whole graph. The most basic and commonly used aggregation approach is to take the sum of information from a node's local neighbourhood or from the whole graph. However, it is unknown if the dominant information is from a node itself or from the node's neighbours (or the rest of the graph nodes). Therefore, there exists information lost at each layer of embedding generation, and this information lost could be accumulated and become more serious when more layers are used in the model. In this paper, we present a differential encoding method to address the issue of information lost. The idea of our method is to encode the differential representation between the information from a node's neighbours (or the rest of the graph nodes) and that from the node itself. The obtained differential encoding is then combined with the original aggregated local or global representation to generate the updated node embedding. By integrating differential encodings, the representational ability of generated node embeddings is improved. The differential encoding method is empirically evaluated on different graph tasks on seven benchmark datasets. The results show that it is a general method that improves the message-passing update and the global attention update, advancing the state-of-the-art performance for graph representation learning on these datasets.

IVJul 29, 2025Code
Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images

Yutao Hu, Ying Zheng, Shumei Miao et al.

Foundation models have demonstrated remarkable potential in medical domain. However, their application to complex cardiovascular diagnostics remains underexplored. In this paper, we present Cardiac-CLIP, a multi-modal foundation model designed for 3D cardiac CT images. Cardiac-CLIP is developed through a two-stage pre-training strategy. The first stage employs a 3D masked autoencoder (MAE) to perform self-supervised representation learning from large-scale unlabeled volumetric data, enabling the visual encoder to capture rich anatomical and contextual features. In the second stage, contrastive learning is introduced to align visual and textual representations, facilitating cross-modal understanding. To support the pre-training, we collect 16641 real clinical CT scans, supplemented by 114k publicly available data. Meanwhile, we standardize free-text radiology reports into unified templates and construct the pathology vectors according to diagnostic attributes, based on which the soft-label matrix is generated to supervise the contrastive learning process. On the other hand, to comprehensively evaluate the effectiveness of Cardiac-CLIP, we collect 6,722 real-clinical data from 12 independent institutions, along with the open-source data to construct the evaluation dataset. Specifically, Cardiac-CLIP is comprehensively evaluated across multiple tasks, including cardiovascular abnormality classification, information retrieval and clinical analysis. Experimental results demonstrate that Cardiac-CLIP achieves state-of-the-art performance across various downstream tasks in both internal and external data. Particularly, Cardiac-CLIP exhibits great effectiveness in supporting complex clinical tasks such as the prospective prediction of acute coronary syndrome, which is notoriously difficult in real-world scenarios.

CVJul 16, 2025Code
Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints

Jiahao Xia, Yike Wu, Wenjian Huang et al.

Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments demonstrate that our method robustly discovers meaningful parts across various categories and scenarios. The code is available at the project https://github.com/Jiahao-UTS/MPAE.

CVMar 28, 2025Code
Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment

Jiahao Xia, Min Xu, Wenjian Huang et al.

Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation. The code is available at https://github.com/Jiahao-UTS/TUFA.

82.1CVMar 26
Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Yike Wu, Necva Bolucu, Stephen Wan et al.

Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.

CVJul 24, 2025
HumanMaterial: Human Material Estimation from a Single Image via Progressive Training

Yu Jiang, Jiahao Xia, Jiongming Qin et al.

Full-body Human inverse rendering based on physically-based rendering aims to acquire high-quality materials, which helps achieve photo-realistic rendering under arbitrary illuminations. This task requires estimating multiple material maps and usually relies on the constraint of rendering result. The absence of constraints on the material maps makes inverse rendering an ill-posed task. Previous works alleviated this problem by building material dataset for training, but their simplified material data and rendering equation lead to rendering results with limited realism, especially that of skin. To further alleviate this problem, we construct a higher-quality dataset (OpenHumanBRDF) based on scanned real data and statistical material data. In addition to the normal, diffuse albedo, roughness, specular albedo, we produce displacement and subsurface scattering to enhance the realism of rendering results, especially for the skin. With the increase in prediction tasks for more materials, using an end-to-end model as in the previous work struggles to balance the importance among various material maps, and leads to model underfitting. Therefore, we design a model (HumanMaterial) with progressive training strategy to make full use of the supervision information of the material maps and improve the performance of material estimation. HumanMaterial first obtain the initial material results via three prior models, and then refine the results by a finetuning model. Prior models estimate different material maps, and each map has different significance for rendering results. Thus, we design a Controlled PBR Rendering (CPR) loss, which enhances the importance of the materials to be optimized during the training of prior models. Extensive experiments on OpenHumanBRDF dataset and real data demonstrate that our method achieves state-of-the-art performance.

LGJun 22, 2025
h-calibration: Rethinking Classifier Recalibration with Probabilistic Error-Bounded Objective

Wenjian Huang, Guiping Cao, Jiahao Xia et al.

Deep neural networks have demonstrated remarkable performance across numerous learning tasks but often suffer from miscalibration, resulting in unreliable probability outputs. This has inspired many recent works on mitigating miscalibration, particularly through post-hoc recalibration methods that aim to obtain calibrated probabilities without sacrificing the classification performance of pre-trained models. In this study, we summarize and categorize previous works into three general strategies: intuitively designed methods, binning-based methods, and methods based on formulations of ideal calibration. Through theoretical and practical analysis, we highlight ten common limitations in previous approaches. To address these limitations, we propose a probabilistic learning framework for calibration called h-calibration, which theoretically constructs an equivalent learning formulation for canonical calibration with boundedness. On this basis, we design a simple yet effective post-hoc calibration algorithm. Our method not only overcomes the ten identified limitations but also achieves markedly better performance than traditional methods, as validated by extensive experiments. We further analyze, both theoretically and experimentally, the relationship and advantages of our learning objective compared to traditional proper scoring rule. In summary, our probabilistic framework derives an approximately equivalent differentiable objective for learning error-bounded calibrated probabilities, elucidating the correspondence and convergence properties of computational statistics with respect to theoretical bounds in canonical calibration. The theoretical effectiveness is verified on standard post-hoc calibration benchmarks by achieving state-of-the-art performance. This research offers valuable reference for learning reliable likelihood in related fields.

IVMar 12, 2025
FCaS: Fine-grained Cardiac Image Synthesis based on 3D Template Conditional Diffusion Model

Jiahao Xia, Yutao Hu, Yaolei Qi et al.

Solving medical imaging data scarcity through semantic image generation has attracted significant attention in recent years. However, existing methods primarily focus on generating whole-organ or large-tissue structures, showing limited effectiveness for organs with fine-grained structure. Due to stringent topological consistency, fragile coronary features, and complex 3D morphological heterogeneity in cardiac imaging, accurately reconstructing fine-grained anatomical details of the heart remains a great challenge. To address this problem, in this paper, we propose the Fine-grained Cardiac image Synthesis(FCaS) framework, established on 3D template conditional diffusion model. FCaS achieves precise cardiac structure generation using Template-guided Conditional Diffusion Model (TCDM) through bidirectional mechanisms, which provides the fine-grained topological structure information of target image through the guidance of template. Meanwhile, we design a deformable Mask Generation Module (MGM) to mitigate the scarcity of high-quality and diverse reference mask in the generation process. Furthermore, to alleviate the confusion caused by imprecise synthetic images, we propose a Confidence-aware Adaptive Learning (CAL) strategy to facilitate the pre-training of downstream segmentation tasks. Specifically, we introduce the Skip-Sampling Variance (SSV) estimation to obtain confidence maps, which are subsequently employed to rectify the pre-training on downstream tasks. Experimental results demonstrate that images generated from FCaS achieves state-of-the-art performance in topological consistency and visual quality, which significantly facilitates the downstream tasks as well. Code will be released in the future.

CVMar 13, 2021
An Efficient Multitask Neural Network for Face Alignment, Head Pose Estimation and Face Tracking

Jiahao Xia, Haimin Zhang, Shiping Wen et al.

While Convolutional Neural Networks (CNNs) have significantly boosted the performance of face related algorithms, maintaining accuracy and efficiency simultaneously in practical use remains challenging. The state-of-the-art methods employ deeper networks for better performance, which makes it less practical for mobile applications because of more parameters and higher computational complexity. Therefore, we propose an efficient multitask neural network, Alignment & Tracking & Pose Network (ATPN) for face alignment, face tracking and head pose estimation. Specifically, to achieve better performance with fewer layers for face alignment, we introduce a shortcut connection between shallow-layer and deep-layer features. We find the shallow-layer features are highly correspond to facial boundaries that can provide the structural information of face and it is crucial for face alignment. Moreover, we generate a cheap heatmap based on the face alignment result and fuse it with features to improve the performance of the other two tasks. Based on the heatmap, the network can utilize both geometric information of landmarks and appearance information for head pose estimation. The heatmap also provides attention clues for face tracking. The face tracking task also saves us the face detection procedure for each frame, which also significantly boost the real-time capability for video-based tasks. We experimentally validate ATPN on four benchmark datasets, WFLW, 300VW, WIDER Face and 300W-LP. The experimental results demonstrate that it achieves better performance with much less parameters and lower computational complexity compared to other light models.

CVMar 8, 2020
Single-View 3D Object Reconstruction from Shape Priors in Memory

Shuo Yang, Min Xu, Haozhe Xie et al.

Existing methods for single-view 3D object reconstruction directly learn to transform image features into 3D representations. However, these methods are vulnerable to images containing noisy backgrounds and heavy occlusions because the extracted image features do not contain enough information to reconstruct high-quality 3D shapes. Humans routinely use incomplete or noisy visual cues from an image to retrieve similar 3D shapes from their memory and reconstruct the 3D shape of an object. Inspired by this, we propose a novel method, named Mem3D, that explicitly constructs shape priors to supplement the missing information in the image. Specifically, the shape priors are in the forms of "image-voxel" pairs in the memory network, which is stored by a well-designed writing strategy during training. We also propose a voxel triplet loss function that helps to retrieve the precise 3D shapes that are highly related to the input image from shape priors. The LSTM-based shape encoder is introduced to extract information from the retrieved 3D shapes, which are useful in recovering the 3D shape of an object that is heavily occluded or in complex environments. Experimental results demonstrate that Mem3D significantly improves reconstruction quality and performs favorably against state-of-the-art methods on the ShapeNet and Pix3D datasets.