49.0CVMay 29
Variational Adapter for Cross-modal Similarity RepresentationWenZhang Wei, Zhipeng Gui, Dehua Peng et al.
The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. Experiments on image-text retrieval, domain generalization, and base-to-novel generalization demonstrate the proposed method's effectiveness and robust generalization ability.
CVSep 15, 2023
Dynamic Visual Semantic Sub-Embeddings and Fast Re-RankingWenzhang Wei, Zhipeng Gui, Changguang Wu et al.
The core of cross-modal matching is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more semantic variations. So, images are usually associated with multiple textual captions in databases. Although popular symmetric embedding methods have explored numerous modal interaction approaches, they often learn toward increasing the average expression probability of multiple semantic variations within image embeddings. Consequently, information entropy in embeddings is increased, resulting in redundancy and decreased accuracy. In this work, we propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy. Specifically, we obtain a set of heterogeneous visual sub-embeddings through dynamic orthogonal constraint loss. To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution and employ a variance-aware weighting loss to assign different weights to the optimization process. In addition, we develop a Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and enhance the performance. We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role of different components by ablation studies and perform a sensitivity analysis of the hyperparameters. The qualitative analysis of visualized bidirectional retrieval and attention maps further demonstrates the ability of our method to encode semantic variations.
73.7CVMar 23Code
SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect DetectionShuxian Zhao, Jie Gui, Baosheng Yu et al.
Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.
SEOct 22, 2024Code
GeoCode-GPT: A Large Language Model for Geospatial Code Generation TasksShuyang Hou, Zhangxiao Shen, Anqi Zhao et al.
The increasing demand for spatiotemporal data and modeling tasks in geosciences has made geospatial code generation technology a critical factor in enhancing productivity. Although large language models (LLMs) have demonstrated potential in code generation tasks, they often encounter issues such as refusal to code or hallucination in geospatial code generation due to a lack of domain-specific knowledge and code corpora. To address these challenges, this paper presents and open-sources the GeoCode-PT and GeoCode-SFT corpora, along with the GeoCode-Eval evaluation dataset. Additionally, by leveraging QLoRA and LoRA for pretraining and fine-tuning, we introduce GeoCode-GPT-7B, the first LLM focused on geospatial code generation, fine-tuned from Code Llama-7B. Furthermore, we establish a comprehensive geospatial code evaluation framework, incorporating option matching, expert validation, and prompt engineering scoring for LLMs, and systematically evaluate GeoCode-GPT-7B using the GeoCode-Eval dataset. Experimental results show that GeoCode-GPT outperforms other models in multiple-choice accuracy by 9.1% to 32.1%, in code summarization ability by 1.7% to 25.4%, and in code generation capability by 1.2% to 25.1%. This paper provides a solution and empirical validation for enhancing LLMs' performance in geospatial code generation, extends the boundaries of domain-specific model applications, and offers valuable insights into unlocking their potential in geospatial code generation.
CVFeb 10
Toward Fine-Grained Facial Control in 3D Talking Head GenerationShaoyang Xie, Xiaofeng Cong, Baosheng Yu et al.
Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.
LGDec 7, 2023Code
A Robust and Efficient Boundary Point Detection Method by Measuring Local Direction DispersionDehua Peng, Zhipeng Gui, Jie Gui et al.
Boundary point detection aims to outline the external contour structure of clusters and enhance the inter-cluster discrimination, thus bolstering the performance of the downstream classification and clustering tasks. However, existing boundary point detectors are sensitive to density heterogeneity or cannot identify boundary points in concave structures and high-dimensional manifolds. In this work, we propose a robust and efficient boundary point detection method based on Local Direction Dispersion (LoDD). The core of boundary point detection lies in measuring the difference between boundary points and internal points. It is a common observation that an internal point is surrounded by its neighbors in all directions, while the neighbors of a boundary point tend to be distributed only in a certain directional range. By considering this observation, we adopt density-independent K-Nearest Neighbors (KNN) method to determine neighboring points and design a centrality metric LoDD using the eigenvalues of the covariance matrix to depict the distribution uniformity of KNN. We also develop a grid-structure assumption of data distribution to determine the parameters adaptively. The effectiveness of LoDD is demonstrated on synthetic datasets, real-world benchmarks, and application of training set split for deep learning model and hole detection on point cloud data. The datasets and toolkit are available at: https://github.com/ZPGuiGroupWhu/lodd.
LGDec 7, 2023Code
MeanCut: A Greedy-Optimized Graph Clustering via Path-based Similarity and Degree Descent CriterionDehua Peng, Zhipeng Gui, Huayi Wu
As the most typical graph clustering method, spectral clustering is popular and attractive due to the remarkable performance, easy implementation, and strong adaptability. Classical spectral clustering measures the edge weights of graph using pairwise Euclidean-based metric, and solves the optimal graph partition by relaxing the constraints of indicator matrix and performing Laplacian decomposition. However, Euclidean-based similarity might cause skew graph cuts when handling non-spherical data distributions, and the relaxation strategy introduces information loss. Meanwhile, spectral clustering requires specifying the number of clusters, which is hard to determine without enough prior knowledge. In this work, we leverage the path-based similarity to enhance intra-cluster associations, and propose MeanCut as the objective function and greedily optimize it in degree descending order for a nondestructive graph partition. This algorithm enables the identification of arbitrary shaped clusters and is robust to noise. To reduce the computational complexity of similarity calculation, we transform optimal path search into generating the maximum spanning tree (MST), and develop a fast MST (FastMST) algorithm to further improve its time-efficiency. Moreover, we define a density gradient factor (DGF) for separating the weakly connected clusters. The validity of our algorithm is demonstrated by testifying on real-world benchmarks and application of face recognition. The source code of MeanCut is available at https://github.com/ZPGuiGroupWhu/MeanCut-Clustering.
LGDec 31, 2023
Interpreting the Curse of Dimensionality from Distance Concentration and Manifold EffectDehua Peng, Zhipeng Gui, Huayi Wu
The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal pattern and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification, or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize the potential challenges associated with manipulating high-dimensional data, and explains the possible causes for the failure of regression, classification, or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration, and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that, as the dimensionality increases, nearest neighbor search (NNS) using three classical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions.
CVJan 10, 2025
PEACE: Empowering Geologic Map Holistic Understanding with MLLMsYangyu Huang, Tianyi Gao, Haoran Xu et al.
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface. These maps are indispensable in various fields, including disaster detection, resource exploration, and civil engineering. Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding. This gap is primarily due to the challenging nature of cartographic generalization, which involves handling high-resolution map, managing multiple associated components, and requiring domain-specific knowledge. To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing. To bridge this gap, we introduce GeoMap-Agent, the inaugural agent designed for geologic map understanding, which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA). Inspired by the interdisciplinary collaboration among human scientists, an AI expert group acts as consultants, utilizing a diverse tool pool to comprehensively analyze questions. Through comprehensive experiments, GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming 0.369 of GPT-4o. Our work, emPowering gEologic mAp holistiC undErstanding (PEACE) with MLLMs, paves the way for advanced AI applications in geology, enhancing the efficiency and accuracy of geological investigations.
SEMay 19, 2025
AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language ModelsShuyang Hou, Zhangxiao Shen, Huayi Wu et al.
Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline-from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.
CVAug 29, 2025
Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal PruningYuquan Bi, Hongsong Wang, Xinli Shi et al.
Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5\%, inference MACs by 56.8\%, and improves inference speed by an average of 81.1\% compared to prior diffusion-based methods, while achieving state-of-the-art performance.
LGApr 21, 2025
A Survey on Small Sample Imbalance Problem: Metrics, Feature Analysis, and SolutionsShuxian Zhao, Jie Gui, Minjing Dong et al.
The small sample imbalance (S&I) problem is a major challenge in machine learning and data analysis. It is characterized by a small number of samples and an imbalanced class distribution, which leads to poor model performance. In addition, indistinct inter-class feature distributions further complicate classification tasks. Existing methods often rely on algorithmic heuristics without sufficiently analyzing the underlying data characteristics. We argue that a detailed analysis from the data perspective is essential before developing an appropriate solution. Therefore, this paper proposes a systematic analytical framework for the S\&I problem. We first summarize imbalance metrics and complexity analysis methods, highlighting the need for interpretable benchmarks to characterize S&I problems. Second, we review recent solutions for conventional, complexity-based, and extreme S&I problems, revealing methodological differences in handling various data distributions. Our summary finds that resampling remains a widely adopted solution. However, we conduct experiments on binary and multiclass datasets, revealing that classifier performance differences significantly exceed the improvements achieved through resampling. Finally, this paper highlights open questions and discusses future trends.
LGJan 2, 2024
Sampling-enabled scalable manifold learning unveils the discriminative cluster structure of high-dimensional dataDehua Peng, Zhipeng Gui, Wenzhang Wei et al.
As a pivotal branch of machine learning, manifold learning uncovers the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space for visualization, classification, clustering, and gaining key insights. Although existing techniques have achieved remarkable successes, they suffer from extensive distortions of cluster structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. We hence propose a sampling-based Scalable manifold learning technique that enables Uniform and Discriminative Embedding, namely SUDE, for large-scale and high-dimensional data. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of SUDE on synthetic datasets and real-world benchmarks, and applied it to analyze single-cell data and detect anomalies in electrocardiogram (ECG) signals. SUDE exhibits distinct advantage in scalability with respect to data size and embedding dimension, and has promising performance in cluster separation, integrity, and global structure preservation. The experiments also demonstrate notable robustness in embedding quality as the sampling rate decreases.
CODec 10, 2019
Optimizing and accelerating space-time Ripley's K function based on Apache Spark for distributed spatiotemporal point pattern analysisYuan Wang, Zhipeng Gui, Huayi Wu et al.
With increasing point of interest (POI) datasets available with fine-grained spatial and temporal attributes, space-time Ripley's K function has been regarded as a powerful approach to analyze spatiotemporal point process. However, space-time Ripley's K function is computationally intensive for point-wise distance comparisons, edge correction and simulations for significance testing. Parallel computing technologies like OpenMP, MPI and CUDA have been leveraged to accelerate the K function, and related experiments have demonstrated the substantial acceleration. Nevertheless, previous works have not extended optimization of Ripley's K function from space dimension to space-time dimension. Without sophisticated spatiotemporal query and partitioning mechanisms, extra computational overhead can be problematic. Meanwhile, these researches were limited by the restricted scalability and relative expensive programming cost of parallel frameworks and impeded their applications for large POI dataset and Ripley's K function variations. This paper presents a distributed computing method to accelerate space-time Ripley's K function upon state-of-the-art distributed computing framework Apache Spark, and four strategies are adopted to simplify calculation procedures and accelerate distributed computing respectively. Based on the optimized method, a web-based visual analytics framework prototype has been developed. Experiments prove the feasibility and time efficiency of the proposed method, and also demonstrate its value on promoting applications of space-time Ripley's K function in ecology, geography, sociology, economics, urban transportation and other fields.