Jingyu Xu

CV
h-index9
17papers
282citations
Novelty38%
AI Score48

17 Papers

78.8CVApr 2Code
TOL: Textual Localization with OpenStreetMap

Youqi Liao, Shuhao Kang, Jingyu Xu et al.

Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.

CVSep 29, 2024
Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

Yifan Duan, Jian Zhao, pengcheng et al.

Spatio-temporal (ST) prediction has garnered a De facto attention in earth sciences, such as meteorological prediction, human mobility perception. However, the scarcity of data coupled with the high expenses involved in sensor deployment results in notable data imbalances. Furthermore, models that are excessively customized and devoid of causal connections further undermine the generalizability and interpretability. To this end, we establish a causal framework for ST predictions, termed CaPaint, which targets to identify causal regions in data and endow model with causal reasoning ability in a two-stage process. Going beyond this process, we utilize the back-door adjustment to specifically address the sub-regions identified as non-causal in the upstream phase. Specifically, we employ a novel image inpainting technique. By using a fine-tuned unconditional Diffusion Probabilistic Model (DDPM) as the generative prior, we in-fill the masks defined as environmental parts, offering the possibility of reliable extrapolation for potential data distributions. CaPaint overcomes the high complexity dilemma of optimal ST causal discovery models by reducing the data generation complexity from exponential to quasi-linear levels. Extensive experiments conducted on five real-world ST benchmarks demonstrate that integrating the CaPaint concept allows models to achieve improvements ranging from 4.3% to 77.3%. Moreover, compared to traditional mainstream ST augmenters, CaPaint underscores the potential of diffusion models in ST enhancement, offering a novel paradigm for this field. Our project is available at https://anonymous.4open.science/r/12345-DFCC.

IVDec 30, 2025
Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission

Xiangyu Chen, Jixiang Luo, Jingyu Xu et al.

Whether a video can be compressed at an extreme compression rate as low as 0.01%? To this end, we achieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GVC), a new framework that redefines the limits of video compression by leveraging modern generative video models to achieve extreme compression rates while preserving a perception-centric, task-oriented communication paradigm, corresponding to Level C of the Shannon-Weaver model. Besides, How we trade computation for compression rate or bandwidth? GVC answers this question by shifting the burden from transmission to inference: it encodes video into extremely compact representations and delegates content reconstruction to the receiver, where powerful generative priors synthesize high-quality video from minimal transmitted information. Is GVC practical and deployable? To ensure practical deployment, we propose a compression-computation trade-off strategy, enabling fast inference on consume-grade GPUs. Within the AI Flow framework, GVC opens new possibility for video communication in bandwidth- and resource-constrained environments such as emergency rescue, remote surveillance, and mobile edge computing. Through empirical validation, we demonstrate that GVC offers a viable path toward a new effective, efficient, scalable, and practical video communication paradigm.

DCFeb 27, 2024
Application of Machine Learning Optimization in Cloud Computing Resource Scheduling and Management

Yifan Zhang, Bo Liu, Yulu Gong et al.

In recent years, cloud computing has been widely used. Cloud computing refers to the centralized computing resources, users through the access to the centralized resources to complete the calculation, the cloud computing center will return the results of the program processing to the user. Cloud computing is not only for individual users, but also for enterprise users. By purchasing a cloud server, users do not have to buy a large number of computers, saving computing costs. According to a report by China Economic News Network, the scale of cloud computing in China has reached 209.1 billion yuan. At present, the more mature cloud service providers in China are Ali Cloud, Baidu Cloud, Huawei Cloud and so on. Therefore, this paper proposes an innovative approach to solve complex problems in cloud computing resource scheduling and management using machine learning optimization techniques. Through in-depth study of challenges such as low resource utilization and unbalanced load in the cloud environment, this study proposes a comprehensive solution, including optimization methods such as deep learning and genetic algorithm, to improve system performance and efficiency, and thus bring new breakthroughs and progress in the field of cloud computing resource management.Rational allocation of resources plays a crucial role in cloud computing. In the resource allocation of cloud computing, the cloud computing center has limited cloud resources, and users arrive in sequence. Each user requests the cloud computing center to use a certain number of cloud resources at a specific time.

DCMar 20, 2024
Dynamic Resource Allocation for Virtual Machine Migration Optimization using Machine Learning

Yulu Gong, Jiaxin Huang, Bo Liu et al.

The paragraph is grammatically correct and logically coherent. It discusses the importance of mobile terminal cloud computing migration technology in meeting the demands of evolving computer and cloud computing technologies. It emphasizes the need for efficient data access and storage, as well as the utilization of cloud computing migration technology to prevent additional time delays. The paragraph also highlights the contributions of cloud computing migration technology to expanding cloud computing services. Additionally, it acknowledges the role of virtualization as a fundamental capability of cloud computing while emphasizing that cloud computing and virtualization are not inherently interconnected. Finally, it introduces machine learning-based virtual machine migration optimization and dynamic resource allocation as a critical research direction in cloud computing, citing the limitations of static rules or manual settings in traditional cloud computing environments. Overall, the paragraph effectively communicates the importance of machine learning technology in addressing resource allocation and virtual machine migration challenges in cloud computing.

CPDec 10, 2024
A Consolidated Volatility Prediction with Back Propagation Neural Network and Genetic Algorithm

Zong Ke, Jingyu Xu, Zizhou Zhang et al.

This paper provides a unique approach with AI algorithms to predict emerging stock markets volatility. Traditionally, stock volatility is derived from historical volatility,Monte Carlo simulation and implied volatility as well. In this paper, the writer designs a consolidated model with back-propagation neural network and genetic algorithm to predict future volatility of emerging stock markets and found that the results are quite accurate with low errors.

CVMar 13, 2024
Improved YOLOv5 Based on Attention Mechanism and FasterNet for Foreign Object Detection on Railway and Airway tracks

Zongqing Qi, Danqing Ma, Jingyu Xu et al.

In recent years, there have been frequent incidents of foreign objects intruding into railway and Airport runways. These objects can include pedestrians, vehicles, animals, and debris. This paper introduces an improved YOLOv5 architecture incorporating FasterNet and attention mechanisms to enhance the detection of foreign objects on railways and Airport runways. This study proposes a new dataset, AARFOD (Aero and Rail Foreign Object Detection), which combines two public datasets for detecting foreign objects in aviation and railway systems.The dataset aims to improve the recognition capabilities of foreign object targets. Experimental results on this large dataset have demonstrated significant performance improvements of the proposed model over the baseline YOLOv5 model, reducing computational requirements.Improved YOLO model shows a significant improvement in precision by 1.2%, recall rate by 1.0%, and mAP@.5 by 0.6%, while mAP@.5-.95 remained unchanged. The parameters were reduced by approximately 25.12%, and GFLOPs were reduced by about 10.63%. In the ablation experiment, it is found that the FasterNet module can significantly reduce the number of parameters of the model, and the reference of the attention mechanism can slow down the performance loss caused by lightweight.

AIMar 26, 2024
Practical Applications of Advanced Cloud Services and Generative AI Systems in Medical Image Analysis

Jingyu Xu, Binbin Wu, Jiaxin Huang et al.

The medical field is one of the important fields in the application of artificial intelligence technology. With the explosive growth and diversification of medical data, as well as the continuous improvement of medical needs and challenges, artificial intelligence technology is playing an increasingly important role in the medical field. Artificial intelligence technologies represented by computer vision, natural language processing, and machine learning have been widely penetrated into diverse scenarios such as medical imaging, health management, medical information, and drug research and development, and have become an important driving force for improving the level and quality of medical services.The article explores the transformative potential of generative AI in medical imaging, emphasizing its ability to generate syntheticACM-2 data, enhance images, aid in anomaly detection, and facilitate image-to-image translation. Despite challenges like model complexity, the applications of generative models in healthcare, including Med-PaLM 2 technology, show promising results. By addressing limitations in dataset size and diversity, these models contribute to more accurate diagnoses and improved patient outcomes. However, ethical considerations and collaboration among stakeholders are essential for responsible implementation. Through experiments leveraging GANs to augment brain tumor MRI datasets, the study demonstrates how generative AI can enhance image quality and diversity, ultimately advancing medical diagnostics and patient care.

SPJan 30, 2024
Progress in artificial intelligence applications based on the combination of self-driven sensors and deep learning

Weixiang Wan, Wenjian Sun, Qiang Zeng et al.

In the era of Internet of Things, how to develop a smart sensor system with sustainable power supply, easy deployment and flexible use has become a difficult problem to be solved. The traditional power supply has problems such as frequent replacement or charging when in use, which limits the development of wearable devices. The contact-to-separate friction nanogenerator (TENG) was prepared by using polychotomy thy lene (PTFE) and aluminum (AI) foils. Human motion energy was collected by human body arrangement, and human motion posture was monitored according to the changes of output electrical signals. In 2012, Academician Wang Zhong lin and his team invented the triboelectric nanogenerator (TENG), which uses Maxwell displacement current as a driving force to directly convert mechanical stimuli into electrical signals, so it can be used as a self-driven sensor. Teng-based sensors have the advantages of simple structure and high instantaneous power density, which provides an important means for building intelligent sensor systems. At the same time, machine learning, as a technology with low cost, short development cycle, strong data processing ability and prediction ability, has a significant effect on the processing of a large number of electrical signals generated by TENG, and the combination with TENG sensors will promote the rapid development of intelligent sensor networks in the future. Therefore, this paper is based on the intelligent sound monitoring and recognition system of TENG, which has good sound recognition capability, and aims to evaluate the feasibility of the sound perception module architecture in ubiquitous sensor networks.

ROFeb 28, 2024
Automatic driving lane change safety prediction model based on LSTM

Wenjian Sun, Linying Pan, Jingyu Xu et al.

Autonomous driving technology can improve traffic safety and reduce traffic accidents. In addition, it improves traffic flow, reduces congestion, saves energy and increases travel efficiency. In the relatively mature automatic driving technology, the automatic driving function is divided into several modules: perception, decision-making, planning and control, and a reasonable division of labor can improve the stability of the system. Therefore, autonomous vehicles need to have the ability to predict the trajectory of surrounding vehicles in order to make reasonable decision planning and safety measures to improve driving safety. By using deep learning method, a safety-sensitive deep learning model based on short term memory (LSTM) network is proposed. This model can alleviate the shortcomings of current automatic driving trajectory planning, and the output trajectory not only ensures high accuracy but also improves safety. The cell state simulation algorithm simulates the trackability of the trajectory generated by this model. The research results show that compared with the traditional model-based method, the trajectory prediction method based on LSTM network has obvious advantages in predicting the trajectory in the long time domain. The intention recognition module considering interactive information has higher prediction and accuracy, and the algorithm results show that the trajectory is very smooth based on the premise of safe prediction and efficient lane change. And autonomous vehicles can efficiently and safely complete lane changes.

NIFeb 28, 2024
The Fusion of Deep Reinforcement Learning and Edge Computing for Real-time Monitoring and Control Optimization in IoT Environments

Jingyu Xu, Weixiang Wan, Linying Pan et al.

In response to the demand for real-time performance and control quality in industrial Internet of Things (IoT) environments, this paper proposes an optimization control system based on deep reinforcement learning and edge computing. The system leverages cloud-edge collaboration, deploys lightweight policy networks at the edge, predicts system states, and outputs controls at a high frequency, enabling monitoring and optimization of industrial objectives. Additionally, a dynamic resource allocation mechanism is designed to ensure rational scheduling of edge computing resources, achieving global optimization. Results demonstrate that this approach reduces cloud-edge communication latency, accelerates response to abnormal situations, reduces system failure rates, extends average equipment operating time, and saves costs for manual maintenance and replacement. This ensures real-time and stable control.

LGJan 21
SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model

Xianghao Zhan, Jingyu Xu, Yuanning Zheng et al.

Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.

IVSep 19, 2025
FMD-TransUNet: Abdominal Multi-Organ Segmentation Based on Frequency Domain Multi-Axis Representation Learning and Dual Attention Mechanisms

Fang Lu, Jingyu Xu, Qinxiu Sun et al.

Accurate abdominal multi-organ segmentation is critical for clinical applications. Although numerous deep learning-based automatic segmentation methods have been developed, they still struggle to segment small, irregular, or anatomically complex organs. Moreover, most current methods focus on spatial-domain analysis, often overlooking the synergistic potential of frequency-domain representations. To address these limitations, we propose a novel framework named FMD-TransUNet for precise abdominal multi-organ segmentation. It innovatively integrates the Multi-axis External Weight Block (MEWB) and the improved dual attention module (DA+) into the TransUNet framework. The MEWB extracts multi-axis frequency-domain features to capture both global anatomical structures and local boundary details, providing complementary information to spatial-domain representations. The DA+ block utilizes depthwise separable convolutions and incorporates spatial and channel attention mechanisms to enhance feature fusion, reduce redundant information, and narrow the semantic gap between the encoder and decoder. Experimental validation on the Synapse dataset shows that FMD-TransUNet outperforms other recent state-of-the-art methods, achieving an average DSC of 81.32\% and a HD of 16.35 mm across eight abdominal organs. Compared to the baseline model, the average DSC increased by 3.84\%, and the average HD decreased by 15.34 mm. These results demonstrate the effectiveness of FMD-TransUNet in improving the accuracy of abdominal multi-organ segmentation.

CVJul 21, 2025
Conditional Video Generation for High-Efficiency Video Compression

Fangqiu Yi, Jingyu Xu, Jiawei Shao et al.

Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fréchet Video Distance (FVD) and LPIPS, especially under high compression ratios.

CVMar 7, 2025
FMT:A Multimodal Pneumonia Detection Model Based on Stacking MOE Framework

Jingyu Xu, Yang Wang

Artificial intelligence has shown the potential to improve diagnostic accuracy through medical image analysis for pneumonia diagnosis. However, traditional multimodal approaches often fail to address real-world challenges such as incomplete data and modality loss. In this study, a Flexible Multimodal Transformer (FMT) was proposed, which uses ResNet-50 and BERT for joint representation learning, followed by a dynamic masked attention strategy that simulates clinical modality loss to improve robustness; finally, a sequential mixture of experts (MOE) architecture was used to achieve multi-level decision refinement. After evaluation on a small multimodal pneumonia dataset, FMT achieved state-of-the-art performance with 94% accuracy, 95% recall, and 93% F1 score, outperforming single-modal baselines (ResNet: 89%; BERT: 79%) and the medical benchmark CheXMed (90%), providing a scalable solution for multimodal diagnosis of pneumonia in resource-constrained medical settings.

CVFeb 6, 2025
Advanced Object Detection and Pose Estimation with Hybrid Task Cascade and High-Resolution Networks

Yuhui Jin, Yaqiong Zhang, Zheyuan Xu et al.

In the field of computer vision, 6D object detection and pose estimation are critical for applications such as robotics, augmented reality, and autonomous driving. Traditional methods often struggle with achieving high accuracy in both object detection and precise pose estimation simultaneously. This study proposes an improved 6D object detection and pose estimation pipeline based on the existing 6D-VNet framework, enhanced by integrating a Hybrid Task Cascade (HTC) and a High-Resolution Network (HRNet) backbone. By leveraging the strengths of HTC's multi-stage refinement process and HRNet's ability to maintain high-resolution representations, our approach significantly improves detection accuracy and pose estimation precision. Furthermore, we introduce advanced post-processing techniques and a novel model integration strategy that collectively contribute to superior performance on public and private benchmarks. Our method demonstrates substantial improvements over state-of-the-art models, making it a valuable contribution to the domain of 6D object detection and pose estimation.

CLJan 18, 2024
Automated Scoring of Clinical Patient Notes using Advanced NLP and Pseudo Labeling

Jingyu Xu, Yifeng Jiang, Bin Yuan et al.

Clinical patient notes are critical for documenting patient interactions, diagnoses, and treatment plans in medical practice. Ensuring accurate evaluation of these notes is essential for medical education and certification. However, manual evaluation is complex and time-consuming, often resulting in variability and resource-intensive assessments. To tackle these challenges, this research introduces an approach leveraging state-of-the-art Natural Language Processing (NLP) techniques, specifically Masked Language Modeling (MLM) pretraining, and pseudo labeling. Our methodology enhances efficiency and effectiveness, significantly reducing training time without compromising performance. Experimental results showcase improved model performance, indicating a potential transformation in clinical note assessment.