X. Feng

CV
h-index21
7papers
41citations
Novelty50%
AI Score53

7 Papers

NAMar 29, 2016
A multi-modes Monte Carlo finite element method for elliptic partial differential equations with random coefficients

X. Feng, J. Lin., C. Lorton

This paper develops and analyzes an efficient numerical method for solving elliptic partial differential equations, where the diffusion coefficients are random perturbations of deterministic diffusion coefficients. The method is based upon a multi-modes representation of the solution as a power series of the perturbation parameter, and the Monte Carlo technique for sampling the probability space. One key feature of the proposed method is that the governing equations for all the expanded mode functions share the same deterministic diffusion coefficients, thus an efficient direct solver by repeated use of the $LU$ decomposition matrices can be employed for solving the finite element discretized linear systems. It is shown that the computational complexity of the whole algorithm is comparable to that of solving a few deterministic elliptic partial differential equations using the $LU$ director solver. Error estimates are derived for the method, and numerical experiments are provided to test the efficiency of the algorithm and validate the theoretical results.

66.9CVMay 18
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

X. Feng, J. Zhu, M. Wu et al.

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose \textbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

CVJul 26, 2025Code
ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

X. Feng, S. Hu, X. Li et al.

Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack.

CVMay 26, 2025Code
CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

X. Feng, D. Zhang, S. Hu et al.

Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https://github.com/XiaokunFeng/CSTrack.

CVJul 15, 2025
NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation

X. Feng, H. Yu, M. Wu et al.

With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.

CVDec 27, 2024
Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues

X. Feng, D. Zhang, S. Hu et al.

Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our approach, achieving state-of-the-art performance and validating the utility of our method for enhanced VLT.

IVOct 17, 2019
Detecting intracranial aneurysm rupture from 3D surfaces using a novel GraphNet approach

Z. Ma, L. Song, X. Feng et al.

Intracranial aneurysm (IA) is a life-threatening blood spot in human's brain if it ruptures and causes cerebral hemorrhage. It is challenging to detect whether an IA has ruptured from medical images. In this paper, we propose a novel graph based neural network named GraphNet to detect IA rupture from 3D surface data. GraphNet is based on graph convolution network (GCN) and is designed for graph-level classification and node-level segmentation. The network uses GCN blocks to extract surface local features and pools to global features. 1250 patient data including 385 ruptured and 865 unruptured IAs were collected from clinic for experiments. The performance on randomly selected 234 test patient data was reported. The experiment with the proposed GraphNet achieved accuracy of 0.82, area-under-curve (AUC) of receiver operating characteristic (ROC) curve 0.82 in the classification task, significantly outperforming the baseline approach without using graph based networks. The segmentation output of the model achieved mean graph-node-based dice coefficient (DSC) score 0.88.