CVDec 19, 2025Code
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual ReasoningQi Song, Honglin Li, Yingchen Yu et al.
Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.
AIMay 23
TIGER: Text-Informed Generalized Enzyme-Reaction RetrievalYuhang Zhang, Keyan Ding, Peilin Chen et al.
Enzyme-reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism elucidation, and the rational design of metabolic pathways and biocatalysts. As a bidirectional task, it entails both enzyme-to-reaction and reaction-to-enzyme mapping. However, existing approaches suffer from poor generalization across tasks and distributions, with performance highly sensitive to dataset splits and substantial asymmetry between retrieval directions. To address these challenges, we present TIGER, a Text-Informed Generalized Enzyme-Reaction Retrieval framework that leverages protein-to-text generation models to distill textual semantic knowledge from enzyme sequences, providing a generalized representation that bridges enzymes and biochemical reactions. To ensure the quality and reliability of textual semantics, we design a Dynamic Gating Network that adaptively fuses text-derived knowledge with sequence features, enabling more consistent and informative enzyme representations, while a Structure-Shared Feature Projector aligns enzyme and reaction representations within a unified latent space. Extensive experiments demonstrate that, under bidirectional retrieval supervision, TIGER significantly outperforms state-of-the-art baselines across diverse distributions and exhibits strong robustness and transferability across tasks.
CVMay 7
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D ReconstructionFeifei Li, Qi Song, Chi Zhang et al.
Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.
CVJul 10, 2024
Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base ModelQi Song, Ziyuan Luo, Ka Chun Cheung et al.
Neural Radiance Fields (NeRFs) have become a key method for 3D scene representation. With the rising prominence and influence of NeRF, safeguarding its intellectual property has become increasingly important. In this paper, we propose \textbf{NeRFProtector}, which adopts a plug-and-play strategy to protect NeRF's copyright during its creation. NeRFProtector utilizes a pre-trained watermarking base model, enabling NeRF creators to embed binary messages directly while creating their NeRF. Our plug-and-play property ensures NeRF creators can flexibly choose NeRF variants without excessive modifications. Leveraging our newly designed progressive distillation, we demonstrate performance on par with several leading-edge neural rendering methods. Our project is available at: \url{https://qsong2001.github.io/NeRFProtector}.
CVMay 21
Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided RefinementYiming Xu, Yixuan Liu, Yuhang Zhang et al.
Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by historical similar cases and their associated symptoms. To simulate this diagnostic process, we propose a framework that performs case-aware reasoning using multimodal knowledge graphs for explainable medical image diagnosis. Given an input image, our method constructs a multimodal knowledge graph from adaptively retrieved similar cases, enabling more effective utilization of related samples. We further introduce a knowledge propagation and injection mechanism, where an image-centric Graph Attention Network propagates knowledge semantics to obtain case-based features, followed by a bidirectional cross-modal attention mechanism that injects these features into visual representations for cross-modal alignment. To mitigate noisy retrieval, we design a confidence-calibrated decision refinement scheme that estimates the reliability of each retrieved case by jointly considering prediction confidence and sample similarity, adaptively adjusting its contribution to the final prediction and providing interpretable case-level evidence. Extensive experiments on multiple medical imaging datasets show that our approach consistently outperforms strong baselines, and ablation studies validate the effectiveness of each component. The source code is publicly available at https://anonymous.4open.science/r/MKG-CARE-8B7B.
AIFeb 17, 2025Code
KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge GraphsQi Zhao, Hongyu Yang, Qi Song et al.
Large language models (LLMs) have demonstrated remarkable capabilities in various complex tasks, yet they still suffer from hallucinations. By incorporating and exploring external knowledge, such as knowledge graphs(KGs), LLM's ability to provide factual answers has been enhanced. This approach carries significant practical implications. However, existing methods suffer from three key limitations: insufficient mining of LLMs' internal knowledge, constrained generation of interpretable reasoning paths, and unclear fusion of internal and external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large model framework driven by the collaboration of internal and external knowledge. It relies on the internal knowledge of the LLM to guide the exploration of interpretable directed subgraphs in external knowledge graphs, better integrating the two knowledge sources for more accurate reasoning. Extensive experiments on multiple real-world datasets demonstrate the effectiveness of KnowPath. Our code and data are available at https://github.com/tize-72/KnowPath.
CLMar 17
SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question AnsweringJunli Liang, Pengfei Zhou, Wangqiu Zhou et al.
Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.
CLSep 25, 2025Code
RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMsCan Lin, Zhengwang Jiang, Ling Zheng et al.
Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.
CVFeb 29, 2024Code
BFRFormer: Transformer-based generator for Real-World Blind Face RestorationGuojing Ge, Qi Song, Guibo Zhu et al.
Blind face restoration is a challenging task due to the unknown and complex degradation. Although face prior-based methods and reference-based methods have recently demonstrated high-quality results, the restored images tend to contain over-smoothed results and lose identity-preserved details when the degradation is severe. It is observed that this is attributed to short-range dependencies, the intrinsic limitation of convolutional neural networks. To model long-range dependencies, we propose a Transformer-based blind face restoration method, named BFRFormer, to reconstruct images with more identity-preserved details in an end-to-end manner. In BFRFormer, to remove blocking artifacts, the wavelet discriminator and aggregated attention module are developed, and spectral normalization and balanced consistency regulation are adaptively applied to address the training instability and over-fitting problem, respectively. Extensive experiments show that our method outperforms state-of-the-art methods on a synthetic dataset and four real-world datasets. The source code, Casia-Test dataset, and pre-trained models are released at https://github.com/s8Znk/BFRFormer.
CVAug 23, 2025Code
Align 3D Representation and Text Embedding for 3D Content PersonalizationQi Song, Ziyuan Luo, Ka Chun Cheung et al.
Recent advances in NeRF and 3DGS have significantly enhanced the efficiency and quality of 3D content synthesis. However, efficient personalization of generated 3D content remains a critical challenge. Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose \textbf{Invert3D}, a novel framework for convenient 3D content personalization. Nowadays, vision-language models such as CLIP enable direct image personalization through aligned vision-text embedding spaces. However, the inherent structural differences between 3D content and 2D images preclude direct application of these techniques to 3D personalization. Our approach bridges this gap by establishing alignment between 3D representations and text embedding spaces. Specifically, we develop a camera-conditioned 3D-to-text inverse mechanism that projects 3D contents into a 3D embedding aligned with text embeddings. This alignment enables efficient manipulation and personalization of 3D content through natural language prompts, eliminating the need for computationally retraining procedures. Extensive experiments demonstrate that Invert3D achieves effective personalization of 3D content. Our work is available at: https://github.com/qsong2001/Invert3D.
LGMay 29, 2025Code
CrossLinear: Plug-and-Play Cross-Correlation Embedding for Time Series Forecasting with Exogenous VariablesPengfei Zhou, Yunlong Liu, Junli Liang et al.
Time series forecasting with exogenous variables is a critical emerging paradigm that presents unique challenges in modeling dependencies between variables. Traditional models often struggle to differentiate between endogenous and exogenous variables, leading to inefficiencies and overfitting. In this paper, we introduce CrossLinear, a novel Linear-based forecasting model that addresses these challenges by incorporating a plug-and-play cross-correlation embedding module. This lightweight module captures the dependencies between variables with minimal computational cost and seamlessly integrates into existing neural networks. Specifically, it captures time-invariant and direct variable dependencies while disregarding time-varying or indirect dependencies, thereby mitigating the risk of overfitting in dependency modeling and contributing to consistent performance improvements. Furthermore, CrossLinear employs patch-wise processing and a global linear head to effectively capture both short-term and long-term temporal dependencies, further improving its forecasting precision. Extensive experiments on 12 real-world datasets demonstrate that CrossLinear achieves superior performance in both short-term and long-term forecasting tasks. The ablation study underscores the effectiveness of the cross-correlation embedding module. Additionally, the generalizability of this module makes it a valuable plug-in for various forecasting tasks across different domains. Codes are available at https://github.com/mumiao2000/CrossLinear.
CLMay 6, 2025Code
TeleEval-OS: Performance evaluations of large language models for operations schedulingYanyan Wang, Yingying Wang, Junli Liang et al.
The rapid advancement of large language models (LLMs) has significantly propelled progress in artificial intelligence, demonstrating substantial application potential across multiple specialized domains. Telecommunications operation scheduling (OS) is a critical aspect of the telecommunications industry, involving the coordinated management of networks, services, risks, and human resources to optimize production scheduling and ensure unified service control. However, the inherent complexity and domain-specific nature of OS tasks, coupled with the absence of comprehensive evaluation benchmarks, have hindered thorough exploration of LLMs' application potential in this critical field. To address this research gap, we propose the first Telecommunications Operation Scheduling Evaluation Benchmark (TeleEval-OS). Specifically, this benchmark comprises 15 datasets across 13 subtasks, comprehensively simulating four key operational stages: intelligent ticket creation, intelligent ticket handling, intelligent ticket closure, and intelligent evaluation. To systematically assess the performance of LLMs on tasks of varying complexity, we categorize their capabilities in telecommunications operation scheduling into four hierarchical levels, arranged in ascending order of difficulty: basic NLP, knowledge Q&A, report generation, and report analysis. On TeleEval-OS, we leverage zero-shot and few-shot evaluation methods to comprehensively assess 10 open-source LLMs (e.g., DeepSeek-V3) and 4 closed-source LLMs (e.g., GPT-4o) across diverse scenarios. Experimental results demonstrate that open-source LLMs can outperform closed-source LLMs in specific scenarios, highlighting their significant potential and value in the field of telecommunications operation scheduling.
CVDec 25, 2025
Towards Long-window Anchoring in Vision-Language Model DistillationHaoyi Zhou, Shuo Li, Tianyu Chen et al.
While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.
CLNov 13, 2025
GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph PromptZhenhe Li, Can Lin, Ling Zheng et al.
Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.
CLFeb 22, 2024
LLMBind: A Unified Modality-Task Integration FrameworkBin Zhu, Munan Ning, Peng Jin et al.
In the multi-modal domain, the dependence of various models on specific input formats leads to user confusion and hinders progress. To address this challenge, we introduce \textbf{LLMBind}, a novel framework designed to unify a diverse array of multi-modal tasks. By harnessing a Mixture-of-Experts (MoE) Large Language Model (LLM), LLMBind processes multi-modal inputs and generates task-specific tokens, enabling the invocation of corresponding models to accomplish tasks. This unique approach empowers LLMBind to interpret inputs and generate outputs across various modalities, including image, text, video, and audio. Furthermore, we have constructed an interaction dataset comprising 400k instructions, which unlocks the ability of LLMBind for interactive visual generation and editing tasks. Extensive experimentation demonstrates that LLMBind achieves very superior performance across diverse tasks and outperforms existing models in user evaluations conducted in real-world scenarios. Moreover, the adaptability of LLMBind allows for seamless integration with the latest models and extension to new modality tasks, highlighting its potential to serve as a unified AI agent for modeling universal modalities.
CVOct 30, 2024
Geometry Cloak: Preventing TGS-based 3D Reconstruction from Copyrighted ImagesQi Song, Ziyuan Luo, Ka Chun Cheung et al.
Single-view 3D reconstruction methods like Triplane Gaussian Splatting (TGS) have enabled high-quality 3D model generation from just a single image input within seconds. However, this capability raises concerns about potential misuse, where malicious users could exploit TGS to create unauthorized 3D models from copyrighted images. To prevent such infringement, we propose a novel image protection approach that embeds invisible geometry perturbations, termed "geometry cloaks", into images before supplying them to TGS. These carefully crafted perturbations encode a customized message that is revealed when TGS attempts 3D reconstructions of the cloaked image. Unlike conventional adversarial attacks that simply degrade output quality, our method forces TGS to fail the 3D reconstruction in a specific way - by generating an identifiable customized pattern that acts as a watermark. This watermark allows copyright holders to assert ownership over any attempted 3D reconstructions made from their protected images. Extensive experiments have verified the effectiveness of our geometry cloak. Our project is available at https://qsong2001.github.io/geometry_cloak.
ROMar 18
Physics-informed Deep Mixture-of-Koopmans Vehicle Dynamics Model with Dual-branch Encoder for Distributed Electric-drive TrucksJinyu Miao, Pu Zhang, Rujun Yan et al.
Advanced autonomous driving systems require accurate vehicle dynamics modeling. However, identifying a precise dynamics model remains challenging due to strong nonlinearities and the coupled longitudinal and lateral dynamic characteristics. Previous research has employed physics-based analytical models or neural networks to construct vehicle dynamics representations. Nevertheless, these approaches often struggle to simultaneously achieve satisfactory performance in terms of system identification efficiency, modeling accuracy, and compatibility with linear control strategies. In this paper, we propose a fully data-driven dynamics modeling method tailored for complex distributed electric-drive trucks (DETs), leveraging Koopman operator theory to represent highly nonlinear dynamics in a lifted linear embedding space. To achieve high-precision modeling, we first propose a novel dual-branch encoder which encodes dynamic states and provides a powerful basis for the proposed Koopman-based methods entitled KODE. A physics-informed supervision mechanism, grounded in the geometric consistency of temporal vehicle motion, is incorporated into the training process to facilitate effective learning of both the encoder and the Koopman operator. Furthermore, to accommodate the diverse driving patterns of DETs, we extend the vanilla Koopman operator to a mixture-of-Koopman operator framework, enhancing modeling capability. Simulations conducted in a high-fidelity TruckSim environment and real-world experiments demonstrate that the proposed approach achieves state-of-the-art performance in long-term dynamics state estimation.
CVApr 17, 2024
Improving Hierarchical Representations of Vectorized HD Maps with Perspective CluesChi Zhang, Qi Song, Feifei Li et al.
The construction of vectorized High-Definition (HD) maps from onboard surround-view cameras has become a significant focus in autonomous driving. However, current map vector estimation pipelines face two key limitations: input-agnostic queries struggle to capture complex map structures, and the view transformation leads to information loss. These issues often result in inaccurate shape restoration or missing instances in map predictions. To address this concern, we propose a novel approach, namely \textbf{PerCMap}, which explicitly exploits clues from perspective-view features at both instance and point level. Specifically, at instance level, we propose Cross-view Instance Activation (CIA) to activate instance queries across surround-view images, thereby helping the model recover the instance attributes of map vectors. At point level, we design Dual-view Point Embedding (DPE), which fuses features from both views to generate input-aware positional embeddings and improve the accuracy of point coordinate estimation. Extensive experiments on \textit{nuScenes} and \textit{Argoverse 2} demonstrate that PerCMap achieves strong and consistent performance across benchmarks, reaching 67.1 and 70.5 mAP, respectively.
IRFeb 28, 2025
Unleashing the Potential of Two-Tower Models: Diffusion-Based Cross-Interaction for Large-Scale MatchingYihan Wang, Fei Xiong, Zhexin Han et al.
Two-tower models are widely adopted in the industrial-scale matching stage across a broad range of application domains, such as content recommendations, advertisement systems, and search engines. This model efficiently handles large-scale candidate item screening by separating user and item representations. However, the decoupling network also leads to a neglect of potential information interaction between the user and item representations. Current state-of-the-art (SOTA) approaches include adding a shallow fully connected layer(i.e., COLD), which is limited by performance and can only be used in the ranking stage. For performance considerations, another approach attempts to capture historical positive interaction information from the other tower by regarding them as the input features(i.e., DAT). Later research showed that the gains achieved by this method are still limited because of lacking the guidance on the next user intent. To address the aforementioned challenges, we propose a "cross-interaction decoupling architecture" within our matching paradigm. This user-tower architecture leverages a diffusion module to reconstruct the next positive intention representation and employs a mixed-attention module to facilitate comprehensive cross-interaction. During the next positive intention generation, we further enhance the accuracy of its reconstruction by explicitly extracting the temporal drift within user behavior sequences. Experiments on two real-world datasets and one industrial dataset demonstrate that our method outperforms the SOTA two-tower models significantly, and our diffusion approach outperforms other generative models in reconstructing item representations.
CVApr 1, 2025
ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving with Multi-modal InputsQi Song, Chenghong Li, Haotong Lin et al.
We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a multi-modal feature matching strategy coupled with a multi-scale Gaussian decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on two large-scale autonomous driving datasets, Waymo and KITTI, demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.
CVAug 31, 2025
MarkSplatter: Generalizable Watermarking for 3D Gaussian Splatting Model via Splatter Image StructureXiufeng Huang, Ziyuan Luo, Qi Song et al.
The growing popularity of 3D Gaussian Splatting (3DGS) has intensified the need for effective copyright protection. Current 3DGS watermarking methods rely on computationally expensive fine-tuning procedures for each predefined message. We propose the first generalizable watermarking framework that enables efficient protection of Splatter Image-based 3DGS models through a single forward pass. We introduce GaussianBridge that transforms unstructured 3D Gaussians into Splatter Image format, enabling direct neural processing for arbitrary message embedding. To ensure imperceptibility, we design a Gaussian-Uncertainty-Perceptual heatmap prediction strategy for preserving visual quality. For robust message recovery, we develop a dense segmentation-based extraction mechanism that maintains reliable extraction even when watermarked objects occupy minimal regions in rendered views. Project page: https://kevinhuangxf.github.io/marksplatter.
CVNov 27, 2025
Creating Blank Canvas Against AI-enabled Image ForgeryQi Song, Ziyuan Luo, Renjie Wan
AIGC-based image editing technology has greatly simplified the realistic-level image modification, causing serious potential risks of image forgery. This paper introduces a new approach to tampering detection using the Segment Anything Model (SAM). Instead of training SAM to identify tampered areas, we propose a novel strategy. The entire image is transformed into a blank canvas from the perspective of neural models. Any modifications to this blank canvas would be noticeable to the models. To achieve this idea, we introduce adversarial perturbations to prevent SAM from ``seeing anything'', allowing it to identify forged regions when the image is tampered with. Due to SAM's powerful perceiving capabilities, naive adversarial attacks cannot completely tame SAM. To thoroughly deceive SAM and make it blind to the image, we introduce a frequency-aware optimization strategy, which further enhances the capability of tamper localization. Extensive experimental results demonstrate the effectiveness of our method.
LGAug 5, 2025
Heterogeneity-Oblivious Robust Federated LearningWeiyao Zhang, Jinyang Li, Qi Song et al.
Federated Learning (FL) remains highly vulnerable to poisoning attacks, especially under real-world hyper-heterogeneity, where clients differ significantly in data distributions, communication capabilities, and model architectures. Such heterogeneity not only undermines the effectiveness of aggregation strategies but also makes attacks more difficult to detect. Furthermore, high-dimensional models expand the attack surface. To address these challenges, we propose Horus, a heterogeneity-oblivious robust FL framework centered on low-rank adaptations (LoRAs). Rather than aggregating full model parameters, Horus inserts LoRAs into empirically stable layers and aggregates only LoRAs to reduce the attack uncover a key empirical observation that the input projection (LoRA-A) is markedly more stable than the output projection (LoRA-B) under heterogeneity and poisoning. Leveraging this, we design a Heterogeneity-Oblivious Poisoning Score using the features from LoRA-A to filter poisoned clients. For the remaining benign clients, we propose projection-aware aggregation mechanism to preserve collaborative signals while suppressing drifts, which reweights client updates by consistency with the global directions. Extensive experiments across diverse datasets, model architectures, and attacks demonstrate that Horus consistently outperforms state-of-the-art baselines in both robustness and accuracy.
LGMay 26, 2025
Advanced Long-term Earth System ForecastingHao Wu, Yuan Gao, Ruijian Gou et al.
Reliable long-term forecasting of Earth system dynamics is fundamentally limited by instabilities in current artificial intelligence (AI) models during extended autoregressive simulations. These failures often originate from inherent spectral bias, leading to inadequate representation of critical high-frequency, small-scale processes and subsequent uncontrolled error amplification. Inspired by the nested grids in numerical models used to resolve small scales, we present TritonCast. At the core of its design is a dedicated latent dynamical core, which ensures the long-term stability of the macro-evolution at a coarse scale. An outer structure then fuses this stable trend with fine-grained local details. This design effectively mitigates the spectral bias caused by cross-scale interactions. In atmospheric science, it achieves state-of-the-art accuracy on the WeatherBench 2 benchmark while demonstrating exceptional long-term stability: executing year-long autoregressive global forecasts and completing multi-year climate simulations that span the entire available $2500$-day test period without drift. In oceanography, it extends skillful eddy forecast to $120$ days and exhibits unprecedented zero-shot cross-resolution generalization. Ablation studies reveal that this performance stems from the synergistic interplay of the architecture's core components. TritonCast thus offers a promising pathway towards a new generation of trustworthy, AI-driven simulations. This significant advance has the potential to accelerate discovery in climate and Earth system science, enabling more reliable long-term forecasting and deeper insights into complex geophysical dynamics.
CVFeb 7, 2025
PoI: A Filter to Extract Pixel of Interest from Novel View Synthesis for Scene Coordinate RegressionFeifei Li, Qi Song, Chi Zhang et al.
Novel View synthesis (NVS) techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), can augment camera pose estimation by extending training data with rendered images. However, the images rendered by these methods are often plagued by blurring, undermining their reliability as training data for camera pose estimation. This limitation is particularly critical for Scene Coordinate Regression (SCR) methods, which aim at pixel-level 3D coordinate estimation, because rendering artifacts directly lead to estimation inaccuracies. To address this challenge, we propose a dual-criteria filtering mechanism that dynamically identifies and discards suboptimal pixels during training. The dual-criteria filter evaluates two concurrent metrics: (1) real-time SCR reprojection error, and (2) gradient threshold, across the coordinate regression domain. In addition, for visual localization problems in sparse input scenarios, it will be even more necessary to use data generated by NVS to assist the localization task. We design a coarse-to-fine PoI variant using sparse input NVS to solve this problem. Experiments across indoor and outdoor benchmarks confirm our method's efficacy. It achieves state-of-the-art localization accuracy while maintaining computational efficiency.
CVAug 13, 2024
Divide and Conquer: Improving Multi-Camera 3D Perception with 2D Semantic-Depth Priors and Input-Dependent QueriesQi Song, Qingyong Hu, Chi Zhang et al.
3D perception tasks, such as 3D object detection and Bird's-Eye-View (BEV) segmentation using multi-camera images, have drawn significant attention recently. Despite the fact that accurately estimating both semantic and 3D scene layouts are crucial for this task, existing techniques often neglect the synergistic effects of semantic and depth cues, leading to the occurrence of classification and position estimation errors. Additionally, the input-independent nature of initial queries also limits the learning capacity of Transformer-based models. To tackle these challenges, we propose an input-aware Transformer framework that leverages Semantics and Depth as priors (named SDTR). Our approach involves the use of an S-D Encoder that explicitly models semantic and depth priors, thereby disentangling the learning process of object categorization and position estimation. Moreover, we introduce a Prior-guided Query Builder that incorporates the semantic prior into the initial queries of the Transformer, resulting in more effective input-aware queries. Extensive experiments on the nuScenes and Lyft benchmarks demonstrate the state-of-the-art performance of our method in both 3D object detection and BEV segmentation tasks.
CVMay 10, 2023
Generative Steganographic FlowPing Wei, Ge Luo, Qi Song et al.
Generative steganography (GS) is a new data hiding manner, featuring direct generation of stego media from secret data. Existing GS methods are generally criticized for their poor performances. In this paper, we propose a novel flow based GS approach -- Generative Steganographic Flow (GSF), which provides direct generation of stego images without cover image. We take the stego image generation and secret data recovery process as an invertible transformation, and build a reversible bijective mapping between input secret data and generated stego images. In the forward mapping, secret data is hidden in the input latent of Glow model to generate stego images. By reversing the mapping, hidden data can be extracted exactly from generated stego images. Furthermore, we propose a novel latent optimization strategy to improve the fidelity of stego images. Experimental results show our proposed GSF has far better performances than SOTA works.
CVFeb 27, 2022
Synergistic Network Learning and Label Correction for Noise-robust Image ClassificationChen Gong, Kong Bin, Eric J. Seibel et al.
Large training datasets almost always contain examples with inaccurate or incorrect labels. Deep Neural Networks (DNNs) tend to overfit training label noise, resulting in poorer model performance in practice. To address this problem, we propose a robust label correction framework combining the ideas of small loss selection and noise correction, which learns network parameters and reassigns ground truth labels iteratively. Taking the expertise of DNNs to learn meaningful patterns before fitting noise, our framework first trains two networks over the current dataset with small loss selection. Based on the classification loss and agreement loss of two networks, we can measure the confidence of training data. More and more confident samples are selected for label correction during the learning process. We demonstrate our method on both synthetic and real-world datasets with different noise types and rates, including CIFAR-10, CIFAR-100 and Clothing1M, where our method outperforms the baseline approaches.
LGJan 13, 2022
Recursive Least Squares for Training and Pruning Convolutional Neural NetworksTianzong Yu, Chunyuan Zhang, Yuan Wang et al.
Convolutional neural networks (CNNs) have succeeded in many practical applications. However, their high computation and storage requirements often make them difficult to deploy on resource-constrained devices. In order to tackle this issue, many pruning algorithms have been proposed for CNNs, but most of them can't prune CNNs to a reasonable level. In this paper, we propose a novel algorithm for training and pruning CNNs based on the recursive least squares (RLS) optimization. After training a CNN for some epochs, our algorithm combines inverse input autocorrelation matrices and weight matrices to evaluate and prune unimportant input channels or nodes layer by layer. Then, our algorithm will continue to train the pruned network, and won't do the next pruning until the pruned network recovers the full performance of the old network. Besides for CNNs, the proposed algorithm can be used for feedforward neural networks (FNNs). Three experiments on MNIST, CIFAR-10 and SVHN datasets show that our algorithm can achieve the more reasonable pruning and have higher learning efficiency than other four popular pruning algorithms.
LGJan 13, 2022
Recursive Least Squares Policy Control with Echo State NetworkChunyuan Zhang, Chao Liu, Qi Song et al.
The echo state network (ESN) is a special type of recurrent neural networks for processing the time-series dataset. However, limited by the strong correlation among sequential samples of the agent, ESN-based policy control algorithms are difficult to use the recursive least squares (RLS) algorithm to update the ESN's parameters. To solve this problem, we propose two novel policy control algorithms, ESNRLS-Q and ESNRLS-Sarsa. Firstly, to reduce the correlation of training samples, we use the leaky integrator ESN and the mini-batch learning mode. Secondly, to make RLS suitable for training ESN in mini-batch mode, we present a new mean-approximation method for updating the RLS correlation matrix. Thirdly, to prevent ESN from over-fitting, we use the L1 regularization technique. Lastly, to prevent the target state-action value from overestimation, we employ the Mellowmax method. Simulation results show that our algorithms have good convergence performance.
IVDec 14, 2021
Stochastic Planner-Actor-Critic for Unsupervised Deformable Image RegistrationZiwei Luo, Jing Hu, Xin Wang et al.
Large deformations of organs, caused by diverse shapes and nonlinear shape changes, pose a significant challenge for medical image registration. Traditional registration methods need to iteratively optimize an objective function via a specific deformation model along with meticulous parameter tuning, but which have limited capabilities in registering images with large deformations. While deep learning-based methods can learn the complex mapping from input images to their respective deformation field, it is regression-based and is prone to be stuck at local minima, particularly when large deformations are involved. To this end, we present Stochastic Planner-Actor-Critic (SPAC), a novel reinforcement learning-based framework that performs step-wise registration. The key notion is warping a moving image successively by each time step to finally align to a fixed image. Considering that it is challenging to handle high dimensional continuous action and state spaces in the conventional reinforcement learning (RL) framework, we introduce a new concept `Plan' to the standard Actor-Critic model, which is of low dimension and can facilitate the actor to generate a tractable high dimensional action. The entire framework is based on unsupervised training and operates in an end-to-end manner. We evaluate our method on several 2D and 3D medical image datasets, some of which contain large deformations. Our empirical results highlight that our work achieves consistent, significant gains and outperforms state-of-the-art methods.
CVDec 14, 2021
Stochastic Actor-Executor-Critic for Image-to-Image TranslationZiwei Luo, Jing Hu, Xin Wang et al.
Training a model-free deep reinforcement learning model to solve image-to-image translation is difficult since it involves high-dimensional continuous state and action spaces. In this paper, we draw inspiration from the recent success of the maximum entropy reinforcement learning framework designed for challenging continuous control problems to develop stochastic policies over high dimensional continuous spaces including image representation, generation, and control simultaneously. Central to this method is the Stochastic Actor-Executor-Critic (SAEC) which is an off-policy actor-critic model with an additional executor to generate realistic images. Specifically, the actor focuses on the high-level representation and control policy by a stochastic latent action, as well as explicitly directs the executor to generate low-level actions to manipulate the state. Experiments on several image-to-image translation tasks have demonstrated the effectiveness and robustness of the proposed SAEC when facing high-dimensional continuous space problems.
CVDec 8, 2021
Fully Attentional Network for Semantic SegmentationQi Song, Jie Li, Chenghong Li et al.
Recent non-local self-attention methods have proven to be effective in capturing long-range dependencies for semantic segmentation. These methods usually form a similarity map of RC*C (by compressing spatial dimensions) or RHW*HW (by compressing channels) to describe the feature relations along either channel or spatial dimensions, where C is the number of channels, H and W are the spatial dimensions of the input feature map. However, such practices tend to condense feature dependencies along the other dimensions,hence causing attention missing, which might lead to inferior results for small/thin categories or inconsistent segmentation inside large objects. To address this problem, we propose anew approach, namely Fully Attentional Network (FLANet),to encode both spatial and channel attentions in a single similarity map while maintaining high computational efficiency. Specifically, for each channel map, our FLANet can harvest feature responses from all other channel maps, and the associated spatial positions as well, through a novel fully attentional module. Our new method has achieved state-of-the-art performance on three challenging semantic segmentation datasets,i.e., 83.6%, 46.99%, and 88.5% on the Cityscapes test set,the ADE20K validation set, and the PASCAL VOC test set,respectively.
SDNov 15, 2021
Time-Frequency Attention for Monaural Speech EnhancementQiquan Zhang, Qi Song, Zhaoheng Ni et al.
Most studies on speech enhancement generally don't consider the energy distribution of speech in time-frequency (T-F) representation, which is important for accurate prediction of mask or spectra. In this paper, we present a simple yet effective T-F attention (TFA) module, where a 2-D attention map is produced to provide differentiated weights to the spectral components of T-F representation. To validate the effectiveness of our proposed TFA module, we use the residual temporal convolution network (ResTCN) as the backbone network and conduct extensive experiments on two commonly used training targets. Our experiments demonstrate that applying our TFA module significantly improves the performance in terms of five objective evaluation metrics with negligible parameter overhead. The evaluation results show that the proposed ResTCN with the TFA module (ResTCN+TFA) consistently outperforms other baselines by a large margin.
CVOct 27, 2021
Denoised Non-Local Neural Network for Semantic SegmentationQi Song, Jie Li, Hao Guo et al.
The non-local network has become a widely used technique for semantic segmentation, which computes an attention map to measure the relationships of each pixel pair. However, most of the current popular non-local models tend to ignore the phenomenon that the calculated attention map appears to be very noisy, containing inter-class and intra-class inconsistencies, which lowers the accuracy and reliability of the non-local methods. In this paper, we figuratively denote these inconsistencies as attention noises and explore the solutions to denoise them. Specifically, we inventively propose a Denoised Non-Local Network (Denoised NL), which consists of two primary modules, i.e., the Global Rectifying (GR) block and the Local Retention (LR) block, to eliminate the inter-class and intra-class noises respectively. First, GR adopts the class-level predictions to capture a binary map to distinguish whether the selected two pixels belong to the same category. Second, LR captures the ignored local dependencies and further uses them to rectify the unwanted hollows in the attention map. The experimental results on two challenging semantic segmentation datasets demonstrate the superior performance of our model. Without any external training data, our proposed Denoised NL can achieve the state-of-the-art performance of 83.5\% and 46.69\% mIoU on Cityscapes and ADE20K, respectively.
ROSep 11, 2021
Autonomous Underwater Vehicle-Manipulator Systems Path Planning with RRTAUVMS AlgorithmXiaoxu Cao, Linyi Gu, JunChen Mu et al.
Autonomous Underwater Vehicle-Manipulator systems (AUVMS) is a new tool for ocean exploration, the AUVMS path planning problem is addressed in this paper. AUVMS is a high dimension system with a large difference in inertia distribution, also it works in a complex environment with obstacles. By integrating the rapidly-exploring random tree(RRT) algorithm with the AUVMS kinematics model, the proposed RRTAUVMS algorithm could randomly sample in the configuration space(C-Space), and also grow the tree directly towards the workspace goal in the task space. The RRTAUVMS can also deal with the redundant mapping of workspace planning goal and configuration space goal. Compared with the traditional RRT algorithm, the efficiency of the AUVMS path planning can be significantly improved.
LGSep 7, 2021
Revisiting Recursive Least Squares for Training Deep Neural NetworksChunyuan Zhang, Qi Song, Hui Zhou et al.
Recursive least squares (RLS) algorithms were once widely used for training small-scale neural networks, due to their fast convergence. However, previous RLS algorithms are unsuitable for training deep neural networks (DNNs), since they have high computational complexity and too many preconditions. In this paper, to overcome these drawbacks, we propose three novel RLS optimization algorithms for training feedforward neural networks, convolutional neural networks and recurrent neural networks (including long short-term memory networks), by using the error backpropagation and our average-approximation RLS method, together with the equivalent gradients of the linear least squares loss function with respect to the linear outputs of hidden layers. Compared with previous RLS optimization algorithms, our algorithms are simple and elegant. They can be viewed as an improved stochastic gradient descent (SGD) algorithm, which uses the inverse autocorrelation matrix of each layer as the adaptive learning rate. Their time and space complexities are only several times those of SGD. They only require the loss function to be the mean squared error and the activation function of the output layer to be invertible. In fact, our algorithms can be also used in combination with other first-order optimization algorithms without requiring these two preconditions. In addition, we present two improved methods for our algorithms. Finally, we demonstrate their effectiveness compared to the Adam algorithm on MNIST, CIFAR-10 and IMDB datasets, and investigate the influences of their hyperparameters experimentally.
CVJun 3, 2021
Transferable Adversarial Examples for Anchor Free Object DetectionQuanyu Liao, Xin Wang, Bin Kong et al.
Deep neural networks have been demonstrated to be vulnerable to adversarial attacks: subtle perturbation can completely change prediction result. The vulnerability has led to a surge of research in this direction, including adversarial attacks on object detection networks. However, previous studies are dedicated to attacking anchor-based object detectors. In this paper, we present the first adversarial attack on anchor-free object detectors. It conducts category-wise, instead of previously instance-wise, attacks on object detectors, and leverages high-level semantic information to efficiently generate transferable adversarial examples, which can also be transferred to attack other object detectors, even anchor-based detectors such as Faster R-CNN. Experimental results on two benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance and transferability.
CVJun 3, 2021
Imperceptible Adversarial Examples for Fake Image DetectionQuanyu Liao, Yuezun Li, Xin Wang et al.
Fooling people with highly realistic fake images generated with Deepfake or GANs brings a great social disturbance to our society. Many methods have been proposed to detect fake images, but they are vulnerable to adversarial perturbations -- intentionally designed noises that can lead to the wrong prediction. Existing methods of attacking fake image detectors usually generate adversarial perturbations to perturb almost the entire image. This is redundant and increases the perceptibility of perturbations. In this paper, we propose a novel method to disrupt the fake image detection by determining key pixels to a fake image detector and attacking only the key pixels, which results in the $L_0$ and the $L_2$ norms of adversarial perturbations much less than those of existing works. Experiments on two public datasets with three fake image detectors indicate that our proposed method achieves state-of-the-art performance in both white-box and black-box attacks.
CVMar 10, 2021
AttaNet: Attention-Augmented Network for Fast and Accurate Scene ParsingQi Song, Kangfu Mei, Rui Huang
Two factors have proven to be very important to the performance of semantic segmentation models: global context and multi-level semantics. However, generating features that capture both factors always leads to high computational complexity, which is problematic in real-time scenarios. In this paper, we propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics while keeping the efficiency high. AttaNet consists of two primary modules: Strip Attention Module (SAM) and Attention Fusion Module (AFM). Viewing that in challenging images with low segmentation accuracy, there are a significantly larger amount of vertical strip areas than horizontal ones, SAM utilizes a striping operation to reduce the complexity of encoding global context in the vertical direction drastically while keeping most of contextual information, compared to the non-local approaches. Moreover, AFM follows a cross-level aggregation strategy to limit the computation, and adopts an attention strategy to weight the importance of different levels of features at each pixel when fusing them, obtaining an efficient multi-level representation. We have conducted extensive experiments on two semantic segmentation benchmarks, and our network achieves different levels of speed/accuracy trade-offs on Cityscapes, e.g., 71 FPS/79.9% mIoU, 130 FPS/78.5% mIoU, and 180 FPS/70.1% mIoU, and leading performance on ADE20K as well.
CVOct 27, 2020
Fast Local Attack: Generating Local Adversarial Examples for Object DetectorsQuanyu Liao, Xin Wang, Bin Kong et al.
The deep neural network is vulnerable to adversarial examples. Adding imperceptible adversarial perturbations to images is enough to make them fail. Most existing research focuses on attacking image classifiers or anchor-based object detectors, but they generate globally perturbation on the whole image, which is unnecessary. In our work, we leverage higher-level semantic information to generate high aggressive local perturbations for anchor-free object detectors. As a result, it is less computationally intensive and achieves a higher black-box attack as well as transferring attack performance. The adversarial examples generated by our method are not only capable of attacking anchor-free object detectors, but also able to be transferred to attack anchor-based object detector.
CVApr 5, 2020
ReADS: A Rectified Attentional Double Supervised Network for Scene Text RecognitionQi Song, Qianyi Jiang, Nan Li et al.
In recent years, scene text recognition is always regarded as a sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and Attentional sequence recognition (Attn) are two very prevailing approaches to tackle this problem while they may fail in some scenarios respectively. CTC concentrates more on every individual character but is weak in text semantic dependency modeling. Attn based methods have better context semantic modeling ability while tends to overfit on limited training data. In this paper, we elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. To overcome the weakness of CTC and Attn, both of them are applied in our method but with different modules in two supervised branches which can make a complementary to each other. Moreover, effective spatial and channel attention mechanisms are introduced to eliminate background noise and extract valid foreground information. Finally, a simple rectified network is implemented to rectify irregular text. The ReADS can be trained end-to-end and only word-level annotations are required. Extensive experiments on various benchmarks verify the effectiveness of ReADS which achieves state-of-the-art performance.
CVFeb 10, 2020
Category-wise Attack: Transferable Adversarial Examples for Anchor Free Object DetectionQuanyu Liao, Xin Wang, Bin Kong et al.
Deep neural networks have been demonstrated to be vulnerable to adversarial attacks: subtle perturbations can completely change the classification results. Their vulnerability has led to a surge of research in this direction. However, most works dedicated to attacking anchor-based object detection models. In this work, we aim to present an effective and efficient algorithm to generate adversarial examples to attack anchor-free object models based on two approaches. First, we conduct category-wise instead of instance-wise attacks on the object detectors. Second, we leverage the high-level semantic information to generate the adversarial examples. Surprisingly, the generated adversarial examples it not only able to effectively attack the targeted anchor-free object detector but also to be transferred to attack other object detectors, even anchor-based detectors such as Faster R-CNN.
CVFeb 5, 2020
Domain Embedded Multi-model Generative Adversarial Networks for Image-based Face InpaintingXian Zhang, Xin Wang, Bin Kong et al.
Prior knowledge of face shape and structure plays an important role in face inpainting. However, traditional face inpainting methods mainly focus on the generated image resolution of the missing portion without consideration of the special particularities of the human face explicitly and generally produce discordant facial parts. To solve this problem, we present a domain embedded multi-model generative adversarial model for inpainting of face images with large cropped regions. We firstly represent only face regions using the latent variable as the domain knowledge and combine it with the non-face parts textures to generate high-quality face images with plausible contents. Two adversarial discriminators are finally used to judge whether the generated distribution is close to the real distribution or not. It can not only synthesize novel image structures but also explicitly utilize the embedded face domain knowledge to generate better predictions with consistency on structures and appearance. Experiments on both CelebA and CelebA-HQ face datasets demonstrate that our proposed approach achieved state-of-the-art performance and generates higher quality inpainting results than existing ones.
CVJan 29, 2020
Robust Multimodal Image Registration Using Deep Recurrent Reinforcement LearningShanhui Sun, Jing Hu, Mingqing Yao et al.
The crucial components of a conventional image registration method are the choice of the right feature representations and similarity measures. These two components, although elaborately designed, are somewhat handcrafted using human knowledge. To this end, these two components are tackled in an end-to-end manner via reinforcement learning in this work. Specifically, an artificial agent, which is composed of a combined policy and value network, is trained to adjust the moving image toward the right direction. We train this network using an asynchronous reinforcement learning algorithm, where a customized reward function is also leveraged to encourage robust image registration. This trained network is further incorporated with a lookahead inference to improve the registration capability. The advantage of this algorithm is fully demonstrated by our superior performance on clinical MR and CT image pairs to other state-of-the-art medical image registration methods.
CVDec 20, 2019
ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on SignboardXi Liu, Rui Zhang, Yongsheng Zhou et al.
Chinese scene text reading is one of the most challenging problems in computer vision and has attracted great interest. Different from English text, Chinese has more than 6000 commonly used characters and Chinesecharacters can be arranged in various layouts with numerous fonts. The Chinese signboards in street view are a good choice for Chinese scene text images since they have different backgrounds, fonts and layouts. We organized a competition called ICDAR2019-ReCTS, which mainly focuses on reading Chinese text on signboard. This report presents the final results of the competition. A large-scale dataset of 25,000 annotated signboard images, in which all the text lines and characters are annotated with locations and transcriptions, were released. Four tasks, namely character recognition, text line recognition, text line detection and end-to-end recognition were set up. Besides, considering the Chinese text ambiguity issue, we proposed a multi ground truth (multi-GT) evaluation method to make evaluation fairer. The competition started on March 1, 2019 and ended on April 30, 2019. 262 submissions from 46 teams are received. Most of the participants come from universities, research institutes, and tech companies in China. There are also some participants from the United States, Australia, Singapore, and Korea. 21 teams submit results for Task 1, 23 teams submit results for Task 2, 24 teams submit results for Task 3, and 13 teams submit results for Task 4. The official website for the competition is http://rrc.cvc.uab.es/?ch=12.
CVMar 25, 2019
DeepCenterline: a Multi-task Fully Convolutional Network for Centerline ExtractionZhihui Guo, Junjie Bai, Yi Lu et al.
A novel centerline extraction framework is reported which combines an end-to-end trainable multi-task fully convolutional network (FCN) with a minimal path extractor. The FCN simultaneously computes centerline distance maps and detects branch endpoints. The method generates single-pixel-wide centerlines with no spurious branches. It handles arbitrary tree-structured object with no prior assumption regarding depth of the tree or its bifurcation pattern. It is also robust to substantial scale changes across different parts of the target object and minor imperfections of the object's segmentation mask. To the best of our knowledge, this is the first deep-learning based centerline extraction method that guarantees single-pixel-wide centerline for a complex tree-structured object. The proposed method is validated in coronary artery centerline extraction on a dataset of 620 patients (400 of which used as test set). This application is challenging due to the large number of coronary branches, branch tortuosity, and large variations in length, thickness, shape, etc. The proposed method generates well-positioned centerlines, exhibiting lower number of missing branches and is more robust in the presence of minor imperfections of the object segmentation mask. Compared to a state-of-the-art traditional minimal path approach, our method improves patient-level success rate of centerline extraction from 54.3% to 88.8% according to independent human expert review.
IRMar 18, 2019
POI Semantic Model with a Deep Convolutional StructureJi Zhao, Meiyu Yu, Huan Chen et al.
When using the electronic map, POI retrieval is the initial and important step, whose quality directly affects the user experience. Similarity between user query and POI information is the most critical feature in POI retrieval. An accurate similarity calculation is challenging since the mismatch between a query and a retrieval text may exist in the case of a mistyped query or an alias inquiry. In this paper, we propose a POI latent semantic model based on deep networks, which can effectively extract query features and POI information features for the similarity calculation. Our model describes the semantic information of complex texts at multiple layers, and achieves multi-field matches by modeling POI's name and detailed address respectively. Our model is evaluated by the POI retrieval ranking datasets, including the labeled data of relevance and real-world user click data in POI retrieval. Results show that our model significantly outperforms our competitors in POI retrieval ranking tasks. The proposed algorithm has become a critical component of an online system serving millions of people everyday.
CVJan 29, 2019
Attention-driven Tree-structured Convolutional LSTM for High Dimensional Data UnderstandingBin Kong, Xin Wang, Junjie Bai et al.
Modeling the sequential information of image sequences has been a vital step of various vision tasks and convolutional long short-term memory (ConvLSTM) has demonstrated its superb performance in such spatiotemporal problems. Nevertheless, the hierarchical data structures in a significant amount of tasks (e.g., human body parts and vessel/airway tree in biomedical images) cannot be properly modeled by sequential models. Thus, ConvLSTM is not suitable for tree-structured image data analysis. In order to address these limitations, we present tree-structured ConvLSTM models for tree-structured image analysis tasks which can be trained end-to-end. To demonstrate the effectiveness of the proposed tree-structured ConvLSTM model, we present a tree-structured segmentation framework which consists of a tree-structured ConvLSTM and an attention fully convolutional network (FCN) model. The proposed framework is extensively validated on four large-scale coronary artery datasets. The results demonstrate the effectiveness and efficiency of the proposed method.
CVJan 2, 2019
Flow Based Self-supervised Pixel Embedding for Image SegmentationBin Ma, Shubao Liu, Yingxuan Zhi et al.
We propose a new self-supervised approach to image feature learning from motion cue. This new approach leverages recent advances in deep learning in two directions: 1) the success of training deep neural network in estimating optical flow in real data using synthetic flow data; and 2) emerging work in learning image features from motion cues, such as optical flow. Building on these, we demonstrate that image features can be learned in self-supervision by first training an optical flow estimator with synthetic flow data, and then learning image features from the estimated flows in real motion data. We demonstrate and evaluate this approach on an image segmentation task. Using the learned image feature representation, the network performs significantly better than the ones trained from scratch in few-shot segmentation tasks.