Dongchen Lu

CV
h-index4
3papers
23citations
Novelty60%
AI Score43

3 Papers

CVMay 30
DeepLatent: Think with Images via Parallel Latent Visual Reasoning

Dongchen Lu, Zhimo Li, Mao Shu et al.

The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.

CVMar 27, 2025
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression

Dongchen Lu, Yuyao Sun, Zilu Zhang et al.

Most multimodal large language models (MLLMs) treat visual tokens as "a sequence of text", integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporating three visual token compression methods. First, we propose a novel vision-language projector, PVTC. This component integrates adjacent visual embeddings to form a local query and utilizes the transformed CLS token as a global query, then performs point-to-region cross-attention through these local and global queries to more effectively convert visual features. Second, we present a layer-wise visual token compression module, LVTC, which compresses tokens in the LLM shallow layers and then expands them through upsampling and residual connections in the deeper layers. This significantly enhances the model computational efficiency. Futhermore, we propose an efficient high resolution slicing method, RVTC, which dynamically adjusts the number of visual tokens based on image area or length filtering. RVTC greatly enhances training efficiency with only a slight reduction in performance. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.

CVApr 18, 2021
OSKDet: Towards Orientation-sensitive Keypoint Localization for Rotated Object Detection

Dongchen Lu

Rotated object detection is a challenging issue of computer vision field. Loss of spatial information and confusion of parametric order have been the bottleneck for rotated detection accuracy. In this paper, we propose an orientation-sensitive keypoint based rotated detector OSKDet. We adopt a set of keypoints to characterize the target and predict the keypoint heatmap on ROI to form a rotated target. By proposing the orientation-sensitive heatmap, OSKDet could learn the shape and direction of rotated target implicitly and has stronger modeling capabilities for target representation, which improves the localization accuracy and acquires high quality detection results. To extract highly effective features at border areas, we design a rotation-aware deformable convolution module. Furthermore, we explore a new keypoint reorder algorithm and feature fusion module based on the angle distribution to eliminate the confusion of keypoint order. Experimental results on several public benchmarks show the state-of-the-art performance of OSKDet. Specifically, we achieve an AP of 77.81% on DOTA, 89.91% on HRSC2016, and 97.18% on UCAS-AOD, respectively.