72.6CVJun 2
Zero-Shot 3D Question Answering via Hierarchical View-to-Token TransportationDongsheng Wang, Dawei Su, Hui Huang
Recently, zero-shot 3D scene understanding via 2D Vision-Language Models (VLMs) has gained increasing research interest due to their promising spatial reasoning capabilities. Typically, multiple 2D views are sampled from a 3D point cloud and fed into pre-trained VLMs to answer a given question. This paradigm highlights the critical role of input context quality and raises the challenge of retaining as many task-relevant 3D details as possible under a limited input budget. We propose \texttt{KeyVT}, a hierarchical approach for input context collection at both the view and token levels. Specifically, we combine pixel features with camera parameters and assess view importance based on both semantic content and geometric position, resulting in spatially consistent and task-relevant views. Furthermore, we address redundancy among patches across selected views by identifying representative tokens under the optimal transport (OT) framework, where view tokens and key tokens are formulated as two discrete distributions in the embedding space. These key tokens are expected to cover all view features by minimizing the OT distance. We evaluate our framework on three widely used benchmarks, demonstrating significant improvements over existing tuning-free methods and performance comparable to training-based approaches.
IRFeb 25Code
RETLLM: Training and Data-Free MLLMs for Multimodal Information RetrievalDawei Su, Dongsheng Wang
Multimodal information retrieval (MMIR) has gained attention for its flexibility in handling text, images, or mixed queries and candidates. Recent breakthroughs in multimodal large language models (MLLMs) boost MMIR performance by incorporating MLLM knowledge under the contrastive finetuning framework. However, they suffer from pre-training inconsistency and require large datasets. In this work, we introduce a novel framework, RetLLM, designed to query MLLMs for MMIR in a training- and data-free manner. Specifically, we formulate MMIR as a similarity score generation task and prompt MLLMs to directly predict retrieval scores in a coarse-then-fine pipeline. At the coarse stage, a top-k filtering strategy builds a small yet high-quality candidate pool for each query, enabling MLLMs to focus on semantically relevant candidates. Subsequently, the retrieval score is predicted by feeding both the query and candidate into MLLMs at the fine stage. Importantly, we propose a visual enhancement module during reasoning to help MLLMs re-pick forgotten visuals, improving retrieval. Extensive experiments on MMIR benchmarks show that RetLLM outperforms fine-tuned models. Ablation studies further verify each component. Our work demonstrates that MLLMs can achieve strong MMIR performance without any training, highlighting their inherent multimodal reasoning ability in a simple, scalable framework. We release our code at: https://github.com/alivecat05/RETLLM
66.4LGApr 16
Improving Sparse Autoencoder with Dynamic AttentionDongsheng Wang, Jinsen Zhang, Dawei Su et al.
Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for each neuron remains challenging in practice: excessive sparsity can lead to poor reconstruction, whereas insufficient sparsity may harm interpretability. While existing activation functions such as ReLU and TopK provide certain sparsity guarantees, they typically require additional sparsity regularization or cherry-picked hyperparameters. We show in this paper that dynamically sparse attention mechanisms using sparsemax can bridge this trade-off, due to their ability to determine the activation numbers in a data-dependent manner. Specifically, we first explore a new class of SAEs based on the cross-attention architecture with the latent features as queries and the learnable dictionary as the key and value matrices. To encourage sparse pattern learning, we employ a sparsemax-based attention strategy that automatically infers a sparse set of elements according to the complexity of each neuron, resulting in a more flexible and general activation function. Through comprehensive evaluation and visualization, we show that our approach successfully achieves lower reconstruction loss while producing high-quality concepts, particularly in top-n classification tasks.
SYAug 24, 2017
Online Coherence Identification Using Dynamic Time Warping for Controlled IslandingHasan Ul Banna, Zhe Yu, Di Shi et al.
Controlled islanding is considered to be the last countermeasure to prevent system-wide blackouts in case of cascading failures. It splits the system into self-sustained islands to maintain transient stability at the expense of possible loss of load. Generator coherence identification is critical to controlled islanding scheme as it helps identify the optimal cut-set to maintain system transient stability. This paper presents a novel approach for online generator coherency identification using phasor measurement unit (PMU) data and dynamic time warping (DTW). Results from the coherence identification are used to further cluster non-generator buses using spectral clustering with the objective of minimizing power flow disruption. The proposed approach is validated and compared to existing methods on the IEEE 39-bus system, through which its advantages are demonstrated.
SYJun 14, 2017
PMU Assisted Power System Parameter Calibration at Jiangsu Electric Power CompanyXiao Lu, Di Shi, Bin Zhu et al.
An online PMU-assisted Power System Parameter Calibration System (PSPCS) was recently developed and implemented at State Grid Jiangsu Electric Power Company (JEPC). PSPCS leverages high-resolution PMU data and data mining techniques to perform online screening of the EMS and Production Management System (PMS) databases for data cleaning, model validation, and parameter calibration. PSPCS calculates transmission line and generator parameters on a regular real-time basis and compares the results with databases to identify record(s) with significant discrepancy, if any. Once consistent discrepancy is observed, the system will raise a flag and further investigation will be initiated, including a novel density-based spatial clustering procedure for parameter/data calibration. A novel metric is proposed to quantify the credibility of PMU-based parameter identification. This paper discusses the proposed methodologies, challenges, as well as implementation issues identified during the development and deployment of PSPCS.