CVAug 25, 2023Code
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary DetectionYiyang Yao, Peng Liu, Tiancheng Zhao et al. · cmu
Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at \url{https://github.com/om-ai-lab/OVDEval}
99.9CVApr 14Code
NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)Guanyi Qin, Jie Liang, Bingbing Zhang et al. · baidu
In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM-PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.
CVJul 6, 2024Code
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video UnderstandingTiancheng Zhao, Qianqian Zhang, Kyusong Lee et al. · cmu
We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an active progressive multimodal pretraining strategy, which gradually increases the model's capacity for long contexts and enhances its overall abilities. By selecting high-quality data during training, OmChat learns from the most relevant and informative data points. With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos, outperforming most open-source models in these benchmarks. Additionally, OmChat proposes a prompting strategy for unifying complex multimodal inputs including single image text, multi-image text and videos, and achieving competitive performance on single-image benchmarks. To further evaluate the model's capabilities, we proposed a benchmark dataset named Temporal Visual Needle in a Haystack. This dataset assesses OmChat's ability to comprehend temporal visual details within long videos. Our analysis highlights several key factors contributing to OmChat's success: support for any-aspect high image resolution, the active progressive pretraining strategy, and high-quality supervised fine-tuning datasets. This report provides a detailed overview of OmChat's capabilities and the strategies that enhance its performance in visual understanding.
CVApr 10, 2025Code
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language ModelHaozhan Shen, Peng Liu, Jingcheng Li et al. · cmu
Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1
99.5IRApr 6
SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUsBi Xue, Hong Wu, Lei Chen et al.
Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing approaches rely on dedicated ANN indexing and filtering services on CPUs, suffering from non-negligible costs and missing co-design opportunities. Such inefficiency makes them difficult to support complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we present SilverTorch, a model-based serving system that brings all components into one unified model. It unifies model serving by replacing standalone indexing and filtering services with model layers. We propose a model-based GPU Bloom index for feature filtering and a fused Int8 ANN kernel for nearest neighbor search. Through co-design of the ANN search and feature filtering, we reduce GPU memory usage and eliminate computation. Benefiting from this design, we scale up retrieval by introducing an OverArch scoring layer and a multi-task retrieval with a Value Model to aggregate scores. These advancements improve the retrieval accuracy and enable future studies for serving more complex models. Our evaluation on industry-scale datasets show that SilverTorch achieves up to 23.7\times higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch solution is 13.35\times more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch is deployed at scale, serving hundreds of models online and supporting recommendation for diverse applications.
SYApr 7, 2019Code
An Open Source Modeling Framework for Interdependent Energy-Transportation- Communication Infrastructure in Smart and Connected CommunitiesXing Lu, Kathryn Hinkelman, Yangyang Fu et al.
Infrastructure in future smart and connected communities is envisioned as an aggregate of public services, including the energy, transportation and communication systems, all intertwined with each other. The intrinsic interdependency among these systems may exert underlying influence on both design and operation of the heterogeneous infrastructures. However, few prior studies have tapped into the interdependency among the three systems in order to quantify their potential impacts during standard operation. In response to this, this paper proposes an open source, flexible, integrated modeling framework suitable for designing coupled energy, transportation, and communication systems and for assessing the impact of their interdependencies. First, a novel multi-level, multi-layer, multi-agent approach is proposed to enable flexible modeling of the interconnected energy, transportation, and communication systems. Then, for the framework's proof-of-concept, preliminary component and system-level models for different systems are designed and implemented using Modelica, an equation-based object-oriented modeling language. Finally, three case studies of gradually increasing complexity are presented (energy, energy + transportation, energy + transportation + communication) to evaluate the interdependencies among the three systems. Quantitative analyses show that the deviation of the average velocity on the road can be 10.5\% and the deviation of the power draw from the grid can be 7\% with or without considering the transportation and communication system at the peak commute time, indicating the presence of notable interdependencies. The proposed modeling framework also has the potential to be further extended for various modeling purposes and use cases, such as dynamic modeling and optimization, resilience analysis, and integrated decision making in future connected communities.
39.2ITApr 2
On the Capacity Region of Additive-Multiplicative MAC with Heterogeneous Input ConstraintsQianqian Zhang, Ying-Chang Liang
This paper characterizes the capacity region of a two-user additive-multiplicative multiple access channel (AM-MAC) under heterogeneous input constraints. This model captures the fundamental limits of symbiotic radio, where an active primary transmitter (PT) conveys information via active transmission subject to an average power constraint, while a passive backscatter device (BD) modulates signals through backscattering under a peak amplitude constraint. Our main results are threefold. Firstly, we prove that the sum-rate capacity equals the PT's point-to-point capacity, achieved when the PT employs Gaussian signaling and the BD acts as a pure reflector to assist the PT's transmission. Secondly, to achieve the BD's maximum achievable rate, the PT must adopt a constant-envelope signaling strategy, while the optimal BD distribution exhibits a concentric-circle structure with a uniform phase. Thirdly, for the remaining boundary points, we establish that the optimal PT signal consists of a continuous uniform phase and a discrete amplitude, whereas the optimal BD distribution is fully discrete. Finally, numerical results are provided to characterized the capacity region by solving a specialized nonlinear optimization problem. To demonstrate the practical implications, we also characterize an baseline rate pair and evaluate the overall performance of the AM-MAC.
CLMay 30, 2025
Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent ResearchQianqian Zhang, Jiajia Liao, Heting Ying et al. · cmu
Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks. However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison. We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and extensible framework that addresses these challenges through three key contributions: (1) a modular architecture with a graph-based workflow engine, efficient memory management, and clean component abstraction; (2) a comprehensive suite of reusable agent algorithms implementing state-of-the-art reasoning approaches; and (3) a rigorous evaluation framework enabling systematic comparison across multiple dimensions. Through extensive experiments on mathematical reasoning and multimodal tasks, we evaluate various agent algorithms across different LLMs, revealing important insights about their relative strengths and applicability. Our results demonstrate that while sophisticated reasoning approaches can enhance agent capabilities, simpler methods like Chain-of-Thought often exhibit robust performance with significantly lower computational overhead. AGORA not only simplifies language agent development but also establishes a foundation for reproducible agent research through standardized evaluation protocols.
CVMay 20, 2025
Selective Structured State Space for Multispectral-fused Small Target DetectionQianqian Zhang, WeiJun Wang, Yunxing Liu et al.
Target detection in high-resolution remote sensing imagery faces challenges due to the low recognition accuracy of small targets and high computational costs. The computational complexity of the Transformer architecture increases quadratically with image resolution, while Convolutional Neural Networks (CNN) architectures are forced to stack deeper convolutional layers to expand their receptive fields, leading to an explosive growth in computational demands. To address these computational constraints, we leverage Mamba's linear complexity for efficiency. However, Mamba's performance declines for small targets, primarily because small targets occupy a limited area in the image and have limited semantic information. Accurate identification of these small targets necessitates not only Mamba's global attention capabilities but also the precise capture of fine local details. To this end, we enhance Mamba by developing the Enhanced Small Target Detection (ESTD) module and the Convolutional Attention Residual Gate (CARG) module. The ESTD module bolsters local attention to capture fine-grained details, while the CARG module, built upon Mamba, emphasizes spatial and channel-wise information, collectively improving the model's ability to capture distinctive representations of small targets. Additionally, to highlight the semantic representation of small targets, we design a Mask Enhanced Pixel-level Fusion (MEPF) module for multispectral fusion, which enhances target features by effectively fusing visible and infrared multimodal information.
ITNov 14, 2024
Latency Optimization in LEO Satellite Communications with Hybrid Beam Pattern and Interference ControlQianqian Zhang, Ye Hu, Minchae Jung
The rapid advancement of low Earth orbit (LEO) satellite communication systems has significantly enhanced global connectivity, offering high-capacity, low-latency services crucial for next-generation applications. However, the dense configuration of LEO constellations poses challenges in resource allocation optimization and interference management, complicating coexistence with other communication systems. To address these limitations, this paper proposes a novel framework for optimizing the beam scheduling and resource allocation in multi-beam LEO systems. To satisfy the uneven terrestrial traffic demand, a hybrid beam pattern is employed to enhance the downlink quality of service and minimize the transmission latency from LEO satellites to ground user terminals. Additionally, a dynamic co-channel interference (CCI) control mechanism is developed to mitigate inter-beam interference within the LEO constellation and limit cross-system interference affecting protected users from other networks. The problem of user-beam-frequency allocation with power optimization is formulated as a mixed-integer dynamic programming model and solved using a low-complexity neural network-based graph generation algorithm. Simulation results show that the proposed approach outperforms the baseline methods of full frequency reuse and single-channel transmission, and highlights the potential for further performance improvement with multi-user transmissions.
LGJan 18, 2022
Leaving No One Behind: A Multi-Scenario Multi-Task Meta Learning Approach for Advertiser ModelingQianqian Zhang, Xinru Liao, Quan Liu et al.
Advertisers play an essential role in many e-commerce platforms like Taobao and Amazon. Fulfilling their marketing needs and supporting their business growth is critical to the long-term prosperity of platform economies. However, compared with extensive studies on user modeling such as click-through rate predictions, much less attention has been drawn to advertisers, especially in terms of understanding their diverse demands and performance. Different from user modeling, advertiser modeling generally involves many kinds of tasks (e.g. predictions of advertisers' expenditure, active-rate, or total impressions of promoted products). In addition, major e-commerce platforms often provide multiple marketing scenarios (e.g. Sponsored Search, Display Ads, Live Streaming Ads) while advertisers' behavior tend to be dispersed among many of them. This raises the necessity of multi-task and multi-scenario consideration in comprehensive advertiser modeling, which faces the following challenges: First, one model per scenario or per task simply doesn't scale; Second, it is particularly hard to model new or minor scenarios with limited data samples; Third, inter-scenario correlations are complicated, and may vary given different tasks. To tackle these challenges, we propose a multi-scenario multi-task meta learning approach (M2M) which simultaneously predicts multiple tasks in multiple advertising scenarios.
NIAug 3, 2021
Semi-Supervised Learning for Channel Charting-Aided IoT Localization in Millimeter Wave NetworksQianqian Zhang, Walid Saad
In this paper, a novel framework is proposed for channel charting (CC)-aided localization in millimeter wave networks. In particular, a convolutional autoencoder model is proposed to estimate the three-dimensional location of wireless user equipment (UE), based on multipath channel state information (CSI), received by different base stations. In order to learn the radio-geometry map and capture the relative position of each UE, an autoencoder-based channel chart is constructed in an unsupervised manner, such that neighboring UEs in the physical space will remain close in the channel chart. Next, the channel charting model is extended to a semi-supervised framework, where the autoencoder is divided into two components: an encoder and a decoder, and each component is optimized individually, using the labeled CSI dataset with associated location information, to further improve positioning accuracy. Simulation results show that the proposed CC-aided semi-supervised localization yields a higher accuracy, compared with existing supervised positioning and conventional unsupervised CC approaches.
ITFeb 2, 2021
Distributed Conditional Generative Adversarial Networks (GANs) for Data-Driven Millimeter Wave Communications in UAV NetworksQianqian Zhang, Aidin Ferdowsi, Walid Saad et al.
In this paper, a novel framework is proposed to perform data-driven air-to-ground (A2G) channel estimation for millimeter wave (mmWave) communications in an unmanned aerial vehicle (UAV) wireless network. First, an effective channel estimation approach is developed to collect mmWave channel information, allowing each UAV to train a stand-alone channel model via a conditional generative adversarial network (CGAN) along each beamforming direction. Next, in order to expand the application scenarios of the trained channel model into a broader spatial-temporal domain, a cooperative framework, based on a distributed CGAN architecture, is developed, allowing each UAV to collaboratively learn the mmWave channel distribution in a fully-distributed manner. To guarantee an efficient learning process, necessary and sufficient conditions for the optimal UAV network topology that maximizes the learning rate for cooperative channel modeling are derived, and the optimal CGAN learning solution per UAV is subsequently characterized, based on the distributed network structure. Simulation results show that the proposed distributed CGAN approach is robust to the local training error at each UAV. Meanwhile, a larger airborne network size requires more communication resources per UAV to guarantee an efficient learning rate. The results also show that, compared with a stand-alone CGAN without information sharing and two other distributed schemes, namely: A multi-discriminator CGAN and a federated CGAN method, the proposed distributed CGAN approach yields a higher modeling accuracy while learning the environment, and it achieves a larger average data rate in the online performance of UAV downlink mmWave communications.
IVNov 30, 2020
SAR Image Despeckling Based on Convolutional Denoising AutoencoderQianqian Zhang, Ruizhi Sun
In Synthetic Aperture Radar (SAR) imaging, despeckling is very important for image analysis,whereas speckle is known as a kind of multiplicative noise caused by the coherent imaging system. During the past three decades, various algorithms have been proposed to denoise the SAR image. Generally, the BM3D is considered as the state of art technique to despeckle the speckle noise with excellent performance. More recently, deep learning make a success in image denoising and achieved a improvement over conventional method where large train dataset is required. Unlike most of the images SAR image despeckling approach, the proposed approach learns the speckle from corrupted images directly. In this paper, the limited scale of dataset make a efficient exploration by using convolutioal denoising autoencoder (C-DAE) to reconstruct the speckle-free SAR images. Batch normalization strategy is integrated with C- DAE to speed up the train time. Moreover, we compute image quality in standard metrics, PSNR and SSIM. It is revealed that our approach perform well than some others.
ITNov 3, 2020
Distributional Reinforcement Learning for mmWave Communications with Intelligent Reflectors on a UAVQianqian Zhang, Walid Saad, Mehdi Bennis
In this paper, a novel communication framework that uses an unmanned aerial vehicle (UAV)-carried intelligent reflector (IR) is proposed to enhance multi-user downlink transmissions over millimeter wave (mmWave) frequencies. In order to maximize the downlink sum-rate, the optimal precoding matrix (at the base station) and reflection coefficient (at the IR) are jointly derived. Next, to address the uncertainty of mmWave channels and maintain line-of-sight links in a real-time manner, a distributional reinforcement learning approach, based on quantile regression optimization, is proposed to learn the propagation environment of mmWave communications, and, then, optimize the location of the UAV-IR so as to maximize the long-term downlink communication capacity. Simulation results show that the proposed learning-based deployment of the UAV-IR yields a significant advantage, compared to a non-learning UAV-IR, a static IR, and a direct transmission schemes, in terms of the average data rate and the achievable line-of-sight probability of downlink mmWave communications.
ITFeb 24, 2020
Millimeter Wave Communications with an Intelligent Reflector: Performance Optimization and Distributional Reinforcement LearningQianqian Zhang, Walid Saad, Mehdi Bennis
In this paper, a novel framework is proposed to optimize the downlink multi-user communication of a millimeter wave base station, which is assisted by a reconfigurable intelligent reflector (IR). In particular, a channel estimation approach is developed to measure the channel state information (CSI) in real-time. First, for a perfect CSI scenario, the precoding transmission of the BS and the reflection coefficient of the IR are jointly optimized, via an iterative approach, so as to maximize the sum of downlink rates towards multiple users. Next, in the imperfect CSI scenario, a distributional reinforcement learning (DRL) approach is proposed to learn the optimal IR reflection and maximize the expectation of downlink capacity. In order to model the transmission rate's probability distribution, a learning algorithm, based on quantile regression (QR), is developed, and the proposed QR-DRL method is proved to converge to a stable distribution of downlink transmission rate. Simulation results show that, in the error-free CSI scenario, the proposed approach yields over 30% and 2-fold increase in the downlink sum-rate, compared with a fixed IR reflection scheme and direct transmission scheme, respectively. Simulation results also show that by deploying more IR elements, the downlink sum-rate can be significantly improved. However, as the number of IR components increases, more time is required for channel estimation, and the slope of increase in the IR-aided transmission rate will become smaller. Furthermore, under limited knowledge of CSI, simulation results show that the proposed QR-DRL method, which learns a full distribution of the downlink rate, yields a better prediction accuracy and improves the downlink rate by 10% for online deployments, compared with a Q-learning baseline.