SPJun 1
Multi-view imaging in networked sensing systems: A covariance-based approachJunyuan Gao, Weifeng Zhu, Yanmo Hu et al.
This paper considers multi-view imaging in a sixth-generation (6G) integrated sensing and communication network, which consists of a transmit base-station (BS), multiple receive BSs connected to a central processing unit (CPU), and multiple extended targets. Our goal is to devise an effective multi-view imaging technique that can jointly leverage the targets' echo signals at all the receive BSs to precisely construct the image of these targets. To achieve this goal, we propose a two-phase approach. In Phase I, each receive BS recovers an individual image based on the sample covariance matrix of its received signals. Specifically, we propose a novel covariance-based imaging framework to jointly estimate effective scattering intensity and grid positions, which reduces the number of estimated parameters leveraging channel statistical properties and allows grid adjustment to conform to target geometry. In Phase II, the CPU fuses the individual images of all the receivers to construct a high-quality image of all the targets. Specifically, we design edge-preserving natural neighbor interpolation (EP-NNI) to map individual heterogeneous images onto common and finer grids, and then propose a joint optimization framework to estimate fused scattering intensity and BS fields of view. Extensive numerical results show that the proposed scheme significantly enhances imaging performance, facilitating high-quality environment reconstruction for future 6G networks.
CVSep 7, 2023Code
Stroke-based Neural Painting and Stylization with Dynamically Predicted Painting RegionTeng Hu, Ran Yi, Haokun Zhu et al. · tsinghua
Stroke-based rendering aims to recreate an image with a set of strokes. Most existing methods render complex images using an uniform-block-dividing strategy, which leads to boundary inconsistency artifacts. To solve the problem, we propose Compositional Neural Painter, a novel stroke-based rendering framework which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. We start from an empty canvas and divide the painting process into several steps. At each step, a compositor network trained with a phasic RL strategy first predicts the next painting region, then a painter network trained with a WGAN discriminator predicts stroke parameters, and a stroke renderer paints the strokes onto the painting region of the current canvas. Moreover, we extend our method to stroke-based style transfer with a novel differentiable distance transform loss, which helps preserve the structure of the input image during stroke-based stylization. Extensive experiments show our model outperforms the existing models in both stroke-based neural painting and stroke-based stylization. Code is available at https://github.com/sjtuplayer/Compositional_Neural_Painter
CVSep 7, 2023Code
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model AdaptionTeng Hu, Jiangning Zhang, Liang Liu et al. · tsinghua
Training a generative model with limited number of samples is a challenging task. Current methods primarily rely on few-shot model adaption to train the network. However, in scenarios where data is extremely limited (less than 10), the generative network tends to overfit and suffers from content degradation. To address these problems, we propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss, which targets different learning objectives at distinct training stages of the diffusion model. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large, and learn local details of target domain when t is small, leading to an improvement in the capture of content, style and local details. Furthermore, we introduce a novel directional distribution consistency loss that ensures the consistency between the generated and source distributions more efficiently and stably than the prior methods, preventing our model from overfitting. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation. Theoretical analysis, qualitative and quantitative experiments demonstrate the superiority of our approach in few-shot generative model adaption tasks compared to state-of-the-art methods. The source code is available at: https://github.com/sjtuplayer/few-shot-diffusion.
CVSep 7, 2023Code
Toward High Quality Facial Representation LearningYue Wang, Jinlong Peng, Jiangning Zhang et al. · tsinghua
Face analysis tasks have a wide range of applications, but the universal facial representation has only been explored in a few works. In this paper, we explore high-performance pre-training methods to boost the face analysis tasks such as face alignment and face parsing. We propose a self-supervised pre-training framework, called \textbf{\it Mask Contrastive Face (MCF)}, with mask image modeling and a contrastive strategy specially adjusted for face domain tasks. To improve the facial representation quality, we use feature map of a pre-trained visual backbone as a supervision item and use a partially pre-trained decoder for mask image modeling. To handle the face identity during the pre-training stage, we further use random masks to build contrastive learning pairs. We conduct the pre-training on the LAION-FACE-cropped dataset, a variants of LAION-FACE 20M, which contains more than 20 million face images from Internet websites. For efficiency pre-training, we explore our framework pre-training performance on a small part of LAION-FACE-cropped and verify the superiority with different pre-training settings. Our model pre-trained with the full pre-training dataset outperforms the state-of-the-art methods on multiple downstream tasks. Our model achieves 0.932 NME$_{diag}$ for AFLW-19 face alignment and 93.96 F1 score for LaPa face parsing. Code is available at https://github.com/nomewang/MCF.
ITMay 15, 2018
A D2D-based Protocol for Ultra-Reliable Wireless Communications for Industrial AutomationLiang Liu, Wei Yu · utoronto
As one indispensable use case for the 5G wireless systems on the roadmap, ultra-reliable and low latency communications (URLLC) is a crucial requirement for the coming era of wireless industrial automation. This paper aims to develop communication techniques for making such a paradigm shift from the conventional human-type broadband communications to the emerging machine-type URLLC. One fundamental task for URLLC is to deliver a short command from the controller to each actuator within the stringent delay requirement and also with high-reliability in the downlink. Motivated by the geographic feature in industrial automation that in the factories many tasks are assigned to different groups of devices who work in close proximity to each other and thus can form clusters of reliable device-to-device (D2D) networks, this paper proposes a novel two-phase transmission protocol for achieving the above goal. Specifically, in the first phase within the latency requirement, the multi-antenna base station (BS) combines the messages of each group together and multicasts them to the corresponding groups; while in the second phase, the devices that have decoded the messages successfully, who are defined as the leaders, help relay the messages to the other devices in their groups. Under this protocol, we further design an innovative leader selection based beamforming strategy at the BS by utilizing the sparse optimization technique, which leads to the desired sparsity pattern in user activity, i.e., at least one leader exists in each group, in the first phase, thus making full utilization of the reliable D2D networks in the second phase. Simulation results are provided to show that the proposed two-phase transmission protocol considerably improves the reliability of the whole system within the stringent latency requirement as compared to other existing schemes for URLLC such as Occupy CoW.
CVMar 16, 2023Code
MixTeacher: Mining Promising Labels with Mixed Scale Teacher for Semi-Supervised Object DetectionLiang Liu, Boshen Zhang, Jiangning Zhang et al.
Scale variation across object instances remains a key challenge in object detection task. Despite the remarkable progress made by modern detection models, this challenge is particularly evident in the semi-supervised case. While existing semi-supervised object detection methods rely on strict conditions to filter high-quality pseudo labels from network predictions, we observe that objects with extreme scale tend to have low confidence, resulting in a lack of positive supervision for these objects. In this paper, we propose a novel framework that addresses the scale variation problem by introducing a mixed scale teacher to improve pseudo label generation and scale-invariant learning. Additionally, we propose mining pseudo labels using score promotion of predictions across scales, which benefits from better predictions from mixed scale features. Our extensive experiments on MS COCO and PASCAL VOC benchmarks under various semi-supervised settings demonstrate that our method achieves new state-of-the-art performance. The code and models are available at \url{https://github.com/lliuz/MixTeacher}.
CVMar 10, 2023Code
Iterative Few-shot Semantic Segmentation from Image Label TextHaohan Wang, Liang Liu, Wuhao Zhang et al.
Few-shot semantic segmentation aims to learn to segment unseen class objects with the guidance of only a few support images. Most previous methods rely on the pixel-level label of support images. In this paper, we focus on a more challenging setting, in which only the image-level labels are available. We propose a general framework to firstly generate coarse masks with the help of the powerful vision-language model CLIP, and then iteratively and mutually refine the mask predictions of support and query images. Extensive experiments on PASCAL-5i and COCO-20i datasets demonstrate that our method not only outperforms the state-of-the-art weakly supervised approaches by a significant margin, but also achieves comparable or better results to recent supervised methods. Moreover, our method owns an excellent generalization ability for the images in the wild and uncommon classes. Code will be available at https://github.com/Whileherham/IMR-HSNet.
CLFeb 2Code
Kimi K2.5: Visual Agentic IntelligenceKimi Team, Tongtong Bai, Yifan Bai et al.
We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
CVMar 14, 2023Code
Calibrated Teacher for Sparsely Annotated Object DetectionHaohan Wang, Liang Liu, Boshen Zhang et al.
Fully supervised object detection requires training images in which all instances are annotated. This is actually impractical due to the high labor and time costs and the unavoidable missing annotations. As a result, the incomplete annotation in each image could provide misleading supervision and harm the training. Recent works on sparsely annotated object detection alleviate this problem by generating pseudo labels for the missing annotations. Such a mechanism is sensitive to the threshold of the pseudo label score. However, the effective threshold is different in different training stages and among different object detectors. Therefore, the current methods with fixed thresholds have sub-optimal performance, and are difficult to be applied to other detectors. In order to resolve this obstacle, we propose a Calibrated Teacher, of which the confidence estimation of the prediction is well calibrated to match its real precision. In this way, different detectors in different training stages would share a similar distribution of the output confidence, so that multiple detectors could share the same fixed threshold and achieve better performance. Furthermore, we present a simple but effective Focal IoU Weight (FIoU) for the classification loss. FIoU aims at reducing the loss weight of false negative samples caused by the missing annotation, and thus works as the complement of the teacher-student paradigm. Extensive experiments show that our methods set new state-of-the-art under all different sparse settings in COCO. Code will be available at https://github.com/Whileherham/CalibratedTeacher.
SPFeb 10, 2023Code
The LuViRA Dataset: Synchronized Vision, Radio, and Audio Sensors for Indoor LocalizationIlayda Yaman, Guoda Tian, Martin Larsson et al.
We present a synchronized multisensory dataset for accurate and robust indoor localization: the Lund University Vision, Radio, and Audio (LuViRA) Dataset. The dataset includes color images, corresponding depth maps, inertial measurement unit (IMU) readings, channel response between a 5G massive multiple-input and multiple-output (MIMO) testbed and user equipment, audio recorded by 12 microphones, and accurate six degrees of freedom (6DOF) pose ground truth of 0.5 mm. We synchronize these sensors to ensure that all data is recorded simultaneously. A camera, speaker, and transmit antenna are placed on top of a slowly moving service robot, and 89 trajectories are recorded. Each trajectory includes 20 to 50 seconds of recorded sensor data and ground truth labels. Data from different sensors can be used separately or jointly to perform localization tasks, and data from the motion capture (mocap) system is used to verify the results obtained by the localization algorithms. The main aim of this dataset is to enable research on sensor fusion with the most commonly used sensors for localization tasks. Moreover, the full dataset or some parts of it can also be used for other research areas such as channel estimation, image classification, etc. Our dataset is available at: https://github.com/ilaydayaman/LuViRA_Dataset
CVFeb 14, 2023Code
Self-Supervised Likelihood Estimation with Energy Guidance for Anomaly Segmentation in Urban ScenesYuanpeng Tu, Yuxi Li, Boshen Zhang et al.
Robust autonomous driving requires agents to accurately identify unexpected areas (anomalies) in urban scenes. To this end, some critical issues remain open: how to design advisable metric to measure anomalies, and how to properly generate training samples of anomaly data? Classical effort in anomaly detection usually resorts to pixel-wise uncertainty or sample synthesis, which ignores the contextual information and sometimes requires auxiliary data with fine-grained annotations. On the contrary, in this paper, we exploit the strong context-dependent nature of the segmentation task and design an energy-guided self-supervised framework for anomaly segmentation, which optimizes an anomaly head by maximizing the likelihood of self-generated anomaly pixels. For this purpose, we design two estimators to model anomaly likelihood, one is a task-agnostic binary estimator and the other depicts the likelihood as residual of task-oriented joint energy. Based on the proposed estimators, we devise an adaptive self-supervised training framework, which exploits the contextual reliance and estimated likelihood to refine mask annotations in anomaly areas. We conduct extensive experiments on challenging Fishyscapes and Road Anomaly benchmarks, demonstrating that without any auxiliary data or synthetic models, our method can still achieve comparable performance to supervised competitors. Code is available at https://github.com/yuanpengtu/SLEEG..
CVJan 3, 2023
Rethinking Mobile Block for Efficient Attention-based ModelsJiangning Zhang, Xiangtai Li, Jian Li et al.
This paper focuses on developing modern, efficient, lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterpart has been recognized by attention-based studies. This work rethinks lightweight infrastructure from efficient IRB and effective components of Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMB) for lightweight model design. Following simple but effective design criterion, we deduce a modern Inverted Residual Mobile Block (iRMB) and build a ResNet-like Efficient MOdel (EMO) with only iRMB for down-stream tasks. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, e.g., EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass equal-order CNN-/Attention-based models, while trading-off the parameter, efficiency, and accuracy well: running 2.8-4.0x faster than EdgeNeXt on iPhone14.
CVFeb 14, 2023
Learning from Noisy Labels with Decoupled Meta Label PurifierYuanpeng Tu, Boshen Zhang, Yuxi Li et al.
Training deep neural networks(DNN) with noisy labels is challenging since DNN can easily memorize inaccurate labels, leading to poor generalization ability. Recently, the meta-learning based label correction strategy is widely adopted to tackle this problem via identifying and correcting potential noisy labels with the help of a small set of clean validation data. Although training with purified labels can effectively improve performance, solving the meta-learning problem inevitably involves a nested loop of bi-level optimization between model weights and hyper-parameters (i.e., label distribution). As compromise, previous methods resort to a coupled learning process with alternating update. In this paper, we empirically find such simultaneous optimization over both model weights and label distribution can not achieve an optimal routine, consequently limiting the representation ability of backbone and accuracy of corrected labels. From this observation, a novel multi-stage label purifier named DMLP is proposed. DMLP decouples the label correction process into label-free representation learning and a simple meta label purifier. In this way, DMLP can focus on extracting discriminative feature and label correction in two distinctive stages. DMLP is a plug-and-play label purifier, the purified labels can be directly reused in naive end-to-end network retraining or other robust learning methods, where state-of-the-art results are obtained on several synthetic and real-world noisy datasets, especially under high noise levels.
CVMay 13, 2022
FRIH: Fine-grained Region-aware Image HarmonizationJinlong Peng, Zekun Luo, Liang Liu et al.
Image harmonization aims to generate a more realistic appearance of foreground and background for a composite image. Existing methods perform the same harmonization process for the whole foreground. However, the implanted foreground always contains different appearance patterns. All the existing solutions ignore the difference of each color block and losing some specific details. Therefore, we propose a novel global-local two stages framework for Fine-grained Region-aware Image Harmonization (FRIH), which is trained end-to-end. In the first stage, the whole input foreground mask is used to make a global coarse-grained harmonization. In the second stage, we adaptively cluster the input foreground mask into several submasks by the corresponding pixel RGB values in the composite image. Each submask and the coarsely adjusted image are concatenated respectively and fed into a lightweight cascaded module, adjusting the global harmonization performance according to the region-aware local feature. Moreover, we further designed a fusion prediction module by fusing features from all the cascaded decoder layers together to generate the final result, which could utilize the different degrees of harmonization results comprehensively. Without bells and whistles, our FRIH algorithm achieves the best performance on iHarmony4 dataset (PSNR is 38.19 dB) with a lightweight model. The parameters for our model are only 11.98 M, far below the existing methods.
SPMar 7, 2023
High-Precision Machine-Learning Based Indoor Localization with Massive MIMO SystemGuoda Tian, Ilayda Yaman, Michiel Sandra et al.
High-precision cellular-based localization is one of the key technologies for next-generation communication systems. In this paper, we investigate the potential of applying machine learning (ML) to a massive multiple-input multiple-output (MIMO) system to enhance localization accuracy. We analyze a new ML-based localization pipeline that has two parallel fully connected neural networks (FCNN). The first FCNN takes the instantaneous spatial covariance matrix to capture angular information, while the second FCNN takes the channel impulse responses to capture delay information. We fuse the estimated coordinates of these two FCNNs for further accuracy improvement. To test the localization algorithm, we performed an indoor measurement campaign with a massive MIMO testbed at 3.7GHz. In the measured scenario, the proposed pipeline can achieve centimeter-level accuracy by combining delay and angular information.
CVFeb 14, 2023
Learning with Noisy labels via Self-supervised Adversarial Noisy MaskingYuanpeng Tu, Boshen Zhang, Yuxi Li et al.
Collecting large-scale datasets is crucial for training deep models, annotating the data, however, inevitably yields noisy labels, which poses challenges to deep learning algorithms. Previous efforts tend to mitigate this problem via identifying and removing noisy samples or correcting their labels according to the statistical properties (e.g., loss values) among training samples. In this paper, we aim to tackle this problem from a new perspective, delving into the deep feature maps, we empirically find that models trained with clean and mislabeled samples manifest distinguishable activation feature distributions. From this observation, a novel robust training approach termed adversarial noisy masking is proposed. The idea is to regularize deep features with a label quality guided masking scheme, which adaptively modulates the input data and label simultaneously, preventing the model to overfit noisy samples. Further, an auxiliary task is designed to reconstruct input data, it naturally provides noise-free self-supervised signals to reinforce the generalization ability of deep models. The proposed method is simple and flexible, it is tested on both synthetic and real-world noisy datasets, where significant improvements are achieved over previous state-of-the-art methods.
LGMay 18, 2022
Scalable Multi-view Clustering with Graph FilteringLiang Liu, Peng Chen, Guangchun Luo et al.
With the explosive growth of multi-source data, multi-view clustering has attracted great attention in recent years. Most existing multi-view methods operate in raw feature space and heavily depend on the quality of original feature representation. Moreover, they are often designed for feature data and ignore the rich topology structure information. Accordingly, in this paper, we propose a generic framework to cluster both attribute and graph data with heterogeneous features. It is capable of exploring the interplay between feature and structure. Specifically, we first adopt graph filtering technique to eliminate high-frequency noise to achieve a clustering-friendly smooth representation. To handle the scalability challenge, we develop a novel sampling strategy to improve the quality of anchors. Extensive experiments on attribute and graph benchmarks demonstrate the superiority of our approach with respect to state-of-the-art approaches.
HCJul 3, 2024
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI AgentsYuxiang Chai, Siyuan Huang, Yazhe Niu et al.
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents which are capable of completing tasks by directly interacting with the graphical user interface (GUI) on mobile devices. AMEX comprises over 104K high-resolution screenshots from popular mobile applications, which are annotated at multiple levels. Unlike existing GUI-related datasets, e.g., Rico, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we finetune a baseline model SPHINX Agent and illustrate the effectiveness of AMEX.The project is available at https://yxchai.com/AMEX/.
AIFeb 27Code
AIDABench: AI Data Analytics BenchmarkYibo Yang, Fei Lei, Yixuan Sun et al.
As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark's difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang-lyx/AIDABench.
LGMar 13, 2023
Spacecraft Anomaly Detection with Attention Temporal Convolution NetworkLiang Liu, Ling Tian, Zhao Kang et al.
Spacecraft faces various situations when carrying out exploration missions in complex space, thus monitoring the anomaly status of spacecraft is crucial to the development of \textcolor{blue}{the} aerospace industry. The time series telemetry data generated by on-orbit spacecraft \textcolor{blue}{contains} important information about the status of spacecraft. However, traditional domain knowledge-based spacecraft anomaly detection methods are not effective due to high dimensionality and complex correlation among variables. In this work, we propose an anomaly detection framework for spacecraft multivariate time-series data based on temporal convolution networks (TCNs). First, we employ dynamic graph attention to model the complex correlation among variables and time series. Second, temporal convolution networks with parallel processing ability are used to extract multidimensional \textcolor{blue}{features} for \textcolor{blue}{the} downstream prediction task. Finally, many potential anomalies are detected by the best threshold. Experiments on real NASA SMAP/MSL spacecraft datasets show the superiority of our proposed model with respect to state-of-the-art methods.
SPSep 6, 2023
LuViRA Dataset Validation and Discussion: Comparing Vision, Radio, and Audio Sensors for Indoor LocalizationIlayda Yaman, Guoda Tian, Erik Tegler et al.
We present a unique comparative analysis, and evaluation of vision, radio, and audio based localization algorithms. We create the first baseline for the aforementioned sensors using the recently published Lund University Vision, Radio, and Audio (LuViRA) dataset, where all the sensors are synchronized and measured in the same environment. Some of the challenges of using each specific sensor for indoor localization tasks are highlighted. Each sensor is paired with a current state-of-the-art localization algorithm and evaluated for different aspects: localization accuracy, reliability and sensitivity to environment changes, calibration requirements, and potential system complexity. Specifically, the evaluation covers the ORB-SLAM3 algorithm for vision-based localization with an RGB-D camera, a machine-learning algorithm for radio-based localization with massive MIMO technology, and the SFS2 algorithm for audio-based localization with distributed microphones. The results can serve as a guideline and basis for further development of robust and high-precision multi-sensory localization systems, e.g., through sensor fusion, context, and environment-aware adaptation.
CVMar 24, 2024Code
SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object TrackingXiaojun Hou, Jiazheng Xing, Yijie Qian et al.
Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at https://github.com/hoqolo/SDSTrack.
AIMar 27, 2025Code
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement LearningZhengxi Lu, Yuxiang Chai, Yaxuan Guo et al.
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples. We additionally develop an optimized version, UI-R1-E-3B, which significantly improves both grounding efficiency and accuracy. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. Code website: https://github.com/lll6gg/UI-R1.
LGJul 28, 2025Code
Kimi K2: Open Agentic IntelligenceKimi Team, Yifan Bai, Yiping Bao et al. · tsinghua
We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual -- surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.
AIMay 21
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision SupportChen Zhan, Xihe Qiu, Xiaoyu Tan et al.
Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.
SDJun 7, 2023
A Mask Free Neural Network for Monaural Speech EnhancementLiang Liu, Haixin Guan, Jinlong Ma et al.
In speech enhancement, the lack of clear structural characteristics in the target speech phase requires the use of conservative and cumbersome network frameworks. It seems difficult to achieve competitive performance using direct methods and simple network architectures. However, we propose the MFNet, a direct and simple network that can not only map speech but also map reverse noise. This network is constructed by stacking global local former blocks (GLFBs), which combine the advantages of Mobileblock for global processing and Metaformer architecture for local interaction. Our experimental results demonstrate that our network using mapping method outperforms masking methods, and direct mapping of reverse noise is the optimal solution in strong noise environments. In a horizontal comparison on the 2020 Deep Noise Suppression (DNS) challenge test set without reverberation, to the best of our knowledge, MFNet is the current state-of-the-art (SOTA) mapping model.
CVAug 28, 2024
DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled QueriesYu Yang, Jianbiao Mei, Liang Liu et al.
LiDAR panoptic segmentation, which jointly performs instance and semantic segmentation for things and stuff classes, plays a fundamental role in LiDAR perception tasks. While most existing methods explicitly separate these two segmentation tasks and utilize different branches (i.e., semantic and instance branches), some recent methods have embraced the query-based paradigm to unify LiDAR panoptic segmentation. However, the distinct spatial distribution and inherent characteristics of objects(things) and their surroundings(stuff) in 3D scenes lead to challenges, including the mutual competition of things/stuff and the ambiguity of classification/segmentation. In this paper, we propose decoupling things/stuff queries according to their intrinsic properties for individual decoding and disentangling classification/segmentation to mitigate ambiguity. To this end, we propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions and fusing multi-level BEV embeddings. Moreover, a query-oriented mask decoder is introduced to decode corresponding segmentation masks by performing masked cross-attention between queries and mask embeddings. Finally, the decoded masks are combined with the semantics of the queries to produce panoptic results. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our DQFormer framework.
CVJan 8
SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot LearningXihe Qiu, Yang Dai, Xiaoyu Tan et al.
Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.
SPApr 1
SAR/ISAR Imaging in 6G NetworkYanmo Hu, Shuowen Zhang, Ross Murch et al.
Imaging is a crucial sensing function that finds wide applications in environmental reconstruction, autonomous driving, etc. However, the signal processing methods for existing radio imaging techniques, such as millimeter wave (mmWave) imaging, require high-resolution range estimation enabled by Gigahertz-level or even Terahertz-level bandwidth, and cannot be applied in 6G integrated sensing and communication (ISAC) network with Megahertz-level bandwidth. This paper proposes two novel high-resolution radio imaging schemes that can work on the 6G signals with limited bandwidth - bandwidth-independent synthetic aperture radar (BI-SAR), where the movable base station (BS) revolves along the static targets by 360 degrees; as well as bandwidth-independent inverse synthetic aperture radar (BI-ISAR), where the BS is static and the targets revolve along an axis by 360 degrees. Different from conventional SAR and ISAR counterparts that rely on range estimation, our proposed imaging schemes solely utilize Doppler information to perform imaging without any range information. The main technical challenge of our schemes lies in the anisotropic scattering functions over different directions, which hinder the coherent synthesis of the backscattered signals from all directions. We design an iterative adaptive approach-based Doppler association (IAA-DA) algorithm to tackle the above issue. Moreover, we also derive the imaging resolution to characterize the reconstruction quality. Real-world experiments are provided to show the feasibility and the effectiveness of our proposed 6G imaging schemes.
CLMay 27, 2025Code
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI AgentsHan Xiao, Guozhi Wang, Yuxiang Chai et al.
In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
SPApr 1
DF-3DRME: A Data-Friendly Learning Framework for 3D Radio Map Estimation based on Super-Resolution TechniqueLin Zhu, Weifeng Zhu, Shuowen Zhang et al.
High-Resolution three-dimensional (3D) radio maps (RMs) provide rich information about the radio landscape that is essential to a myriad of wireless applications in the future wireless networks. Although deep learning (DL) methods have shown their effectiveness in RM construction, existing approaches require massive high-resolution 3D RM samples in the training dataset, the acquisition of which is labor-intensive and time-consuming in practice. In this paper, our goal is to devise a data-friendly high-resolution 3D RM construction solution via training over a hybrid dataset, wherein the RMs associated with a small fraction of environment maps (EMs) are of high-resolution, while those corresponding to the majority of EMs are of low-resolution. To this end, we propose a Data-Friendly 3D Radio Map Estimator (DF-3DRME), which comprises two processing stages. Specifically, in the first stage, we leverage the abundant low-resolution 3D RM samples to train a neural network, termed the LR-Net, for predicting the low-resolution 3D RM from the input EM, which provides a coarse characterization of the spatial radio propagation. In the second stage, we employ an advanced super-resolution network, termed the SR-Net, to upscale the predicted low-resolution 3D RM to its high-resolution counterpart. Unlike the LR-Net, the SR-Net can be effectively trained with only the limited high-resolution 3D RM samples available in the hybrid dataset. Experimental results demonstrate that the proposed framework achieves compelling reconstruction performance with only 4% of the EMs in the dataset having high-resolution 3D RM labels, which significantly reduces data acquisition overhead and facilitates practical deployment.
SIDec 11, 2025
Understanding Toxic Interaction Across User and Video Clusters in Social Video PlatformsQiao Wang, Liang Liu, Mitsuo Yoshida
Social video platforms shape how people access information, while recommendation systems can narrow exposure and increase the risk of toxic interaction. Previous research has often examined text or users in isolation, overlooking the structural context in which such toxic interactions occur. Without considering who interacts with whom and around what content, it is difficult to explain why negative expressions cluster within particular communities. To address this issue, this study focuses on the Chinese social video platform Bilibili, incorporating video-level information as the environment for user expression, modeling users and videos in an interaction matrix. After normalization and dimensionality reduction, we perform separate clustering on both sides of the video-user interaction matrix with K-means. Cluster assignments facilitate comparisons of user behavior, including message length, posting frequency, and source (barrage and comment), as well as textual features such as sentiment and toxicity, and video attributes defined by uploaders. Such a clustering approach integrates structural ties with content signals to identify stable groups of videos and users. We find clear stratification in interaction style (message length, comment ratio) across user clusters, while sentiment and toxicity differences are weak or inconsistent across video clusters. Across video clusters, viewing volume exhibits a clear hierarchy, with higher exposure groups concentrating more toxic expressions. For such a group, platforms should require timely intervention during periods of rapid growth. Across user clusters, comment ratio and message length form distinct hierarchies, and several clusters with longer and comment-oriented messages exhibit lower toxicity. For such groups, platforms should strengthen mechanisms that sustain rational dialogue and encourage engagement across topics.
AIDec 1, 2025
RoboDriveVLM: A Novel Benchmark and Baseline towards Robust Vision-Language Models for Autonomous DrivingDacheng Liao, Mengshi Qi, Peng Shu et al.
Current Vision-Language Model (VLM)-based end-to-end autonomous driving systems often leverage large language models to generate driving decisions directly based on their understanding of the current scene. However, such systems introduce multiple risks in real-world driving scenarios. To evaluate whether VLMs are truly viable for autonomous driving, we introduce RoboDriveBench, the first robustness benchmark focused on end-to-end trajectory prediction tasks. This benchmark systematically evaluates two critical categories of real-world challenges for VLM-based end-to-end autonomous driving systems through 11 simulated scenarios encompassing various corruption types, including 6 scenarios of sensor corruption caused by environmental variations, along with 5 cases of prompt corruption resulting from human intervention and data transmission failures. Each corruption type includes 250 unique driving scenarios and 5,689 frames, resulting in 64,559 total trajectory prediction cases per evaluation. To overcome these real-world challenges, we propose a novel VLM-based autonomous driving framework called RoboDriveVLM, which enhances robustness by mapping more multimodal data-e.g., lidar and radar-into a unified latent space. Furthermore, we introduce a new Test-Time Adaptation (TTA) method based on cross-modal knowledge distillation to improve the robustness of VLM-based autonomous driving systems. Through extensive experiments, our work highlights the limitations of current VLM-based end-to-end autonomous driving systems and provides a more reliable solution for real-world deployment. Source code and datasets will be released.
ARMay 13
Efficient Implementation of an Adaptive Transformer Accelerator for Massive MIMO Outdoor LocalizationIlayda Yaman, Sijia Cheng, Ove Edfors et al.
We present the implementation of an adaptive Transformer-based localization system for 5G massive MIMO targeting sub-10ms real-time positioning. The design exploits propagation characteristics, where beam-delay channel representations exhibit sparsity, enabling a row-wise skipping mechanism that removes low-energy beam components with minimal control overhead. The contribution is focused on hardware realization of the model using a mixed dataflow architecture, combining input- and output-stationary execution, mapped onto a heterogeneous vector processing engine with parallel processing elements and adder trees for efficient matrix computation. Environment-dependent processing is supported through a lightweight runtime model-switching mechanism, where temporally filtered outputs of a single-layer perceptron router enable stable selection between specialized models with reduced latency. Implemented on a Xilinx Zynq UltraScale+ FPGA and evaluated on real-world massive MIMO measurements, the design achieves up to 65% row sparsity, yielding peak computational speedups of approximately 2x while limiting the average localization accuracy degradation to below 10%, relative to the floating-point baseline model. The accelerator attains below 1.15m localization accuracy across scenarios, with inference latency of 0.51-2.11ms and throughput of up to 1961 positions/s. These results demonstrate that propagation-aware sparsity, mixed dataflow execution, and efficient runtime model switching enable a scalable and low-latency hardware realization of adaptive Transformer-based localization for real-time 5G systems.
AIJul 29, 2025Code
SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented GenerationHao Ye, Mengshi Qi, Zhaohong Liu et al.
In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model's capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety. Our source code and data are available at https://github.com/Lumos0507/SafeDriveRAG.
CVJun 25, 2025Code
Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language ModelsWeiyi Zhao, Xiaoyu Tan, Liang Liu et al.
Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs' detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at https://github.com/zgg2577/VS-KC.
CVJan 28, 2025Code
Target-driven Self-Distillation for Partial Observed Trajectories ForecastingPengfei Zhu, Peng Shu, Mengshi Qi et al.
Accurate prediction of future trajectories of traffic agents is essential for ensuring safe autonomous driving. However, partially observed trajectories can significantly degrade the performance of even state-of-the-art models. Previous approaches often rely on knowledge distillation to transfer features from fully observed trajectories to partially observed ones. This involves firstly training a fully observed model and then using a distillation process to create the final model. While effective, they require multi-stage training, making the training process very expensive. Moreover, knowledge distillation can lead to a performance degradation of the model. In this paper, we introduce a Target-driven Self-Distillation method (TSD) for motion forecasting. Our method leverages predicted accurate targets to guide the model in making predictions under partial observation conditions. By employing self-distillation, the model learns from the feature distributions of both fully observed and partially observed trajectories during a single end-to-end training process. This enhances the model's ability to predict motion accurately in both fully observed and partially observed scenarios. We evaluate our method on multiple datasets and state-of-the-art motion forecasting models. Extensive experimental results demonstrate that our approach achieves significant performance improvements in both settings. To facilitate further research, we will release our code and model checkpoints.
CVNov 28, 2024Code
Improving Batch Normalization with TTA for Robust Object Detection in Self-DrivingDacheng Liao, Mengshi Qi, Liang Liu et al.
In current open real-world autonomous driving scenarios, challenges such as sensor failure and extreme weather conditions hinder the generalization of most autonomous driving perception models to these unseen domain due to the domain shifts between the test and training data. As the parameter scale of autonomous driving perception models grows, traditional test-time adaptation (TTA) methods become unstable and often degrade model performance in most scenarios. To address these challenges, this paper proposes two new robust methods to improve the Batch Normalization with TTA for object detection in autonomous driving: (1) We introduce a LearnableBN layer based on Generalized-search Entropy Minimization (GSEM) method. Specifically, we modify the traditional BN layer by incorporating auxiliary learnable parameters, which enables the BN layer to dynamically update the statistics according to the different input data. (2) We propose a new semantic-consistency based dual-stage-adaptation strategy, which encourages the model to iteratively search for the optimal solution and eliminates unstable samples during the adaptation process. Extensive experiments on the NuScenes-C dataset shows that our method achieves a maximum improvement of about 8% using BEVFormer as the baseline model across six corruption types and three levels of severity. We will make our source code available soon.
CVDec 10, 2023Code
AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion ModelTeng Hu, Jiangning Zhang, Ran Yi et al.
Anomaly inspection plays an important role in industrial manufacture. Existing anomaly inspection methods are limited in their performance due to insufficient anomaly data. Although anomaly generation methods have been proposed to augment the anomaly data, they either suffer from poor generation authenticity or inaccurate alignment between the generated anomalies and masks. To address the above problems, we propose AnomalyDiffusion, a novel diffusion-based few-shot anomaly generation model, which utilizes the strong prior information of latent diffusion model learned from large-scale dataset to enhance the generation authenticity under few-shot training data. Firstly, we propose Spatial Anomaly Embedding, which consists of a learnable anomaly embedding and a spatial embedding encoded from an anomaly mask, disentangling the anomaly information into anomaly appearance and location information. Moreover, to improve the alignment between the generated anomalies and the anomaly masks, we introduce a novel Adaptive Attention Re-weighting Mechanism. Based on the disparities between the generated anomaly image and normal sample, it dynamically guides the model to focus more on the areas with less noticeable generated anomalies, enabling generation of accurately-matched anomalous image-mask pairs. Extensive experiments demonstrate that our model significantly outperforms the state-of-the-art methods in generation authenticity and diversity, and effectively improves the performance of downstream anomaly inspection tasks. The code and data are available in https://github.com/sjtuplayer/anomalydiffusion.
CVDec 14, 2020Code
HR-Depth: High Resolution Self-Supervised Monocular Depth EstimationXiaoyang Lyu, Liang Liu, Mengmeng Wang et al.
Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at https://github.com/shawLyu/HR-Depth.
AIMay 9
What Will Happen Next: Large Models-Driven Deduction for Emergency InstancesZhengqing Hu, Dong Chen, Junkun Yuan et al.
Traditional simulation methods reproduce occurred emergency instances through presetting to assist people in risk assessment and emergency decision-making. However, due to the lack of randomness and diversity, existing simulation systems struggle to fully explore the potential risk as emergency instances are scarce. In contrast, Large Models (LMs) can dynamically adjust generation strategies to introduce controllable randomness, while also possessing extensive prior knowledge and cross-domain knowledge transfer capabilities. Inspired by it, we propose the LMs-driven World Line Divergence System (WLDS), which enables diversified visualization and deduction of emergency instances in different domains. WLDS leverages LMs to deduce emergency instances in various development directions, and introduces the factual calibration and logical calibration mechanism to ensure factual accuracy and logical rigor during the deduction process. The interactive module can independently select deduction directions to avoid potential hallucinations that are difficult for the system to identify. Furthermore, by introducing the visualization module, WLDS forms simulation and deduction that combine text and images, which enhances interpretability. Extensive experiments conducted on the proposed Emergency Instances Deduction (EID) benchmark dataset demonstrate that WLDS achieves high-precision and high-fidelity simulation and deduction of emergency instances in multiple specific domains. Relevant experiments further demonstrate that WLDS can generate more emergency instances deduction data for users and provide support for better decision-making in similar emergency instances in the future.
SPMay 4
A Scalable 256-Antenna Distributed MIMO Testbed with Real-Time Fully Digital BeamformingDumitra Iancu, Vilgot Snygg, Sijia Cheng et al.
Distributed massive MIMO (D-MIMO) is a promising technology for future generation wireless systems as it takes advantage of both an increased array aperture and a decentralized processing architecture and topology. In order to truly understand the possibilities and limitations of these approaches in real scenarios, practical realization of testbeds is an essential step in the technology advancement. This work presents the Lund University Large Intelligent Surface testbed -- LuLIS, that can operate up to 256 coherent radio frequency (RF) chains using 16 AMD Zynq UltraScale RFSoC ZCU216 evaluation boards acting as distributed processing nodes. Real-time processing is facilitated by acceleration and distribution of MIMO processing algorithms on the FPGA fabric of the boards. The system is easily scalable, as increasing the number of antennas is done in multiples of 16 by adding more RFSoCs, which also implies addition of another processing node. The design allows up-scaling without hardware redesign, introduction of large latencies or data transfer overhead. The testbed is flexible in terms of deployment, with options of fully distributing the nodes (as in D-MIMO) or co-locating them (as in more traditional Massive MIMO). A detailed description of the implementation of the testbed is presented and initial results are shown for an uplink (UL) transmission from four single-antenna user equipments (UEs) to 64, 128 and 256 base-station antennas.
CVMar 18, 2024
Learning Unified Reference Representation for Unsupervised Multi-class Anomaly DetectionLiren He, Zhengkai Jiang, Jinlong Peng et al.
In the field of multi-class anomaly detection, reconstruction-based methods derived from single-class anomaly detection face the well-known challenge of "learning shortcuts", wherein the model fails to learn the patterns of normal samples as it should, opting instead for shortcuts such as identity mapping or artificial noise elimination. Consequently, the model becomes unable to reconstruct genuine anomalies as normal instances, resulting in a failure of anomaly detection. To counter this issue, we present a novel unified feature reconstruction-based anomaly detection framework termed RLR (Reconstruct features from a Learnable Reference representation). Unlike previous methods, RLR utilizes learnable reference representations to compel the model to learn normal feature patterns explicitly, thereby prevents the model from succumbing to the "learning shortcuts" issue. Additionally, RLR incorporates locality constraints into the learnable reference to facilitate more effective normal pattern capture and utilizes a masked learnable key attention mechanism to enhance robustness. Evaluation of RLR on the 15-category MVTec-AD dataset and the 12-category VisA dataset shows superior performance compared to state-of-the-art methods under the unified setting. The code of RLR will be publicly available.
CRFeb 19, 2025
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative ChaosYang Yao, Xuan Tong, Ruofan Wang et al.
Large Reasoning Models (LRMs) have significantly advanced beyond traditional Large Language Models (LLMs) with their exceptional logical reasoning capabilities, yet these improvements introduce heightened safety risks. When subjected to jailbreak attacks, their ability to generate more targeted and organized content can lead to greater harm. Although some studies claim that reasoning enables safer LRMs against existing LLM attacks, they overlook the inherent flaws within the reasoning process itself. To address this gap, we propose the first jailbreak attack targeting LRMs, exploiting their unique vulnerabilities stemming from the advanced reasoning capabilities. Specifically, we introduce a Chaos Machine, a novel component to transform attack prompts with diverse one-to-one mappings. The chaos mappings iteratively generated by the machine are embedded into the reasoning chain, which strengthens the variability and complexity and also promotes a more robust attack. Based on this, we construct the Mousetrap framework, which makes attacks projected into nonlinear-like low sample spaces with mismatched generalization enhanced. Also, due to the more competing objectives, LRMs gradually maintain the inertia of unpredictable iterative reasoning and fall into our trap. Success rates of the Mousetrap attacking o1-mini, Claude-Sonnet and Gemini-Thinking are as high as 96%, 86% and 98% respectively on our toxic dataset Trotter. On benchmarks such as AdvBench, StrongREJECT, and HarmBench, attacking Claude-Sonnet, well-known for its safety, Mousetrap can astonishingly achieve success rates of 87.5%, 86.58% and 93.13% respectively. Attention: This paper contains inappropriate, offensive and harmful content.
AIJan 2, 2025
A3: Android Agent Arena for Mobile GUI AgentsYuxiang Chai, Hanhao Li, Jiayu Zhang et al.
AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at https://yuxiangchai.github.io/Android-Agent-Arena/.
CVJan 6, 2024
Self-supervised Feature Adaptation for 3D Industrial Anomaly DetectionYuanpeng Tu, Boshen Zhang, Liang Liu et al.
Industrial anomaly detection is generally addressed as an unsupervised task that aims at locating defects with only normal training samples. Recently, numerous 2D anomaly detection methods have been proposed and have achieved promising results, however, using only the 2D RGB data as input is not sufficient to identify imperceptible geometric surface anomalies. Hence, in this work, we focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets, i.e., ImageNet, to construct feature databases. And we empirically find that directly using these pre-trained models is not optimal, it can either fail to detect subtle defects or mistake abnormal features as normal ones. This may be attributed to the domain gap between target industrial data and source data.Towards this problem, we propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.Both intra-modal adaptation and cross-modal alignment are optimized from a local-to-global perspective in LSFA to ensure the representation quality and consistency in the inference stage.Extensive experiments demonstrate that our method not only brings a significant performance boost to feature embedding based approaches, but also outperforms previous State-of-The-Art (SoTA) methods prominently on both MVTec-3D AD and Eyecandies datasets, e.g., LSFA achieves 97.1% I-AUROC on MVTec-3D, surpass previous SoTA by +3.4%.
CVNov 28, 2024
T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous DrivingChangsheng Lv, Mengshi Qi, Liang Liu et al.
Understanding the traffic scenes and then generating high-definition (HD) maps present significant challenges in autonomous driving. In this paper, we defined a novel Traffic Topology Scene Graph, a unified scene graph explicitly modeling the lane, controlled and guided by different road signals (e.g., right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. For the generation of T2SG, we propose TopoFormer, a novel one-stage Topology Scene Graph TransFormer with two newly designed layers. Specifically, TopoFormer incorporates a Lane Aggregation Layer (LAL) that leverages the geometric distance among the centerline of lanes to guide the aggregation of global information. Furthermore, we proposed a Counterfactual Intervention Layer (CIL) to model the reasonable road structure ( e.g., intersection, straight) among lanes under counterfactual intervention. Then the generated T2SG can provide a more accurate and explainable description of the topological structure in traffic scenes. Experimental results demonstrate that TopoFormer outperforms existing methods on the T2SG generation task, and the generated T2SG significantly enhances traffic topology reasoning in downstream tasks, achieving a state-of-the-art performance of 46.3 OLS on the OpenLane-V2 benchmark. We will release our source code and model.
SPApr 21
Networked Tracking of Multiple Moving Targets in 6G NetworkYanmo Hu, Weifeng Zhu, Chenshu Wu et al.
This paper considers a networked tracking architecture in 6G integrated sensing and communication (ISAC) systems, where multiple base stations (BSs) cooperatively transmit radio signals and process received echo signals to track multiple moving targets. Compared to the single-BS counterpart, networked tracking allows the moving targets to be associated with different BSs over time such that the wireless resources can be dynamically allocated among BSs based on target locations. However, networked tracking imposes new challenges for algorithm design and resource allocation. In this paper, we first design the networked Kalman Filter (NKF) that is suitable for multi-BS based tracking, then characterize the posterior Cramer-Rao bound (PCRB) under this NKF, and last design the beamforming vectors of all the BSs to minimize the tracking PCRB. Numerical results show that our dynamic beamforming design can properly associate the targets to the suitable BSs at various sensing blocks and reduce the tracking mean-squared error (MSE).
LGFeb 21, 2024
Wisdom of Committee: Distilling from Foundation Model to Specialized Application ModelZichang Liu, Qingyun Liu, Yuening Li et al.
Recent advancements in foundation models have yielded impressive performance across a wide range of tasks. Meanwhile, for specific applications, practitioners have been developing specialized application models. To enjoy the benefits of both kinds of models, one natural path is to transfer the knowledge in foundation models into specialized application models, which are generally more efficient for serving. Techniques from knowledge distillation may be applied here, where the application model learns to mimic the foundation model. However, specialized application models and foundation models have substantial gaps in capacity, employing distinct architectures, using different input features from different modalities, and being optimized on different distributions. These differences in model characteristics lead to significant challenges for distillation methods. In this work, we propose creating a teaching committee comprising both foundation model teachers and complementary teachers. Complementary teachers possess model characteristics akin to the student's, aiming to bridge the gap between the foundation model and specialized application models for a smoother knowledge transfer. Further, to accommodate the dissimilarity among the teachers in the committee, we introduce DiverseDistill, which allows the student to understand the expertise of each teacher and extract task knowledge. Our evaluations demonstrate that adding complementary teachers enhances student performance. Finally, DiverseDistill consistently outperforms baseline distillation methods, regardless of the teacher choices, resulting in significantly improved student performance.
LGFeb 7, 2024
LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different ViewsYuji Roh, Qingyun Liu, Huan Gui et al.
Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitations of fine-tuning data and regulate fine-tuning to preserve the general representation learned from pre-training data. However, potential limitations in the pre-training data and models are often ignored. In this paper, we contend that overly relying on the pre-trained representation may hinder fine-tuning from learning essential representations for downstream tasks and thus hurt its OOD generalization. It can be especially catastrophic when new tasks are from different (sub)domains compared to pre-training data. To address the issues in both pre-training and fine-tuning data, we propose a novel generalizable fine-tuning method LEVI (Layer-wise Ensemble of different VIews), where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model, while preserving its efficiencies. By combining two complementing models, LEVI effectively suppresses problematic features in both the fine-tuning data and pre-trained model and preserves useful features for new tasks. Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.