97.9CLMay 29Code
A Visually Impaired Assistance Benchmark for VLM-as-a-Judge EvaluationYi Zhao, Siqi Wang, Zhe Hu et al.
AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE
LGJun 8, 2022Code
ReCo: A Dataset for Residential Community Layout PlanningXi Chen, Yun Xiong, Siqi Wang et al.
Layout planning is centrally important in the field of architecture and urban design. Among the various basic units carrying urban functions, residential community plays a vital part for supporting human life. Therefore, the layout planning of residential community has always been of concern, and has attracted particular attention since the advent of deep learning that facilitates the automated layout generation and spatial pattern recognition. However, the research circles generally suffer from the insufficiency of residential community layout benchmark or high-quality datasets, which hampers the future exploration of data-driven methods for residential community layout planning. The lack of datasets is largely due to the difficulties of large-scale real-world residential data acquisition and long-term expert screening. In order to address the issues and advance a benchmark dataset for various intelligent spatial design and analysis applications in the development of smart city, we introduce Residential Community Layout Planning (ReCo) Dataset, which is the first and largest open-source vector dataset related to real-world community to date. ReCo Dataset is presented in multiple data formats with 37,646 residential community layout plans, covering 598,728 residential buildings with height information. ReCo can be conveniently adapted for residential community layout related urban design tasks, e.g., generative layout design, morphological pattern recognition and spatial evaluation. To validate the utility of ReCo in automated residential community layout planning, two Generative Adversarial Network (GAN) based generative models are further applied to the dataset. We expect ReCo Dataset to inspire more creative and practical work in intelligent design and beyond. The ReCo Dataset is published at: https://www.kaggle.com/fdudsde/reco-dataset.
CVAug 1, 2023Code
Human-M3: A Multi-view Multi-modal Dataset for 3D Human Pose Estimation in Outdoor ScenesBohao Fan, Siqi Wang, Wenxuan Guo et al.
3D human pose estimation in outdoor environments has garnered increasing attention recently. However, prevalent 3D human pose datasets pertaining to outdoor scenes lack diversity, as they predominantly utilize only one type of modality (RGB image or pointcloud), and often feature only one individual within each scene. This limited scope of dataset infrastructure considerably hinders the variability of available data. In this article, we propose Human-M3, an outdoor multi-modal multi-view multi-person human pose database which includes not only multi-view RGB videos of outdoor scenes but also corresponding pointclouds. In order to obtain accurate human poses, we propose an algorithm based on multi-modal data input to generate ground truth annotation. This benefits from robust pointcloud detection and tracking, which solves the problem of inaccurate human localization and matching ambiguity that may exist in previous multi-view RGB videos in outdoor multi-person scenes, and generates reliable ground truth annotations. Evaluation of multiple different modalities algorithms has shown that this database is challenging and suitable for future research. Furthermore, we propose a 3D human pose estimation algorithm based on multi-modal data input, which demonstrates the advantages of multi-modal data input for 3D human pose estimation. Code and data will be released on https://github.com/soullessrobot/Human-M3-Dataset.
79.6CVMay 25Code
ERNIE-Image Technical ReportJiaxiang Liu, Zhida Feng, Pengyu Zou et al.
We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.
CLMay 21, 2025
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-ThoughtTencent Hunyuan Team, Ao Liu, Botong Zhou et al. · tencent-ai
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.
CVOct 30, 2023Code
CHAMMI: A benchmark for channel-adaptive models in microscopy imagingZitong Chen, Chau Pham, Siqi Wang et al.
Most neural networks assume that input images have a fixed number of channels (three for RGB images). However, there are many settings where the number of channels may vary, such as microscopy images where the number of channels changes depending on instruments and experimental goals. Yet, there has not been a systemic attempt to create and evaluate neural networks that are invariant to the number and type of channels. As a result, trained models remain specific to individual studies and are hardly reusable for other microscopy settings. In this paper, we present a benchmark for investigating channel-adaptive models in microscopy imaging, which consists of 1) a dataset of varied-channel single-cell images, and 2) a biologically relevant evaluation framework. In addition, we adapted several existing techniques to create channel-adaptive models and compared their performance on this benchmark to fixed-channel, baseline models. We find that channel-adaptive models can generalize better to out-of-domain tasks and can be computationally efficient. We contribute a curated dataset (https://doi.org/10.5281/zenodo.7988357) and an evaluation API (https://github.com/broadinstitute/MorphEm.git) to facilitate objective comparisons in future research and applications.
CVMay 16, 2022Code
An Effective Transformer-based Solution for RSNA Intracranial Hemorrhage Detection CompetitionFangxin Shang, Siqi Wang, Xiaorong Wang et al.
We present an effective method for Intracranial Hemorrhage Detection (IHD) which exceeds the performance of the winner solution in RSNA-IHD competition (2019). Meanwhile, our model only takes quarter parameters and ten percent FLOPs compared to the winner's solution. The IHD task needs to predict the hemorrhage category of each slice for the input brain CT. We review the top-5 solutions for the IHD competition held by the Radiological Society of North America(RSNA) in 2019. Nearly all the top solutions rely on 2D convolutional networks and sequential models (Bidirectional GRU or LSTM) to extract intra-slice and inter-slice features, respectively. All the top solutions enhance the performance by leveraging the model ensemble, and the model number varies from 7 to 31. In the past years, since much progress has been made in the computer vision regime especially Transformer-based models, we introduce the Transformer-based techniques to extract the features in both intra-slice and inter-slice views for IHD tasks. Additionally, a semi-supervised method is embedded into our workflow to further improve the performance. The code is available in the manuscript.
SYDec 17, 2017
Quantitative Control Approach for Wind Turbine Generators to Provide Fast Frequency Response with Guarantee of Rotor SecuritySiqi Wang, Kevin Tomsovic
Wind generation is expected to reach substantially higher levels of penetration in the near future. With the converter interface, the rotor inertia of doubly-fed induction generator (DFIG) based wind turbine generator is effectively decoupled from the system, causing a reduction in inertial response. This can be compensated by enabling the DFIG to provide fast frequency response. This paper proposes a quantitative control approach for DFIG to deliver fast frequency response in the inertial response time scale. A supplementary power surge function is added to the active power reference of DFIG. The exact amount of power surge that is available from DFIG-based wind turbine is quantified based on estimation of maximum extractable energy. Moreover, the operational constraints such as rotor limits and converter over-load limit are considered at the same time. Thus, the proposed approach not only provides adequate inertial response but also ensures the rotor speed is kept within a specified operating range. Rotor safety is guaranteed without the need for an additional rotor speed protection scheme.
LGSep 6, 2024
Active-Passive Federated Learning for Vertically Partitioned Multi-view DataJiyuan Liu, Xinwang Liu, Siqi Wang et al.
Vertical federated learning is a natural and elegant approach to integrate multi-view data vertically partitioned across devices (clients) while preserving their privacies. Apart from the model training, existing methods requires the collaboration of all clients in the model inference. However, the model inference is probably maintained for service in a long time, while the collaboration, especially when the clients belong to different organizations, is unpredictable in real-world scenarios, such as concellation of contract, network unavailablity, etc., resulting in the failure of them. To address this issue, we, at the first attempt, propose a flexible Active-Passive Federated learning (APFed) framework. Specifically, the active client is the initiator of a learning task and responsible to build the complete model, while the passive clients only serve as assistants. Once the model built, the active client can make inference independently. In addition, we instance the APFed framework into two classification methods with employing the reconstruction loss and the contrastive loss on passive clients, respectively. Meanwhile, the two methods are tested in a set of experiments and achieves desired results, validating their effectiveness.
LGFeb 25, 2023
Cross-modal Contrastive Learning for Multimodal Fake News DetectionLongzheng Wang, Chuang Zhang, Hongbo Xu et al.
Automatic detection of multimodal fake news has gained a widespread attention recently. Many existing approaches seek to fuse unimodal features to produce multimodal news representations. However, the potential of powerful cross-modal contrastive learning methods for fake news detection has not been well exploited. Besides, how to aggregate features from different modalities to boost the performance of the decision-making process is still an open question. To address that, we propose COOLANT, a cross-modal contrastive learning framework for multimodal fake news detection, aiming to achieve more accurate image-text alignment. To further improve the alignment precision, we leverage an auxiliary task to soften the loss term of negative samples during the contrast process. A cross-modal fusion module is developed to learn the cross-modality correlations. An attention mechanism with an attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations and the cross-modality correlations. Finally, we evaluate the COOLANT and conduct a comparative study on two widely used datasets, Twitter and Weibo. The experimental results demonstrate that our COOLANT outperforms previous approaches by a large margin and achieves new state-of-the-art results on the two datasets.
65.4CVMay 25
How Far Has AI Come in Liver Fibrosis Staging? A Large-Scale Real-World Dataset and BenchmarkYuanye Liu, Nannan Shi, Zhejia Zhang et al.
Despite years of methodological progress, how far AI has come in liver fibrosis staging has never been systematically evaluated under the heterogeneous, multi-center conditions that define clinical practice. To address this gap, we introduce LiFS, a large-scale dataset and benchmark derived from the MICCAI 2025 CARE-Liver challenge, comprising 610 patients across multiple centers and scanners with multi-sequence MRI. To the best of our knowledge, LiFS is the first benchmark providing complete gadoxetic acid-enhanced sequences with histopathology-confirmed annotations from diverse real-world scanners. Through systematic evaluation of 9 independently developed methods selected from 96 registered teams against in-cohort radiologist reference results, our findings address how far current AI has progressed toward clinical-level liver fibrosis staging from three complementary perspectives. First, against radiologists, the best AI methods were broadly comparable to the senior radiologist and significantly exceeded the junior radiologist in selected settings, while median AI performance generally approached junior-radiologist levels. Second, from a data perspective, cross-center heterogeneity, label imbalance, and contrast-enhanced sequence variability emerge as the dominant challenges for AI methods. Third, from a technical perspective, methodological design choices, including spatial registration, input dimensionality, multi-modal fusion strategy, and backbone architecture, appear to modulate cross-center robustness, although no single choice alone closes the gap. Overall, LiFS provides a rigorous real-world benchmark for positioning the current state of AI in liver fibrosis staging and for enabling future research on the key challenges that limit clinically reliable deployment.
CLNov 30, 2023
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, ToxicityShiyao Cui, Zhenyu Zhang, Yilong Chen et al.
The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts, primarily stemming from factoid, unfair, and toxic content. Previous researchers have invested much effort in assessing the harmlessness of generative language models. However, existing benchmarks are struggling in the era of large language models (LLMs), due to the stronger language generation and instruction following capabilities, as well as wider applications. In this paper, we propose FFT, a new benchmark with 2116 elaborated-designed instances, for LLM harmlessness evaluation with factuality, fairness, and toxicity. To investigate the potential harms of LLMs, we evaluate 9 representative LLMs covering various parameter scales, training stages, and creators. Experiments show that the harmlessness of LLMs is still under-satisfactory, and extensive analysis derives some insightful findings that could inspire future research for harmless LLM research.
CVJun 20, 2023
LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge IntegrationSiqi Wang, Bryan A. Plummer
Learning with noisy labels (LNL) aims to train a high-performing model using a noisy dataset. We observe that noise for a given class often comes from a limited set of categories, yet many LNL methods overlook this. For example, an image mislabeled as a cheetah is more likely a leopard than a hippopotamus due to its visual similarity. Thus, we explore Learning with Noisy Labels with noise source Knowledge integration (LNL+K), which leverages knowledge about likely source(s) of label noise that is often provided in a dataset's meta-data. Integrating noise source knowledge boosts performance even in settings where LNL methods typically fail. For example, LNL+K methods are effective on datasets where noise represents the majority of samples, which breaks a critical premise of most methods developed for LNL. Our LNL+K methods can boost performance even when noise sources are estimated rather than extracted from meta-data. We provide several baseline LNL+K methods that integrate noise source knowledge into state-of-the-art LNL models that are evaluated across six diverse datasets and two types of noise, where we report gains of up to 23% compared to the unadapted methods. Critically, we show that LNL methods fail to generalize on some real-world datasets, even when adapted to integrate noise source knowledge, highlighting the importance of directly exploring LNL+K.
LGDec 4, 2025
RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive DraftingSiqi Wang, Hailong Yang, Junjie Zhu et al.
Reinforcement Learning from Human Feedback (RLHF) is an important fine-tuning technique for large language models (LLMs) and comprises three stages: generation, inference, and training. The generation stage generates samples that are then used to infer learnable experiences for training. We observe that the generation stage is the bottleneck of the entire execution process and consider it a key point for optimization. Specifically, we realize the first attempt to integrate speculative decoding into the RLHF generation stage and propose RLHFSpec, an RLHF system that accelerates generation execution with efficient speculative decoding and sample reallocation. To fully exploit the performance potential provided by speculative decoding, especially dealing with the dynamic workload of the generation stage, RLHFSpec proposes a workload-aware drafting strategy selection mechanism, which selects the near-optimal strategy by jointly considering the verification cost and the number of accepted tokens. Moreover, RLHFSpec also proposes sample reallocation to fully utilize the GPU resources, and optimizes it with an efficient sample migration mechanism. The experimental results show that the RLHFSpec can achieve higher throughput in the generation stage compared to state-of-the-art works. Moreover, due to the effective alleviation of the generation bottleneck, RLHFSpec also shows significant performance speedup in the entire RLHF execution.
CVApr 19, 2024Code
TextSquare: Scaling up Text-Centric Visual Instruction TuningJingqun Tang, Chunhui Lin, Zhen Zhao et al.
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
LGSep 20, 2022
Deep-Steiner: Learning to Solve the Euclidean Steiner Tree ProblemSiqi Wang, Yifan Wang, Guangmo Tong
The Euclidean Steiner tree problem seeks the min-cost network to connect a collection of target locations, and it underlies many applications of wireless networks. In this paper, we present a study on solving the Euclidean Steiner tree problem using reinforcement learning enhanced by graph representation learning. Different from the commonly studied connectivity problems like travelling salesman problem or vehicle routing problem where the search space is finite, the Euclidean Steiner tree problem requires to search over the entire Euclidean space, thereby making the existing methods not applicable. In this paper, we design discretization methods by leveraging the unique characteristics of the Steiner tree, and propose new training schemes for handling the dynamic Steiner points emerging during the incremental construction. Our design is examined through a sanity check using experiments on a collection of datasets, with encouraging results demonstrating the utility of our method as an alternative to classic combinatorial methods.
CVJun 4, 2023
USD: Unknown Sensitive Detector Empowered by Decoupled Objectness and Segment Anything ModelYulin He, Wei Chen, Yusong Tan et al.
Open World Object Detection (OWOD) is a novel and challenging computer vision task that enables object detection with the ability to detect unknown objects. Existing methods typically estimate the object likelihood with an additional objectness branch, but ignore the conflict in learning objectness and classification boundaries, which oppose each other on the semantic manifold and training objective. To address this issue, we propose a simple yet effective learning strategy, namely Decoupled Objectness Learning (DOL), which divides the learning of these two boundaries into suitable decoder layers. Moreover, detecting unknown objects comprehensively requires a large amount of annotations, but labeling all unknown objects is both difficult and expensive. Therefore, we propose to take advantage of the recent Large Vision Model (LVM), specifically the Segment Anything Model (SAM), to enhance the detection of unknown objects. Nevertheless, the output results of SAM contain noise, including backgrounds and fragments, so we introduce an Auxiliary Supervision Framework (ASF) that uses a pseudo-labeling and a soft-weighting strategies to alleviate the negative impact of noise. Extensive experiments on popular benchmarks, including Pascal VOC and MS COCO, demonstrate the effectiveness of our approach. Our proposed Unknown Sensitive Detector (USD) outperforms the recent state-of-the-art methods in terms of Unknown Recall, achieving significant improvements of 14.3\%, 15.5\%, and 8.9\% on the M-OWODB, and 27.1\%, 29.1\%, and 25.1\% on the S-OWODB.
CVJul 15, 2024
2D Neural Fields with Learned DiscontinuitiesChenxi Liu, Siqi Wang, Matthew Fisher et al.
Effective representation of 2D images is fundamental in digital image processing, where traditional methods like raster and vector graphics struggle with sharpness and textural complexity respectively. Current neural fields offer high-fidelity and resolution independence but require predefined meshes with known discontinuities, restricting their utility. We observe that by treating all mesh edges as potential discontinuities, we can represent the magnitude of discontinuities with continuous variables and optimize. Based on this observation, we introduce a novel discontinuous neural field model that jointly approximate the target image and recovers discontinuities. Through systematic evaluations, our neural field demonstrates superior performance in denoising and super-resolution tasks compared to InstantNGP, achieving improvements of over 5dB and 10dB, respectively. Our model also outperforms Mumford-Shah-based methods in accurately capturing discontinuities, with Chamfer distances 3.5x closer to the ground truth. Additionally, our approach shows remarkable capability in handling complex artistic drawings and natural images.
LGSep 20, 2022
Learning the Propagation of Worms in Wireless Sensor NetworksYifan Wang, Siqi Wang, Guangmo Tong
Wireless sensor networks (WSNs) are composed of spatially distributed sensors and are considered vulnerable to attacks by worms and their variants. Due to the distinct strategies of worms propagation, the dynamic behavior varies depending on the different features of the sensors. Modeling the spread of worms can help us understand the worm attack behaviors and analyze the propagation procedure. In this paper, we design a communication model under various worms. We aim to learn our proposed model to analytically derive the dynamics of competitive worms propagation. We develop a new searching space combined with complex neural network models. Furthermore, the experiment results verified our analysis and demonstrated the performance of our proposed learning algorithms.
AISep 21, 2024
A Survey on Large Language Model-empowered Autonomous DrivingYuxuan Zhu, Shiyi Wang, Wenqing Zhong et al.
Artificial intelligence (AI) plays a crucial role in autonomous driving (AD) research, propelling its development towards intelligence and efficiency. Currently, the development of AD technology follows two main technical paths: modularization and end-to-end. Modularization decompose the driving task into modules such as perception, prediction, planning, and control, and train them separately. Due to the inconsistency of training objectives between modules, the integrated effect suffers from bias. End-to-end attempts to address this issue by utilizing a single model that directly maps from sensor data to control signals. This path has limited learning capabilities in a comprehensive set of features and struggles to handle unpredictable long-tail events and complex urban traffic scenarios. In the face of challenges encountered in both paths, many researchers believe that large language models (LLMs) with powerful reasoning capabilities and extensive knowledge understanding may be the solution, expecting LLMs to provide AD systems with deeper levels of understanding and decision-making capabilities. In light of the challenges faced by both paths, many researchers believe that LLMs, with their powerful reasoning abilities and extensive knowledge, could offer a solution. To understand if LLMs could enhance AD, this paper conducts a thorough analysis of the potential applications of LLMs in AD systems, including exploring their optimization strategies in both modular and end-to-end approaches, with a particular focus on how LLMs can tackle the problems and challenges present in current solutions. Furthermore, we discuss an important question: Can LLM-based artificial general intelligence (AGI) be a key to achieve high-level AD? We further analyze the potential limitations and challenges that LLMs may encounter in promoting the development of AD technology.
AINov 24, 2024Code
Decoding Urban Industrial Complexity: Enhancing Knowledge-Driven Insights via IndustryScopeGPTSiqi Wang, Chao Liang, Yunfan Gao et al.
Industrial parks are critical to urban economic growth. Yet, their development often encounters challenges stemming from imbalances between industrial requirements and urban services, underscoring the need for strategic planning and operations. This paper introduces IndustryScopeKG, a pioneering large-scale multi-modal, multi-level industrial park knowledge graph, which integrates diverse urban data including street views, corporate, socio-economic, and geospatial information, capturing the complex relationships and semantics within industrial parks. Alongside this, we present the IndustryScopeGPT framework, which leverages Large Language Models (LLMs) with Monte Carlo Tree Search to enhance tool-augmented reasoning and decision-making in Industrial Park Planning and Operation (IPPO). Our work significantly improves site recommendation and functional planning, demonstrating the potential of combining LLMs with structured datasets to advance industrial park management. This approach sets a new benchmark for intelligent IPPO research and lays a robust foundation for advancing urban industrial development. The dataset and related code are available at https://github.com/Tongji-KGLLM/IndustryScope.
IMMar 2
PhysFormer: A Physics-Embedded Generative Model for Physically Self-Consistent Spectral SynthesisSiqi Wang, Mengmeng Zhang, Yude Bu et al.
In scientific and engineering domains, modeling high-dimensional complex systems governed by partial differential equations (PDEs) remains challenging in terms of physical consistency and numerical stability. However, existing approaches, such as physics-informed neural networks (PINNs), typically rely on known physical fields or coefficients and enforce physical constraints via external loss functions, which can lead to training instability and make it difficult to handle high-dimensional or unobservable scenarios. To this end, we propose PhysFormer, a generative modeling framework that is self-consistent at both the data and physical levels. PhysFormer leverages a low-dimensional, physically interpretable latent space to learn key physical quantities directly from data without requiring known high-dimensional physical field parameters, and embeds the physical process of radiative flux generation within the network to ensure the physical consistency of the generated spectra. In high-dimensional, degenerate inversion tasks, PhysFormer constrains generation within physical limits and enhances spectral fidelity and inversion stability under varying signal-to-noise ratios (SNRs). More broadly, this approach shifts the physical processes from external loss functions into the generative mechanism itself, providing a physically consistent generative modeling paradigm for complex systems involving unknown or unobservable physical quantities.
75.5GRApr 15
A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural RetargetingJunlin Li, Xinhao Song, Siqi Wang et al.
Text-driven motion editing and intra-structural retargeting, where source and target share topology but may differ in bone lengths, are traditionally handled by fragmented pipelines with incompatible inputs and representations: editing relies on specialized generative steering, while retargeting is deferred to geometric post-processing. We present a unifying perspective where both tasks are cast as instances of conditional transport within a single generative framework. By leveraging recent advances in flow matching, we demonstrate that editing and retargeting are fundamentally the same generative task, distinguished only by which conditioning signal, semantic or structural, is modulated during inference. We implement this vision via a rectified-flow motion model jointly conditioned on text prompts and target skeletal structures. Our architecture extends a DiT-style transformer with per-joint tokenization and explicit joint self-attention to strictly enforce kinematic dependencies, while a multi-condition classifier-free guidance strategy balances text adherence with skeletal conformity. Experiments on SnapMoGen and a multi-character Mixamo subset show that a single trained model supports text-to-motion generation, zero-shot editing, and zero-shot intra-structural retargeting. This unified approach simplifies deployment and improves structural consistency compared to task-specific baselines.
GRDec 13, 2024Code
AniSora: Exploring the Frontiers of Animation Video Generation in the Sora EraYudong Jiang, Baohan Xu, Siqian Yang et al.
Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation benchmark. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our entire project is publicly available on https://github.com/bilibili/Index-anisora/tree/main.
AIDec 18, 2025
CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?Siqi Wang, Chao Liang, Yunfan Gao et al.
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.
CLJun 4, 2025Code
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower RewardYi Zhao, Siqi Wang, Jing Li
Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.
DCOct 16, 2025Code
xLLM Technical ReportTongxuan Liu, Tao Peng, Peijun Yang et al.
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.
CLJun 25, 2024Code
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceErxin Yu, Jing Li, Ming Liao et al.
As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.
CVAug 5, 2021Code
Video Abnormal Event Detection by Learning to Complete Visual Cloze TestsSiqi Wang, Guang Yu, Zhiping Cai et al.
Although deep neural networks (DNNs) enable great progress in video abnormal event detection (VAD), existing solutions typically suffer from two issues: (1) The localization of video events cannot be both precious and comprehensive. (2) The semantics and temporal context are under-explored. To tackle those issues, we are motivated by the prevalent cloze test in education and propose a novel approach named Visual Cloze Completion (VCC), which conducts VAD by learning to complete "visual cloze tests" (VCTs). Specifically, VCC first localizes each video event and encloses it into a spatio-temporal cube (STC). To achieve both precise and comprehensive localization, appearance and motion are used as complementary cues to mark the object region associated with each event. For each marked region, a normalized patch sequence is extracted from current and adjacent frames and stacked into a STC. With each patch and the patch sequence of a STC compared to a visual "word" and "sentence" respectively, we deliberately erase a certain "word" (patch) to yield a VCT. Then, the VCT is completed by training DNNs to infer the erased patch and its optical flow via video semantics. Meanwhile, VCC fully exploits temporal context by alternatively erasing each patch in temporal context and creating multiple VCTs. Furthermore, we propose localization-level, event-level, model-level and decision-level solutions to enhance VCC, which can further exploit VCC's potential and produce significant performance improvement gain. Extensive experiments demonstrate that VCC achieves state-of-the-art VAD performance. Our codes and results are open at https://github.com/yuguangnudt/VEC_VAD/tree/VCC.
CVApr 27, 2021Code
Multi-view Deep One-class Classification: A Systematic ExplorationSiqi Wang, Jiyuan Liu, Guang Yu et al.
One-class classification (OCC), which models one single positive class and distinguishes it from the negative class, has been a long-standing topic with pivotal application to realms like anomaly detection. As modern society often deals with massive high-dimensional complex data spawned by multiple sources, it is natural to consider OCC from the perspective of multi-view deep learning. However, it has not been discussed by the literature and remains an unexplored topic. Motivated by this blank, this paper makes four-fold contributions: First, to our best knowledge, this is the first work that formally identifies and formulates the multi-view deep OCC problem. Second, we take recent advances in relevant areas into account and systematically devise eleven different baseline solutions for multi-view deep OCC, which lays the foundation for research on multi-view deep OCC. Third, to remedy the problem that limited benchmark datasets are available for multi-view deep OCC, we extensively collect existing public data and process them into more than 30 new multi-view benchmark datasets via multiple means, so as to provide a publicly available evaluation platform for multi-view deep OCC. Finally, by comprehensively evaluating the devised solutions on benchmark datasets, we conduct a thorough analysis on the effectiveness of the designed baselines, and hopefully provide other researchers with beneficial guidance and insight to multi-view deep OCC. Our data and codes are opened at https://github.com/liujiyuan13/MvDOCC-datasets and https://github.com/liujiyuan13/MvDOCC-code respectively to facilitate future research.
CVSep 5, 2020Code
A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D Skeleton Based Person Re-IdentificationHaocong Rao, Siqi Wang, Xiping Hu et al.
Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages. Existing solutions either rely on hand-crafted descriptors or supervised gait representation learning. This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID. Specifically, we first create self-supervision by learning to reconstruct unlabeled skeleton sequences reversely, which involves richer high-level semantics to obtain better gait representations. Other pretext tasks are also explored to further improve self-supervised learning. Second, inspired by the fact that motion's continuity endows adjacent skeletons in one skeleton sequence and temporally consecutive skeleton sequences with higher correlations (referred as locality in 3D skeleton data), we propose a locality-aware attention mechanism and a locality-aware contrastive learning scheme, which aim to preserve locality-awareness on intra-sequence level and inter-sequence level respectively during self-supervised learning. Last, with context vectors learned by our locality-aware attention mechanism and contrastive learning scheme, a novel feature named Constrastive Attention-based Gait Encodings (CAGEs) is designed to represent gait effectively. Empirical evaluations show that our approach significantly outperforms skeleton-based counterparts by 15-40% Rank-1 accuracy, and it even achieves superior performance to numerous multi-modal methods with extra RGB or depth information. Our codes are available at https://github.com/Kali-Hac/Locality-Awareness-SGE.
CVAug 27, 2020Code
Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video EventsGuang Yu, Siqi Wang, Zhiping Cai et al.
As a vital topic in media content interpretation, video anomaly detection (VAD) has made fruitful progress via deep neural network (DNN). However, existing methods usually follow a reconstruction or frame prediction routine. They suffer from two gaps: (1) They cannot localize video activities in a both precise and comprehensive manner. (2) They lack sufficient abilities to utilize high-level semantics and temporal context information. Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. Appearance and motion are exploited as mutually complimentary cues to localize regions of interest (RoIs). A normalized spatio-temporal cube (STC) is built from each RoI as a video event, which lays the foundation of VEC and serves as a basic processing unit. Second, we encourage DNN to capture high-level semantics by solving a visual cloze test. To build such a visual cloze test, a certain patch of STC is erased to yield an incomplete event (IE). The DNN learns to restore the original video event from the IE by inferring the missing patch. Third, to incorporate richer motion dynamics, another DNN is trained to infer erased patches' optical flow. Finally, two ensemble strategies using different types of IE and modalities are proposed to boost VAD performance, so as to fully exploit the temporal context and modality information for VAD. VEC can consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks. Our codes and results can be verified at github.com/yuguangnudt/VEC_VAD.
CVAug 21, 2020Code
Self-Supervised Gait Encoding with Locality-Aware Attention for Person Re-IdentificationHaocong Rao, Siqi Wang, Xiping Hu et al.
Gait-based person re-identification (Re-ID) is valuable for safety-critical applications, and using only 3D skeleton data to extract discriminative gait features for person Re-ID is an emerging open topic. Existing methods either adopt hand-crafted features or learn gait features by traditional supervised learning paradigms. Unlike previous methods, we for the first time propose a generic gait encoding approach that can utilize unlabeled skeleton data to learn gait representations in a self-supervised manner. Specifically, we first propose to introduce self-supervision by learning to reconstruct input skeleton sequences in reverse order, which facilitates learning richer high-level semantics and better gait representations. Second, inspired by the fact that motion's continuity endows temporally adjacent skeletons with higher correlations ("locality"), we propose a locality-aware attention mechanism that encourages learning larger attention weights for temporally adjacent skeletons when reconstructing current skeleton, so as to learn locality when encoding gait. Finally, we propose Attention-based Gait Encodings (AGEs), which are built using context vectors learned by locality-aware attention, as final gait representations. AGEs are directly utilized to realize effective person Re-ID. Our approach typically improves existing skeleton-based methods by 10-20% Rank-1 accuracy, and it achieves comparable or even superior performance to multi-modal methods with extra RGB or depth information. Our codes are available at https://github.com/Kali-Hac/SGE-LA.
SYMar 6, 2018Code
Cyber-Physical Testbed for Power System Wide-Area Measurement-Based Control Using Open-Source SoftwareHantao Cui, Fangxing Li, Kevin Tomsovic et al.
The electric power system is a cyber-physical system with power flow in the physical system and information flow in the cyber. Simulation is crucial to understanding the dynamics and control of electric power systems yet the underlying communication system has historically been ignored in these studies. This paper aims at meeting the increasing needs to simulate the operations of a real power system including the physical system, the energy management system, the communication system, and the emerging wide-area measurement-based controls. This paper proposes a cyber-physical testbed design and implementation for verifying and demonstrating wide-area control methods based on streaming telemetry and phasor measurement unit data. The proposed decoupled architecture is composed of a differential algebraic equation based physical system simulator, a software-defined network, a scripting language environment for prototyping an EMS system and a control system, all of which are integrated over industry-standard communication protocols. The proposed testbed is implemented using open-source software packages managed by a Python dispatcher. Finally, demonstrations are presented to show two wide-area measurement-based controls - system separation control and hierarchical voltage control, in the implemented testbed.
CVNov 30, 2023
A Unified Framework for Connecting Noise Modeling to Boost Noise DetectionSiqi Wang, Chau Pham, Bryan A. Plummer
Noisy labels can impair model performance, making the study of learning with noisy labels an important topic. Two conventional approaches are noise modeling and noise detection. However, these two methods are typically studied independently, and there has been limited work on their collaboration. In this work, we explore the integration of these two approaches, proposing an interconnected structure with three crucial blocks: noise modeling, source knowledge identification, and enhanced noise detection using noise source-knowledge-integration methods. This collaboration structure offers advantages such as discriminating hard negatives and preserving genuinely clean labels that might be suspiciously noisy. Our experiments on four datasets, featuring three types of noise and different combinations of each block, demonstrate the efficacy of these components' collaboration. Our collaborative structure methods achieve up to a 10% increase in top-1 classification accuracy in synthesized noise datasets and 3-5% in real-world noisy datasets. The results also suggest that these components make distinct contributions to overall performance across various noise scenarios. These findings provide valuable insights for designing noisy label learning methods customized for specific noise scenarios in the future. Our code is accessible to the public.
LGApr 26, 2023
Membrane Potential Distribution Adjustment and Parametric Surrogate Gradient in Spiking Neural NetworksSiqi Wang, Tee Hiang Cheng, Meng-Hiot Lim
As an emerging network model, spiking neural networks (SNNs) have aroused significant research attentions in recent years. However, the energy-efficient binary spikes do not augur well with gradient descent-based training approaches. Surrogate gradient (SG) strategy is investigated and applied to circumvent this issue and train SNNs from scratch. Due to the lack of well-recognized SG selection rule, most SGs are chosen intuitively. We propose the parametric surrogate gradient (PSG) method to iteratively update SG and eventually determine an optimal surrogate gradient parameter, which calibrates the shape of candidate SGs. In SNNs, neural potential distribution tends to deviate unpredictably due to quantization error. We evaluate such potential shift and propose methodology for potential distribution adjustment (PDA) to minimize the loss of undesired pre-activations. Experimental results demonstrate that the proposed methods can be readily integrated with backpropagation through time (BPTT) algorithm and help modulated SNNs to achieve state-of-the-art performance on both static and dynamic dataset with fewer timesteps.
CVApr 20, 2025
NTIRE 2025 Challenge on Real-World Face Restoration: Methods and ResultsZheng Chen, Jingkai Wang, Kai Liu et al.
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
AIDec 11, 2025
Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy InterventionYang Yu, Zhuangzhuang Chen, Siqi Wang et al.
Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL-based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing studies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ignore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the diversity of responses. In this paper, we propose Selective-adversarial Entropy Intervention, namely SaEI, which enhances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the entropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formulates the entropy of sampled responses as an adversarial objective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger answer space during RL sampling. Then, we propose token-selective entropy computation (TsEC) to maximize the effectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.
CVJan 23
A Lightweight Medical Image Classification Framework via Self-Supervised Contrastive Learning and Quantum-Enhanced Feature ModelingJingsong Xia, Siqi Wang
Intelligent medical image analysis is essential for clinical decision support but is often limited by scarce annotations, constrained computational resources, and suboptimal model generalization. To address these challenges, we propose a lightweight medical image classification framework that integrates self-supervised contrastive learning with quantum-enhanced feature modeling. MobileNetV2 is employed as a compact backbone and pretrained using a SimCLR-style self-supervised paradigm on unlabeled images. A lightweight parameterized quantum circuit (PQC) is embedded as a quantum feature enhancement module, forming a hybrid classical-quantum architecture, which is subsequently fine-tuned on limited labeled data. Experimental results demonstrate that, with only approximately 2-3 million parameters and low computational cost, the proposed method consistently outperforms classical baselines without self-supervised learning or quantum enhancement in terms of Accuracy, AUC, and F1-score. Feature visualization further indicates improved discriminability and representation stability. Overall, this work provides a practical and forward-looking solution for high-performance medical artificial intelligence under resource-constrained settings.
CLMay 21, 2025
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM ReasoningJinghui Lu, Haiyang Yu, Siliang Xu et al.
Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally improve accuracy and even degrade performance on simpler tasks. To address this, we propose Certainty-based Adaptive Reasoning (CAR), a novel framework that dynamically switches between short answers and long-form reasoning based on the model perplexity. CAR first generates a short answer and evaluates its perplexity, triggering reasoning only when the model exhibits low confidence (i.e., high perplexity). Experiments across diverse multimodal VQA/KIE benchmarks and text reasoning datasets show that CAR outperforms both short-answer and long-form reasoning approaches, striking an optimal balance between accuracy and efficiency.
CVJan 26
Generative Diffusion Augmentation with Quantum-Enhanced Discrimination for Medical Image DiagnosisJingsong Xia, Siqi Wang
In biomedical engineering, artificial intelligence has become a pivotal tool for enhancing medical diagnostics, particularly in medical image classification tasks such as detecting pneumonia from chest X-rays and breast cancer screening. However, real-world medical datasets frequently exhibit severe class imbalance, where positive samples substantially outnumber negative samples, leading to biased models with low recall rates for minority classes. This imbalance not only compromises diagnostic accuracy but also poses clinical misdiagnosis risks. To address this challenge, we propose SDA-QEC (Simplified Diffusion Augmentation with Quantum-Enhanced Classification), an innovative framework that integrates simplified diffusion-based data augmentation with quantum-enhanced feature discrimination. Our approach employs a lightweight diffusion augmentor to generate high-quality synthetic samples for minority classes, rebalancing the training distribution. Subsequently, a quantum feature layer embedded within MobileNetV2 architecture enhances the model's discriminative capability through high-dimensional feature mapping in Hilbert space. Comprehensive experiments on coronary angiography image classification demonstrate that SDA-QEC achieves 98.33% accuracy, 98.78% AUC, and 98.33% F1-score, significantly outperforming classical baselines including ResNet18, MobileNetV2, DenseNet121, and VGG16. Notably, our framework simultaneously attains 98.33% sensitivity and 98.33% specificity, achieving a balanced performance critical for clinical deployment. The proposed method validates the feasibility of integrating generative augmentation with quantum-enhanced modeling in real-world medical imaging tasks, offering a novel research pathway for developing highly reliable medical AI systems in small-sample, highly imbalanced, and high-risk diagnostic scenarios.
CVJan 22
A Lightweight Brain-Inspired Machine Learning Framework for Coronary Angiography: Hybrid Neural Representation and Robust Learning StrategiesJingsong Xia, Siqi Wang
Background: Coronary angiography (CAG) is a cornerstone imaging modality for assessing coronary artery disease and guiding interventional treatment decisions. However, in real-world clinical settings, angiographic images are often characterized by complex lesion morphology, severe class imbalance, label uncertainty, and limited computational resources, posing substantial challenges to conventional deep learning approaches in terms of robustness and generalization.Methods: The proposed framework is built upon a pretrained convolutional neural network to construct a lightweight hybrid neural representation. A selective neural plasticity training strategy is introduced to enable efficient parameter adaptation. Furthermore, a brain-inspired attention-modulated loss function, combining Focal Loss with label smoothing, is employed to enhance sensitivity to hard samples and uncertain annotations. Class-imbalance-aware sampling and cosine annealing with warm restarts are adopted to mimic rhythmic regulation and attention allocation mechanisms observed in biological neural systems.Results: Experimental results demonstrate that the proposed lightweight brain-inspired model achieves strong and stable performance in binary coronary angiography classification, yielding competitive accuracy, recall, F1-score, and AUC metrics while maintaining high computational efficiency.Conclusion: This study validates the effectiveness of brain-inspired learning mechanisms in lightweight medical image analysis and provides a biologically plausible and deployable solution for intelligent clinical decision support under limited computational resources.
CVFeb 26, 2024
Asphalt Concrete Characterization Using Digital Image Correlation: A Systematic Review of Best Practices, Applications, and Future VisionSiqi Wang, Zehui Zhu, Tao Ma et al.
Digital Image Correlation (DIC) is an optical technique that measures displacement and strain by tracking pattern movement in a sequence of captured images during testing. DIC has gained recognition in asphalt pavement engineering since the early 2000s. However, users often perceive the DIC technique as an out-of-box tool and lack a thorough understanding of its operational and measurement principles. This article presents a state-of-art review of DIC as a crucial tool for laboratory testing of asphalt concrete (AC), primarily focusing on the widely utilized 2D-DIC and 3D-DIC techniques. To address frequently asked questions from users, the review thoroughly examines the optimal methods for preparing speckle patterns, configuring single-camera or dual-camera imaging systems, conducting DIC analyses, and exploring various applications. Furthermore, emerging DIC methodologies such as Digital Volume Correlation and deep-learning-based DIC are introduced, highlighting their potential for future applications in pavement engineering. The article also provides a comprehensive and reliable flowchart for implementing DIC in AC characterization. Finally, critical directions for future research are presented.
CVJan 25, 2025
"Stones from Other Hills can Polish Jade": Zero-shot Anomaly Image Synthesis via Cross-domain Anomaly InjectionSiqi Wang, Yuanze Hu, Xinwang Liu et al.
Industrial image anomaly detection (IAD) is a pivotal topic with huge value. Due to anomaly's nature, real anomalies in a specific modern industrial domain (i.e. domain-specific anomalies) are usually too rare to collect, which severely hinders IAD. Thus, zero-shot anomaly synthesis (ZSAS), which synthesizes pseudo anomaly images without any domain-specific anomaly, emerges as a vital technique for IAD. However, existing solutions are either unable to synthesize authentic pseudo anomalies, or require cumbersome training. Thus, we focus on ZSAS and propose a brand-new paradigm that can realize both authentic and training-free ZSAS. It is based on a chronically-ignored fact: Although domain-specific anomalies are rare, real anomalies from other domains (i.e. cross-domain anomalies) are actually abundant and directly applicable to ZSAS. Specifically, our new ZSAS paradigm makes three-fold contributions: First, we propose a novel method named Cross-domain Anomaly Injection (CAI), which directly exploits cross-domain anomalies to enable highly authentic ZSAS in a training-free manner. Second, to supply CAI with sufficient cross-domain anomalies, we build the first Domain-agnostic Anomaly Dataset within our best knowledge, which provides ZSAS with abundant real anomaly patterns. Third, we propose a CAI-guided Diffusion Mechanism, which further breaks the quantity limit of real anomalies and enable unlimited anomaly synthesis. Our head-to-head comparison with existing ZSAS solutions justifies our paradigm's superior performance for IAD and demonstrates it as an effective and pragmatic ZSAS solution.
AIAug 26, 2025
AniME: Adaptive Multi-Agent Planning for Long Animation GenerationLisai Zhang, Baohan Xu, Siqian Yang et al.
We present AniME, a director-oriented multi-agent system for automated long-form anime production, covering the full workflow from a story to the final video. The director agent keeps a global memory for the whole workflow, and coordinates several downstream specialized agents. By integrating customized Model Context Protocol (MCP) with downstream model instruction, the specialized agent adaptively selects control conditions for diverse sub-tasks. AniME produces cinematic animation with consistent characters and synchronized audio visual elements, offering a scalable solution for AI-driven anime creation.
AIJun 17, 2025
ADRD: LLM-Driven Autonomous Driving Based on Rule-based Decision SystemsFanzhi Zeng, Siqi Wang, Chuzhao Zhu et al.
How to construct an interpretable autonomous driving decision-making system has become a focal point in academic research. In this study, we propose a novel approach that leverages large language models (LLMs) to generate executable, rule-based decision systems to address this challenge. Specifically, harnessing the strong reasoning and programming capabilities of LLMs, we introduce the ADRD(LLM-Driven Autonomous Driving Based on Rule-based Decision Systems) framework, which integrates three core modules: the Information Module, the Agents Module, and the Testing Module. The framework operates by first aggregating contextual driving scenario information through the Information Module, then utilizing the Agents Module to generate rule-based driving tactics. These tactics are iteratively refined through continuous interaction with the Testing Module. Extensive experimental evaluations demonstrate that ADRD exhibits superior performance in autonomous driving decision tasks. Compared to traditional reinforcement learning approaches and the most advanced LLM-based methods, ADRD shows significant advantages in terms of interpretability, response speed, and driving performance. These results highlight the framework's ability to achieve comprehensive and accurate understanding of complex driving scenarios, and underscore the promising future of transparent, rule-based decision systems that are easily modifiable and broadly applicable. To the best of our knowledge, this is the first work that integrates large language models with rule-based systems for autonomous driving decision-making, and our findings validate its potential for real-world deployment.
CVJan 25
Domain Generalization with Quantum Enhancement for Medical Image Classification: A Lightweight Approach for Cross-Center DeploymentJingsong Xia, Siqi Wang
Medical image artificial intelligence models often achieve strong performance in single-center or single-device settings, yet their effectiveness frequently deteriorates in real-world cross-center deployment due to domain shift, limiting clinical generalizability. To address this challenge, we propose a lightweight domain generalization framework with quantum-enhanced collaborative learning, enabling robust generalization to unseen target domains without relying on real multi-center labeled data. Specifically, a MobileNetV2-based domain-invariant encoder is constructed and optimized through three key components: (1) multi-domain imaging shift simulation using brightness, contrast, sharpening, and noise perturbations to emulate heterogeneous acquisition conditions; (2) domain-adversarial training with gradient reversal to suppress domain-discriminative features; and (3) a lightweight quantum feature enhancement layer that applies parameterized quantum circuits for nonlinear feature mapping and entanglement modeling. In addition, a test-time adaptation strategy is employed during inference to further alleviate distribution shifts. Experiments on simulated multi-center medical imaging datasets demonstrate that the proposed method significantly outperforms baseline models without domain generalization or quantum enhancement on unseen domains, achieving reduced domain-specific performance variance and improved AUC and sensitivity. These results highlight the clinical potential of quantum-enhanced domain generalization under constrained computational resources and provide a feasible paradigm for hybrid quantum--classical medical imaging systems.
LGDec 5, 2025
Scaling and Transferability of Annealing Strategies in Large Language Model TrainingSiqi Wang, Zhengyu Chen, Teng Xiao et al.
Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.
AIAug 13, 2025
MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR AdvancementWeitao Jia, Jinghui Lu, Haiyang Yu et al.
Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.
CLJul 27, 2025
Post-Completion Learning for Language ModelsXiang Fei, Siqi Wang, Shu Wei et al.
Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos>) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.