ITJun 3
Sequence Reconstruction for Substitution Channel: New Sufficient Conditions and AlgorithmsChen Wang, Eitan Yaakobi, Yiwei Zhang
In the sequence reconstruction problem, a codeword $\x$ is transmitted through several identical channels where each channel produces a noisy read of $\x$, and the problem is to analyze how to uniquely reconstruct $\x$ based on these noisy reads. Levenshtein has studied the minimum number of reads which guarantees unique reconstruction of $\x$, which is one sufficient condition for unique reconstruction. In this paper, we move on to a different perspective and propose a new framework for unique reconstruction. Our new sufficient condition for unique reconstruction takes both the number of reads and the distances among the reads into consideration. We offer both theoretical analysis and corresponding efficient reconstruction algorithms for our reconstruction framework.
AIMay 28
PassNet: Scaling Large Language Models for Graph Compiler Pass GenerationYiqun Liu, Yingsheng Wu, Ruqi Yang et al.
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.
CVJul 31, 2022
Neuro-Symbolic Learning: Principles and Applications in OphthalmologyMuhammad Hassan, Haifei Guan, Aikaterini Melliou et al.
Neural networks have been rapidly expanding in recent years, with novel strategies and applications. However, challenges such as interpretability, explainability, robustness, safety, trust, and sensibility remain unsolved in neural network technologies, despite the fact that they will unavoidably be addressed for critical applications. Attempts have been made to overcome the challenges in neural network computing by representing and embedding domain knowledge in terms of symbolic representations. Thus, the neuro-symbolic learning (NeSyL) notion emerged, which incorporates aspects of symbolic representation and bringing common sense into neural networks (NeSyL). In domains where interpretability, reasoning, and explainability are crucial, such as video and image captioning, question-answering and reasoning, health informatics, and genomics, NeSyL has shown promising outcomes. This review presents a comprehensive survey on the state-of-the-art NeSyL approaches, their principles, advances in machine and deep learning algorithms, applications such as opthalmology, and most importantly, future perspectives of this emerging field.
CVApr 20Code
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action ModelsYiwei Zhang, Xuesong Chen, Jin Gao et al.
Vision-Language Models(VLMs) excel at autoregressive text generation, yet end-to-end autonomous driving requires multi-task learning with structured outputs and heterogeneous decoding behaviors, such as autoregressive language generation, parallel object detection and trajectory regression. To accommodate these differences, existing systems typically introduce separate or cascaded decoders, resulting in architectural fragmentation and limited backbone reuse. In this work, we present a unified autonomous driving framework built upon a pretrained VLM, where heterogeneous decoding behaviors are reconciled within a single transformer decoder. We demonstrate that pretrained VLM attention exhibits strong transferability beyond pure language modeling. By organizing visual and structured query tokens within a single causal decoder, structured queries can naturally condition on visual context through the original attention mechanism. Textual and structured outputs share a common attention backbone, enabling stable joint optimization across heterogeneous tasks. Trajectory planning is realized within the same causal LLM decoder by introducing structured trajectory queries. This unified formulation enables planning to share the pretrained attention backbone with images and perception tokens. Extensive experiments on end-to-end autonomous driving benchmarks demonstrate state-of-the-art performance, including 0.28 L2 and 0.18 collision rate on nuScenes open-loop evaluation and competitive results (86.8 PDMS) on NAVSIM closed-loop evaluation. The full model preserves multi-modal generation capability, while an efficient inference mode achieves approximately 40% lower latency. Code and models are available at https://github.com/Z1zyw/OneDrive
LGMay 6
Information Theoretic Adversarial Training of Large Language ModelsYiwei Zhang, Jeremiah Birrell, Reza Ebrahimi et al.
Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack strategies. While adversarial training can improve robustness, existing approaches are computationally expensive and difficult to scale. Recent continuous adversarial training methods, such as Continuous adversarial training (CAT) and Continuous Adversarial Preference Optimization (CAPO), address this challenge by leveraging gradient-based perturbations in the embedding space, enabling more efficient and expressive attacks. Building on this paradigm, we propose WARDEN, a distributionally robust adversarial training framework for LLMs that dynamically reweights adversarial examples through an f -divergence ambiguity set around the empirical training distribution. Our method optimizes the worst-case adversarial loss within a divergence ball around the empirical data distribution, automatically emphasizing harder adversarial examples. Using the convex dual formulation, the objective reduces to a log-sum-exp form under the KL divergence, with a dynamical parameter controlling the strength of reweighting. This study leads to a new class of information-theoretic objectives that significantly reduce attack success rates while maintaining model utility. Across multiple LLMs and attack settings, WARDEN substantially reduces attack success rates with computational and utility costs comparable to CAT-, CAPO-, and MixAT-based baselines, making it a practical approach for scalable robust alignment.
NCMar 14, 2023
Emergent Bio-Functional Similarities in a Cortical-Spike-Train-Decoding Spiking Neural Network Facilitate Predictions of Neural ComputationTengjun Liu, Yansong Chua, Yiwei Zhang et al.
Despite its better bio-plausibility, goal-driven spiking neural network (SNN) has not achieved applicable performance for classifying biological spike trains, and showed little bio-functional similarities compared to traditional artificial neural networks. In this study, we proposed the motorSRNN, a recurrent SNN topologically inspired by the neural motor circuit of primates. By employing the motorSRNN in decoding spike trains from the primary motor cortex of monkeys, we achieved a good balance between classification accuracy and energy consumption. The motorSRNN communicated with the input by capturing and cultivating more cosine-tuning, an essential property of neurons in the motor cortex, and maintained its stability during training. Such training-induced cultivation and persistency of cosine-tuning was also observed in our monkeys. Moreover, the motorSRNN produced additional bio-functional similarities at the single-neuron, population, and circuit levels, demonstrating biological authenticity. Thereby, ablation studies on motorSRNN have suggested long-term stable feedback synapses contribute to the training-induced cultivation in the motor cortex. Besides these novel findings and predictions, we offer a new framework for building authentic models of neural computation.
CVMar 11, 2024Code
BEV$^2$PR: BEV-Enhanced Visual Place Recognition with Structural CuesFudong Ge, Yiwei Zhang, Shuhan Shen et al.
In this paper, we propose a new image-based visual place recognition (VPR) framework by exploiting the structural cues in bird's-eye view (BEV) from a single monocular camera. The motivation arises from two key observations about place recognition methods based on both appearance and structure: 1) For the methods relying on LiDAR sensors, the integration of LiDAR in robotic systems has led to increased expenses, while the alignment of data between different sensors is also a major challenge. 2) Other image-/camera-based methods, involving integrating RGB images and their derived variants (eg, pseudo depth images, pseudo 3D point clouds), exhibit several limitations, such as the failure to effectively exploit the explicit spatial relationships between different objects. To tackle the above issues, we design a new BEV-enhanced VPR framework, namely BEV$^2$PR, generating a composite descriptor with both visual cues and spatial awareness based on a single camera. The key points lie in: 1) We use BEV features as an explicit source of structural knowledge in constructing global features. 2) The lower layers of the pre-trained backbone from BEV generation are shared for visual and structural streams in VPR, facilitating the learning of fine-grained local features in the visual stream. 3) The complementary visual and structural features can jointly enhance VPR performance. Our BEV$^2$PR framework enables consistent performance improvements over several popular aggregation modules for RGB global features. The experiments on our collected VPR-NuScenes dataset demonstrate an absolute gain of 2.47% on Recall@1 for the strong Conv-AP baseline to achieve the best performance in our setting, and notably, a 18.06% gain on the hard set. The code and dataset will be available at https://github.com/FudongGe/BEV2PR.
AIFeb 25
FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning EvaluationXiyuan Zhang, Huihang Wu, Jiayu Guo et al.
We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practical business scenarios. For theoretical assessment, we curate a diverse set of examination questions drawn from widely recognized financial qualification exams, enabling evaluation of LLMs deep understanding and application of financial knowledge. In addition, to assess the practical value of LLMs in real-world financial tasks, we propose a systematic evaluation matrix that categorizes complex financial domains and ensures coverage of essential subdomains and business activities. Based on this evaluation matrix, we collect 3,000 financial scenario questions, consisting of closed-form decision questions with reference answers and open-ended questions evaluated by predefined rubrics. We conduct comprehensive evaluations of state-of-the-art LLMs on the FIRE benchmark, including XuanYuan 4.0, our latest financial-domain model, as a strong in-domain baseline. These results enable a systematic analysis of the capability boundaries of current LLMs in financial applications. We publicly release the benchmark questions and evaluation code to facilitate future research.
CVOct 14, 2025Code
VideoLucy: Deep Memory Backtracking for Long Video UnderstandingJialong Zuo, Yongtai Deng, Lingdong Kong et al.
Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io
CVNov 3, 2024Code
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector QuantizationYiwei Zhang, Jin Gao, Fudong Ge et al.
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car to make the results coherent and realistic. Due to the challenges posed by occlusion, unfavourable imaging conditions and low resolution, \emph{generating} the BEV semantic maps corresponding to corrupted or invalid areas in the perspective view (PV) is appealing very recently. \emph{The question is how to align the PV features with the generative models to facilitate the map estimation}. In this paper, we propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens from the discrete representation learning based on a specialized token decoder module, and finally generate high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV. We evaluate the BEV map layout estimation performance of our model, termed VQ-Map, on both the nuScenes and Argoverse benchmarks, achieving 62.2/47.6 mean IoU for surround-view/monocular evaluation on nuScenes, as well as 73.4 IoU for monocular evaluation on Argoverse, which all set a new record for this map layout estimation task. The code and models are available on \url{https://github.com/Z1zyw/VQ-Map}.
CVJan 16
SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid AttentionRuibang Li, Guan Luo, Yiwei Zhang et al.
Standard softmax self-attention excels in vision tasks but incurs quadratic complexity O(N^2), limiting high-resolution deployment. Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy. We present an analytical study that contrasts linear and softmax attention for visual representation learning from a layer-stacking perspective. We further conduct systematic experiments on layer-wise hybridization patterns of linear and softmax attention. Our results show that, compared with rigid intra-block hybrid designs, fine-grained layer-wise hybridization can match or surpass performance while requiring fewer softmax layers. Building on these findings, we propose SoLA-Vision (Softmax-Linear Attention Vision), a flexible layer-wise hybrid attention backbone that enables fine-grained control over how linear and softmax attention are integrated. By strategically inserting a small number of global softmax layers, SoLA-Vision achieves a strong trade-off between accuracy and computational cost. On ImageNet-1K, SoLA-Vision outperforms purely linear and other hybrid attention models. On dense prediction tasks, it consistently surpasses strong baselines by a considerable margin. Code will be released.
CVJan 14
Integrating Diverse Assignment Strategies into DETRsYiwei Zhang, Jin Gao, Hanshi Wang et al.
Label assignment is a critical component in object detectors, particularly within DETR-style frameworks where the one-to-one matching strategy, despite its end-to-end elegance, suffers from slow convergence due to sparse supervision. While recent works have explored one-to-many assignments to enrich supervisory signals, they often introduce complex, architecture-specific modifications and typically focus on a single auxiliary strategy, lacking a unified and scalable design. In this paper, we first systematically investigate the effects of ``one-to-many'' supervision and reveal a surprising insight that performance gains are driven not by the sheer quantity of supervision, but by the diversity of the assignment strategies employed. This finding suggests that a more elegant, parameter-efficient approach is attainable. Building on this insight, we propose LoRA-DETR, a flexible and lightweight framework that seamlessly integrates diverse assignment strategies into any DETR-style detector. Our method augments the primary network with multiple Low-Rank Adaptation (LoRA) branches during training, each instantiating a different one-to-many assignment rule. These branches act as auxiliary modules that inject rich, varied supervisory gradients into the main model and are discarded during inference, thus incurring no additional computational cost. This design promotes robust joint optimization while maintaining the architectural simplicity of the original detector. Extensive experiments on different baselines validate the effectiveness of our approach. Our work presents a new paradigm for enhancing detectors, demonstrating that diverse ``one-to-many'' supervision can be integrated to achieve state-of-the-art results without compromising model elegance.
CVDec 8, 2025
Online Segment Any 3D Thing as Instance TrackingHanshi Wang, Zijian Cai, Jin Gao et al.
Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.
HCMar 19
Deconstructing Open-World Game Mission Design Formula: A Thematic Analysis Using an Action-Block FrameworkKaijie Xu, Yiwei Zhang, Brian Yang et al.
Open-world missions often rely on repeated formulas, yet designers lack systematic ways to examine pacing, variation, and experiential balance across large portfolios. We introduce the Mission Action Quality Vector (MAQV), a six-dimensional framework-covering combat, exploration, narrative, emotion, problem-solving, and uniqueness-paired with an action block grammar representing missions as gameplay sequences. Using about 2200 missions from 20 AAA titles, we apply LLM-assisted parsing to convert community walkthroughs into structured action sequences and score them with MAQV. An interactive dashboard enables designers to reveal underlying mission formulas. In a mixed-methods study with experienced players and designers, we validate the pipeline's fidelity and the tool's usability, and use thematic analysis to identify recurring design trade-offs, pacing grammars, and systematic differences by quest type and franchise evolution. Our work offers a reproducible analytical workflow, a data-driven visualization tool, and reflective insights to support more balanced, varied mission design at scale.
ITMar 11
A Disguise-and-Squeeze PIR Scheme for the MDS-TPIR Setting and BeyondRui Sun, Ran Tao, Jingke Xu et al.
We consider the problem of private information retrieval (PIR) from MDS coded databases with colluding servers, i.e., MDS-TPIR. In the MDS-TPIR setting, $M$ files are stored across $N$ servers, where each file is stored independently using an $(N,K)$-MDS code. A user wants to retrieve one file without disclosing the index of the desired file to any set of up to $T$ colluding servers. The general problem in studying PIR schemes is to maximize the PIR rate, defined as the ratio of the size of the desired file to the size of the total download. Freij-Hollanti et al. proposed a conjecture of the MDS-TPIR capacity (the maximum achievable PIR rate), which was later disproved by Sun and Jafar by a counterexample with $(M,N,T,K)=(2,4,2,2)$. In this paper, we propose a new MDS-TPIR scheme based on a disguise-and-squeeze approach. The features of our scheme include the following. Our scheme generalizes the Sun-Jafar counterexample to $(M,N,T,K)=(2,N,2,K)$ with $N\geq K+2$ for an arbitrary $(N,K)$-MDS coded system, providing more counterexamples to the conjecture by Freij-Hollanti et al. For $(M,N,T,K)=(2,N,2,K)$ and a GRS (generalized Reed-Solomon codes) coded system, our scheme has rate $\frac{N^2-N}{N^2+KN-2K}$, beating the state-of-the-art results. We further show that this rate achieves the linear MDS-TPIR capacity when $K=2$. Our scheme features a significantly smaller field size for implementation and the adaptiveness to generalized PIR models such as multi-file MDS-TPIR and MDS-PIR against cyclically adjacent colluding servers. Lastly, we provide an $ε$-error MDS-TPIR scheme for $T\geq 3$ based on the disguise-and-squeeze framework.
LGMar 31, 2025
Federated Learning for Cross-Domain Data Privacy: A Distributed Approach to Secure CollaborationYiwei Zhang, Jie Liu, Jiawei Wang et al.
This paper proposes a data privacy protection framework based on federated learning, which aims to realize effective cross-domain data collaboration under the premise of ensuring data privacy through distributed learning. Federated learning greatly reduces the risk of privacy breaches by training the model locally on each client and sharing only model parameters rather than raw data. The experiment verifies the high efficiency and privacy protection ability of federated learning under different data sources through the simulation of medical, financial, and user data. The results show that federated learning can not only maintain high model performance in a multi-domain data environment but also ensure effective protection of data privacy. The research in this paper provides a new technical path for cross-domain data collaboration and promotes the application of large-scale data analysis and machine learning while protecting privacy.
LGMar 25, 2025
Social Network User Profiling for Anomaly Detection Based on Graph Neural NetworksYiwei Zhang
This study proposes a risk pricing anomaly detection method for social network user portraits based on graph neural networks (GNNs), aiming to improve the ability to identify abnormal users in social network environments. In view of the limitations of traditional methods in social network data modeling, this paper combines graph autoencoders (GAEs) and graph attention networks (GATs) to achieve accurate detection of abnormal users through dynamic aggregation of neighbor features and reconstruction error evaluation. The Facebook Page-Page Network dataset is used in the experiment and compared with VAE, GNN, Transformer and GAE. The results show that the proposed method achieves the best performance in AUC, F1-score, Precision and Recall, verifying its effectiveness. In addition, this paper explores the computational efficiency of the model in large-scale data and looks forward to combining self-supervised learning, federated learning, and other technologies in the future to improve the robustness and privacy protection of risk assessment. The research results can provide efficient anomaly detection solutions for financial risk control, social security management, and other fields.
CVMay 17, 2024
Safeguarding Vision-Language Models Against Patched Visual Prompt InjectorsJiachen Sun, Changsheng Wang, Jiongxiao Wang et al.
Large language models have become increasingly prominent, also signaling a shift towards multimodality as the next frontier in artificial intelligence, where their embeddings are harnessed as prompts to generate textual content. Vision-language models (VLMs) stand at the forefront of this advancement, offering innovative ways to combine visual and textual data for enhanced understanding and interaction. However, this integration also enlarges the attack surface. Patch-based adversarial attack is considered the most realistic threat model in physical vision applications, as demonstrated in many existing literature. In this paper, we propose to address patched visual prompt injection, where adversaries exploit adversarial patches to generate target content in VLMs. Our investigation reveals that patched adversarial prompts exhibit sensitivity to pixel-wise randomization, a trait that remains robust even against adaptive attacks designed to counteract such defenses. Leveraging this insight, we introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, specifically tailored to protect VLMs from the threat of patched visual prompt injectors. Our framework significantly lowers the attack success rate to a range between 0% and 5.0% on two leading VLMs, while achieving around 67.3% to 95.0% context recovery of the benign images, demonstrating a balance between security and usability.
LGApr 8, 2025
Addressing Class Imbalance with Probabilistic Graphical Models and Variational InferenceYujia Lou, Jie Liu, Yuan Sheng et al.
This study proposes a method for imbalanced data classification based on deep probabilistic graphical models (DPGMs) to solve the problem that traditional methods have insufficient learning ability for minority class samples. To address the classification bias caused by class imbalance, we introduce variational inference optimization probability modeling, which enables the model to adaptively adjust the representation ability of minority classes and combines the class-aware weight adjustment strategy to enhance the classifier's sensitivity to minority classes. In addition, we combine the adversarial learning mechanism to generate minority class samples in the latent space so that the model can better characterize the category boundary in the high-dimensional feature space. The experiment is evaluated on the Kaggle "Credit Card Fraud Detection" dataset and compared with a variety of advanced imbalanced classification methods (such as GAN-based sampling, BRF, XGBoost-Cost Sensitive, SAAD, HAN). The results show that the method in this study has achieved the best performance in AUC, Precision, Recall and F1-score indicators, effectively improving the recognition rate of minority classes and reducing the false alarm rate. This method can be widely used in imbalanced classification tasks such as financial fraud detection, medical diagnosis, and anomaly detection, providing a new solution for related research.
LGMar 14, 2025
Context-Aware Rule Mining Using a Dynamic Transformer-Based FrameworkJie Liu, Yiwei Zhang, Yuan Sheng et al.
This study proposes a dynamic rule data mining algorithm based on an improved Transformer architecture, aiming to improve the accuracy and efficiency of rule mining in a dynamic data environment. With the increase in data volume and complexity, traditional data mining methods are difficult to cope with dynamic data with strong temporal and variable characteristics, so new algorithms are needed to capture the temporal regularity in the data. By improving the Transformer architecture, and introducing a dynamic weight adjustment mechanism and a temporal dependency module, we enable the model to adapt to data changes and mine more accurate rules. Experimental results show that compared with traditional rule mining algorithms, the improved Transformer model has achieved significant improvements in rule mining accuracy, coverage, and stability. The contribution of each module in the algorithm performance is further verified by ablation experiments, proving the importance of temporal dependency and dynamic weight adjustment mechanisms in improving the model effect. In addition, although the improved model has certain challenges in computational efficiency, its advantages in accuracy and coverage enable it to perform well in processing complex dynamic data. Future research will focus on optimizing computational efficiency and combining more deep learning technologies to expand the application scope of the algorithm, especially in practical applications in the fields of finance, medical care, and intelligent recommendation.
LGMar 31, 2025
A Deep Learning Approach to Anomaly Detection in High-Frequency Trading DataQiuliuyang Bao, Jiawei Wang, Hao Gong et al.
This paper proposes an algorithm based on a staged sliding window Transformer architecture to detect abnormal behaviors in the microstructure of the foreign exchange market, focusing on high-frequency EUR/USD trading data. The method captures multi-scale temporal features through a staged sliding window, extracts global and local dependencies by combining the self-attention mechanism and weighted attention mechanism of the Transformer, and uses a classifier to identify abnormal events. Experimental results on a real high-frequency dataset containing order book depth, spread, and trading volume show that the proposed method significantly outperforms traditional machine learning (such as decision trees and random forests) and deep learning methods (such as MLP, CNN, RNN, LSTM) in terms of accuracy (0.93), F1-Score (0.91), and AUC-ROC (0.95). Ablation experiments verify the contribution of each component, and the visualization of order book depth and anomaly detection further reveals the effectiveness of the model under complex market dynamics. Despite the false positive problem, the model still provides important support for market supervision. In the future, noise processing can be optimized and extended to other markets to improve generalization and real-time performance.
CROct 13, 2024
Uncovering Attacks and Defenses in Secure Aggregation for Federated Deep LearningYiwei Zhang, Rouzbeh Behnia, Attila A. Yavuz et al.
Federated learning enables the collaborative learning of a global model on diverse data, preserving data locality and eliminating the need to transfer user data to a central server. However, data privacy remains vulnerable, as attacks can target user training data by exploiting the updates sent by users during each learning iteration. Secure aggregation protocols are designed to mask/encrypt user updates and enable a central server to aggregate the masked information. MicroSecAgg (PoPETS 2024) proposes a single server secure aggregation protocol that aims to mitigate the high communication complexity of the existing approaches by enabling a one-time setup of the secret to be re-used in multiple training iterations. In this paper, we identify a security flaw in the MicroSecAgg that undermines its privacy guarantees. We detail the security flaw and our attack, demonstrating how an adversary can exploit predictable masking values to compromise user privacy. Our findings highlight the critical need for enhanced security measures in secure aggregation protocols, particularly the implementation of dynamic and unpredictable masking strategies. We propose potential countermeasures to mitigate these vulnerabilities and ensure robust privacy protection in the secure aggregation frameworks.
AISep 29, 2025
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language ModelsChenyue Zhou, Mingxuan Wang, Yanbiao Ma et al.
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.
CVAug 4, 2025
An Event-based Fast Intensity Reconstruction Scheme for UAV Real-time PerceptionXin Dong, Yiwei Zhang, Yangjie Cui et al.
Event cameras offer significant advantages, including a wide dynamic range, high temporal resolution, and immunity to motion blur, making them highly promising for addressing challenging visual conditions. Extracting and utilizing effective information from asynchronous event streams is essential for the onboard implementation of event cameras. In this paper, we propose a streamlined event-based intensity reconstruction scheme, event-based single integration (ESI), to address such implementation challenges. This method guarantees the portability of conventional frame-based vision methods to event-based scenarios and maintains the intrinsic advantages of event cameras. The ESI approach reconstructs intensity images by performing a single integration of the event streams combined with an enhanced decay algorithm. Such a method enables real-time intensity reconstruction at a high frame rate, typically 100 FPS. Furthermore, the relatively low computation load of ESI fits onboard implementation suitably, such as in UAV-based visual tracking scenarios. Extensive experiments have been conducted to evaluate the performance comparison of ESI and state-of-the-art algorithms. Compared to state-of-the-art algorithms, ESI demonstrates remarkable runtime efficiency improvements, superior reconstruction quality, and a high frame rate. As a result, ESI enhances UAV onboard perception significantly under visual adversary surroundings. In-flight tests, ESI demonstrates effective performance for UAV onboard visual tracking under extremely low illumination conditions(2-10lux), whereas other comparative algorithms fail due to insufficient frame rate, poor image quality, or limited real-time performance.
LGMar 7, 2025
To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model MergingZitao Fang, Guodong DU, Shuyang Yu et al.
Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlooked the fundamental roles of neurons, their connectivity, and activation, resulting in a merging process and a merged model that does not consider how neurons relay and process information. In this work, we present the first study that relies on neuronal mechanisms for model merging. Specifically, we decomposed task-specific representations into two complementary neuronal subspaces that regulate input sensitivity and task adaptability. Leveraging this decomposition, we introduced NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrated that NeuroMerging achieved superior performance compared to existing methods on multi-task benchmarks across both natural language and vision domains. Our findings highlighted the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion. Our project is available at https://ZzzitaoFang.github.io/projects/NeuroMerging/.
CVDec 17, 2025
Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 DatasetsJialong Zuo, Haoyou Deng, Hanyu Zhou et al.
The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.
AISep 30, 2025
AuditAgent: Expert-Guided Multi-Agent Reasoning for Cross-Document Fraudulent Evidence DiscoverySongran Bai, Bingzhe Wu, Yiwei Zhang et al.
Financial fraud detection in real-world scenarios presents significant challenges due to the subtlety and dispersion of evidence across complex, multi-year financial disclosures. In this work, we introduce a novel multi-agent reasoning framework AuditAgent, enhanced with auditing domain expertise, for fine-grained evidence chain localization in financial fraud cases. Leveraging an expert-annotated dataset constructed from enforcement documents and financial reports released by the China Securities Regulatory Commission, our approach integrates subject-level risk priors, a hybrid retrieval strategy, and specialized agent modules to efficiently identify and aggregate cross-report evidence. Extensive experiments demonstrate that our method substantially outperforms General-Purpose Agent paradigm in both recall and interpretability, establishing a new benchmark for automated, transparent financial forensics. Our results highlight the value of domain-specific reasoning and dataset construction for advancing robust financial fraud detection in practical, real-world regulatory applications.
ROJun 10, 2025
WD-DETR: Wavelet Denoising-Enhanced Real-Time Object Detection Transformer for Robot Perception with Event CamerasYangjie Cui, Boyang Gao, Yiwei Zhang et al.
Previous studies on event camera sensing have demonstrated certain detection performance using dense event representations. However, the accumulated noise in such dense representations has received insufficient attention, which degrades the representation quality and increases the likelihood of missed detections. To address this challenge, we propose the Wavelet Denoising-enhanced DEtection TRansformer, i.e., WD-DETR network, for event cameras. In particular, a dense event representation is presented first, which enables real-time reconstruction of events as tensors. Then, a wavelet transform method is designed to filter noise in the event representations. Such a method is integrated into the backbone for feature extraction. The extracted features are subsequently fed into a transformer-based network for object prediction. To further reduce inference time, we incorporate the Dynamic Reorganization Convolution Block (DRCB) as a fusion module within the hybrid encoder. The proposed method has been evaluated on three event-based object detection datasets, i.e., DSEC, Gen1, and 1Mpx. The results demonstrate that WD-DETR outperforms tested state-of-the-art methods. Additionally, we implement our approach on a common onboard computer for robots, the NVIDIA Jetson Orin NX, achieving a high frame rate of approximately 35 FPS using TensorRT FP16, which is exceptionally well-suited for real-time perception of onboard robotic systems.
IVApr 13, 2025
Dual-Modality Computational Ophthalmic Imaging with Deep Learning and Coaxial Optical DesignBoyuan Peng, Jiaju Chen, Yiwei Zhang et al.
The growing burden of myopia and retinal diseases necessitates more accessible and efficient eye screening solutions. This study presents a compact, dual-function optical device that integrates fundus photography and refractive error detection into a unified platform. The system features a coaxial optical design using dichroic mirrors to separate wavelength-dependent imaging paths, enabling simultaneous alignment of fundus and refraction modules. A Dense-U-Net-based algorithm with customized loss functions is employed for accurate pupil segmentation, facilitating automated alignment and focusing. Experimental evaluations demonstrate the system's capability to achieve high-precision pupil localization (EDE = 2.8 px, mIoU = 0.931) and reliable refractive estimation with a mean absolute error below 5%. Despite limitations due to commercial lens components, the proposed framework offers a promising solution for rapid, intelligent, and scalable ophthalmic screening, particularly suitable for community health settings.
CLMay 17, 2021
Room to Grow: Understanding Personal Characteristics Behind Self Improvement Using Social MediaMeiXing Dong, Xueming Xu, Yiwei Zhang et al.
Many people aim for change, but not everyone succeeds. While there are a number of social psychology theories that propose motivation-related characteristics of those who persist with change, few computational studies have explored the motivational stage of personal change. In this paper, we investigate a new dataset consisting of the writings of people who manifest intention to change, some of whom persist while others do not. Using a variety of linguistic analysis techniques, we first examine the writing patterns that distinguish the two groups of people. Persistent people tend to reference more topics related to long-term self-improvement and use a more complicated writing style. Drawing on these consistent differences, we build a classifier that can reliably identify the people more likely to persist, based on their language. Our experiments provide new insights into the motivation-related behavior of people who persist with their intention to change.
SIJan 10, 2020
Measuring Similarity between Brands using Followers' Post in Social MediaYiwei Zhang, Xueting Wang, Yoshiaki Sakai et al.
In this paper, we propose a new measure to estimate the similarity between brands via posts of brands' followers on social network services (SNS). Our method was developed with the intention of exploring the brands that customers are likely to jointly purchase. Nowadays, brands use social media for targeted advertising because influencing users' preferences can greatly affect the trends in sales. We assume that data on SNS allows us to make quantitative comparisons between brands. Our proposed algorithm analyzes the daily photos and hashtags posted by each brand's followers. By clustering them and converting them to histograms, we can calculate the similarity between brands. We evaluated our proposed algorithm with purchase logs, credit card information, and answers to the questionnaires. The experimental results show that the purchase data maintained by a mall or a credit card company can predict the co-purchase very well, but not the customer's willingness to buy products of new brands. On the other hand, our method can predict the users' interest on brands with a correlation value over 0.53, which is pretty high considering that such interest to brands are high subjective and individual dependent.
CVDec 25, 2019
Look, Listen, and Act: Towards Audio-Visual Embodied NavigationChuang Gan, Yiwei Zhang, Jiajun Wu et al.
A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given only raw egocentric visual and audio sensory data. To accomplish this task, the agent is required to learn from various modalities, i.e. relating the audio signal to the visual environment. Here we describe an approach to audio-visual embodied navigation that takes advantage of both visual and audio pieces of evidence. Our solution is based on three key ideas: a visual perception mapper module that constructs its spatial memory of the environment, a sound perception module that infers the relative location of the sound source from the agent, and a dynamic path planner that plans a sequence of actions based on the audio-visual observations and the spatial memory of the environment to navigate toward the goal. Experimental results on a newly collected Visual-Audio-Room dataset using the simulated multi-modal environment demonstrate the effectiveness of our approach over several competitive baselines.
LGNov 22, 2019
Factorized Multimodal Transformer for Multimodal Sequential LearningAmir Zadeh, Chengfeng Mao, Kelly Shi et al.
The complex world around us is inherently multimodal and sequential (continuous). Information is scattered across different modalities and requires multiple continuous sensors to be captured. As machine learning leaps towards better generalization to real world, multimodal sequential learning becomes a fundamental research area. Arguably, modeling arbitrarily distributed spatio-temporal dynamics within and across modalities is the biggest challenge in this research area. In this paper, we present a new transformer model, called the Factorized Multimodal Transformer (FMT) for multimodal sequential learning. FMT inherently models the intramodal and intermodal (involving two or more modalities) dynamics within its multimodal input in a factorized manner. The proposed factorization allows for increasing the number of self-attentions to better model the multimodal phenomena at hand; without encountering difficulties during training (e.g. overfitting) even on relatively low-resource setups. All the attention mechanisms within FMT have a full time-domain receptive field which allows them to asynchronously capture long-range multimodal dynamics. In our experiments we focus on datasets that contain the three commonly studied modalities of language, vision and acoustic. We perform a wide range of experiments, spanning across 3 well-studied datasets and 21 distinct labels. FMT shows superior performance over previously proposed models, setting new state of the art in the studied datasets.
SIOct 21, 2019
User-Aware Folk Popularity Rank: User-Popularity-Based Tag Recommendation That Can Enhance Social PopularityXueting Wang, Yiwei Zhang, Toshihiko Yamasaki
In this paper we propose a method that can enhance the social popularity of a post (i.e., the number of views or likes) by recommending appropriate hash tags considering both content popularity and user popularity. A previous approach called FolkPopularityRank (FP-Rank) considered only the relationship among images, tags, and their popularity. However, the popularity of an image/video is strongly affected by who uploaded it. Therefore, we develop an algorithm that can incorporate user popularity and users' tag usage tendency into the FP-Rank algorithm. The experimental results using 60,000 training images with their accompanying tags and 1,000 test data, which were actually uploaded to a real social network service (SNS), show that, in ten days, our proposed algorithm can achieve 1.2 times more views than the FP-Rank algorithm. This technology would be critical to individual users and companies/brands who want to promote themselves in SNSs.
CVFeb 24, 2019
Bi-Skip: A Motion Deblurring Network Using Self-paced LearningYiwei Zhang, Chunbiao Zhu, Ge Li et al.
A fast and effective motion deblurring method has great application values in real life. This work presents an innovative approach in which a self-paced learning is combined with GAN to deblur image. First, We explain that a proper generator can be used as deep priors and point out that the solution for pixel-based loss is not same with the one for perception-based loss. By using these ideas as starting points, a Bi-Skip network is proposed to improve the generating ability and a bi-level loss is adopted to solve the problem that common conditions are non-identical. Second, considering that the complex motion blur will perturb the network in the training process, a self-paced mechanism is adopted to enhance the robustness of the network. Through extensive evaluations on both qualitative and quantitative criteria, it is demonstrated that our approach has a competitive advantage over state-of-the-art methods.
CVJun 21, 2018
Detection based Defense against Adversarial Examples from the Steganalysis Point of ViewJiayang Liu, Weiming Zhang, Yiwei Zhang et al.
Deep Neural Networks (DNNs) have recently led to significant improvements in many fields. However, DNNs are vulnerable to adversarial examples which are samples with imperceptible perturbations while dramatically misleading the DNNs. Moreover, adversarial examples can be used to perform an attack on various kinds of DNN based systems, even if the adversary has no access to the underlying model. Many defense methods have been proposed, such as obfuscating gradients of the networks or detecting adversarial examples. However it is proved out that these defense methods are not effective or cannot resist secondary adversarial attacks. In this paper, we point out that steganalysis can be applied to adversarial examples detection, and propose a method to enhance steganalysis features by estimating the probability of modifications caused by adversarial attacks. Experimental results show that the proposed method can accurately detect adversarial examples. Moreover, secondary adversarial attacks cannot be directly performed to our method because our method is not based on a neural network but based on high-dimensional artificial features and FLD (Fisher Linear Discriminant) ensemble.
MAAug 25, 2017
Reinforcement Mechanism Design for e-commerceQingpeng Cai, Aris Filos-Ratsikas, Pingzhong Tang et al.
We study the problem of allocating impressions to sellers in e-commerce websites, such as Amazon, eBay or Taobao, aiming to maximize the total revenue generated by the platform. We employ a general framework of reinforcement mechanism design, which uses deep reinforcement learning to design efficient algorithms, taking the strategic behaviour of the sellers into account. Specifically, we model the impression allocation problem as a Markov decision process, where the states encode the history of impressions, prices, transactions and generated revenue and the actions are the possible impression allocations in each round. To tackle the problem of continuity and high-dimensionality of states and actions, we adopt the ideas of the DDPG algorithm to design an actor-critic policy gradient algorithm which takes advantage of the problem domain in order to achieve convergence and stability. We evaluate our proposed algorithm, coined IA(GRU), by comparing it against DDPG, as well as several natural heuristics, under different rationality models for the sellers - we assume that sellers follow well-known no-regret type strategies which may vary in their degree of sophistication. We find that IA(GRU) outperforms all algorithms in terms of the total revenue.