Jingyu Wang

CV
h-index116
47papers
997citations
Novelty54%
AI Score58

47 Papers

CVJul 29, 2022Code
Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Jiachang Hao, Haifeng Sun, Pengfei Ren et al.

Temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video. However, recent works find that existing methods suffer a severe temporal bias problem. These methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets. To this end, this paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy. Our framework introduces two auxiliary tasks, cross-modal matching and temporal order discrimination, to promote the grounding model training. The cross-modal matching task leverages the content consistency between shuffled and original videos to force the grounding model to mine visual contents to semantically match queries. The temporal order discrimination task leverages the difference in temporal order to strengthen the understanding of long-term temporal contexts. Extensive experiments on Charades-STA and ActivityNet Captions demonstrate the effectiveness of our method for mitigating the reliance on temporal biases and strengthening the model's generalization ability against the different temporal distributions. Code is available at https://github.com/haojc/ShufflingVideosForTSG.

SEAug 15, 2023Code
Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies

Yilun Liu, Shimin Tao, Weibin Meng et al.

Automated log analysis is crucial in modern software-intensive systems for facilitating program comprehension throughout software maintenance and engineering life cycles. Existing methods perform tasks such as log parsing and log anomaly detection by providing a single prediction value without interpretation. However, given the increasing volume of system events, the limited interpretability of analysis results hinders analysts' comprehension of program status and their ability to take appropriate actions. Moreover, these methods require substantial in-domain training data, and their performance declines sharply (by up to 62.5%) in online scenarios involving unseen logs from new domains, a common occurrence due to rapid software updates. In this paper, we propose LogPrompt, a novel interpretable log analysis approach for online scenarios. LogPrompt employs large language models (LLMs) to perform online log analysis tasks via a suite of advanced prompt strategies tailored for log tasks, which enhances LLMs' performance by up to 380.7% compared with simple prompts. Experiments on nine publicly available evaluation datasets across two tasks demonstrate that LogPrompt, despite requiring no in-domain training, outperforms existing approaches trained on thousands of logs by up to 55.9%. We also conduct a human evaluation of LogPrompt's interpretability, with six practitioners possessing over 10 years of experience, who highly rated the generated content in terms of usefulness and readability (averagely 4.42/5). LogPrompt also exhibits remarkable compatibility with open-source and smaller-scale LLMs, making it flexible for practical deployment. Code of LogPrompt is available at https://github.com/lunyiliu/LogPrompt.

AIAug 22, 2024Code
Enhanced Fine-Tuning of Lightweight Domain-Specific Q&A Model Based on Large Language Models

Shenglin Zhang, Pengtian Zhu, Minghua Ma et al.

Large language models (LLMs) excel at general question-answering (Q&A) but often fall short in specialized domains due to a lack of domain-specific knowledge. Commercial companies face the dual challenges of privacy protection and resource constraints when involving LLMs for fine-tuning. This paper propose a novel framework, Self-Evolution, designed to address these issues by leveraging lightweight open-source LLMs through multiple iterative fine-tuning rounds. To enhance the efficiency of iterative fine-tuning, Self-Evolution employ a strategy that filters and reinforces the knowledge with higher value during the iterative process. We employed Self-Evolution on Qwen1.5-7B-Chat using 4,000 documents containing rich domain knowledge from China Mobile, achieving a performance score 174% higher on domain-specific question-answering evaluations than Qwen1.5-7B-Chat and even 22% higher than Qwen1.5-72B-Chat. Self-Evolution has been deployed in China Mobile's daily operation and maintenance for 117 days, and it improves the efficiency of locating alarms, fixing problems, and finding related reports, with an average efficiency improvement of over 18.6%. In addition, we release Self-Evolution framework code in https://github.com/Zero-Pointer/Self-Evolution.

LGSep 27, 2024
Rethinking the Power of Timestamps for Robust Time Series Forecasting: A Global-Local Fusion Perspective

Chengsen Wang, Qi Qi, Jingyu Wang et al.

Time series forecasting has played a pivotal role across various industries, including finance, transportation, energy, healthcare, and climate. Due to the abundant seasonal information they contain, timestamps possess the potential to offer robust global guidance for forecasting techniques. However, existing works primarily focus on local observations, with timestamps being treated merely as an optional supplement that remains underutilized. When data gathered from the real world is polluted, the absence of global information will damage the robust prediction capability of these algorithms. To address these problems, we propose a novel framework named GLAFF. Within this framework, the timestamps are modeled individually to capture the global dependencies. Working as a plugin, GLAFF adaptively adjusts the combined weights for global and local information, enabling seamless collaboration with any time series forecasting backbone. Extensive experiments conducted on nine real-world datasets demonstrate that GLAFF significantly enhances the average performance of widely used mainstream forecasting models by 12.5%, surpassing the previous state-of-the-art method by 5.5%.

GTJun 21, 2023
Adaptive DNN Surgery for Selfish Inference Acceleration with On-demand Edge Resource

Xiang Yang, Dezhi Chen, Qi Qi et al.

Deep Neural Networks (DNNs) have significantly improved the accuracy of intelligent applications on mobile devices. DNN surgery, which partitions DNN processing between mobile devices and multi-access edge computing (MEC) servers, can enable real-time inference despite the computational limitations of mobile devices. However, DNN surgery faces a critical challenge: determining the optimal computing resource demand from the server and the corresponding partition strategy, while considering both inference latency and MEC server usage costs. This problem is compounded by two factors: (1) the finite computing capacity of the MEC server, which is shared among multiple devices, leading to inter-dependent demands, and (2) the shift in modern DNN architecture from chains to directed acyclic graphs (DAGs), which complicates potential solutions. In this paper, we introduce a novel Decentralized DNN Surgery (DDS) framework. We formulate the partition strategy as a min-cut and propose a resource allocation game to adaptively schedule the demands of mobile devices in an MEC environment. We prove the existence of a Nash Equilibrium (NE), and develop an iterative algorithm to efficiently reach the NE for each device. Our extensive experiments demonstrate that DDS can effectively handle varying MEC scenarios, achieving up to 1.25$\times$ acceleration compared to the state-of-the-art algorithm.

CVFeb 5, 2023
Decoupled Iterative Refinement Framework for Interacting Hands Reconstruction from a Single RGB Image

Pengfei Ren, Chao Wen, Xiaozheng Zheng et al.

Reconstructing interacting hands from a single RGB image is a very challenging task. On the one hand, severe mutual occlusion and similar local appearance between two hands confuse the extraction of visual features, resulting in the misalignment of estimated hand meshes and the image. On the other hand, there are complex spatial relationship between interacting hands, which significantly increases the solution space of hand poses and increases the difficulty of network learning. In this paper, we propose a decoupled iterative refinement framework to achieve pixel-alignment hand reconstruction while efficiently modeling the spatial relationship between hands. Specifically, we define two feature spaces with different characteristics, namely 2D visual feature space and 3D joint feature space. First, we obtain joint-wise features from the visual feature map and utilize a graph convolution network and a transformer to perform intra- and inter-hand information interaction in the 3D joint feature space, respectively. Then, we project the joint features with global information back into the 2D visual feature space in an obfuscation-free manner and utilize the 2D convolution for pixel-wise enhancement. By performing multiple alternate enhancements in the two feature spaces, our method can achieve an accurate and robust reconstruction of interacting hands. Our method outperforms all existing two-hand reconstruction methods by a large margin on the InterHand2.6M dataset.

MEApr 28
Fractionally Supervised Classification with Maxima Nominated Samples

Mohammad Jafari Jozani, Jingyu Wang

Fractionally supervised classification (FSC) offers a flexible framework for combining labeled and unlabeled data in model-based classification, but existing formulations assume simple random sampling. In many applications, however, the retained observation is an extreme order statistic from a set rather than a randomly selected unit. This is particularly appealing when the target population is rare, since maxima nomination sampling (NS) can enrich the sample with the most informative observations, as in screening, environmental monitoring, repeated testing, and reliability studies. Under such designs, the likelihood function changes fundamentally, and the usual FSC EM construction is no longer valid. We develop FSC for nominated samples by introducing a latent representation that accounts for both the class membership of the observed maximum and the latent composition of the remaining units in the set. The resulting method yields a proper EM algorithm and a coherent weighted-likelihood FSC procedure for NS data. We present the methodology in general form, illustrate it for a rare-event contamination normal mixtures, and show through simulation that it substantially improves on the misspecified alternative by ignoring the extra rank information of such data. A real-data analysis demonstrates its practical value.

CLJul 26, 2023
How Does Diffusion Influence Pretrained Language Models on Out-of-Distribution Data?

Huazheng Wang, Daixuan Cheng, Haifeng Sun et al.

Transformer-based pretrained language models (PLMs) have achieved great success in modern NLP. An important advantage of PLMs is good out-of-distribution (OOD) robustness. Recently, diffusion models have attracted a lot of work to apply diffusion to PLMs. It remains under-explored how diffusion influences PLMs on OOD data. The core of diffusion models is a forward diffusion process which gradually applies Gaussian noise to inputs, and a reverse denoising process which removes noise. The noised input reconstruction is a fundamental ability of diffusion models. We directly analyze OOD robustness by measuring the reconstruction loss, including testing the abilities to reconstruct OOD data, and to detect OOD samples. Experiments are conducted by analyzing different training parameters and data statistical features on eight datasets. It shows that finetuning PLMs with diffusion degrades the reconstruction ability on OOD data. The comparison also shows that diffusion models can effectively detect OOD samples, achieving state-of-the-art performance in most of the datasets with an absolute accuracy improvement up to 18%. These results indicate that diffusion reduces OOD robustness of PLMs.

CVFeb 2, 2023
HaMuCo: Hand Pose Estimation via Multiview Collaborative Self-Supervised Learning

Xiaozheng Zheng, Chao Wen, Zhou Xue et al.

Recent advancements in 3D hand pose estimation have shown promising results, but its effectiveness has primarily relied on the availability of large-scale annotated datasets, the creation of which is a laborious and costly process. To alleviate the label-hungry limitation, we propose a self-supervised learning framework, HaMuCo, that learns a single-view hand pose estimator from multi-view pseudo 2D labels. However, one of the main challenges of self-supervised learning is the presence of noisy labels and the ``groupthink'' effect from multiple views. To overcome these issues, we introduce a cross-view interaction network that distills the single-view estimator by utilizing the cross-view correlated features and enforcing multi-view consistency to achieve collaborative learning. Both the single-view estimator and the cross-view interaction network are trained jointly in an end-to-end manner. Extensive experiments show that our method can achieve state-of-the-art performance on multi-view self-supervised hand pose estimation. Furthermore, the proposed cross-view interaction network can also be applied to hand pose estimation from multi-view input and outperforms previous methods under the same settings.

LGNov 5, 2024Code
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios

Yunkai Dang, Mengxi Gao, Yibo Yan et al. · tsinghua

Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding. However, existing studies have concentrated mainly on visual-textual misalignment, leaving largely unexplored the MLLMs' ability to preserve an originally correct answer when confronted with misleading information. We reveal a response uncertainty phenomenon: across nine standard datasets, twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue. To systematically quantify this vulnerability, we propose a two-stage evaluation pipeline: (1) elicit each model's original response on unperturbed inputs; (2) inject explicit (false-answer hints) and implicit (contextual contradictions) misleading instructions, and compute the misleading rate - the fraction of correct-to-incorrect flips. Leveraging the most susceptible examples, we curate the Multimodal Uncertainty Benchmark (MUB), a collection of image-question pairs stratified into low, medium, and high difficulty based on how many of twelve state-of-the-art MLLMs they mislead. Extensive evaluation on twelve open-source and five closed-source models reveals a high uncertainty: average misleading rates exceed 86%, with explicit cues over 67.19% and implicit cues over 80.67%. To reduce the misleading rate, we then fine-tune all open-source MLLMs on a compact 2000-sample mixed-instruction dataset, reducing misleading rates to 6.97% (explicit) and 32.77% (implicit), boosting consistency by nearly 29.37% on highly deceptive inputs, and slightly improving accuracy on standard benchmarks. Our code is available at https://github.com/Yunkaidang/uncertainty

CVSep 30, 2024
RISE-SDF: a Relightable Information-Shared Signed Distance Field for Glossy Object Inverse Rendering

Deheng Zhang, Jingyu Wang, Shaofei Wang et al.

In this paper, we propose a novel end-to-end relightable neural inverse rendering system that achieves high-quality reconstruction of geometry and material properties, thus enabling high-quality relighting. The cornerstone of our method is a two-stage approach for learning a better factorization of scene parameters. In the first stage, we develop a reflection-aware radiance field using a neural signed distance field (SDF) as the geometry representation and deploy an MLP (multilayer perceptron) to estimate indirect illumination. In the second stage, we introduce a novel information-sharing network structure to jointly learn the radiance field and the physically based factorization of the scene. For the physically based factorization, to reduce the noise caused by Monte Carlo sampling, we apply a split-sum approximation with a simplified Disney BRDF and cube mipmap as the environment light representation. In the relighting phase, to enhance the quality of indirect illumination, we propose a second split-sum algorithm to trace secondary rays under the split-sum rendering framework. Furthermore, there is no dataset or protocol available to quantitatively evaluate the inverse rendering performance for glossy objects. To assess the quality of material reconstruction and relighting, we have created a new dataset with ground truth BRDF parameters and relighting results. Our experiments demonstrate that our algorithm achieves state-of-the-art performance in inverse rendering and relighting, with particularly strong results in the reconstruction of highly reflective objects.

SPMar 19
Holter-to-Sleep: AI-Enabled Repurposing of Single-Lead ECG for Sleep Phenotyping

Donglin Xie, Qingshuo Zhao, Jingyu Wang et al.

Sleep disturbances are tightly linked to cardiovascular risk, yet polysomnography (PSG)-the clinical reference standard-remains resource-intensive and poorly suited for multi-night, home-based, and large-scale screening. Single-lead electrocardiography (ECG), already ubiquitous in Holter and patch-based devices, enables comfortable long-term acquisition and encodes sleep-relevant physiology through autonomic modulation and cardiorespiratory coupling. Here, we present a proof-of-concept Holter-to-Sleep framework that, using single-lead ECG as the sole input, jointly supports overnight sleep phenotyping and Holter-grade cardiac phenotyping within the same recording, and further provides an explicit analytic pathway for scalable cardio-sleep association studies. The framework is developed and validated on a pooled multi-center PSG sample of 10,439 studies spanning four public cohorts, with independent external evaluation to assess cross-cohort generalizability, and additional real-world feasibility assessment using overnight patch-ECG recordings via objective-subjective consistency analysis. This integrated design enables robust extraction of clinically meaningful overnight sleep phenotypes under heterogeneous populations and acquisition conditions, and facilitates systematic linkage between ECG-derived sleep metrics and arrhythmia-related Holter phenotypes. Collectively, the Holter-to-Sleep paradigm offers a practical foundation for low-burden, home-deployable, and scalable cardio-sleep monitoring and research beyond traditional PSG-centric workflows.

CLFeb 27, 2025Code
Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models

Huazheng Wang, Yongcheng Jing, Haifeng Sun et al.

In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation, ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, relation-reversed, and one-hop reasoned data. We then conduct a rigorous evaluation of 15 state-of-the-art methods across three datasets, revealing that unlearned models still recall paraphrased answers and retain target facts in their intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/PERMU.

CVNov 20, 2024Code
MEGL: Multimodal Explanation-Guided Learning

Yifei Zhang, Tianxu Jiang, Bo Pan et al.

Explaining the decision-making processes of Artificial Intelligence (AI) models is crucial for addressing their "black box" nature, particularly in tasks like image classification. Traditional eXplainable AI (XAI) methods typically rely on unimodal explanations, either visual or textual, each with inherent limitations. Visual explanations highlight key regions but often lack rationale, while textual explanations provide context without spatial grounding. Further, both explanation types can be inconsistent or incomplete, limiting their reliability. To address these challenges, we propose a novel Multimodal Explanation-Guided Learning (MEGL) framework that leverages both visual and textual explanations to enhance model interpretability and improve classification performance. Our Saliency-Driven Textual Grounding (SDTG) approach integrates spatial information from visual explanations into textual rationales, providing spatially grounded and contextually rich explanations. Additionally, we introduce Textual Supervision on Visual Explanations to align visual explanations with textual rationales, even in cases where ground truth visual annotations are missing. A Visual Explanation Distribution Consistency loss further reinforces visual coherence by aligning the generated visual explanations with dataset-level patterns, enabling the model to effectively learn from incomplete multimodal supervision. We validate MEGL on two new datasets, Object-ME and Action-ME, for image classification with multimodal explanations. Experimental results demonstrate that MEGL outperforms previous approaches in prediction accuracy and explanation quality across both visual and textual domains. Our code will be made available upon the acceptance of the paper.

CRSep 21, 2023
Extracting Physical Causality from Measurements to Detect and Localize False Data Injection Attacks

Shengyang Wu, Jingyu Wang, Dongyuan Shi

False Data Injection Attack (FDIA) has become a growing concern in modern cyber-physical power systems. Most existing FDIA detection techniques project the raw measurement data into a high-dimensional latent space to separate normal and attacked samples. These approaches focus more on the statistical correlations of data values and are therefore susceptible to data distribution drifts induced by changes in system operating points or changes in FDIA types and strengths, especially for FDIA localization tasks. Causal inference, on the other hand, extracts the causality behind the coordinated fluctuations of different measurements. The causality patterns are determined by fundamental physical laws such as Ohm's Law and Kirchhoff's Law. They are sensitive to the violation of physical laws caused by FDIA, but tend to remain stable with the drift of system operating points. Leveraging this advantage, this paper proposes a joint FDIA detection and localization framework based on causal inference and the Graph Attention Network (GAT) to identify the attacked system nodes. The proposed framework consists of two levels. The lower level uses the X-learner algorithm to estimate the causality strength between measurements and generate Measurement Causality Graphs (MCGs). The upper level then applies a GAT to identify the anomaly patterns in the MCGs. Since the extracted causality patterns are intrinsically related to the measurements, it is easier for the upper level to figure out the attacked nodes than the existing FDIA localization approaches. The performance of the proposed framework is evaluated on the IEEE 39-bus system. Experimental results show that the causality-based FDIA detection and localization mechanism is highly interpretable and robust.

SPApr 7
The Breakthrough of Sleep: A Contactless Approach for Accurate Sleep Stage Detection Using the Sleepal AI Lamp

Zhuo Diao, Yueting Li, Jianpeng Wang et al.

Sleep staging is essential for the assessment of sleep quality and the diagnosis of sleep-related disorders. Conventional polysomnography (PSG), while considered the gold standard, is intrusive, labor-intensive, and unsuitable for long-term monitoring. This study evaluates the performance of the Sleepal AI Lamp, a contactless, radar-based consumer-grade sleep tracker, in comparison with gold-standard polysomnography (PSG), using a large-scale dataset comprising 1022 overnight recordings. We extract multi-scale respiratory and motion-related features from radar signals to train a frequency-augmented deep learning model. For the binary sleep-wake classification task, experimental results demonstrated that the model achieved an accuracy of 92.8% alongside a macro-averaged F1 score of 0.895. For four-stage classification (wake, light NREM (N1 + N2), deep NREM (N3), REM), the model achieved an accuracy of 78.5% with a Cohen's kappa coefficient of 0.695 in healthy individuals and maintained a stable accuracy of 77.2% with a kappa of 0.677 in a heterogeneous population including patients with varying severities of obstructive sleep apnea (OSA). These experimental results demonstrate that the sleep staging performance of the contactless Sleepal AI Lamp is in high agreement with expert-labeled PSG sleep stages. Our findings suggest that non-contact radar sensing, combined with advanced temporal modeling, can provide reliable sleep staging performance without requiring physical contact or wearable devices. Owing to its unobtrusive nature, ease of deployment, and robustness to long-term use, the contactless Sleepal AI Lamp shows strong potential for clinical screening, home-based sleep assessment, and continuous longitudinal sleep monitoring in real-world medical and healthcare applications.

CVFeb 20Code
3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

Ziyue Wang, Linghan Cai, Chang Han Low et al.

3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical assistants.Code and data are available at \href{https://github.com/jinlab-imvr/3DMedAgent}{https://github.com/jinlab-imvr/3DMedAgent}.

CVFeb 3, 2024Code
Multi-Level Aggregation and Recursive Alignment Architecture for Efficient Parallel Inference Segmentation Network

Yanhua Zhang, Ke Zhang, Jingyu Wang et al.

Real-time semantic segmentation is a crucial research for real-world applications. However, many methods lay particular emphasis on reducing the computational complexity and model size, while largely sacrificing the accuracy. To tackle this problem, we propose a parallel inference network customized for semantic segmentation tasks to achieve a good trade-off between speed and accuracy. We employ a shallow backbone to ensure real-time speed, and propose three core components to compensate for the reduced model capacity to improve accuracy. Specifically, we first design a dual-pyramidal path architecture (Multi-level Feature Aggregation Module, MFAM) to aggregate multi-level features from the encoder to each scale, providing hierarchical clues for subsequent spatial alignment and corresponding in-network inference. Then, we build Recursive Alignment Module (RAM) by combining the flow-based alignment module with recursive upsampling architecture for accurate spatial alignment between multi-scale feature maps with half the computational complexity of the straightforward alignment method. Finally, we perform independent parallel inference on the aligned features to obtain multi-scale scores, and adaptively fuse them through an attention-based Adaptive Scores Fusion Module (ASFM) so that the final prediction can favor objects of multiple scales. Our framework shows a better balance between speed and accuracy than state-of-the-art real-time methods on Cityscapes and CamVid datasets. We also conducted systematic ablation studies to gain insight into our motivation and architectural design. Code is available at: https://github.com/Yanhua-Zhang/MFARANet.

CLDec 16, 2024
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data

Chengsen Wang, Qi Qi, Jingyu Wang et al.

Human experts typically integrate numerical and textual multimodal information to analyze time series. However, most traditional deep learning predictors rely solely on unimodal numerical data, using a fixed-length window for training and prediction on a single dataset, and cannot adapt to different scenarios. The powered pre-trained large language model has introduced new opportunities for time series analysis. Yet, existing methods are either inefficient in training, incapable of handling textual information, or lack zero-shot forecasting capability. In this paper, we innovatively model time series as a foreign language and construct ChatTime, a unified framework for time series and text processing. As an out-of-the-box multimodal time series foundation model, ChatTime provides zero-shot forecasting capability and supports bimodal input/output for both time series and text. We design a series of experiments to verify the superior performance of ChatTime across multiple tasks and scenarios, and create four multimodal datasets to address data gaps. The experimental results demonstrate the potential and utility of ChatTime.

CVApr 10, 2025
Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al. · tsinghua

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.

CVFeb 17, 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs

En Yu, Kangheng Lin, Liang Zhao et al.

In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.

LGDec 7, 2023
Detection and Imputation based Two-Stage Denoising Diffusion Power System Measurement Recovery under Cyber-Physical Uncertainties

Jianhua Pei, Jingyu Wang, Dongyuan Shi et al.

Power system cyber-physical uncertainties, including measurement ambiguities stemming from cyber attacks and data losses, along with system uncertainties introduced by massive renewables and complex dynamics, reduce the likelihood of enhancing the quality of measurements. Fortunately, denoising diffusion models exhibit powerful learning and generation abilities for the complex underlying physics of the real world. To this end, this paper proposes an improved detection and imputation based two-stage denoising diffusion model (TSDM) to identify and reconstruct the measurements with various cyber-physical uncertainties. The first stage of the model comprises a classifier-guided conditional anomaly detection component, while the second stage involves diffusion-based measurement imputation component. Moreover, the proposed TSDM adopts optimal variance to accelerate the diffusion generation process with subsequence sampling. Extensive numerical case studies demonstrate that the proposed TSDM can accurately recover power system measurements despite renewables-induced strong randomness and highly nonlinear dynamics. Additionally, the proposed TSDM has stronger robustness compared to existing reconstruction networks and exhibits lower computational complexity than general denoising diffusion models.

CVApr 18, 2024
MLS-Track: Multilevel Semantic Interaction in RMOT

Zeliang Ma, Song Yang, Zhe Cui et al.

The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.

CVFeb 13, 2024
An Order-Complexity Aesthetic Assessment Model for Aesthetic-aware Music Recommendation

Xin Jin, Wu Zhou, Jingyu Wang et al.

Computational aesthetic evaluation has made remarkable contribution to visual art works, but its application to music is still rare. Currently, subjective evaluation is still the most effective form of evaluating artistic works. However, subjective evaluation of artistic works will consume a lot of human and material resources. The popular AI generated content (AIGC) tasks nowadays have flooded all industries, and music is no exception. While compared to music produced by humans, AI generated music still sounds mechanical, monotonous, and lacks aesthetic appeal. Due to the lack of music datasets with rating annotations, we have to choose traditional aesthetic equations to objectively measure the beauty of music. In order to improve the quality of AI music generation and further guide computer music production, synthesis, recommendation and other tasks, we use Birkhoff's aesthetic measure to design a aesthetic model, objectively measuring the aesthetic beauty of music, and form a recommendation list according to the aesthetic feeling of music. Experiments show that our objective aesthetic model and recommendation method are effective.

LGOct 11, 2024
Interdependency Matters: Graph Alignment for Multivariate Time Series Anomaly Detection

Yuanyi Wang, Haifeng Sun, Chengsen Wang et al.

Anomaly detection in multivariate time series (MTS) is crucial for various applications in data mining and industry. Current industrial methods typically approach anomaly detection as an unsupervised learning task, aiming to identify deviations by estimating the normal distribution in noisy, label-free datasets. These methods increasingly incorporate interdependencies between channels through graph structures to enhance accuracy. However, the role of interdependencies is more critical than previously understood, as shifts in interdependencies between MTS channels from normal to anomalous data are significant. This observation suggests that \textit{anomalies could be detected by changes in these interdependency graph series}. To capitalize on this insight, we introduce MADGA (MTS Anomaly Detection via Graph Alignment), which redefines anomaly detection as a graph alignment (GA) problem that explicitly utilizes interdependencies for anomaly detection. MADGA dynamically transforms subsequences into graphs to capture the evolving interdependencies, and Graph alignment is performed between these graphs, optimizing an alignment plan that minimizes cost, effectively minimizing the distance for normal data and maximizing it for anomalous data. Uniquely, our GA approach involves explicit alignment of both nodes and edges, employing Wasserstein distance for nodes and Gromov-Wasserstein distance for edges. To our knowledge, this is the first application of GA to MTS anomaly detection that explicitly leverages interdependency for this purpose. Extensive experiments on diverse real-world datasets validate the effectiveness of MADGA, demonstrating its capability to detect anomalies and differentiate interdependencies, consistently achieving state-of-the-art across various scenarios.

CVMar 31, 2024
Object-level Copy-Move Forgery Image Detection based on Inconsistency Mining

Jingyu Wang, Niantai Jing, Ziyao Liu et al.

In copy-move tampering operations, perpetrators often employ techniques, such as blurring, to conceal tampering traces, posing significant challenges to the detection of object-level targets with intact structures. Focus on these challenges, this paper proposes an Object-level Copy-Move Forgery Image Detection based on Inconsistency Mining (IMNet). To obtain complete object-level targets, we customize prototypes for both the source and tampered regions and dynamically update them. Additionally, we extract inconsistent regions between coarse similar regions obtained through self-correlation calculations and regions composed of prototypes. The detected inconsistent regions are used as supplements to coarse similar regions to refine pixel-level detection. We operate experiments on three public datasets which validate the effectiveness and the robustness of the proposed IMNet.

IRJan 23, 2024
Gradient Flow of Energy: A General and Efficient Approach for Entity Alignment Decoding

Yuanyi Wang, Haifeng Sun, Jingyu Wang et al.

Entity alignment (EA), a pivotal process in integrating multi-source Knowledge Graphs (KGs), seeks to identify equivalent entity pairs across these graphs. Most existing approaches regard EA as a graph representation learning task, concentrating on enhancing graph encoders. However, the decoding process in EA - essential for effective operation and alignment accuracy - has received limited attention and remains tailored to specific datasets and model architectures, necessitating both entity and additional explicit relation embeddings. This specificity limits its applicability, particularly in GNN-based models. To address this gap, we introduce a novel, generalized, and efficient decoding approach for EA, relying solely on entity embeddings. Our method optimizes the decoding process by minimizing Dirichlet energy, leading to the gradient flow within the graph, to maximize graph homophily. The discretization of the gradient flow produces a fast and scalable approach, termed Triple Feature Propagation (TFP). TFP innovatively generalizes adjacency matrices to multi-views matrices:entity-to-entity, entity-to-relation, relation-to-entity, and relation-to-triple. The gradient flow through generalized matrices enables TFP to harness the multi-view structural information of KGs. Rigorous experimentation on diverse public datasets demonstrates that our approach significantly enhances various EA methods. Notably, the approach achieves these advancements with less than 6 seconds of additional computational time, establishing a new benchmark in efficiency and adaptability for future EA methods.

CVDec 21, 2023
A Comprehensive End-to-End Computer Vision Framework for Restoration and Recognition of Low-Quality Engineering Drawings

Lvyang Yang, Jiankang Zhang, Huaiqiang Li et al.

The digitization of engineering drawings is crucial for efficient reuse, distribution, and archiving. Existing computer vision approaches for digitizing engineering drawings typically assume the input drawings have high quality. However, in reality, engineering drawings are often blurred and distorted due to improper scanning, storage, and transmission, which may jeopardize the effectiveness of existing approaches. This paper focuses on restoring and recognizing low-quality engineering drawings, where an end-to-end framework is proposed to improve the quality of the drawings and identify the graphical symbols on them. The framework uses K-means clustering to classify different engineering drawing patches into simple and complex texture patches based on their gray level co-occurrence matrix statistics. Computer vision operations and a modified Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) model are then used to improve the quality of the two types of patches, respectively. A modified Faster Region-based Convolutional Neural Network (Faster R-CNN) model is used to recognize the quality-enhanced graphical symbols. Additionally, a multi-stage task-driven collaborative learning strategy is proposed to train the modified ESRGAN and Faster R-CNN models to improve the resolution of engineering drawings in the direction that facilitates graphical symbol recognition, rather than human visual perception. A synthetic data generation method is also proposed to construct quality-degraded samples for training the framework. Experiments on real-world electrical diagrams show that the proposed framework achieves an accuracy of 98.98% and a recall of 99.33%, demonstrating its superiority over previous approaches. Moreover, the framework is integrated into a widely-used power system software application to showcase its practicality.

LGMar 7, 2025
MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration

Jinguang Wang, Jingyu Wang, Haifeng Sun et al.

Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.

LGMay 15, 2025
ChronoSteer: Bridging Large Language Model and Time Series Foundation Model via Synthetic Data

Chengsen Wang, Qi Qi, Zhongwen Rao et al.

Conventional forecasting methods rely on unimodal time series data, limiting their ability to exploit rich textual information. Recently, large language models (LLMs) and time series foundation models (TSFMs) have demonstrated powerful capability in textual reasoning and temporal modeling, respectively. Integrating the strengths of both to construct a multimodal model that concurrently leverages both temporal and textual information for future inference has emerged as a critical research challenge. To address the scarcity of event-series paired data, we propose a decoupled framework: an LLM is employed to transform textual events into revision instructions, which are then used to steer the output of TSFM. To implement this framework, we introduce ChronoSteer, a multimodal TSFM that can be steered through textual revision instructions, effectively bridging LLM and TSFM. Moreover, to mitigate the shortage of cross-modal instruction-series paired data, we devise a two-stage training strategy based on synthetic data. In addition, we also construct a high-quality multimodal time series forecasting benchmark to address the information leakage concerns during evaluation. After integrating with an LLM, ChronoSteer, which is trained exclusively on synthetic data, achieves a 25.7% improvement in prediction accuracy compared to the unimodal backbone and a 22.5% gain over the previous state-of-the-art multimodal method.

NIOct 18, 2024
DRL Optimization Trajectory Generation via Wireless Network Intent-Guided Diffusion Models for Optimizing Resource Allocation

Junjie Wu, Xuming Fang, Dusit Niyato et al.

With the rapid advancements in wireless communication fields, including low-altitude economies, 6G, and Wi-Fi, the scale of wireless networks continues to expand, accompanied by increasing service quality demands. Traditional deep reinforcement learning (DRL)-based optimization models can improve network performance by solving non-convex optimization problems intelligently. However, they heavily rely on online deployment and often require extensive initial training. Online DRL optimization models typically make accurate decisions based on current channel state distributions. When these distributions change, their generalization capability diminishes, which hinders the responsiveness essential for real-time and high-reliability wireless communication networks. Furthermore, different users have varying quality of service (QoS) requirements across diverse scenarios, and conventional online DRL methods struggle to accommodate this variability. Consequently, exploring flexible and customized AI strategies is critical. We propose a wireless network intent (WNI)-guided trajectory generation model based on a generative diffusion model (GDM). This model can be generated and fine-tuned in real time to achieve the objective and meet the constraints of target intent networks, significantly reducing state information exposure during wireless communication. Moreover, The WNI-guided optimization trajectory generation can be customized to address differentiated QoS requirements, enhancing the overall quality of communication in future intelligent networks. Extensive simulation results demonstrate that our approach achieves greater stability in spectral efficiency variations and outperforms traditional DRL optimization models in dynamic communication systems.

IRFeb 5, 2024
Understanding and Guiding Weakly Supervised Entity Alignment with Potential Isomorphism Propagation

Yuanyi Wang, Wei Tang, Haifeng Sun et al.

Weakly Supervised Entity Alignment (EA) is the task of identifying equivalent entities across diverse knowledge graphs (KGs) using only a limited number of seed alignments. Despite substantial advances in aggregation-based weakly supervised EA, the underlying mechanisms in this setting remain unexplored. In this paper, we present a propagation perspective to analyze weakly supervised EA and explain the existing aggregation-based EA models. Our theoretical analysis reveals that these models essentially seek propagation operators for pairwise entity similarities. We further prove that, despite the structural heterogeneity of different KGs, the potentially aligned entities within aggregation-based EA models have isomorphic subgraphs, which is the core premise of EA but has not been investigated. Leveraging this insight, we introduce a potential isomorphism propagation operator to enhance the propagation of neighborhood information across KGs. We develop a general EA framework, PipEA, incorporating this operator to improve the accuracy of every type of aggregation-based model without altering the learning process. Extensive experiments substantiate our theoretical findings and demonstrate PipEA's significant performance gains over state-of-the-art weakly supervised EA methods. Our work not only advances the field but also enhances our comprehension of aggregation-based weakly supervised EA.

GRSep 12, 2025
Chord: Chain of Rendering Decomposition for PBR Material Estimation from Generated Texture Images

Zhi Ying, Boxiang Rong, Jingyu Wang et al.

Material creation and reconstruction are crucial for appearance modeling but traditionally require significant time and expertise from artists. While recent methods leverage visual foundation models to synthesize PBR materials from user-provided inputs, they often fall short in quality, flexibility, and user control. We propose a novel two-stage generate-and-estimate framework for PBR material generation. In the generation stage, a fine-tuned diffusion model synthesizes shaded, tileable texture images aligned with user input. In the estimation stage, we introduce a chained decomposition scheme that sequentially predicts SVBRDF channels by passing previously extracted representation as input into a single-step image-conditional diffusion model. Our method is efficient, high quality, and enables flexible user control. We evaluate our approach against existing material generation and estimation methods, demonstrating superior performance. Our material estimation method shows strong robustness on both generated textures and in-the-wild photographs. Furthermore, we highlight the flexibility of our framework across diverse applications, including text-to-material, image-to-material, structure-guided generation, and material editing.

SPSep 3, 2025
Artificial Intelligence-derived Cardiotocography Age as a Digital Biomarker for Predicting Future Adverse Pregnancy Outcomes

Jinshuai Gu, Zenghui Lin, Jingying Ma et al.

Cardiotocography (CTG) is a low-cost, non-invasive fetal health assessment technique used globally, especially in underdeveloped countries. However, it is currently mainly used to identify the fetus's current status (e.g., fetal acidosis or hypoxia), and the potential of CTG in predicting future adverse pregnancy outcomes has not been fully explored. We aim to develop an AI-based model that predicts biological age from CTG time series (named CTGage), then calculate the age gap between CTGage and actual age (named CTGage-gap), and use this gap as a new digital biomarker for future adverse pregnancy outcomes. The CTGage model is developed using 61,140 records from 11,385 pregnant women, collected at Peking University People's Hospital between 2018 and 2022. For model training, a structurally designed 1D convolutional neural network is used, incorporating distribution-aligned augmented regression technology. The CTGage-gap is categorized into five groups: < -21 days (underestimation group), -21 to -7 days, -7 to 7 days (normal group), 7 to 21 days, and > 21 days (overestimation group). We further defined the underestimation group and overestimation group together as the high-risk group. We then compare the incidence of adverse outcomes and maternal diseases across these groups. The average absolute error of the CTGage model is 10.91 days. When comparing the overestimation group with the normal group, premature infants incidence is 5.33% vs. 1.42% (p < 0.05) and gestational diabetes mellitus (GDM) incidence is 31.93% vs. 20.86% (p < 0.05). When comparing the underestimation group with the normal group, low birth weight incidence is 0.17% vs. 0.15% (p < 0.05) and anaemia incidence is 37.51% vs. 34.74% (p < 0.05). Artificial intelligence-derived CTGage can predict the future risk of adverse pregnancy outcomes and hold potential as a novel, non-invasive, and easily accessible digital biomarker.

LGJun 24, 2025
Efficient Extreme Operating Condition Search for Online Relay Setting Calculation in Renewable Power Systems Based on Parallel Graph Neural Network

Yan Li, Zengli Yang, Youhuai Wang et al.

The Extreme Operating Conditions Search (EOCS) problem is one of the key problems in relay setting calculation, which is used to ensure that the setting values of protection relays can adapt to the changing operating conditions of power systems over a period of time after deployment. The high penetration of renewable energy and the wide application of inverter-based resources make the operating conditions of renewable power systems more volatile, which urges the adoption of the online relay setting calculation strategy. However, the computation speed of existing EOCS methods based on local enumeration, heuristic algorithms, and mathematical programming cannot meet the efficiency requirement of online relay setting calculation. To reduce the time overhead, this paper, for the first time, proposes an efficient deep learning-based EOCS method suitable for online relay setting calculation. First, the power system information is formulated as four layers, i.e., a component parameter layer, a topological connection layer, an electrical distance layer, and a graph distance layer, which are fed into a parallel graph neural network (PGNN) model for feature extraction. Then, the four feature layers corresponding to each node are spliced and stretched, and then fed into the decision network to predict the extreme operating condition of the system. Finally, the proposed PGNN method is validated on the modified IEEE 39-bus and 118-bus test systems, where some of the synchronous generators are replaced by renewable generation units. The nonlinear fault characteristics of renewables are fully considered when computing fault currents. The experiment results show that the proposed PGNN method achieves higher accuracy than the existing methods in solving the EOCS problem. Meanwhile, it also provides greater improvements in online computation time.

LGMay 1, 2025
Unlocking the Potential of Linear Networks for Irregular Multivariate Time Series Forecasting

Chengsen Wang, Qi Qi, Jingyu Wang et al.

Time series forecasting holds significant importance across various industries, including finance, transportation, energy, healthcare, and climate. Despite the widespread use of linear networks due to their low computational cost and effectiveness in modeling temporal dependencies, most existing research has concentrated on regularly sampled and fully observed multivariate time series. However, in practice, we frequently encounter irregular multivariate time series characterized by variable sampling intervals and missing values. The inherent intra-series inconsistency and inter-series asynchrony in such data hinder effective modeling and forecasting with traditional linear networks relying on static weights. To tackle these challenges, this paper introduces a novel model named AiT. AiT utilizes an adaptive linear network capable of dynamically adjusting weights according to observation time points to address intra-series inconsistency, thereby enhancing the accuracy of temporal dependencies modeling. Furthermore, by incorporating the Transformer module on variable semantics embeddings, AiT efficiently captures variable correlations, avoiding the challenge of inter-series asynchrony. Comprehensive experiments across four benchmark datasets demonstrate the superiority of AiT, improving prediction accuracy by 11% and decreasing runtime by 52% compared to existing state-of-the-art methods.

LGMar 3, 2025
OIPR: Evaluation for Time-series Anomaly Detection Inspired by Operator Interest

Yuhan Jing, Jingyu Wang, Lei Zhang et al.

With the growing adoption of time-series anomaly detection (TAD) technology, numerous studies have employed deep learning-based detectors for analyzing time-series data in the fields of Internet services, industrial systems, and sensors. The selection and optimization of anomaly detectors strongly rely on the availability of an effective performance evaluation method for TAD. Since anomalies in time-series data often manifest as a sequence of points, conventional metrics that solely consider the detection of individual point are inadequate. Existing evaluation methods for TAD typically employ point-based or event-based metrics to capture the temporal context. However, point-based metrics tend to overestimate detectors that excel only in detecting long anomalies, while event-based metrics are susceptible to being misled by fragmented detection results. To address these limitations, we propose OIPR, a novel set of TAD evaluation metrics. It models the process of operators receiving detector alarms and handling faults, utilizing area under the operator interest curve to evaluate the performance of TAD algorithms. Furthermore, we build a special scenario dataset to compare the characteristics of different evaluation methods. Through experiments conducted on the special scenario dataset and five real-world datasets, we demonstrate the remarkable performance of OIPR in extreme and complex scenarios. It achieves a balance between point and event perspectives, overcoming their primary limitations and offering applicability to broader situations.

NIFeb 18, 2025
NTP-INT: Network Traffic Prediction-Driven In-band Network Telemetry for High-load Switches

Penghui Zhang, Hua Zhang, Yuqi Dai et al.

In-band network telemetry (INT) is essential to network management due to its real-time visibility. However, because of the rapid increase in network devices and services, it has become crucial to have targeted access to detailed network information in a dynamic network environment. This paper proposes an intelligent network telemetry system called NTP-INT to obtain more fine-grained network information on high-load switches. Specifically, NTP-INT consists of three modules: network traffic prediction module, network pruning module, and probe path planning module. Firstly, the network traffic prediction module adopts a Multi-Temporal Graph Neural Network (MTGNN) to predict future network traffic and identify high-load switches. Then, we design the network pruning algorithm to generate a subnetwork covering all high-load switches to reduce the complexity of probe path planning. Finally, the probe path planning module uses an attention-mechanism-based deep reinforcement learning (DEL) model to plan efficient probe paths in the network slice. The experimental results demonstrate that NTP-INT can acquire more precise network information on high-load switches while decreasing the control overhead by 50\%.

LGJan 16, 2025
Fast Searching of Extreme Operating Conditions for Relay Protection Setting Calculation Based on Graph Neural Network and Reinforcement Learning

Yan Li, Jingyu Wang, Jiankang Zhang et al.

Searching for the Extreme Operating Conditions (EOCs) is one of the core problems of power system relay protection setting calculation. The current methods based on brute-force search, heuristic algorithms, and mathematical programming can hardly meet the requirements of today's power systems in terms of computation speed due to the drastic changes in operating conditions induced by renewables and power electronics. This paper proposes an EOC fast search method, named Graph Dueling Double Deep Q Network (Graph D3QN), which combines graph neural network and deep reinforcement learning to address this challenge. First, the EOC search problem is modeled as a Markov decision process, where the information of the underlying power system is extracted using graph neural networks, so that the EOC of the system can be found via deep reinforcement learning. Then, a two-stage Guided Learning and Free Exploration (GLFE) training framework is constructed to accelerate the convergence speed of reinforcement learning. Finally, the proposed Graph D3QN method is validated through case studies of searching maximum fault current for relay protection setting calculation on the IEEE 39-bus and 118-bus systems. The experimental results demonstrate that Graph D3QN can reduce the computation time by 10 to 1000 times while guaranteeing the accuracy of the selected EOCs.

CLJun 27, 2024
OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

Jinguang Wang, Yuexi Yin, Haifeng Sun et al.

Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.

LGJun 15, 2024
Finite-difference-informed graph network for solving steady-state incompressible flows on block-structured grids

Yiye Zou, Tianyu Li, Lin Lu et al.

Advances in deep learning have enabled physics-informed neural networks to solve partial differential equations. Numerical differentiation using the finite-difference (FD) method is efficient in physics-constrained designs, even in parameterized settings. In traditional computational fluid dynamics(CFD), body-fitted block-structured grids are often employed for complex flow cases when obtaining FD solutions. However, convolution operators in convolutional neural networks for FD are typically limited to single-block grids. To address this issue, \blueText{graphs and graph networks are used} to learn flow representations across multi-block-structured grids. \blueText{A graph convolution-based FD method (GC-FDM) is proposed} to train graph networks in a label-free physics-constrained manner, enabling differentiable FD operations on unstructured graph outputs. To demonstrate model performance from single- to multi-block-structured grids, \blueText{the parameterized steady incompressible Navier-Stokes equations are solved} for a lid-driven cavity flow and the flows around single and double circular cylinder configurations. When compared to a CFD solver under various boundary conditions, the proposed method achieves a relative error in velocity field predictions on the order of $10^{-3}$. Furthermore, the proposed method reduces training costs by approximately 20\% compared to a physics-informed neural network. \blueText{To} further verify the effectiveness of GC-FDM in multi-block processing, \blueText{a 30P30N airfoil geometry is considered} and the \blueText{predicted} results are reasonable compared with those given by CFD. \blueText{Finally, the applicability of GC-FDM to three-dimensional (3D) case is tested using a 3D cavity geometry.

LGFeb 1, 2022
Approximation of Images via Generalized Higher Order Singular Value Decomposition over Finite-dimensional Commutative Semisimple Algebra

Liang Liao, Sen Lin, Lun Li et al.

Low-rank approximation of images via singular value decomposition is well-received in the era of big data. However, singular value decomposition (SVD) is only for order-two data, i.e., matrices. It is necessary to flatten a higher order input into a matrix or break it into a series of order-two slices to tackle higher order data such as multispectral images and videos with the SVD. Higher order singular value decomposition (HOSVD) extends the SVD and can approximate higher order data using sums of a few rank-one components. We consider the problem of generalizing HOSVD over a finite dimensional commutative algebra. This algebra, referred to as a t-algebra, generalizes the field of complex numbers. The elements of the algebra, called t-scalars, are fix-sized arrays of complex numbers. One can generalize matrices and tensors over t-scalars and then extend many canonical matrix and tensor algorithms, including HOSVD, to obtain higher-performance versions. The generalization of HOSVD is called THOSVD. Its performance of approximating multi-way data can be further improved by an alternating algorithm. THOSVD also unifies a wide range of principal component analysis algorithms. To exploit the potential of generalized algorithms using t-scalars for approximating images, we use a pixel neighborhood strategy to convert each pixel to "deeper-order" t-scalar. Experiments on publicly available images show that the generalized algorithm over t-scalars, namely THOSVD, compares favorably with its canonical counterparts.

ARJul 6, 2021
CAP-RAM: A Charge-Domain In-Memory Computing 6T-SRAM for Accurate and Precision-Programmable CNN Inference

Zhiyu Chen, Zhanghao Yu, Qing Jin et al.

A compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference. It leverages a novel charge-domain multiply-and-accumulate (MAC) mechanism and circuitry to achieve superior linearity under process variations compared to conventional IMC designs. The adopted semi-parallel architecture efficiently stores filters from multiple CNN layers by sharing eight standard 6T SRAM cells with one charge-domain MAC circuit. Moreover, up to six levels of bit-width of weights with two encoding schemes and eight levels of input activations are supported. A 7-bit charge-injection SAR (ciSAR) analog-to-digital converter (ADC) getting rid of sample and hold (S&H) and input/reference buffers further improves the overall energy efficiency and throughput. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM. A single 512x128 macro stores a complete pruned and quantized CNN model to achieve 98.8% inference accuracy on the MNIST data set and 89.0% on the CIFAR-10 data set, with a 573.4-giga operations per second (GOPS) peak throughput and a 49.4-tera operations per second (TOPS)/W energy efficiency.

CVNov 5, 2020
Robust Unsupervised Video Anomaly Detection by Multi-Path Frame Prediction

Xuanzhao Wang, Zhengping Che, Bo Jiang et al.

Video anomaly detection is commonly used in many applications such as security surveillance and is very challenging.A majority of recent video anomaly detection approaches utilize deep reconstruction models, but their performance is often suboptimal because of insufficient reconstruction error differences between normal and abnormal video frames in practice. Meanwhile, frame prediction-based anomaly detection methods have shown promising performance. In this paper, we propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design which is more in line with the characteristics of surveillance videos. The proposed method is equipped with a multi-path ConvGRU-based frame prediction network that can better handle semantically informative objects and areas of different scales and capture spatial-temporal dependencies in normal videos. A noise tolerance loss is introduced during training to mitigate the interference caused by background noise. Extensive experiments have been conducted on the CUHK Avenue, ShanghaiTech Campus, and UCSD Pedestrian datasets, and the results show that our proposed method outperforms existing state-of-the-art approaches. Remarkably, our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.

CVJul 19, 2020
AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation

Weiting Huang, Pengfei Ren, Jingyu Wang et al.

In this paper, we propose an adaptive weighting regression (AWR) method to leverage the advantages of both detection-based and regression-based methods. Hand joint coordinates are estimated as discrete integration of all pixels in dense representation, guided by adaptive weight maps. This learnable aggregation process introduces both dense and joint supervision that allows end-to-end training and brings adaptability to weight maps, making the network more accurate and robust. Comprehensive exploration experiments are conducted to validate the effectiveness and generality of AWR under various experimental settings, especially its usefulness for different types of dense representation and input modality. Our method outperforms other state-of-the-art methods on four publicly available datasets, including NYU, ICVL, MSRA and HANDS 2017 dataset.

CVMay 28, 2019
OICSR: Out-In-Channel Sparsity Regularization for Compact Deep Neural Networks

Jiashi Li, Qi Qi, Jingyu Wang et al.

Channel pruning can significantly accelerate and compress deep neural networks. Many channel pruning works utilize structured sparsity regularization to zero out all the weights in some channels and automatically obtain structure-sparse network in training stage. However, these methods apply structured sparsity regularization on each layer separately where the correlations between consecutive layers are omitted. In this paper, we first combine one out-channel in current layer and the corresponding in-channel in next layer as a regularization group, namely out-in-channel. Our proposed Out-In-Channel Sparsity Regularization (OICSR) considers correlations between successive layers to further retain predictive power of the compact network. Training with OICSR thoroughly transfers discriminative features into a fraction of out-in-channels. Correspondingly, OICSR measures channel importance based on statistics computed from two consecutive layers, not individual layer. Finally, a global greedy pruning algorithm is designed to remove redundant out-in-channels in an iterative way. Our method is comprehensively evaluated with various CNN architectures including CifarNet, AlexNet, ResNet, DenseNet and PreActSeNet on CIFAR-10, CIFAR-100 and ImageNet-1K datasets. Notably, on ImageNet-1K, we reduce 37.2% FLOPs on ResNet-50 while outperforming the original model by 0.22% top-1 accuracy.

CVNov 7, 2016
Chinese/English mixed Character Segmentation as Semantic Segmentation

Huabin Zheng, Jingyu Wang, Zhengjie Huang et al.

OCR character segmentation for multilingual printed documents is difficult due to the diversity of different linguistic characters. Previous approaches mainly focus on monolingual texts and are not suitable for multilingual-lingual cases. In this work, we particularly tackle the Chinese/English mixed case by reframing it as a semantic segmentation problem. We take advantage of the successful architecture called fully convolutional networks (FCN) in the field of semantic segmentation. Given a wide enough receptive field, FCN can utilize the necessary context around a horizontal position to determinate whether this is a splitting point or not. As a deep neural architecture, FCN can automatically learn useful features from raw text line images. Although trained on synthesized samples with simulated random disturbance, our FCN model generalizes well to real-world samples. The experimental results show that our model significantly outperforms the previous methods.