MMAug 28, 2023Code
UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery LocalizationRui Zhang, Hongxia Wang, Mingshan Du et al.
The emergence of artificial intelligence-generated content (AIGC) has raised concerns about the authenticity of multimedia content in various fields. However, existing research for forgery content detection has focused mainly on binary classification tasks of complete videos, which has limited applicability in industrial settings. To address this gap, we propose UMMAFormer, a novel universal transformer framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation. Our approach introduces a Temporal Feature Abnormal Attention (TFAA) module based on temporal feature reconstruction to enhance the detection of temporal differences. We also design a Parallel Cross-Attention Feature Pyramid Network (PCA-FPN) to optimize the Feature Pyramid Network (FPN) for subtle feature enhancement. To evaluate the proposed method, we contribute a novel Temporal Video Inpainting Localization (TVIL) dataset specifically tailored for video inpainting scenes. Our experiments show that our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd, significantly outperforming previous methods. The code and data are available at https://github.com/ymhzyj/UMMAFormer/.
CVMay 19Code
RE-VLM: Event-Augmented Vision-Language Model for Scene UnderstandingHanqing Liu, Mingjie Liu, Luoping Cui et al.
Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets are available at https://github.com/bupt-ai-cz/RE-VLM.
AINov 13, 2025
Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance SystemsJiahuan Long, Tingsong Jiang, Hanqing Liu et al.
Adversarial patches have emerged as a popular privacy-preserving approach for resisting AI-driven surveillance systems. However, their conspicuous appearance makes them difficult to deploy in real-world scenarios. In this paper, we propose a thermally activated adversarial wearable designed to ensure adaptability and effectiveness in complex real-world environments. The system integrates thermochromic dyes with flexible heating units to induce visually dynamic adversarial patterns on clothing surfaces. In its default state, the clothing appears as an ordinary black T-shirt. Upon heating via an embedded thermal unit, hidden adversarial patterns on the fabric are activated, allowing the wearer to effectively evade detection across both visible and infrared modalities. Physical experiments demonstrate that the adversarial wearable achieves rapid texture activation within 50 seconds and maintains an adversarial success rate above 80\% across diverse real-world surveillance environments. This work demonstrates a new pathway toward physically grounded, user-controllable anti-AI systems, highlighting the growing importance of proactive adversarial techniques for privacy protection in the age of ubiquitous AI surveillance.
CVMay 18
DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary DetectionDonghong Jiang, Endian Lin, Hanqing Liu et al.
Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the iden- tification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine- grained detection tasks involving attributes like color, ma- terial, and texture. We attribute this performance bottle- neck in OVD models to a core issue: when category sig- nals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect bind- ing between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capa- bilities by strengthening attribute semantics at two criti- cal stages. In the text embedding stage, we employ At- tribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further am- plify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encod- ing phase, selectively enhancing the Key and Value vec- tors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with differ- ent attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary mod- els.
AIOct 21, 2024Code
Boosting Jailbreak Transferability for Large Language ModelsHanqing Liu, Lifeng Zhou, Huanqian Yan
Large language models have drawn significant attention to the challenge of safe alignment, especially regarding jailbreak attacks that circumvent security measures to produce harmful content. To address the limitations of existing methods like GCG, which perform well in single-model attacks but lack transferability, we propose several enhancements, including a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs. Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability. Notably, our method has won the first place in the AISG-hosted Global Challenge for Safe and Secure LLMs. The code is released at https://github.com/HqingLiu/SI-GCG.
ROApr 7
JailWAM: Jailbreaking World Action Models in Robot ControlHanqing Liu, Songping Wang, Jiahuan Long et al.
The World Action Model (WAM) can jointly predict future world states and actions, exhibiting stronger physical manipulation capabilities compared with traditional models. Such powerful physical interaction ability is a double-edged sword: if safety is ignored, it will directly threaten personal safety, property security and environmental safety. However, existing research pays extremely limited attention to the critical security gap: the vulnerability of WAM to jailbreak attacks. To fill this gap, we define the Three-Level Safety Classification Framework to systematically quantify the safety of robotic arm motions. Furthermore, we propose JailWAM, the first dedicated jailbreak attack and evaluation framework for WAM, which consists of three core components: (1) Visual-Trajectory Mapping, which unifies heterogeneous action spaces into visual trajectory representations and enables cross-architectural unified evaluation; (2) Risk Discriminator, which serves as a high-recall screening tool that optimizes the efficiency-accuracy trade-off when identifying destructive behaviors in visual trajectories; (3) Dual-Path Verification Strategy, which first conducts rapid coarse screening via a single-image-based video-action generation module, and then performs efficient and comprehensive verification through full closed-loop physical simulation. In addition, we construct JailWAM-Bench, a benchmark for comprehensively evaluating the safety alignment performance of WAM under jailbreak attacks. Experiments in RoboTwin simulation environment demonstrate that the proposed framework efficiently exposes physical vulnerabilities, achieving an 84.2% attack success rate on the state-of-the-art LingBot-VA. Meanwhile, robust defense mechanisms can be constructed based on JailWAM, providing an effective technical solution for designing safe and reliable robot control systems.
LGFeb 1, 2024
Enriched Physics-informed Neural Networks for Dynamic Poisson-Nernst-Planck SystemsXujia Huang, Fajie Wang, Benrong Zhang et al.
This paper proposes a meshless deep learning algorithm, enriched physics-informed neural networks (EPINNs), to solve dynamic Poisson-Nernst-Planck (PNP) equations with strong coupling and nonlinear characteristics. The EPINNs takes the traditional physics-informed neural networks as the foundation framework, and adds the adaptive loss weight to balance the loss functions, which automatically assigns the weights of losses by updating the parameters in each iteration based on the maximum likelihood estimate. The resampling strategy is employed in the EPINNs to accelerate the convergence of loss function. Meanwhile, the GPU parallel computing technique is adopted to accelerate the solving process. Four examples are provided to demonstrate the validity and effectiveness of the proposed method. Numerical results indicate that the new method has better applicability than traditional numerical methods in solving such coupled nonlinear systems. More importantly, the EPINNs is more accurate, stable, and fast than the traditional physics-informed neural networks. This work provides a simple and high-performance numerical tool for addressing PNPs with arbitrary boundary shapes and boundary conditions.
CVApr 21, 2025
Fast Adversarial Training with Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on VideosSongping Wang, Hanqing Liu, Yueming Lyu et al.
Adversarial Training (AT) has been shown to significantly enhance adversarial robustness via a min-max optimization approach. However, its effectiveness in video recognition tasks is hampered by two main challenges. First, fast adversarial training for video models remains largely unexplored, which severely impedes its practical applications. Specifically, most video adversarial training methods are computationally costly, with long training times and high expenses. Second, existing methods struggle with the trade-off between clean accuracy and adversarial robustness. To address these challenges, we introduce Video Fast Adversarial Training with Weak-to-Strong consistency (VFAT-WS), the first fast adversarial training method for video data. Specifically, VFAT-WS incorporates the following key designs: First, it integrates a straightforward yet effective temporal frequency augmentation (TF-AUG), and its spatial-temporal enhanced form STF-AUG, along with a single-step PGD attack to boost training efficiency and robustness. Second, it devises a weak-to-strong spatial-temporal consistency regularization, which seamlessly integrates the simpler TF-AUG and the more complex STF-AUG. Leveraging the consistency regularization, it steers the learning process from simple to complex augmentations. Both of them work together to achieve a better trade-off between clean accuracy and robustness. Extensive experiments on UCF-101 and HMDB-51 with both CNN and Transformer-based models demonstrate that VFAT-WS achieves great improvements in adversarial robustness and corruption robustness, while accelerating training by nearly 490%.
CLApr 7, 2025
scAgent: Universal Single-Cell Annotation via a LLM AgentYuren Mao, Yu Mi, Peigen Liu et al.
Cell type annotation is critical for understanding cellular heterogeneity. Based on single-cell RNA-seq data and deep learning models, good progress has been made in annotating a fixed number of cell types within a specific tissue. However, universal cell annotation, which can generalize across tissues, discover novel cell types, and extend to novel cell types, remains less explored. To fill this gap, this paper proposes scAgent, a universal cell annotation framework based on Large Language Models (LLMs). scAgent can identify cell types and discover novel cell types in diverse tissues; furthermore, it is data efficient to learn novel cell types. Experimental studies in 160 cell types and 35 tissues demonstrate the superior performance of scAgent in general cell-type annotation, novel cell discovery, and extensibility to novel cell type.
CVMar 10, 2025
When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation AttackHanqing Liu, Shouwei Ruan, Yao Huang et al.
Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose \textbf{I}llumination \textbf{T}ransformation \textbf{A}ttack (\textbf{ITA}), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly reduce the performance of advanced VLMs, e.g., LLaVA-1.6, while possessing competitive naturalness, exposing VLMS' critical illuminiation vulnerabilities.
CRNov 21, 2024
Global Challenge for Safe and Secure LLMs Track 1Xiaojun Jia, Yihao Huang, Yang Liu et al.
This paper introduces the Global Challenge for Safe and Secure Large Language Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) to foster the development of advanced defense mechanisms against automated jailbreaking attacks. With the increasing integration of LLMs in critical sectors such as healthcare, finance, and public administration, ensuring these models are resilient to adversarial attacks is vital for preventing misuse and upholding ethical standards. This competition focused on two distinct tracks designed to evaluate and enhance the robustness of LLM security frameworks. Track 1 tasked participants with developing automated methods to probe LLM vulnerabilities by eliciting undesirable responses, effectively testing the limits of existing safety protocols within LLMs. Participants were challenged to devise techniques that could bypass content safeguards across a diverse array of scenarios, from offensive language to misinformation and illegal activities. Through this process, Track 1 aimed to deepen the understanding of LLM vulnerabilities and provide insights for creating more resilient models.
CVApr 18, 2024
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training ModelsShouwei Ruan, Yinpeng Dong, Hanqing Liu et al.
Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable success in computer vision and particularly demonstrated superior robustness to distribution shifts of 2D images. However, their robustness under 3D viewpoint variations is still limited, which can hinder the development for real-world applications. This paper successfully addresses this concern while keeping VLPs' original performance by breaking through two primary obstacles: 1) the scarcity of training data and 2) the suboptimal fine-tuning paradigms. To combat data scarcity, we build the Multi-View Caption (MVCap) dataset -- a comprehensive collection of over four million multi-view image-text pairs across more than 100K objects, providing more potential for VLP models to develop generalizable viewpoint-invariant representations. To address the limitations of existing paradigms in performance trade-offs and training efficiency, we design a novel fine-tuning framework named Omniview-Tuning (OVT). Specifically, OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy, which effectively aligns representations of identical objects from diverse viewpoints without causing overfitting. Additionally, OVT fine-tunes VLP models in a parameter-efficient manner, leading to minimal computational cost. Extensive experiments on various VLP models with different architectures validate that OVT significantly improves the models' resilience to viewpoint shifts and keeps the original performance, establishing a pioneering standard for boosting the viewpoint invariance of VLP models.
ROSep 23, 2025
Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical VariationsHanqing Liu, Jiahuan Long, Junqi Wu et al.
Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework that systematically evaluates the robustness of VLA models by transforming discrete physical variations into continuous optimization problems. However, comprehensively assessing VLA robustness presents two key challenges: (1) how to systematically characterize diverse physical variations encountered in real-world deployments while maintaining evaluation reproducibility, and (2) how to discover worst-case scenarios without prohibitive real-world data collection costs efficiently. To address the first challenge, we decompose real-world variations into three critical domains: object 3D transformations that affect spatial reasoning, illumination variations that challenge visual perception, and adversarial patches that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization framework that transforms discrete physical variations into parameter optimization, enabling systematic exploration of worst-case scenarios. Extensive experiments on state-of-the-art OpenVLA models across multiple benchmarks reveal alarming vulnerabilities: all variation types trigger failure rates exceeding 60%, with object transformations causing up to 97.8% failure in long-horizon tasks. Our findings expose critical gaps between controlled laboratory success and unpredictable deployment readiness, while the Eva-VLA framework provides a practical pathway for hardening VLA-based robotic manipulation models against real-world deployment challenges.
CVDec 4, 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?Shouwei Ruan, Hanqing Liu, Yao Huang et al.
Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs' robustness to real-world 3D variations, we propose AdvDreamer, the first framework capable of generating physically reproducible Adversarial 3D Transformation (Adv-3DT) samples from single-view observations. In AdvDreamer, we integrate three key innovations: Firstly, to characterize real-world 3D variations with limited prior knowledge precisely, we design a zero-shot Monocular Pose Manipulation pipeline built upon generative 3D priors. Secondly, to ensure the visual quality of worst-case Adv-3DT samples, we propose a Naturalness Reward Model that provides continuous naturalness regularization during adversarial optimization, effectively preventing convergence to hallucinated or unnatural elements. Thirdly, to enable systematic evaluation across diverse VLM architectures and visual-language tasks, we introduce the Inverse Semantic Probability loss as the adversarial optimization objective, which solely operates in the fundamental visual-textual alignment space. Based on the captured Adv-3DT samples with high aggressiveness and transferability, we establish MM3DTBench, the first VQA benchmark dataset tailored to evaluate VLM robustness under challenging 3D variations. Extensive evaluations of representative VLMs with varying architectures reveal that real-world 3D variations can pose severe threats to model performance across various tasks.
CVNov 11, 2025
PEOD: A Pixel-Aligned Event-RGB Benchmark for Object Detection under Challenging ConditionsLuoping Cui, Hanqing Liu, Mingjie Liu et al.
Robust object detection for challenging scenarios increasingly relies on event cameras, yet existing Event-RGB datasets remain constrained by sparse coverage of extreme conditions and low spatial resolution (<= 640 x 480), which prevents comprehensive evaluation of detectors under challenging scenarios. To address these limitations, we propose PEOD, the first large-scale, pixel-aligned and high-resolution (1280 x 720) Event-RGB dataset for object detection under challenge conditions. PEOD contains 130+ spatiotemporal-aligned sequences and 340k manual bounding boxes, with 57% of data captured under low-light, overexposure, and high-speed motion. Furthermore, we benchmark 14 methods across three input configurations (Event-based, RGB-based, and Event-RGB fusion) on PEOD. On the full test set and normal subset, fusion-based models achieve the excellent performance. However, in illumination challenge subset, the top event-based model outperforms all fusion models, while fusion models still outperform their RGB-based counterparts, indicating limits of existing fusion methods when the frame modality is severely degraded. PEOD establishes a realistic, high-quality benchmark for multimodal perception and facilitates future research.
CVAug 4, 2025
Beyond RGB and Events: Enhancing Object Detection under Adverse Lighting with Monocular Normal MapsMingjie Liu, Hanqing Liu, Chuang Zhu
Accurate object detection under adverse lighting conditions is critical for real-world applications such as autonomous driving. Although neuromorphic event cameras have been introduced to handle these scenarios, adverse lighting often induces distracting reflections from tunnel walls or road surfaces, which frequently lead to false obstacle detections. However, neither RGB nor event data alone is robust enough to address these complexities, and mitigating these issues without additional sensors remains underexplored. To overcome these challenges, we propose leveraging normal maps, directly predicted from monocular RGB images, as robust geometric cues to suppress false positives and enhance detection accuracy. We introduce NRE-Net, a novel multi-modal detection framework that effectively fuses three complementary modalities: monocularly predicted surface normal maps, RGB images, and event streams. To optimize the fusion process, our framework incorporates two key modules: the Adaptive Dual-stream Fusion Module (ADFM), which integrates RGB and normal map features, and the Event-modality Aware Fusion Module (EAFM), which adapts to the high dynamic range characteristics of event data. Extensive evaluations on the DSEC-Det-sub and PKU-DAVIS-SOD datasets demonstrate that NRE-Net significantly outperforms state-of-the-art methods. Our approach achieves mAP50 improvements of 7.9% and 6.1% over frame-based approaches (e.g., YOLOX), while surpassing the fusion-based SFNet by 2.7% on the DSEC-Det-sub dataset and SODFormer by 7.1% on the PKU-DAVIS-SOD dataset.
CVApr 11, 2025
Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation ModelsJiahuan Long, Tingsong Jiang, Wen Yao et al.
Vision foundation models (VFMs) have demonstrated remarkable capabilities in learning universal visual representations. However, adapting these models to downstream tasks conventionally requires parameter updates, with even parameter-efficient fine-tuning methods necessitating the modification of thousands to millions of weights. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a novel parameter-free fine-tuning method. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on fine-tuning foundation models. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse more task-irrelevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method in different vision tasks (e.g., image segmentation, depth estimation and image classification). Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces GPU memory overhead.
NEMar 3, 2021
Surrogate-assisted cooperative signal optimization for large-scale traffic networksYongsheng Liang, Zhigang Ren, Lin Wang et al.
Reasonable setting of traffic signals can be very helpful in alleviating congestion in urban traffic networks. Meta-heuristic optimization algorithms have proved themselves to be able to find high-quality signal timing plans. However, they generally suffer from performance deterioration when solving large-scale traffic signal optimization problems due to the huge search space and limited computational budget. Directing against this issue, this study proposes a surrogate-assisted cooperative signal optimization (SCSO) method. Different from existing methods that directly deal with the entire traffic network, SCSO first decomposes it into a set of tractable sub-networks, and then achieves signal setting by cooperatively optimizing these sub-networks with a surrogate-assisted optimizer. The decomposition operation significantly narrows the search space of the whole traffic network, and the surrogate-assisted optimizer greatly lowers the computational burden by reducing the number of expensive traffic simulations. By taking Newman fast algorithm, radial basis function and a modified estimation of distribution algorithm as decomposer, surrogate model and optimizer, respectively, this study develops a concrete SCSO algorithm. To evaluate its effectiveness and efficiency, a large-scale traffic network involving crossroads and T-junctions is generated based on a real traffic network. Comparison with several existing meta-heuristic algorithms specially designed for traffic signal optimization demonstrates the superiority of SCSO in reducing the average delay time of vehicles.
NEJan 19, 2021
A Surrogate-Assisted Variable Grouping Algorithm for General Large Scale Global Optimization ProblemsAn Chen, Zhigang Ren, Muyi Wang et al.
Problem decomposition plays a vital role when applying cooperative coevolution (CC) to large scale global optimization problems. However, most learning-based decomposition algorithms either only apply to additively separable problems or face the issue of false separability detections. Directing against these limitations, this study proposes a novel decomposition algorithm called surrogate-assisted variable grouping (SVG). SVG first designs a general-separability-oriented detection criterion according to whether the optimum of a variable changes with other variables. This criterion is consistent with the separability definition and thus endows SVG with broad applicability and high accuracy. To reduce the fitness evaluation requirement, SVG seeks the optimum of a variable with the help of a surrogate model rather than the original expensive high-dimensional model. Moreover, it converts the variable grouping process into a dynamic-binary-tree search one, which facilitates reutilizing historical separability detection information and thus reducing detection times. To evaluate the performance of SVG, a suite of benchmark functions with up to 2000 dimensions, including additively and non-additively separable ones, were designed. Experimental results on these functions indicate that, compared with six state-of-the-art decomposition algorithms, SVG possesses broader applicability and competitive efficiency. Furthermore, it can significantly enhance the optimization performance of CC.
CRJun 19, 2019
Catfish Effect Between Internal and External Attackers:Being Semi-honest is HelpfulHanqing Liu, Na Ruan, Joseph K. Liu
The consensus protocol named proof of work (PoW) is widely applied by cryptocurrencies like Bitcoin. Although security of a PoW cryptocurrency is always the top priority, it is threatened by mining attacks like selfish mining. Researchers have proposed many mining attack models with one attacker, and optimized the attacker's strategy. During these mining attacks, an attacker pursues a higher relative revenue (RR) by wasting a large amount of computational power of the honest miners at the cost of a small amount of computational power of himself. In this paper, we propose a mining attack model with two phases: the original system and the multi-attacker system. It is the first model to provide both theoretical and quantitative analysis of mining attacks with two attackers. We explain how the original system turns into the multi-attacker system by introducing two attackers: the internal attacker and the external attacker. If both attackers take the attacking strategy selfish mining, the RR of the internal attacker in multi-attacker system will drop by up to 31.9% compared with his RR in original system. The external attacker will overestimate his RR by up to 44.6% in multiattacker system. Unexpected competitions, auctions between attackers and overestimation of attackers' influence factor are three main causes of both attackers' dropping RR. We propose a mining strategy named Partial Initiative Release (PIR) which is a semi-honest mining strategy in multi-attacker system. In some specific situations, PIR allows the attacker to get more block reward by launching an attack in multi-attacker system.