Pengcheng Xia

CR
h-index13
12papers
175citations
Novelty40%
AI Score53

12 Papers

CVMar 28, 2023Code
AutoKary2022: A Large-Scale Densely Annotated Dataset for Chromosome Instance Segmentation

Dan You, Pengcheng Xia, Qiuzhu Chen et al.

Automated chromosome instance segmentation from metaphase cell microscopic images is critical for the diagnosis of chromosomal disorders (i.e., karyotype analysis). However, it is still a challenging task due to lacking of densely annotated datasets and the complicated morphologies of chromosomes, e.g., dense distribution, arbitrary orientations, and wide range of lengths. To facilitate the development of this area, we take a big step forward and manually construct a large-scale densely annotated dataset named AutoKary2022, which contains over 27,000 chromosome instances in 612 microscopic images from 50 patients. Specifically, each instance is annotated with a polygonal mask and a class label to assist in precise chromosome detection and segmentation. On top of it, we systematically investigate representative methods on this dataset and obtain a number of interesting findings, which helps us have a deeper understanding of the fundamental problems in chromosome instance segmentation. We hope this dataset could advance research towards medical understanding. The dataset can be available at: https://github.com/wangjuncongyu/chromosome-instance-segmentation-dataset.

AIDec 31, 2025Code
Multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis under unseen working conditions

Pengcheng Xia, Yixiang Huang, Chengjin Qin et al.

Intelligent fault diagnosis has become an indispensable technique for ensuring machinery reliability. However, existing methods suffer significant performance decline in real-world scenarios where models are tested under unseen working conditions, while domain adaptation approaches are limited to their reliance on target domain samples. Moreover, most existing studies rely on single-modal sensing signals, overlooking the complementary nature of multi-modal information for improving model generalization. To address these limitations, this paper proposes a multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis. A dual disentanglement framework is developed to decouple modality-invariant and modality-specific features, as well as domain-invariant and domain-specific representations, enabling both comprehensive multi-modal representation learning and robust domain generalization. A cross-domain mixed fusion strategy is designed to randomly mix modality information across domains for modality and domain diversity augmentation. Furthermore, a triple-modal fusion mechanism is introduced to adaptively integrate multi-modal heterogeneous information. Extensive experiments are conducted on induction motor fault diagnosis under both unseen constant and time-varying working conditions. The results demonstrate that the proposed method consistently outperforms advanced methods and comprehensive ablation studies further verify the effectiveness of each proposed component and multi-modal fusion. The code is available at: https://github.com/xiapc1996/MMDG.

58.3SEMay 4Code
CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation

Zirui Wan, Zhaonan Wu, Xinyi Hou et al.

High-quality commit messages are critical for maintaining software projects, yet ensuring their consistency and informativeness remains a practical challenge. While the Conventional Commits Specification (CCS) provides a structured format for commit messages, research on CCS-based commit classification and commit message generation (CMG) is limited by the absence of large-scale benchmarks, semantic annotations, and reliable evaluation methods. In this paper, we introduce CommitSuite, a benchmark comprising 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit is labeled with its CCS type and enriched with AST-level code changes, along with LLM-assisted semantic annotations that capture the "what" and "why" behind the change. To evaluate CMG systems, we propose a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality, enabling semantic-level assessment without relying on human-written references. Our experiments show that LLMs can effectively support both generation and evaluation, with evaluation achieving 0.849 Cohen's Kappa agreement against human judgments. CommitSuite offers a unified resource for structured commit understanding and facilitates reproducible research on commit classification and generation.

56.4CVMar 24
TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

Chunxia Qin, Chenyu Liu, Pengcheng Xia et al.

Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse'' strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.

CVOct 17, 2024Code
DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Hanbo Cheng, Limin Lin, Chenyu Liu et al.

Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly available at https://github.com/Hanbo-Cheng/DAWN-pytorch.

AIMar 7
Bi-directional digital twin prototype anchoring with multi-periodicity learning for few-shot fault diagnosis

Pengcheng Xia, Zhichao Dong, Yixiang Huang et al.

Intelligent fault diagnosis (IFD) has emerged as a powerful paradigm for ensuring the safety and reliability of industrial machinery. However, traditional IFD methods rely heavily on abundant labeled data for training, which is often difficult to obtain in practical industrial environments. Constructing a digital twin (DT) of the physical asset to obtain simulation data has therefore become a promising alternative. Nevertheless, existing DT-assisted diagnosis methods mainly transfer diagnostic knowledge through domain adaptation techniques, which still require a considerable amount of unlabeled data from the target asset. To address the challenges in few-shot scenarios where only extremely limited samples are available, a bi-directional DT prototype anchoring method with multi-periodicity learning is proposed. Specifically, a framework involving meta-training in the DT virtual space and test-time adaptation in the physical space is constructed for reliable few-shot model adaptation for the target asset. A bi-directional twin-domain prototype anchoring strategy with covariance-guided augmentation for adaptation is further developed to improve the robustness of prototype estimation. In addition, a multi-periodicity feature learning module is designed to capture the intrinsic periodic characteristics within current signals. A DT of an asynchronous motor is built based on finite element method, and experiments are conducted under multiple few-shot settings and three working conditions. Comparative and ablation studies demonstrate the superiority and effectiveness of the proposed method for few-shot fault diagnosis.

CRSep 1, 2021
Trade or Trick? Detecting and Characterizing Scam Tokens on Uniswap Decentralized Exchange

Pengcheng Xia, Haoyu wang, Bingyu Gao et al.

The prosperity of the cryptocurrency ecosystem drives the need for digital asset trading platforms. Beyond centralized exchanges (CEXs), decentralized exchanges (DEXs) are introduced to allow users to trade cryptocurrency without transferring the custody of their digital assets to the middlemen, thus eliminating the security and privacy issues of traditional CEX. Uniswap, as the most prominent cryptocurrency DEX, is continuing to attract scammers, with fraudulent cryptocurrencies flooding in the ecosystem. In this paper, we take the first step to detect and characterize scam tokens on Uniswap. We first collect all the transactions related to Uniswap V2 exchange and investigate the landscape of cryptocurrency trading on Uniswap from different perspectives. Then, we propose an accurate approach for flagging scam tokens on Uniswap based on a guilt-by-association heuristic and a machine-learning powered technique. We have identified over 10K scam tokens listed on Uniswap, which suggests that roughly 50% of the tokens listed on Uniswap are scam tokens. All the scam tokens and liquidity pools are created specialized for the "rug pull" scams, and some scam tokens have embedded tricks and backdoors in the smart contracts. We further observe that thousands of collusion addresses help carry out the scams in league with the scam token/pool creators. The scammers have gained a profit of at least \$16 million from 39,762 potential victims. Our observations in this paper suggest the urgency to identify and stop scams in the decentralized finance ecosystem, and our approach can act as a whistleblower that identifies scam tokens at their early stages.

CRApr 12, 2021
Ethereum Name Service: the Good, the Bad, and the Ugly

Pengcheng Xia, Haoyu Wang, Zhou Yu et al.

DNS has always been criticized for its inherent design flaws, making the system vulnerable to kinds of attacks. Besides, DNS domain names are not fully controlled by the users, which can be easily taken down by the authorities and registrars. Since blockchain has its unique properties like immutability and decentralization, it seems to be promising to build a decentralized name service on blockchain. Ethereum Name Service (ENS), as a novel name service built atop Etheruem, has received great attention from the community. Yet, no existing work has systematically studied this emerging system, especially the security issues and misbehaviors in ENS. To fill the void, we present the first large-scale study of ENS by collecting and analyzing millions of event logs related to ENS. We characterize the ENS system from a number of perspectives. Our findings suggest that ENS is showing gradually popularity during its four years' evolution, mainly due to its distributed and open nature that ENS domain names can be set to any kinds of records, even censored and malicious contents. We have identified several security issues and misbehaviors including traditional DNS security issues and new issues introduced by ENS smart contracts. Attackers are abusing the system with thousands of squatting ENS names, a number of scam blockchain addresses and malicious websites, etc. Our exploration suggests that our community should invest more effort into the detection and mitigation of issues in Blockchain-based Name Services towards building an open and trustworthy name service.

CRNov 5, 2020
Tracking Counterfeit Cryptocurrency End-to-end

Bingyu Gao, Haoyu Wang, Pengcheng Xia et al.

The production of counterfeit money has a long history. It refers to the creation of imitation currency that is produced without the legal sanction of government. With the growth of the cryptocurrency ecosystem, there is expanding evidence that counterfeit cryptocurrency has also appeared. In this paper, we empirically explore the presence of counterfeit cryptocurrencies on Ethereum and measure their impact. By analyzing over 190K ERC-20 tokens (or cryptocurrencies) on Ethereum, we have identified 2, 117 counterfeit tokens that target 94 of the 100 most popular cryptocurrencies. We perform an end-to-end characterization of the counterfeit token ecosystem, including their popularity, creators and holders, fraudulent behaviors and advertising channels. Through this, we have identified two types of scams related to counterfeit tokens and devised techniques to identify such scams. We observe that over 7,104 victims were deceived in these scams, and the overall financial loss sums to a minimum of $ 17 million (74,271.7 ETH). Our findings demonstrate the urgency to identify counterfeit cryptocurrencies and mitigate this threat.

CRJul 27, 2020
Don't Fish in Troubled Waters! Characterizing Coronavirus-themed Cryptocurrency Scams

Pengcheng Xia, Haoyu Wang, Xiapu Luo et al.

As COVID-19 has been spreading across the world since early 2020, a growing number of malicious campaigns are capitalizing the topic of COVID-19. COVID-19 themed cryptocurrency scams are increasingly popular during the pandemic. However, these newly emerging scams are poorly understood by our community. In this paper, we present the first measurement study of COVID-19 themed cryptocurrency scams. We first create a comprehensive taxonomy of COVID-19 scams by manually analyzing the existing scams reported by users from online resources. Then, we propose a hybrid approach to perform the investigation by: 1) collecting reported scams in the wild; and 2) detecting undisclosed ones based on information collected from suspicious entities (e.g., domains, tweets, etc). We have collected 195 confirmed COVID-19 cryptocurrency scams in total, including 91 token scams, 19 giveaway scams, 9 blackmail scams, 14 crypto malware scams, 9 Ponzi scheme scams, and 53 donation scams. We then identified over 200 blockchain addresses associated with these scams, which lead to at least 330K US dollars in losses from 6,329 victims. For each type of scams, we further investigated the tricks and social engineering techniques they used. To facilitate future research, we have released all the well-labelled scams to the research community.

CRMay 29, 2020
Beyond the Virus: A First Look at Coronavirus-themed Mobile Malware

Liu Wang, Ren He, Haoyu Wang et al.

As the COVID-19 pandemic emerged in early 2020, a number of malicious actors have started capitalizing the topic. Although a few media reports mentioned the existence of coronavirus-themed mobile malware, the research community lacks the understanding of the landscape of the coronavirus-themed mobile malware. In this paper, we present the first systematic study of coronavirus-themed Android malware. We first make efforts to create a daily growing COVID-19 themed mobile app dataset, which contains 4,322 COVID-19 themed apk samples (2,500 unique apps) and 611 potential malware samples (370 unique malicious apps) by the time of mid-November, 2020. We then present an analysis of them from multiple perspectives including trends and statistics, installation methods, malicious behaviors and malicious actors behind them. We observe that the COVID-19 themed apps as well as malicious ones began to flourish almost as soon as the pandemic broke out worldwide. Most malicious apps are camouflaged as benign apps using the same app identifiers (e.g., app name, package name and app icon). Their main purposes are either stealing users' private information or making profit by using tricks like phishing and extortion. Furthermore, only a quarter of the COVID-19 malware creators are habitual developers who have been active for a long time, while 75% of them are newcomers in this pandemic. The malicious developers are mainly located in US, mostly targeting countries including English-speaking countries, China, Arabic countries and Europe. To facilitate future research, we have publicly released all the well-labelled COVID-19 themed apps (and malware) to the research community. Till now, over 30 research institutes around the world have requested our dataset for COVID-19 themed research.

CRMar 16, 2020
Characterizing Cryptocurrency Exchange Scams

Pengcheng Xia, Bowen Zhang, Ru Ji et al.

As the indispensable trading platforms of the ecosystem, hundreds of cryptocurrency exchanges are emerging to facilitate the trading of digital assets. While, it also attracts the attentions of attackers. A number of scam attacks were reported targeting cryptocurrency exchanges, leading to a huge mount of financial loss. However, no previous work in our research community has systematically studied this problem. In this paper, we make the first effort to identify and characterize the cryptocurrency exchange scams. We first identify over 1,500 scam domains and over 300 fake apps, by collecting existing reports and using typosquatting generation techniques. Then we investigate the relationship between them, and identify 94 scam domain families and 30 fake app families. We further characterize the impacts of such scams, and reveal that these scams have incurred financial loss of 520k US dollars at least. We further observe that the fake apps have been sneaked to major app markets (including Google Play) to infect unsuspicious users. Our findings demonstrate the urgency to identify and prevent cryptocurrency exchange scams. To facilitate future research, we have publicly released all the identified scam domains and fake apps to the community.