Cong Zhao

CV
h-index19
17papers
865citations
Novelty51%
AI Score40

17 Papers

CVAug 19, 2023
Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

Guiqin Wang, Peng Zhao, Cong Zhao et al.

Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.

CLOct 20, 2023
Controlled Randomness Improves the Performance of Transformer Models

Tobias Deußer, Cong Zhao, Wolfgang Krämer et al.

During the pre-training step of natural language models, the main objective is to learn a general representation of the pre-training dataset, usually requiring large amounts of textual data to capture the complexity and diversity of natural language. Contrasting this, in most cases, the size of the data available to solve the specific downstream task is often dwarfed by the aforementioned pre-training dataset, especially in domains where data is scarce. We introduce controlled randomness, i.e. noise, into the training process to improve fine-tuning language models and explore the performance of targeted noise in addition to the parameters of these models. We find that adding such noise can improve the performance in our two downstream tasks of joint named entity recognition and relation extraction and text summarization.

DCMar 24, 2022
ACE: Towards Application-Centric Edge-Cloud Collaborative Intelligence

Luhui Wang, Cong Zhao, Shusen Yang et al.

Intelligent applications based on machine learning are impacting many parts of our lives. They are required to operate under rigorous practical constraints in terms of service latency, network bandwidth overheads, and also privacy. Yet current implementations running in the Cloud are unable to satisfy all these constraints. The Edge-Cloud Collaborative Intelligence (ECCI) paradigm has become a popular approach to address such issues, and rapidly increasing applications are developed and deployed. However, these prototypical implementations are developer-dependent and scenario-specific without generality, which cannot be efficiently applied in large-scale or to general ECC scenarios in practice, due to the lack of supports for infrastructure management, edge-cloud collaborative service, complex intelligence workload, and efficient performance optimization. In this article, we systematically design and construct the first unified platform, ACE, that handles ever-increasing edge and cloud resources, user-transparent services, and proliferating intelligence workloads with increasing scale and complexity, to facilitate cost-efficient and high-performing ECCI application development and deployment. For verification, we explicitly present the construction process of an ACE-based intelligent video query application, and demonstrate how to achieve customizable performance optimization efficiently. Based on our initial experience, we discuss both the limitations and vision of ACE to shed light on promising issues to elaborate in the approaching ECCI ecosystem.

CVDec 14, 2023Code
Generative Model-based Feature Knowledge Distillation for Action Recognition

Guiqin Wang, Peng Zhao, Yanjiang Shi et al.

Knowledge distillation (KD), a technique widely employed in computer vision, has emerged as a de facto standard for improving the performance of small neural networks. However, prevailing KD-based approaches in video tasks primarily focus on designing loss functions and fusing cross-modal information. This overlooks the spatial-temporal feature semantics, resulting in limited advancements in model compression. Addressing this gap, our paper introduces an innovative knowledge distillation framework, with the generative model for training a lightweight student model. In particular, the framework is organized into two steps: the initial phase is Feature Representation, wherein a generative model-based attention module is trained to represent feature semantics; Subsequently, the Generative-based Feature Distillation phase encompasses both Generative Distillation and Attention Distillation, with the objective of transferring attention-based feature semantics with the generative model. The efficacy of our approach is demonstrated through comprehensive experiments on diverse popular datasets, proving considerable enhancements in video action recognition task. Moreover, the effectiveness of our proposed framework is validated in the context of more intricate video action detection task. Our code is available at https://github.com/aaai-24/Generative-based-KD.

CVAug 19, 2025Code
Generative Model-Based Feature Attention Module for Video Action Analysis

Guiqin Wang, Peng Zhao, Cong Zhao et al.

Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions' foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at https://github.com/Generative-Feature-Model/GAF.

LGDec 29, 2023
FedLED: Label-Free Equipment Fault Diagnosis with Vertical Federated Transfer Learning

Jie Shen, Shusen Yang, Cong Zhao et al.

Intelligent equipment fault diagnosis based on Federated Transfer Learning (FTL) attracts considerable attention from both academia and industry. It allows real-world industrial agents with limited samples to construct a fault diagnosis model without jeopardizing their raw data privacy. Existing approaches, however, can neither address the intense sample heterogeneity caused by different working conditions of practical agents, nor the extreme fault label scarcity, even zero, of newly deployed equipment. To address these issues, we present FedLED, the first unsupervised vertical FTL equipment fault diagnosis method, where knowledge of the unlabeled target domain is further exploited for effective unsupervised model transfer. Results of extensive experiments using data of real equipment monitoring demonstrate that FedLED obviously outperforms SOTA approaches in terms of both diagnosis accuracy (up to 4.13 times) and generality. We expect our work to inspire further study on label-free equipment fault diagnosis systematically enhanced by target domain knowledge.

AIDec 6, 2024
TelOps: AI-driven Operations and Maintenance for Telecommunication Networks

Yuqian Yang, Shusen Yang, Cong Zhao et al.

Telecommunication Networks (TNs) have become the most important infrastructure for data communications over the last century. Operations and maintenance (O&M) is extremely important to ensure the availability, effectiveness, and efficiency of TN communications. Different from the popular O&M technique for IT systems (e.g., the cloud), artificial intelligence for IT Operations (AIOps), O&M for TNs meets the following three fundamental challenges: topological dependence of network components, highly heterogeneous software, and restricted failure data. This article presents TelOps, the first AI-driven O&M framework for TNs, systematically enhanced with mechanism, data, and empirical knowledge. We provide a comprehensive comparison between TelOps and AIOps, and conduct a proof-of-concept case study on a typical O&M task (failure diagnosis) for a real industrial TN. As the first systematic AI-driven O&M framework for TNs, TelOps opens a new door to applying AI techniques to TN automation.

CVJun 5, 2024
EdgeSync: Faster Edge-model Updating via Adaptive Continuous Learning for Video Data Drift

Peng Zhao, Runchu Dong, Guiqin Wang et al.

Real-time video analytics systems typically place models with fewer weights on edge devices to reduce latency. The distribution of video content features may change over time for various reasons (i.e. light and weather change) , leading to accuracy degradation of existing models, to solve this problem, recent work proposes a framework that uses a remote server to continually train and adapt the lightweight model at edge with the help of complex model. However, existing analytics approaches leave two challenges untouched: firstly, retraining task is compute-intensive, resulting in large model update delays; secondly, new model may not fit well enough with the data distribution of the current video stream. To address these challenges, in this paper, we present EdgeSync, EdgeSync filters the samples by considering both timeliness and inference results to make training samples more relevant to the current video content as well as reduce the update delay, to improve the quality of training, EdgeSync also designs a training management module that can efficiently adjusts the model training time and training order on the runtime. By evaluating real datasets with complex scenes, our method improves about 3.4% compared to existing methods and about 10% compared to traditional means.

CVJul 24, 2020
HEU Emotion: A Large-scale Database for Multi-modal Emotion Recognition in the Wild

Jing Chen, Chenhui Wang, Kejun Wang et al.

The study of affective computing in the wild setting is underpinned by databases. Existing multimodal emotion databases in the real-world conditions are few and small, with a limited number of subjects and expressed in a single language. To meet this requirement, we collected, annotated, and prepared to release a new natural state video database (called HEU Emotion). HEU Emotion contains a total of 19,004 video clips, which is divided into two parts according to the data source. The first part contains videos downloaded from Tumblr, Google, and Giphy, including 10 emotions and two modalities (facial expression and body posture). The second part includes corpus taken manually from movies, TV series, and variety shows, consisting of 10 emotions and three modalities (facial expression, body posture, and emotional speech). HEU Emotion is by far the most extensive multi-modal emotional database with 9,951 subjects. In order to provide a benchmark for emotion recognition, we used many conventional machine learning and deep learning methods to evaluate HEU Emotion. We proposed a Multi-modal Attention module to fuse multi-modal features adaptively. After multi-modal fusion, the recognition accuracies for the two parts increased by 2.19% and 4.01% respectively over those of single-modal facial expression recognition.

LGMay 4, 2020
CDC: Classification Driven Compression for Bandwidth Efficient Edge-Cloud Collaborative Deep Learning

Yuanrui Dong, Peng Zhao, Hanqiao Yu et al.

The emerging edge-cloud collaborative Deep Learning (DL) paradigm aims at improving the performance of practical DL implementations in terms of cloud bandwidth consumption, response latency, and data privacy preservation. Focusing on bandwidth efficient edge-cloud collaborative training of DNN-based classifiers, we present CDC, a Classification Driven Compression framework that reduces bandwidth consumption while preserving classification accuracy of edge-cloud collaborative DL. Specifically, to reduce bandwidth consumption, for resource-limited edge servers, we develop a lightweight autoencoder with a classification guidance for compression with classification driven feature preservation, which allows edges to only upload the latent code of raw data for accurate global training on the Cloud. Additionally, we design an adjustable quantization scheme adaptively pursuing the tradeoff between bandwidth consumption and classification accuracy under different network conditions, where only fine-tuning is required for rapid compression ratio adjustment. Results of extensive experiments demonstrate that, compared with DNN training with raw data, CDC consumes 14.9 times less bandwidth with an accuracy loss no more than 1.06%, and compared with DNN training with data compressed by AE without guidance, CDC introduces at least 100% lower accuracy loss.

DCApr 22, 2020
OL4EL: Online Learning for Edge-cloud Collaborative Learning on Heterogeneous Edges with Resource Constraints

Qing Han, Shusen Yang, Xuebin Ren et al.

Distributed machine learning (ML) at network edge is a promising paradigm that can preserve both network bandwidth and privacy of data providers. However, heterogeneous and limited computation and communication resources on edge servers (or edges) pose great challenges on distributed ML and formulate a new paradigm of Edge Learning (i.e. edge-cloud collaborative machine learning). In this article, we propose a novel framework of 'learning to learn' for effective Edge Learning (EL) on heterogeneous edges with resource constraints. We first model the dynamic determination of collaboration strategy (i.e. the allocation of local iterations at edge servers and global aggregations on the Cloud during collaborative learning process) as an online optimization problem to achieve the tradeoff between the performance of EL and the resource consumption of edge servers. Then, we propose an Online Learning for EL (OL4EL) framework based on the budget-limited multi-armed bandit model. OL4EL supports both synchronous and asynchronous learning patterns, and can be used for both supervised and unsupervised learning tasks. To evaluate the performance of OL4EL, we conducted both real-world testbed experiments and extensive simulations based on docker containers, where both Support Vector Machine and K-means were considered as use cases. Experimental results demonstrate that OL4EL significantly outperforms state-of-the-art EL and other collaborative ML approaches in terms of the trade-off between learning performance and resource consumption.

LGDec 17, 2019
Asynchronous Federated Learning with Differential Privacy for Edge Intelligence

Yanan Li, Shusen Yang, Xuebin Ren et al.

Federated learning has been showing as a promising approach in paving the last mile of artificial intelligence, due to its great potential of solving the data isolation problem in large scale machine learning. Particularly, with consideration of the heterogeneity in practical edge computing systems, asynchronous edge-cloud collaboration based federated learning can further improve the learning efficiency by significantly reducing the straggler effect. Despite no raw data sharing, the open architecture and extensive collaborations of asynchronous federated learning (AFL) still give some malicious participants great opportunities to infer other parties' training data, thus leading to serious concerns of privacy. To achieve a rigorous privacy guarantee with high utility, we investigate to secure asynchronous edge-cloud collaborative federated learning with differential privacy, focusing on the impacts of differential privacy on model convergence of AFL. Formally, we give the first analysis on the model convergence of AFL under DP and propose a multi-stage adjustable private algorithm (MAPA) to improve the trade-off between model utility and privacy by dynamically adjusting both the noise scale and the learning rate. Through extensive simulations and real-world experiments with an edge-could testbed, we demonstrate that MAPA significantly improves both the model accuracy and convergence speed with sufficient privacy guarantee.

LGMay 7, 2019
Neural Architecture Refinement: A Practical Way for Avoiding Overfitting in NAS

Yang Jiang, Cong Zhao, Zeyang Dou et al.

Neural architecture search (NAS) is proposed to automate the architecture design process and attracts overwhelming interest from both academia and industry. However, it is confronted with overfitting issue due to the high-dimensional search space composed by operator selection and skip connection of each layer. This paper explores the architecture overfitting issue in depth based on the reinforcement learning-based NAS framework. We show that the policy gradient method has deep correlations with the cross entropy minimization. Based on this correlation, we further demonstrate that, though the reward of NAS is sparse, the policy gradient method implicitly assign the reward to all operations and skip connections based on the sampling frequency. However, due to the inaccurate reward estimation, curse of dimensionality problem and the hierachical structure of neural networks, reward charateristics for operators and skip connections have intrinsic differences, the assigned rewards for the skip connections are extremely noisy and inaccurate. To alleviate this problem, we propose a neural architecture refinement approach that working with an initial state-of-the-art network structure and only refining its operators. Extensive experiments have demonstrated that the proposed method can achieve fascinated results, including classification, face recognition etc.

CVAug 22, 2018
Coarse-to-Fine Annotation Enrichment for Semantic Segmentation Learning

Yadan Luo, Ziwei Wang, Zi Huang et al.

Rich high-quality annotated data is critical for semantic segmentation learning, yet acquiring dense and pixel-wise ground-truth is both labor- and time-consuming. Coarse annotations (e.g., scribbles, coarse polygons) offer an economical alternative, with which training phase could hardly generate satisfactory performance unfortunately. In order to generate high-quality annotated data with a low time cost for accurate segmentation, in this paper, we propose a novel annotation enrichment strategy, which expands existing coarse annotations of training data to a finer scale. Extensive experiments on the Cityscapes and PASCAL VOC 2012 benchmarks have shown that the neural networks trained with the enriched annotations from our framework yield a significant improvement over that trained with the original coarse labels. It is highly competitive to the performance obtained by using human annotated dense annotations. The proposed method also outperforms among other state-of-the-art weakly-supervised segmentation methods.

LGNov 30, 2017
Towards Accurate Binary Convolutional Neural Network

Xiaofan Lin, Cong Zhao, Wei Pan

We introduce a novel scheme to train binary convolutional neural networks (CNNs) -- CNNs with weights and activations constrained to {-1,+1} at run-time. It has been known that using binary weights and activations drastically reduce memory size and accesses, and can replace arithmetic operations with more efficient bitwise operations, leading to much faster test-time inference and lower power consumption. However, previous works on binarizing CNNs usually result in severe prediction accuracy degradation. In this paper, we address this issue with two major innovations: (1) approximating full-precision weights with the linear combination of multiple binary weight bases; (2) employing multiple binary activations to alleviate information loss. The implementation of the resulting binary CNN, denoted as ABC-Net, is shown to achieve much closer performance to its full-precision counterpart, and even reach the comparable prediction accuracy on ImageNet and forest trail datasets, given adequate binary weight bases and activations.

NIJan 8, 2017
Cheating-Resilient Incentive Scheme for Mobile Crowdsensing Systems

Cong Zhao, Xinyu Yang, Wei Yu et al.

Mobile Crowdsensing is a promising paradigm for ubiquitous sensing, which explores the tremendous data collected by mobile smart devices with prominent spatial-temporal coverage. As a fundamental property of Mobile Crowdsensing Systems, temporally recruited mobile users can provide agile, fine-grained, and economical sensing labors, however their self-interest cannot guarantee the quality of the sensing data, even when there is a fair return. Therefore, a mechanism is required for the system server to recruit well-behaving users for credible sensing, and to stimulate and reward more contributive users based on sensing truth discovery to further increase credible reporting. In this paper, we develop a novel Cheating-Resilient Incentive (CRI) scheme for Mobile Crowdsensing Systems, which achieves credibility-driven user recruitment and payback maximization for honest users with quality data. Via theoretical analysis, we demonstrate the correctness of our design. The performance of our scheme is evaluated based on extensive realworld trace-driven simulations. Our evaluation results show that our scheme is proven to be effective in terms of both guaranteeing sensing accuracy and resisting potential cheating behaviors, as demonstrated in practical scenarios, as well as those that are intentionally harsher.

NIJan 8, 2017
Rapid, User-Transparent, and Trustworthy Device Pairing for D2D-Enabled Mobile Crowdsourcing

Cong Zhao, Shusen Yang, Xinyu Yang et al.

Mobile Crowdsourcing is a promising service paradigm utilizing ubiquitous mobile devices to facilitate largescale crowdsourcing tasks (e.g. urban sensing and collaborative computing). Many applications in this domain require Device-to-Device (D2D) communications between participating devices for interactive operations such as task collaborations and file transmissions. Considering the private participating devices and their opportunistic encountering behaviors, it is highly desired to establish secure and trustworthy D2D connections in a fast and autonomous way, which is vital for implementing practical Mobile Crowdsourcing Systems (MCSs). In this paper, we develop an efficient scheme, Trustworthy Device Pairing (TDP), which achieves user-transparent secure D2D connections and reliable peer device selections for trustworthy D2D communications. Through rigorous analysis, we demonstrate the effectiveness and security intensity of TDP in theory. The performance of TDP is evaluated based on both real-world prototype experiments and extensive trace-driven simulations. Evaluation results verify our theoretical analysis and show that TDP significantly outperforms existing approaches in terms of pairing speed, stability, and security.