ITMar 12, 2022
Adaptive Information Bottleneck Guided Joint Source and Channel Coding for Image TransmissionLunan Sun, Yang Yang, Mingzhe Chen et al.
Joint source and channel coding (JSCC) for image transmission has attracted increasing attention due to its robustness and high efficiency. However, the existing deep JSCC research mainly focuses on minimizing the distortion between the transmitted and received information under a fixed number of available channels. Therefore, the transmitted rate may be far more than its required minimum value. In this paper, an adaptive information bottleneck (IB) guided joint source and channel coding (AIB-JSCC) method is proposed for image transmission. The goal of AIB-JSCC is to reduce the transmission rate while improving the image reconstruction quality. In particular, a new IB objective for image transmission is proposed so as to minimize the distortion and the transmission rate. A mathematically tractable lower bound on the proposed objective is derived, and then, adopted as the loss function of AIB-JSCC. To trade off compression and reconstruction quality, an adaptive algorithm is proposed to adjust the hyperparameter of the proposed loss function dynamically according to the distortion during the training. Experimental results show that AIB-JSCC can significantly reduce the required amount of transmitted data and improve the reconstruction quality and downstream task accuracy.
ITAug 8, 2023
Federated Inference with Reliable Uncertainty Quantification over Wireless Channels via Conformal PredictionMeiyi Zhu, Matteo Zecchin, Sangwoo Park et al.
In this paper, we consider a wireless federated inference scenario in which devices and a server share a pre-trained machine learning model. The devices communicate statistical information about their local data to the server over a common wireless channel, aiming to enhance the quality of the inference decision at the server. Recent work has introduced federated conformal prediction (CP), which leverages devices-to-server communication to improve the reliability of the server's decision. With federated CP, devices communicate to the server information about the loss accrued by the shared pre-trained model on the local data, and the server leverages this information to calibrate a decision interval, or set, so that it is guaranteed to contain the correct answer with a pre-defined target reliability level. Previous work assumed noise-free communication, whereby devices can communicate a single real number to the server. In this paper, we study for the first time federated CP in a wireless setting. We introduce a novel protocol, termed wireless federated conformal prediction (WFCP), which builds on type-based multiple access (TBMA) and on a novel quantile correction strategy. WFCP is proved to provide formal reliability guarantees in terms of coverage of the predicted set produced by the server. Using numerical results, we demonstrate the significant advantages of WFCP against digital implementations of existing federated CP schemes, especially in regimes with limited communication resources and/or large number of devices.
SPSep 12, 2024
Conformal Distributed Remote Inference in Sensor Networks Under Reliability and Communication ConstraintsMeiyi Zhu, Matteo Zecchin, Sangwoo Park et al.
This paper presents communication-constrained distributed conformal risk control (CD-CRC) framework, a novel decision-making framework for sensor networks under communication constraints. Targeting multi-label classification problems, such as segmentation, CD-CRC dynamically adjusts local and global thresholds used to identify significant labels with the goal of ensuring a target false negative rate (FNR), while adhering to communication capacity limits. CD-CRC builds on online exponentiated gradient descent to estimate the relative quality of the observations of different sensors, and on online conformal risk control (CRC) as a mechanism to control local and global thresholds. CD-CRC is proved to offer deterministic worst-case performance guarantees in terms of FNR and communication overhead, while the regret performance in terms of false positive rate (FPR) is characterized as a function of the key hyperparameters. Simulation results highlight the effectiveness of CD-CRC, particularly in communication resource-constrained environments, making it a valuable tool for enhancing the performance and reliability of distributed sensor networks.
CVOct 20, 2022
Image-Text Retrieval with Binary and Continuous Label SupervisionZheng Li, Caili Guo, Zerun Feng et al.
Most image-text retrieval work adopts binary labels indicating whether a pair of image and text matches or not. Such a binary indicator covers only a limited subset of image-text semantic relations, which is insufficient to represent relevance degrees between images and texts described by continuous labels such as image captions. The visual-semantic embedding space obtained by learning binary labels is incoherent and cannot fully characterize the relevance degrees. In addition to the use of binary labels, this paper further incorporates continuous pseudo labels (generally approximated by text similarity between captions) to indicate the relevance degrees. To learn a coherent embedding space, we propose an image-text retrieval framework with Binary and Continuous Label Supervision (BCLS), where binary labels are used to guide the retrieval model to learn limited binary correlations, and continuous labels are complementary to the learning of image-text semantic relations. For the learning of binary labels, we improve the common Triplet ranking loss with Soft Negative mining (Triplet-SN) to improve convergence. For the learning of continuous labels, we design Kendall ranking loss inspired by Kendall rank correlation coefficient (Kendall), which improves the correlation between the similarity scores predicted by the retrieval model and the continuous labels. To mitigate the noise introduced by the continuous pseudo labels, we further design Sliding Window sampling and Hard Sample mining strategy (SW-HS) to alleviate the impact of noise and reduce the complexity of our framework to the same order of magnitude as the triplet ranking loss. Extensive experiments on two image-text retrieval benchmarks demonstrate that our method can improve the performance of state-of-the-art image-text retrieval models.
CVSep 28, 2022
Unified Loss of Pair Similarity Optimization for Vision-Language RetrievalZheng Li, Caili Guo, Xin Wang et al.
There are two popular loss functions used for vision-language retrieval, i.e., triplet loss and contrastive learning loss, both of them essentially minimize the difference between the similarities of negative pairs and positive pairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN), which is widely used in existing retrieval models to improve the discriminative ability, is easy to fall into local minima in training. On the other hand, Vision-Language Contrastive learning loss (VLC), which is widely used in the vision-language pre-training, has been shown to achieve significant performance gains on vision-language retrieval, but the performance of fine-tuning with VLC on small datasets is not satisfactory. This paper proposes a unified loss of pair similarity optimization for vision-language retrieval, providing a powerful tool for understanding existing loss functions. Our unified loss includes the hard sample mining strategy of VLC and introduces the margin used by the triplet loss for better similarity separation. It is shown that both Triplet-HN and VLC are special forms of our unified loss. Compared with the Triplet-HN, our unified loss has a fast convergence speed. Compared with the VLC, our unified loss is more discriminative and can provide better generalization in downstream fine-tuning tasks. Experiments on image-text and video-text retrieval benchmarks show that our unified loss can significantly improve the performance of the state-of-the-art retrieval models.
CVMar 1, 2023
Selectively Hard Negative Mining for Alleviating Gradient Vanishing in Image-Text MatchingZheng Li, Caili Guo, Xin Wang et al.
Recently, a series of Image-Text Matching (ITM) methods achieve impressive performance. However, we observe that most existing ITM models suffer from gradients vanishing at the beginning of training, which makes these models prone to falling into local minima. Most ITM models adopt triplet loss with Hard Negative mining (HN) as the optimization objective. We find that optimizing an ITM model using only the hard negative samples can easily lead to gradient vanishing. In this paper, we derive the condition under which the gradient vanishes during training. When the difference between the positive pair similarity and the negative pair similarity is close to 0, the gradients on both the image and text encoders will approach 0. To alleviate the gradient vanishing problem, we propose a Selectively Hard Negative Mining (SelHN) strategy, which chooses whether to mine hard negative samples according to the gradient vanishing condition. SelHN can be plug-and-play applied to existing ITM models to give them better training behavior. To further ensure the back-propagation of gradients, we construct a Residual Visual Semantic Embedding model with SelHN, denoted as RVSE++. Extensive experiments on two ITM benchmarks demonstrate the strength of RVSE++, achieving state-of-the-art performance.
CVSep 25, 2023
Boundary-Aware Proposal Generation Method for Temporal Action LocalizationHao Zhang, Chunyan Feng, Jiahui Yang et al.
The goal of Temporal Action Localization (TAL) is to find the categories and temporal boundaries of actions in an untrimmed video. Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries. More importantly, few works consider the background frames that are similar to action frames in pixels but dissimilar in semantics, which also leads to inaccurate temporal boundaries. To address the challenge above, we propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning. Specifically, we define the above background frames as hard negative samples. Contrastive learning with hard negative mining is introduced to improve the discrimination of BAPG. BAPG is independent of the existing TAL network architecture, so it can be applied plug-and-play to mainstream TAL models. Extensive experimental results on THUMOS14 and ActivityNet-1.3 demonstrate that BAPG can significantly improve the performance of TAL.
CVMay 26, 2023Code
Integrating Listwise Ranking into Pairwise-based Image-Text RetrievalZheng Li, Caili Guo, Xin Wang et al.
Image-Text Retrieval (ITR) is essentially a ranking problem. Given a query caption, the goal is to rank candidate images by relevance, from large to small. The current ITR datasets are constructed in a pairwise manner. Image-text pairs are annotated as positive or negative. Correspondingly, ITR models mainly use pairwise losses, such as triplet loss, to learn to rank. Pairwise-based ITR increases positive pair similarity while decreasing negative pair similarity indiscriminately. However, the relevance between dissimilar negative pairs is different. Pairwise annotations cannot reflect this difference in relevance. In the current datasets, pairwise annotations miss many correlations. There are many potential positive pairs among the pairs labeled as negative. Pairwise-based ITR can only rank positive samples before negative samples, but cannot rank negative samples by relevance. In this paper, we integrate listwise ranking into conventional pairwise-based ITR. Listwise ranking optimizes the entire ranking list based on relevance scores. Specifically, we first propose a Relevance Score Calculation (RSC) module to calculate the relevance score of the entire ranked list. Then we choose the ranking metric, Normalized Discounted Cumulative Gain (NDCG), as the optimization objective. We transform the non-differentiable NDCG into a differentiable listwise loss, named Smooth-NDCG (S-NDCG). Our listwise ranking approach can be plug-and-play integrated into current pairwise-based ITR models. Experiments on ITR benchmarks show that integrating listwise ranking can improve the performance of current ITR models and provide more user-friendly retrieval results. The code is available at https://github.com/AAA-Zheng/Listwise_ITR.
CVMay 20, 2024
Data Augmentation for Text-based Person Retrieval Using Large Language ModelsZheng Li, Lijia Si, Caili Guo et al.
Text-based Person Retrieval (TPR) aims to retrieve person images that match the description given a text query. The performance improvement of the TPR model relies on high-quality data for supervised training. However, it is difficult to construct a large-scale, high-quality TPR dataset due to expensive annotation and privacy protection. Recently, Large Language Models (LLMs) have approached or even surpassed human performance on many NLP tasks, creating the possibility to expand high-quality TPR datasets. This paper proposes an LLM-based Data Augmentation (LLM-DA) method for TPR. LLM-DA uses LLMs to rewrite the text in the current TPR dataset, achieving high-quality expansion of the dataset concisely and efficiently. These rewritten texts are able to increase the diversity of vocabulary and sentence structure while retaining the original key concepts and semantic information. In order to alleviate the hallucinations of LLMs, LLM-DA introduces a Text Faithfulness Filter (TFF) to filter out unfaithful rewritten text. To balance the contributions of original text and augmented text, a Balanced Sampling Strategy (BSS) is proposed to control the proportion of original text and augmented text used for training. LLM-DA is a plug-and-play method that can be easily integrated into various TPR models. Comprehensive experiments on three TPR benchmarks show that LLM-DA can improve the retrieval performance of current TPR models.
LGJun 16, 2025
Lightweight Task-Oriented Semantic Communication Empowered by Large-Scale AI ModelsChuanhong Liu, Caili Guo, Yang Yang et al.
Recent studies have focused on leveraging large-scale artificial intelligence (LAI) models to improve semantic representation and compression capabilities. However, the substantial computational demands of LAI models pose significant challenges for real-time communication scenarios. To address this, this paper proposes utilizing knowledge distillation (KD) techniques to extract and condense knowledge from LAI models, effectively reducing model complexity and computation latency. Nevertheless, the inherent complexity of LAI models leads to prolonged inference times during distillation, while their lack of channel awareness compromises the distillation performance. These limitations make standard KD methods unsuitable for task-oriented semantic communication scenarios. To address these issues, we propose a fast distillation method featuring a pre-stored compression mechanism that eliminates the need for repetitive inference, significantly improving efficiency. Furthermore, a channel adaptive module is incorporated to dynamically adjust the transmitted semantic information based on varying channel conditions, enhancing communication reliability and adaptability. In addition, an information bottleneck-based loss function is derived to guide the fast distillation process. Simulation results verify that the proposed scheme outperform baselines in term of task accuracy, model size, computation latency, and training data requirements.
ITFeb 16, 2024
On the Impact of Uncertainty and Calibration on Likelihood-Ratio Membership Inference AttacksMeiyi Zhu, Caili Guo, Chunyan Feng et al.
In a membership inference attack (MIA), an attacker exploits the overconfidence exhibited by typical machine learning models to determine whether a specific data point was used to train a target model. In this paper, we analyze the performance of the likelihood ratio attack (LiRA) within an information-theoretical framework that allows the investigation of the impact of the aleatoric uncertainty in the true data generation process, of the epistemic uncertainty caused by a limited training data set, and of the calibration level of the target model. We compare three different settings, in which the attacker receives decreasingly informative feedback from the target model: confidence vector (CV) disclosure, in which the output probability vector is released; true label confidence (TLC) disclosure, in which only the probability assigned to the true label is made available by the model; and decision set (DS) disclosure, in which an adaptive prediction set is produced as in conformal prediction. We derive bounds on the advantage of an MIA adversary with the aim of offering insights into the impact of uncertainty and calibration on the effectiveness of MIAs. Simulation results demonstrate that the derived analytical bounds predict well the effectiveness of MIAs.
LGNov 19, 2025
Attention-Based Feature Online Conformal Prediction for Time SeriesMeiyi Zhu, Caili Guo, Chunyan Feng et al.
Online conformal prediction (OCP) wraps around any pre-trained predictor to produce prediction sets with coverage guarantees that hold irrespective of temporal dependencies or distribution shifts. However, standard OCP faces two key limitations: it operates in the output space using simple nonconformity (NC) scores, and it treats all historical observations uniformly when estimating quantiles. This paper introduces attention-based feature OCP (AFOCP), which addresses both limitations through two key innovations. First, AFOCP operates in the feature space of pre-trained neural networks, leveraging learned representations to construct more compact prediction sets by concentrating on task-relevant information while suppressing nuisance variation. Second, AFOCP incorporates an attention mechanism that adaptively weights historical observations based on their relevance to the current test point, effectively handling non-stationarity and distribution shifts. We provide theoretical guarantees showing that AFOCP maintains long-term coverage while provably achieving smaller prediction intervals than standard OCP under mild regularity conditions. Extensive experiments on synthetic and real-world time series datasets demonstrate that AFOCP consistently reduces the size of prediction intervals by as much as $88\%$ as compared to OCP, while maintaining target coverage levels, validating the benefits of both feature-space calibration and attention-based adaptive weighting.
LGAug 4, 2025
Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge DistillationKuiyuan Ding, Caili Guo, Yang Yang et al.
Large-scale models (LSMs) can be an effective framework for semantic representation and understanding, thereby providing a suitable tool for designing semantic communication (SC) systems. However, their direct deployment is often hindered by high computational complexity and resource requirements. In this paper, a novel robust knowledge distillation based semantic communication (RKD-SC) framework is proposed to enable efficient and \textcolor{black}{channel-noise-robust} LSM-powered SC. The framework addresses two key challenges: determining optimal compact model architectures and effectively transferring knowledge while maintaining robustness against channel noise. First, a knowledge distillation-based lightweight differentiable architecture search (KDL-DARTS) algorithm is proposed. This algorithm integrates knowledge distillation loss and a complexity penalty into the neural architecture search process to identify high-performance, lightweight semantic encoder architectures. Second, a novel two-stage robust knowledge distillation (RKD) algorithm is developed to transfer semantic capabilities from an LSM (teacher) to a compact encoder (student) and subsequently enhance system robustness. To further improve resilience to channel impairments, a channel-aware transformer (CAT) block is introduced as the channel codec, trained under diverse channel conditions with variable-length outputs. Extensive simulations on image classification tasks demonstrate that the RKD-SC framework significantly reduces model parameters while preserving a high degree of the teacher model's performance and exhibiting superior robustness compared to existing methods.
CVJun 6, 2024
Attribute-Aware Implicit Modality Alignment for Text Attribute Person SearchXin Wang, Fangfang Liu, Zheng Li et al.
Text attribute person search aims to find specific pedestrians through given textual attributes, which is very meaningful in the scene of searching for designated pedestrians through witness descriptions. The key challenge is the significant modality gap between textual attributes and images. Previous methods focused on achieving explicit representation and alignment through unimodal pre-trained models. Nevertheless, the absence of inter-modality correspondence in these models may lead to distortions in the local information of intra-modality. Moreover, these methods only considered the alignment of inter-modality and ignored the differences between different attribute categories. To mitigate the above problems, we propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images and combine global representation matching to narrow the modality gap. Firstly, we introduce the CLIP model as the backbone and design prompt templates to transform attribute combinations into structured sentences. This facilitates the model's ability to better understand and match image details. Next, we design a Masked Attribute Prediction (MAP) module that predicts the masked attributes after the interaction of image and masked textual attribute features through multi-modal interaction, thereby achieving implicit local relationship alignment. Finally, we propose an Attribute-IoU Guided Intra-Modal Contrastive (A-IoU IMC) loss, aligning the distribution of different textual attributes in the embedding space with their IoU distribution, achieving better semantic arrangement. Extensive experiments on the Market-1501 Attribute, PETA, and PA100K datasets show that the performance of our proposed method significantly surpasses the current state-of-the-art methods.
CVJan 29, 2022
Semantic-assisted image compressionQizheng Sun, Caili Guo, Yang Yang et al.
Conventional image compression methods typically aim at pixel-level consistency while ignoring the performance of downstream AI tasks.To solve this problem, this paper proposes a Semantic-Assisted Image Compression method (SAIC), which can maintain semantic-level consistency to enable high performance of downstream AI tasks.To this end, we train the compression network using semantic-level loss function. In particular, semantic-level loss is measured using gradient-based semantic weights mechanism (GSW). GSW directly consider downstream AI tasks' perceptual results. Then, this paper proposes a semantic-level distortion evaluation metric to quantify the amount of semantic information retained during the compression process. Experimental results show that the proposed SAIC method can retain more semantic-level information and achieve better performance of downstream AI tasks compared to the traditional deep learning-based method and the advanced perceptual method at the same compression ratio.
LGNov 26, 2021
Jointly Learning Agent and Lane Information for Multimodal Trajectory PredictionJie Wang, Caili Guo, Minan Guo et al.
Predicting the plausible future trajectories of nearby agents is a core challenge for the safety of Autonomous Vehicles and it mainly depends on two external cues: the dynamic neighbor agents and static scene context. Recent approaches have made great progress in characterizing the two cues separately. However, they ignore the correlation between the two cues and most of them are difficult to achieve map-adaptive prediction. In this paper, we use lane as scene data and propose a staged network that Jointly learning Agent and Lane information for Multimodal Trajectory Prediction (JAL-MTP). JAL-MTP use a Social to Lane (S2L) module to jointly represent the static lane and the dynamic motion of the neighboring agents as instance-level lane, a Recurrent Lane Attention (RLA) mechanism for utilizing the instance-level lanes to predict the map-adaptive future trajectories and two selectors to identify the typical and reasonable trajectories. The experiments conducted on the public Argoverse dataset demonstrate that JAL-MTP significantly outperforms the existing models in both quantitative and qualitative.
CVSep 29, 2021
Semantic Communications With AI TasksYang Yang, Caili Guo, Fangfang Liu et al.
A radical paradigm shift of wireless networks from ``connected things'' to ``connected intelligence'' undergoes, which coincides with the Shanno and Weaver's envisions: Communications will transform from the technical level to the semantic level. This article proposes a semantic communication method with artificial intelligence tasks (SC-AIT). First, the architecture of SC-AIT is elaborated. Then, based on the proposed architecture, we implement SC-AIT for a image classifications task. A prototype of SC-AIT is also established for surface defect detection, is conducted. Experimental results show that SC-AIT has much lower bandwidth requirements, and can achieve more than $40\%$ classification accuracy gains compared with the communications at the technical level. Future trends and key challenges for semantic communications are also identified.
CVMar 19, 2021
Learning Multiscale Correlations for Human Motion PredictionHonghong Zhou, Caili Guo, Hao Zhang et al.
In spite of the great progress in human motion prediction, it is still a challenging task to predict those aperiodic and complicated motions. We believe that to capture the correlations among human body components is the key to understand the human motion. In this paper, we propose a novel multiscale graph convolution network (MGCN) to address this problem. Firstly, we design an adaptive multiscale interactional encoding module (MIEM) which is composed of two sub modules: scale transformation module and scale interaction module to learn the human body correlations. Secondly, we apply a coarse-to-fine decoding strategy to decode the motions sequentially. We evaluate our approach on two standard benchmark datasets for human motion prediction: Human3.6M and CMU motion capture dataset. The experiments show that the proposed approach achieves the state-of-the-art performance for both short-term and long-term prediction especially in those complicated action category.
SPMar 5, 2021
Optimization of User Selection and Bandwidth Allocation for Federated Learning in VLC/RF SystemsChuanhong Liu, Caili Guo, Yang Yang et al.
Limited radio frequency (RF) resources restrict the number of users that can participate in federated learning (FL) thus affecting FL convergence speed and performance. In this paper, we first introduce visible light communication (VLC) as a supplement to RF in FL and build a hybrid VLC/RF communication system, in which each indoor user can use both VLC and RF to transmit its FL model parameters. Then, the problem of user selection and bandwidth allocation is studied for FL implemented over a hybrid VLC/RF system aiming to optimize the FL performance. The problem is first separated into two subproblems. The first subproblem is a user selection problem with a given bandwidth allocation, which is solved by a traversal algorithm. The second subproblem is a bandwidth allocation problem with a given user selection, which is solved by a numerical method. The final user selection and bandwidth allocation are obtained by iteratively solving these two subproblems. Simulation results show that the proposed FL algorithm that efficiently uses VLC and RF for FL model transmission can improve the prediction accuracy by up to 10% compared with a conventional FL system using only RF.
AIAug 14, 2020
Multi-Agent Deep Reinforcement Learning enabled Computation Resource Allocation in a Vehicular Cloud NetworkShilin Xu, Caili Guo, Rose Qingyang Hu et al.
In this paper, we investigate the computational resource allocation problem in a distributed Ad-Hoc vehicular network with no centralized infrastructure support. To support the ever increasing computational needs in such a vehicular network, the distributed virtual cloud network (VCN) is formed, based on which a computational resource sharing scheme through offloading among nearby vehicles is proposed. In view of the time-varying computational resource in VCN, the statistical distribution characteristics for computational resource are analyzed in detail. Thereby, a resource-aware combinatorial optimization objective mechanism is proposed. To alleviate the non-stationary environment caused by the typically multi-agent environment in VCN, we adopt a centralized training and decentralized execution framework. In addition, for the objective optimization problem, we model it as a Markov game and propose a DRL based multi-agent deep deterministic reinforcement learning (MADDPG) algorithm to solve it. Interestingly, to overcome the dilemma of lacking a real central control unit in VCN, the allocation is actually completed on the vehicles in a distributed manner. The simulation results are presented to demonstrate our scheme's effectiveness.
CVJun 16, 2020
Exploiting Visual Semantic Reasoning for Video-Text RetrievalZerun Feng, Zhimin Zeng, Caili Guo et al.
Video retrieval is a challenging research topic bridging the vision and language areas and has attracted broad attention in recent years. Previous works have been devoted to representing videos by directly encoding from frame-level features. In fact, videos consist of various and abundant semantic relations to which existing methods pay less attention. To address this issue, we propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions. Specifically, we consider frame regions as vertices and construct a fully-connected semantic correlation graph. Then, we perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations. With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed. Finally, the region features are aggregated to form frame-level features for further encoding to measure video-text similarity. Extensive experiments on two public benchmark datasets validate the effectiveness of our method by achieving state-of-the-art performance due to the powerful semantic reasoning.