Kai Niu

IT
h-index25
19papers
1,297citations
Novelty53%
AI Score52

19 Papers

CVNov 2, 2022Code
WITT: A Wireless Image Transmission Transformer for Semantic Communications

Ke Yang, Sixian Wang, Jincheng Dai et al.

In this paper, we aim to redesign the vision Transformer (ViT) as a new backbone to realize semantic image transmission, termed wireless image transmission transformer (WITT). Previous works build upon convolutional neural networks (CNNs), which are inefficient in capturing global dependencies, resulting in degraded end-to-end transmission performance especially for high-resolution images. To tackle this, the proposed WITT employs Swin Transformers as a more capable backbone to extract long-range information. Different from ViTs in image classification tasks, WITT is highly optimized for image transmission while considering the effect of the wireless channel. Specifically, we propose a spatial modulation module to scale the latent representations according to channel state information, which enhances the ability of a single model to deal with various channel conditions. As a result, extensive experiments verify that our WITT attains better performance for different image resolutions, distortion metrics, and channel conditions. The code is available at https://github.com/KeYang8/WITT.

CVMay 26, 2022
Wireless Deep Video Semantic Transmission

Sixian Wang, Jincheng Dai, Zijian Liang et al.

In this paper, we design a new class of high-efficiency deep joint source-channel coding methods to achieve end-to-end video transmission over wireless channels. The proposed methods exploit nonlinear transform and conditional coding architecture to adaptively extract semantic features across video frames, and transmit semantic feature domain representations over wireless channels via deep joint source-channel coding. Our framework is collected under the name deep video semantic transmission (DVST). In particular, benefiting from the strong temporal prior provided by the feature domain context, the learned nonlinear transform function becomes temporally adaptive, resulting in a richer and more accurate entropy model guiding the transmission of current frame. Accordingly, a novel rate adaptive transmission mechanism is developed to customize deep joint source-channel coding for video sources. It learns to allocate the limited channel bandwidth within and among video frames to maximize the overall transmission performance. The whole DVST design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under perceptual quality metrics or machine vision task performance metrics. Across standard video source test sequences and various communication scenarios, experiments show that our DVST can generally surpass traditional wireless video coded transmission schemes. The proposed DVST framework can well support future semantic communications due to its video content-aware and machine vision task integration abilities.

ITAug 4, 2022
Communication Beyond Transmitting Bits: Semantics-Guided Source and Channel Coding

Jincheng Dai, Ping Zhang, Kai Niu et al.

Classical communication paradigms focus on accurately transmitting bits over a noisy channel, and Shannon theory provides a fundamental theoretical limit on the rate of reliable communications. In this approach, bits are treated equally, and the communication system is oblivious to what meaning these bits convey or how they would be used. Future communications towards intelligence and conciseness will predictably play a dominant role, and the proliferation of connected intelligent agents requires a radical rethinking of coded transmission paradigm to support the new communication morphology on the horizon. The recent concept of "semantic communications" offers a promising research direction. Injecting semantic guidance into the coded transmission design to achieve semantics-aware communications shows great potential for further breakthrough in effectiveness and reliability. This article sheds light on semantics-guided source and channel coding as a transmission paradigm of semantic communications, which exploits both data semantics diversity and wireless channel diversity together to boost the whole system performance. We present the general system architecture and key techniques, and indicate some open issues on this topic.

CVMay 26, 2022
Perceptual Learned Source-Channel Coding for High-Fidelity Image Semantic Transmission

Jun Wang, Sixian Wang, Jincheng Dai et al.

As one novel approach to realize end-to-end wireless image semantic transmission, deep learning-based joint source-channel coding (deep JSCC) method is emerging in both deep learning and communication communities. However, current deep JSCC image transmission systems are typically optimized for traditional distortion metrics such as peak signal-to-noise ratio (PSNR) or multi-scale structural similarity (MS-SSIM). But for low transmission rates, due to the imperfect wireless channel, these distortion metrics lose significance as they favor pixel-wise preservation. To account for human visual perception in semantic communications, it is of great importance to develop new deep JSCC systems optimized beyond traditional PSNR and MS-SSIM metrics. In this paper, we introduce adversarial losses to optimize deep JSCC, which tends to preserve global semantic information and local texture. Our new deep JSCC architecture combines encoder, wireless channel, decoder/generator, and discriminator, which are jointly learned under both perceptual and adversarial losses. Our method yields human visually much more pleasing results than state-of-the-art engineered image coded transmission systems and traditional deep JSCC systems. A user study confirms that achieving the perceptually similar end-to-end image transmission quality, the proposed method can save about 50\% wireless channel bandwidth cost.

ITNov 8, 2022
Toward Adaptive Semantic Communications: Efficient Data Transmission via Online Learned Nonlinear Transform Source-Channel Coding

Jincheng Dai, Sixian Wang, Ke Yang et al.

The emerging field semantic communication is driving the research of end-to-end data transmission. By utilizing the powerful representation ability of deep learning models, learned data transmission schemes have exhibited superior performance than the established source and channel coding methods. While, so far, research efforts mainly concentrated on architecture and model improvements toward a static target domain. Despite their successes, such learned models are still suboptimal due to the limitations in model capacity and imperfect optimization and generalization, particularly when the testing data distribution or channel response is different from that adopted for model training, as is likely to be the case in real-world. To tackle this, we propose a novel online learned joint source and channel coding approach that leverages the deep learning model's overfitting property. Specifically, we update the off-the-shelf pre-trained models after deployment in a lightweight online fashion to adapt to the distribution shifts in source data and environment domain. We take the overfitting concept to the extreme, proposing a series of implementation-friendly methods to adapt the codec model or representations to an individual data or channel state instance, which can further lead to substantial gains in terms of the bandwidth ratio-distortion performance. The proposed methods enable the communication-efficient adaptation for all parameters in the network without sacrificing decoding speed. Our experiments, including user study, on continually changing target source data and wireless channel environments, demonstrate the effectiveness and efficiency of our approach, on which we outperform existing state-of-the-art engineered transmission scheme (VVC combined with 5G LDPC coded transmission).

CLMay 7, 2022
Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information

Zhipeng Zhang, Xinglin Hou, Kai Niu et al.

Recently, online shopping has gradually become a common way of shopping for people all over the world. Wonderful merchandise advertisements often attract more people to buy. These advertisements properly integrate multimodal multi-structured information of commodities, such as visual spatial information and fine-grained structure information. However, traditional multimodal text generation focuses on the conventional description of what existed and happened, which does not match the requirement of advertisement copywriting in the real world. Because advertisement copywriting has a vivid language style and higher requirements of faithfulness. Unfortunately, there is a lack of reusable evaluation frameworks and a scarcity of datasets. Therefore, we present a dataset, E-MMAD (e-commercial multimodal multi-structured advertisement copywriting), which requires, and supports much more detailed information in text generation. Noticeably, it is one of the largest video captioning datasets in this field. Accordingly, we propose a baseline method and faithfulness evaluation metric on the strength of structured information reasoning to solve the demand in reality on this dataset. It surpasses the previous methods by a large margin on all metrics. The dataset and method are coming soon on \url{https://e-mmad.github.io/e-mmad.net/index.html}.

ITApr 16
A Synonymous Variational Perspective on the Rate-Distortion-Perception Tradeoff

Zijian Liang, Kai Niu, Changshuo Wang et al.

The fundamental limit of natural signal compression has traditionally been characterized by classical rate-distortion (RD) theory through the tradeoff between coding rate and reconstruction distortion, while the rate-distortion-perception (RDP) framework introduces a divergence-based measure of perceptual quality as a modeling principle rather than a theoretically-derived principle, leaving its theoretical origin unclear. In this paper, motivated by a synonymity-based semantic information perspective, we reformulate perceptual reconstruction as recovering any admissible sample within an ideal synonymous set (synset) associated with the source, rather than the source sample itself, and correspondingly establish a synonymous source coding architecture. On this basis, we develop a synonymous variational inference (SVI) analysis framework with a synonymous variational lower bound (SVLBO) for tractable analysis of synset-oriented compression. Within this framework, we establish a synonymity-perception consistency principle, showing that optimal identification of semantic information is theoretically consistent with perceptual optimization. Based on its derivation result, we prove a synonymous RDP tradeoff for the proposed synonymous source coding. These analytical results show that the distributional divergence term arises naturally from the synset-based reconstruction objective, clarify its compatibility with existing RDP formulations and classical RD theory, and suggest the potential advantages of synonymous source coding.

ITMay 13
Blind Recognition of Polar Codes Using Successive Cancellation List Decoding

Changwei Tu, Yang Liu, Xianzhao Feng et al.

Blind recognition of polar codes remains challenging in non-cooperative scenarios, particularly for information-set recognition with known code length. Existing methods mainly rely on threshold decisions determined by the generator-matrix structure and channel bit error probability, without fully exploiting the soft information in received signals. In this letter, we propose a blind recognition method using successive cancellation list (SCL) decoding for polar codes with known code length. The proposed method exploits the distinct statistical behaviors of frozen and information bits in source-side decision log-likelihood ratios (LLRs) over multiple received vectors: frozen bits tend to favor zero decisions, whereas information bits exhibit nearly equiprobable $0/1$ decisions. Based on this property, the decoder expands candidate paths under the frozen-bit and information-bit hypotheses at each bit position, evaluates their reliabilities using the corresponding average path metrics, and retains only the $L_{\mathrm{list}}$ most reliable paths for subsequent recognition. Finally, the information-set pattern corresponding to the most reliable surviving path is selected as the recognition result. Simulation results show that the proposed scheme improves the recognition success rate as the list size increases. For the $(32,16)$, $(64,32)$, and $(128,64)$ polar codes, it achieves at least $2.5$ dB gain over the previous method when $L_{\mathrm{list}}=64$.

ITMay 28, 2025
Synonymous Variational Inference for Perceptual Image Compression

Zijian Liang, Kai Niu, Changshuo Wang et al.

Recent contributions of semantic information theory reveal the set-element relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criterion to build an ideal synonymous set (Synset), and approximate the posterior of its latent synonymous representation with a parametric density by minimizing a partial semantic KL divergence. This analysis theoretically proves that the optimization direction of perception image compression follows a triple tradeoff that can cover the existing rate-distortion-perception schemes. Additionally, we introduce synonymous image compression (SIC), a new image compression scheme that corresponds to the analytical process of SVI, and implement a progressive SIC codec to fully leverage the model's capabilities. Experimental results demonstrate comparable rate-distortion-perception performance using a single progressive SIC codec, thus verifying the effectiveness of our proposed analysis method.

SPFeb 27, 2025
NeRFCom: Feature Transform Coding Meets Neural Radiance Field for Free-View 3D Scene Semantic Transmission

Weijie Yue, Zhongwei Si, Bolin Wu et al.

We introduce NeRFCom, a novel communication system designed for end-to-end 3D scene transmission. Compared to traditional systems relying on handcrafted NeRF semantic feature decomposition for compression and well-adaptive channel coding for transmission error correction, our NeRFCom employs a nonlinear transform and learned probabilistic models, enabling flexible variable-rate joint source-channel coding and efficient bandwidth allocation aligned with the NeRF semantic feature's different contribution to the 3D scene synthesis fidelity. Experimental results demonstrate that NeRFCom achieves free-view 3D scene efficient transmission while maintaining robustness under adverse channel conditions.

ITAug 21, 2025
Way to Build Native AI-driven 6G Air Interface: Principles, Roadmap, and Outlook

Ping Zhang, Kai Niu, Yiming Liu et al.

Artificial intelligence (AI) is expected to serve as a foundational capability across the entire lifecycle of 6G networks, spanning design, deployment, and operation. This article proposes a native AI-driven air interface architecture built around two core characteristics: compression and adaptation. On one hand, compression enables the system to understand and extract essential semantic information from the source data, focusing on task relevance rather than symbol-level accuracy. On the other hand, adaptation allows the air interface to dynamically transmit semantic information across diverse tasks, data types, and channel conditions, ensuring scalability and robustness. This article first introduces the native AI-driven air interface architecture, then discusses representative enabling methodologies, followed by a case study on semantic communication in 6G non-terrestrial networks. Finally, it presents a forward-looking discussion on the future of native AI in 6G, outlining key challenges and research opportunities.

CLMar 20, 2025
Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning

Peiyi Lin, Fukai Zhang, Kai Niu et al.

Continual instruction tuning enables large language models (LLMs) to learn incrementally while retaining past knowledge, whereas existing methods primarily focus on how to retain old knowledge rather than on selecting which new knowledge to learn. In domain-specific contexts, maintaining data quality and managing system constraints remain key challenges. To address these issues, we propose an automated continual instruction tuning framework that dynamically filters incoming data, which identify and reduce redundant data across successive updates. Our approach utilizes a small proxy model for efficient perplexity-based filtering, and updates the proxy to ensure that the filtering criteria remain aligned with the evolving state of the deployed model. Compared to existing static data selection methods, our framework can effectively handle incrementally acquired data and shifting distributions. Additionally, it addresses practical deployment challenges by enabling seamless model updates, supporting version rollback and incorporating automatic checkpoint evaluation. We evaluated the system in real-world medical scenarios. It reduced computational costs by 66.7% and improved model performance, and achieved autonomous updates, thus demonstrating its effectiveness for automatic continual instruction tuning.

ITJun 10, 2024
Deep Generative Modeling Reshapes Compression and Transmission: From Efficiency to Resiliency

Jincheng Dai, Xiaoqi Qin, Sixian Wang et al.

Information theory and machine learning are inextricably linked and have even been referred to as "two sides of the same coin". One particularly elegant connection is the essential equivalence between probabilistic generative modeling and data compression or transmission. In this article, we reveal the dual-functionality of deep generative models that reshapes both data compression for efficiency and transmission error concealment for resiliency. We present how the contextual predictive capabilities of powerful generative models can be well positioned to be strong compressors and estimators. In this sense, we advocate for viewing the deep generative modeling problem through the lens of end-to-end communications, and evaluate the compression and error restoration capabilities of foundation generative models. We show that the kernel of many large generative models is powerful predictor that can capture complex relationships among semantic latent variables, and the communication viewpoints provide novel insights into semantic feature tokenization, contextual learning, and usage of deep generative models. In summary, our article highlights the essential connections of generative AI to source and channel coding techniques, and motivates researchers to make further explorations in this emerging topic.

CLJun 5, 2024
Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models

Qiang Sun, Yuanyi Luo, Wenxiao Zhang et al.

Even for a conservative estimate, 80% of enterprise data reside in unstructured files, stored in data lakes that accommodate heterogeneous formats. Classical search engines can no longer meet information seeking needs, especially when the task is to browse and explore for insight formulation. In other words, there are no obvious search keywords to use. Knowledge graphs, due to their natural visual appeals that reduce the human cognitive load, become the winning candidate for heterogeneous data integration and knowledge representation. In this paper, we introduce Docs2KG, a novel framework designed to extract multimodal information from diverse and heterogeneous unstructured documents, including emails, web pages, PDF files, and Excel files. Dynamically generates a unified knowledge graph that represents the extracted key information, Docs2KG enables efficient querying and exploration of document data lakes. Unlike existing approaches that focus on domain-specific data sources or pre-designed schemas, Docs2KG offers a flexible and extensible solution that can adapt to various document structures and content types. The proposed framework unifies data processing supporting a multitude of downstream tasks with improved domain interpretability. Docs2KG is publicly accessible at https://docs2kg.ai4wa.com, and a demonstration video is available at https://docs2kg.ai4wa.com/Video.

ITJan 25, 2022
Distributed Image Transmission using Deep Joint Source-Channel Coding

Sixian Wang, Ke Yang, Jincheng Dai et al.

We study the problem of deep joint source-channel coding (D-JSCC) for correlated image sources, where each source is transmitted through a noisy independent channel to the common receiver. In particular, we consider a pair of images captured by two cameras with probably overlapping fields of view transmitted over wireless channels and reconstructed in the center node. The challenging problem involves designing a practical code to utilize both source and channel correlations to improve transmission efficiency without additional transmission overhead. To tackle this, we need to consider the common information across two stereo images as well as the differences between two transmission channels. In this case, we propose a deep neural networks solution that includes lightweight edge encoders and a powerful center decoder. Besides, in the decoder, we propose a novel channel state information aware cross attention module to highlight the overlapping fields and leverage the relevance between two noisy feature maps.Our results show the impressive improvement of reconstruction quality in both links by exploiting the noisy representations of the other link. Moreover, the proposed scheme shows competitive results compared to the separated schemes with capacity-achieving channel codes.

ITDec 21, 2021
Nonlinear Transform Source-Channel Coding for Semantic Communications

Jincheng Dai, Sixian Wang, Kailin Tan et al.

In this paper, we propose a class of high-efficiency deep joint source-channel coding methods that can closely adapt to the source distribution under the nonlinear transform, it can be collected under the name nonlinear transform source-channel coding (NTSCC). In the considered model, the transmitter first learns a nonlinear analysis transform to map the source data into latent space, then transmits the latent representation to the receiver via deep joint source-channel coding. Our model incorporates the nonlinear transform as a strong prior to effectively extract the source semantic features and provide side information for source-channel coding. Unlike existing conventional deep joint source-channel coding methods, the proposed NTSCC essentially learns both the source latent representation and an entropy model as the prior on the latent representation. Accordingly, novel adaptive rate transmission and hyperprior-aided codec refinement mechanisms are developed to upgrade deep joint source-channel coding. The whole system design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under established perceptual quality metrics. Across test image sources with various resolutions, we find that the proposed NTSCC transmission method generally outperforms both the analog transmission using the standard deep joint source-channel coding and the classical separation-based digital transmission. Notably, the proposed NTSCC method can potentially support future semantic communications due to its content-aware ability and perceptual optimization goal.

CVSep 27, 2021
Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Shizhou Zhang, De Cheng, Wenlong Luo et al.

Finding target persons in full scene images with a query of text description has important practical applications in intelligent video surveillance.However, different from the real-world scenarios where the bounding boxes are not available, existing text-based person retrieval methods mainly focus on the cross modal matching between the query text descriptions and the gallery of cropped pedestrian images. To close the gap, we study the problem of text-based person search in full images by proposing a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks. To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals. Besides, a cross-scale visual-semantic embedding mechanism is utilized to improve the performance. To validate the proposed method, we collect and annotate two large-scale benchmark datasets based on the widely adopted image-based person search datasets CUHK-SYSU and PRW. Comprehensive experiments are conducted on the two datasets and compared with the baseline methods, our method achieves the state-of-the-art performance.

CVNov 2, 2020
Actor and Action Modular Network for Text-based Video Segmentation

Jianhua Yang, Yan Huang, Kai Niu et al.

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of \emph{semantic asymmetry}. The \emph{semantic asymmetry} implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.

CVJun 23, 2019
Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments

Kai Niu, Yan Huang, Wanli Ouyang et al.

Description-based person re-identification (Re-id) is an important task in video surveillance that requires discriminative cross-modal representations to distinguish different people. It is difficult to directly measure the similarity between images and descriptions due to the modality heterogeneity (the cross-modal problem). And all samples belonging to a single category (the fine-grained problem) makes this task even harder than the conventional image-description matching task. In this paper, we propose a Multi-granularity Image-text Alignments (MIA) model to alleviate the cross-modal fine-grained problem for better similarity evaluation in description-based person Re-id. Specifically, three different granularities, i.e., global-global, global-local and local-local alignments are carried out hierarchically. Firstly, the global-global alignment in the Global Contrast (GC) module is for matching the global contexts of images and descriptions. Secondly, the global-local alignment employs the potential relations between local components and global contexts to highlight the distinguishable components while eliminating the uninvolved ones adaptively in the Relation-guided Global-local Alignment (RGA) module. Thirdly, as for the local-local alignment, we match visual human parts with noun phrases in the Bi-directional Fine-grained Matching (BFM) module. The whole network combining multiple granularities can be end-to-end trained without complex pre-processing. To address the difficulties in training the combination of multiple granularities, an effective step training strategy is proposed to train these granularities step-by-step. Extensive experiments and analysis have shown that our method obtains the state-of-the-art performance on the CUHK-PEDES dataset and outperforms the previous methods by a significant margin.