CLSep 2, 2022
Exploiting Hybrid Semantics of Relation Paths for Multi-hop Question Answering Over Knowledge GraphsZile Qiao, Wei Ye, Tong Zhang et al. · pku
Answering natural language questions on knowledge graphs (KGQA) remains a great challenge in terms of understanding complex questions via multi-hop reasoning. Previous efforts usually exploit large-scale entity-related text corpora or knowledge graph (KG) embeddings as auxiliary information to facilitate answer selection. However, the rich semantics implied in off-the-shelf relation paths between entities is far from well explored. This paper proposes improving multi-hop KGQA by exploiting relation paths' hybrid semantics. Specifically, we integrate explicit textual information and implicit KG structural features of relation paths based on a novel rotate-and-scale entity link prediction framework. Extensive experiments on three existing KGQA datasets demonstrate the superiority of our method, especially in multi-hop scenarios. Further investigation confirms our method's systematical coordination between questions and relation paths to identify answer entities.
LGMay 12Code
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage OptimizationXu Chu, Guanyu Wang, Zhijie Tan et al.
Large Language Models (LLMs) suffer from order bias, where their performance is affected by the arrangement order of input elements. This unfairness limits the model's applications in scenarios such as in-context learning and Retrieval-Augmented Generation (RAG). Recent studies attempt to obtain optimal or suboptimal arrangements based on statistical results or using dataset-based search, but these methods increase inference overhead while leaving the model's inherent order bias unresolved. Other studies mitigate order sensitivity through supervised fine-tuning using augmented training sets with multiple order variants, but often at the cost of accuracy, trapping the model in consistent yet incorrect hallucinations. In this paper, we propose \textbf{D}ual \textbf{G}roup \textbf{A}dvantage \textbf{O}ptimization (\textbf{DGAO}), which aims to improve model accuracy and order stability simultaneously. DGAO calculates and balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding the policy model for generating order-stable and correct outputs while penalizing order-sensitive or incorrect responses. This marks the first time reinforcement learning has been used to mitigate LLMs' order sensitivity. We also propose two new metrics, Consistency Rate and Overconfidence Rate, to reveal the pseudo-stability of previous methods and guide more comprehensive evaluation. Extensive experiments demonstrate that DGAO achieves superior order fairness while improving performance on RAG, mathematical reasoning, and classification tasks. Our code is available at: https://github.com/Hyalinesky/DGAO.
CLJan 24, 2025Code
Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes DomainsXu Chu, Zhijie Tan, Hanlin Xue et al.
Large Language Models (LLMs) are widely applied to downstream domains. However, current LLMs for high-stakes domain tasks, such as financial investment and legal QA, typically generate brief answers without reasoning processes and explanations. This limits users' confidence in making decisions based on their responses. While original CoT shows promise, it lacks self-correction mechanisms during reasoning. This work introduces Domain$o1$s, which enhances LLMs' reasoning capabilities on domain tasks through supervised fine-tuning and tree search. We construct CoT-stock-2k and CoT-legal-2k datasets for fine-tuning models that activate domain-specific reasoning steps based on their judgment. Additionally, we propose Selective Tree Exploration to spontaneously explore solution spaces and sample optimal reasoning paths to improve performance. We also introduce PROOF-Score, a new metric for evaluating domain models' explainability, complementing traditional accuracy metrics with richer assessment dimensions. Extensive experiments on stock investment recommendation and legal reasoning QA tasks demonstrate Domaino1s's leading performance and explainability. Our code is available at https://github.com/Hyalinesky/Domaino1s.
CVMay 29, 2025Code
Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual InformationXu Chu, Xinrong Chen, Guanyu Wang et al.
Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the number and length of reflections. Then, we formally prove that VLRMs lose attention to visual tokens as reasoning progresses, and demonstrate that supplementing visual information during reflection enhances visual attention. Therefore, during training and inference, Visual Token COPY and Visual Token ROUTE are introduced to force the model to re-attention visual information at the visual level, addressing the limitations of text-only reflection. Experiments on multiple visual QA datasets and hallucination metrics indicate that Qwen-LA achieves leading accuracy performance while reducing hallucinations. Our code is available at: https://github.com/Liar406/Look_Again
MMMar 10
MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement LearningXiang Yuan, Xu Chu, Xinrong Chen et al.
Multimodal Object-Entity Relation Extraction (MORE) is a challenging task in information extraction research. It aims to identify relations between visual objects and textual entities, requiring complex multimodal understanding and cross-modal reasoning abilities. Existing methods, mainly classification-based or generation-based without reasoning, struggle to handle complex extraction scenarios in the MORE task and suffer from limited scalability and intermediate reasoning transparency. To address these challenges, we propose MORE-R1, a novel model that introduces explicit stepwise reasoning with Reinforcement Learning (RL) to enable Large Vision-Language Model (LVLM) to address the MORE task effectively. MORE-R1 integrates a two-stage training process, including an initial cold-start training stage with Supervised Fine-Tuning (SFT) and a subsequent RL stage for reasoning ability optimization. In the initial stage, we design an efficient way to automatically construct a high-quality SFT dataset containing fine-grained stepwise reasoning tailored to the MORE task, enabling the model to learn an effective reasoning paradigm. In the subsequent stage, we employ the Group Relative Policy Optimization (GRPO) RL algorithm with a Progressive Sample-Mixing Strategy to stabilize training and further enhance model's reasoning ability on hard samples. Comprehensive experiments on the MORE benchmark demonstrate that MORE-R1 achieves state-of-the-art performance with significant improvement over baselines.
CLJun 17, 2025Code
GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection VectorsHengyuan Zhang, Xinrong Chen, Yingmin Qiu et al.
Parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), offer an efficient way to adapt large language models with reduced computational costs. However, their performance is limited by the small number of trainable parameters. Recent work combines LoRA with the Mixture-of-Experts (MoE), i.e., LoRA-MoE, to enhance capacity, but two limitations remain in hindering the full exploitation of its potential: 1) the influence of downstream tasks when assigning expert numbers, and 2) the uniform rank assignment across all LoRA experts, which restricts representational diversity. To mitigate these gaps, we propose GuiLoMo, a fine-grained layer-wise expert numbers and ranks allocation strategy with GuidedSelection Vectors (GSVs). GSVs are learned via a prior bilevel optimization process to capture both model- and task-specific needs, and are then used to allocate optimal expert numbers and ranks. Experiments on three backbone models across diverse benchmarks show that GuiLoMo consistently achieves superior or comparable performance to all baselines. Further analysis offers key insights into how expert numbers and ranks vary across layers and tasks, highlighting the benefits of adaptive expert configuration. Our code is available at https://github.com/Liar406/Gui-LoMo.git.
CLJun 6, 2023
Joint Event Extraction via Structural Semantic MatchingHaochen Li, Tianhao Gao, Jingkun Wang et al.
Event Extraction (EE) is one of the essential tasks in information extraction, which aims to detect event mentions from text and find the corresponding argument roles. The EE task can be abstracted as a process of matching the semantic definitions and argument structures of event types with the target text. This paper encodes the semantic features of event types and makes structural matching with target text. Specifically, Semantic Type Embedding (STE) and Dynamic Structure Encoder (DSE) modules are proposed. Also, the Joint Structural Semantic Matching (JSSM) model is built to jointly perform event detection and argument extraction tasks through a bidirectional attention layer. The experimental results on the ACE2005 dataset indicate that our model achieves a significant performance improvement
CVMay 25, 2023Code
NVTC: Nonlinear Vector Transform CodingRunsen Feng, Zongyu Guo, Weiping Li et al.
In theory, vector quantization (VQ) is always better than scalar quantization (SQ) in terms of rate-distortion (R-D) performance. Recent state-of-the-art methods for neural image compression are mainly based on nonlinear transform coding (NTC) with uniform scalar quantization, overlooking the benefits of VQ due to its exponentially increased complexity. In this paper, we first investigate on some toy sources, demonstrating that even if modern neural networks considerably enhance the compression performance of SQ with nonlinear transform, there is still an insurmountable chasm between SQ and VQ. Therefore, revolving around VQ, we propose a novel framework for neural image compression named Nonlinear Vector Transform Coding (NVTC). NVTC solves the critical complexity issue of VQ through (1) a multi-stage quantization strategy and (2) nonlinear vector transforms. In addition, we apply entropy-constrained VQ in latent space to adaptively determine the quantization boundaries for joint rate-distortion optimization, which improves the performance both theoretically and experimentally. Compared to previous NTC approaches, NVTC demonstrates superior rate-distortion performance, faster decoding speed, and smaller model size. Our code is available at https://github.com/USTC-IMCL/NVTC
CLJan 14, 2022Code
Eliciting Knowledge from Pretrained Language Models for Prototypical Prompt VerbalizerYinyi Wei, Tong Mo, Yongtao Jiang et al.
Recent advances on prompt-tuning cast few-shot classification tasks as a masked language modeling problem. By wrapping input into a template and using a verbalizer which constructs a mapping between label space and label word space, prompt-tuning can achieve excellent results in zero-shot and few-shot scenarios. However, typical prompt-tuning needs a manually designed verbalizer which requires domain expertise and human efforts. And the insufficient label space may introduce considerable bias into the results. In this paper, we focus on eliciting knowledge from pretrained language models and propose a prototypical prompt verbalizer for prompt-tuning. Labels are represented by prototypical embeddings in the feature space rather than by discrete words. The distances between the embedding at the masked position of input and prototypical embeddings are used as classification criterion. For zero-shot settings, knowledge is elicited from pretrained language models by a manually designed template to form initial prototypical embeddings. For few-shot settings, models are tuned to learn meaningful and interpretable prototypical embeddings. Our method optimizes models by contrastive learning. Extensive experimental results on several many-class text classification datasets with low-resource settings demonstrate the effectiveness of our approach compared with other verbalizer construction methods. Our implementation is available at https://github.com/Ydongd/prototypical-prompt-verbalizer.
CVNov 23, 2019Code
Region Normalization for Image InpaintingTao Yu, Zongyu Guo, Xin Jin et al.
Feature Normalization (FN) is an important technique to help neural network training, which typically normalizes features across spatial dimensions. Most previous image inpainting methods apply FN in their networks without considering the impact of the corrupted regions of the input image on normalization, e.g. mean and variance shifts. In this work, we show that the mean and variance shifts caused by full-spatial FN limit the image inpainting network training and we propose a spatial region-wise normalization named Region Normalization (RN) to overcome the limitation. RN divides spatial pixels into different regions according to the input mask, and computes the mean and variance in each region for normalization. We develop two kinds of RN for our image inpainting network: (1) Basic RN (RN-B), which normalizes pixels from the corrupted and uncorrupted regions separately based on the original inpainting mask to solve the mean and variance shift problem; (2) Learnable RN (RN-L), which automatically detects potentially corrupted and uncorrupted regions for separate normalization, and performs global affine transformation to enhance their fusion. We apply RN-B in the early layers and RN-L in the latter layers of the network respectively. Experiments show that our method outperforms current state-of-the-art methods quantitatively and qualitatively. We further generalize RN to other inpainting networks and achieve consistent performance improvements. Our code is available at https://github.com/geekyutao/RN.
AIOct 22, 2024
Order Matters: Exploring Order Sensitivity in Multimodal Large Language ModelsZhijie Tan, Xu Chu, Weiping Li et al.
Multimodal Large Language Models (MLLMs) utilize multimodal contexts consisting of text, images, or videos to solve various multimodal tasks. However, we find that changing the order of multimodal input can cause the model's performance to fluctuate between advanced performance and random guessing. This phenomenon exists in both single-modality (text-only or image-only) and mixed-modality (image-text-pair) contexts. Furthermore, we demonstrate that popular MLLMs pay special attention to certain multimodal context positions, particularly the beginning and end. Leveraging this special attention, we place key video frames and important image/text content in special positions within the context and submit them to the MLLM for inference. This method results in average performance gains of 14.7% for video-caption matching and 17.8% for visual question answering tasks. Additionally, we propose a new metric, Position-Invariant Accuracy (PIA), to address order bias in MLLM evaluation. Our research findings contribute to a better understanding of Multi-Modal In-Context Learning (MMICL) and provide practical strategies for enhancing MLLM performance without increasing computational costs.
CVMay 19, 2025
GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal RetrievalChengsong Sun, Weiping Li, Xiang Li et al.
Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples, enabling the model to handle unseen classes during inference. Unlike traditional cross-modal retrieval tasks, which assume that both training and testing data share the same class distribution, few-shot retrieval involves data with sparse representations across modalities. Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data, resulting in two main biases in the latent semantic space: intra-modal bias, where sparse samples fail to capture intra-class diversity, and inter-modal bias, where misalignments between image and text distributions exacerbate the semantic gap. These biases hinder retrieval accuracy. To address these issues, we propose a novel method, GCRDP, for few-shot cross-modal retrieval. This approach effectively captures the complex multi-peak distribution of data using a Gaussian Mixture Model (GMM) and incorporates a multi-positive sample contrastive learning mechanism for comprehensive feature modeling. Additionally, we introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions, thereby improving the accuracy of cross-modal representations. We validate our approach through extensive experiments on four benchmark datasets, demonstrating superior performance over six state-of-the-art methods.
CLApr 13, 2025
CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question AnsweringLiqiang Wen, Guanming Xiong, Tong Mo et al.
This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification. Our approach employs a Bayesian inference mechanism to quantify query ambiguity and guide LLMs in determining when and how to request clarification from users within a multi-turn dialogue framework. We further develop a two-agent interaction framework where an LLM-based user simulator enables iterative refinement of logical forms through simulated user feedback. Experimental results on the WebQSP and CWQ dataset demonstrate that our method significantly improves performance by effectively resolving semantic ambiguities. Additionally, we contribute a refined dataset of disambiguated queries, derived from interaction histories, to facilitate future research in this direction.
LGJan 24, 2025
GraphSOS: Graph Sampling and Order Selection to Help LLMs Understand Graphs BetterXu Chu, Hanlin Xue, Zhijie Tan et al.
The success of Large Language Models (LLMs) in various domains has led researchers to apply them to graph-related problems by converting graph data into natural language text. However, unlike graph data, natural language inherently has sequential order. We observe a counter-intuitive fact that when the order of nodes or edges in the natural language description of a graph is shuffled, despite describing the same graph, model performance fluctuates between high performance and random guessing. Additionally, due to LLMs' limited input context length, current methods typically randomly sample neighbors of target nodes as representatives of their neighborhood, which may not always be effective for accurate reasoning. To address these gaps, we introduce GraphSOS (Graph Sampling and Order Selection). This novel model framework features an Order Selector Module to ensure proper serialization order of the graph and a Subgraph Sampling Module to sample subgraphs with better structure for better reasoning. Furthermore, we propose Graph CoT obtained through distillation, and enhance LLM's reasoning and zero-shot learning capabilities for graph tasks through instruction tuning. Experiments on multiple datasets for node classification and graph question-answering demonstrate that GraphSOS improves LLMs' performance and generalization ability on graph tasks.
CVMar 9
SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph GenerationJiaye Feng, Qixiang Yin, Yuankun Liu et al.
Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.
IROct 24, 2025
CausalRec: A CausalBoost Attention Model for Sequential RecommendationYunbo Hou, Tianle Yang, Ruijie Li et al.
Recent advances in correlation-based sequential recommendation systems have demonstrated substantial success. Specifically, the attention-based model outperforms other RNN-based and Markov chains-based models by capturing both short- and long-term dependencies more effectively. However, solely focusing on item co-occurrences overlooks the underlying motivations behind user behaviors, leading to spurious correlations and potentially inaccurate recommendations. To address this limitation, we present a novel framework that integrates causal attention for sequential recommendation, CausalRec. It incorporates a causal discovery block and a CausalBooster. The causal discovery block learns the causal graph in user behavior sequences, and we provide a theory to guarantee the identifiability of the learned causal graph. The CausalBooster utilizes the discovered causal graph to refine the attention mechanism, prioritizing behaviors with causal significance. Experimental evaluations on real-world datasets indicate that CausalRec outperforms several state-of-the-art methods, with average improvements of 7.21% in Hit Rate (HR) and 8.65% in Normalized Discounted Cumulative Gain (NDCG). To the best of our knowledge, this is the first model to incorporate causality through the attention mechanism in sequential recommendation, demonstrating the value of causality in generating more accurate and reliable recommendations.
CVJan 17, 2025
Mitigating Hallucinations on Object Attributes using Multiview Images and Negative InstructionsZhijie Tan, Yuzhi Li, Shengwei Meng et al.
Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs' bias towards ``Yes" responses. Comprehensive experiments demonstrate the effectiveness of our method.
LGJan 17, 2025
Adaptive Spatiotemporal Augmentation for Improving Dynamic Graph LearningXu Chu, Hanlin Xue, Bingce Wang et al.
Dynamic graph augmentation is used to improve the performance of dynamic GNNs. Most methods assume temporal locality, meaning that recent edges are more influential than earlier edges. However, for temporal changes in edges caused by random noise, overemphasizing recent edges while neglecting earlier ones may lead to the model capturing noise. To address this issue, we propose STAA (SpatioTemporal Activity-Aware Random Walk Diffusion). STAA identifies nodes likely to have noisy edges in spatiotemporal dimensions. Spatially, it analyzes critical topological positions through graph wavelet coefficients. Temporally, it analyzes edge evolution through graph wavelet coefficient change rates. Then, random walks are used to reduce the weights of noisy edges, deriving a diffusion matrix containing spatiotemporal information as an augmented adjacency matrix for dynamic GNN learning. Experiments on multiple datasets show that STAA outperforms other dynamic graph augmentation methods in node classification and link prediction tasks.
CVOct 13, 2024
Towards Defining an Efficient and Expandable File Format for AI-Generated ContentsYixin Gao, Runsen Feng, Xin Li et al.
Recently, AI-generated content (AIGC) has gained significant traction due to its powerful creation capability. However, the storage and transmission of large amounts of high-quality AIGC images inevitably pose new challenges for recent file formats. To overcome this, we define a new file format for AIGC images, named AIGIF, enabling ultra-low bitrate coding of AIGC images. Unlike compressing AIGC images intuitively with pixel-wise space as existing file formats, AIGIF instead compresses the generation syntax. This raises a crucial question: Which generation syntax elements, e.g., text prompt, device configuration, etc, are necessary for compression/transmission? To answer this question, we systematically investigate the effects of three essential factors: platform, generative model, and data configuration. We experimentally find that a well-designed composable bitstream structure incorporating the above three factors can achieve an impressive compression ratio of even up to 1/10,000 while still ensuring high fidelity. We also introduce an expandable syntax in AIGIF to support the extension of the most advanced generation models to be developed in the future.
CLJun 12, 2024
Supportiveness-based Knowledge Rewriting for Retrieval-augmented Language ModelingZile Qiao, Wei Ye, Yong Jiang et al.
Retrieval-augmented language models (RALMs) have recently shown great potential in mitigating the limitations of implicit knowledge in LLMs, such as untimely updating of the latest expertise and unreliable retention of long-tail knowledge. However, since the external knowledge base, as well as the retriever, can not guarantee reliability, potentially leading to the knowledge retrieved not being helpful or even misleading for LLM generation. In this paper, we introduce Supportiveness-based Knowledge Rewriting (SKR), a robust and pluggable knowledge rewriter inherently optimized for LLM generation. Specifically, we introduce the novel concept of "supportiveness"--which represents how effectively a knowledge piece facilitates downstream tasks--by considering the perplexity impact of augmented knowledge on the response text of a white-box LLM. Based on knowledge supportiveness, we first design a training data curation strategy for our rewriter model, effectively identifying and filtering out poor or irrelevant rewrites (e.g., with low supportiveness scores) to improve data efficacy. We then introduce the direct preference optimization (DPO) algorithm to align the generated rewrites to optimal supportiveness, guiding the rewriter model to summarize augmented content that better improves the final response. Comprehensive evaluations across six popular knowledge-intensive tasks and four LLMs have demonstrated the effectiveness and superiority of SKR. With only 7B parameters, SKR has shown better knowledge rewriting capability over GPT-4, the current state-of-the-art general-purpose LLM.
MMSep 29, 2020
Sequential Reinforced 360-Degree Video Adaptive Streaming with Cross-user Attentive NetworkJun Fu, Zhibo Chen, Xiaoming Chen et al.
In the tile-based 360-degree video streaming, predicting user's future viewpoints and developing adaptive bitrate (ABR) algorithms are essential for optimizing user's quality of experience (QoE). Traditional single-user based viewpoint prediction methods fail to achieve good performance in long-term prediction, and the recently proposed reinforcement learning (RL) based ABR schemes applied in traditional video streaming can not be directly applied in the tile-based 360-degree video streaming due to the exponential action space. Therefore, we propose a sequential reinforced 360-degree video streaming scheme with cross-user attentive network. Firstly, considering different users may have the similar viewing preference on the same video, we propose a cross-user attentive network (CUAN), boosting the performance of long-term viewpoint prediction by selectively utilizing cross-user information. Secondly, we propose a sequential RL-based (360SRL) ABR approach, transforming action space size of each decision step from exponential to linear via introducing a sequential decision structure. We evaluate the proposed CUAN and 360SRL using trace-driven experiments and experimental results demonstrate that CUAN and 360SRL outperform existing viewpoint prediction and ABR approaches with a noticeable margin.
IVApr 13, 2020
Blind Quality Assessment for Image Superresolution Using Deep Two-Stream Convolutional NetworksWei Zhou, Qiuping Jiang, Yuwang Wang et al.
Numerous image superresolution (SR) algorithms have been proposed for reconstructing high-resolution (HR) images from input images with lower spatial resolutions. However, effectively evaluating the perceptual quality of SR images remains a challenging research problem. In this paper, we propose a no-reference/blind deep neural network-based SR image quality assessor (DeepSRQ). To learn more discriminative feature representations of various distorted SR images, the proposed DeepSRQ is a two-stream convolutional network including two subcomponents for distorted structure and texture SR images. Different from traditional image distortions, the artifacts of SR images cause both image structure and texture quality degradation. Therefore, we choose the two-stream scheme that captures different properties of SR inputs instead of directly learning features from one image stream. Considering the human visual system (HVS) characteristics, the structure stream focuses on extracting features in structural degradations, while the texture stream focuses on the change in textural distributions. In addition, to augment the training data and ensure the category balance, we propose a stride-based adaptive cropping approach for further improvement. Experimental results on three publicly available SR image quality databases demonstrate the effectiveness and generalization ability of our proposed DeepSRQ method compared with state-of-the-art image quality assessment algorithms.
CVMar 25, 2020
Prior-enlightened and Motion-robust Video DeblurringYa Zhou, Jianfeng Xu, Kazuyuki Tasaka et al.
Various blur distortions in video will cause negative impact on both human viewing and video-based applications, which makes motion-robust deblurring methods urgently needed. Most existing works have strong dataset dependency and limited generalization ability in handling challenging scenarios, like blur in low contrast or severe motion areas, and non-uniform blur. Therefore, we propose a PRiOr-enlightened and MOTION-robust video deblurring model (PROMOTION) suitable for challenging blurs. On the one hand, we use 3D group convolution to efficiently encode heterogeneous prior information, explicitly enhancing the scenes' perception while mitigating the output's artifacts. On the other hand, we design the priors representing blur distribution, to better handle non-uniform blur in spatio-temporal domain. Besides the classical camera shake caused global blurry, we also prove the generalization for the downstream task suffering from local blur. Extensive experiments demonstrate we can achieve the state-of-the-art performance on well-known REDS and GoPro datasets, and bring machine task gain.
LGMar 3, 2019
Asynchronous Episodic Deep Deterministic Policy Gradient: Towards Continuous Control in Computationally Complex EnvironmentsZhizheng Zhang, Jiale Chen, Zhibo Chen et al.
Deep Deterministic Policy Gradient (DDPG) has been proved to be a successful reinforcement learning (RL) algorithm for continuous control tasks. However, DDPG still suffers from data insufficiency and training inefficiency, especially in computationally complex environments. In this paper, we propose Asynchronous Episodic DDPG (AE-DDPG), as an expansion of DDPG, which can achieve more effective learning with less training time required. First, we design a modified scheme for data collection in an asynchronous fashion. Generally, for asynchronous RL algorithms, sample efficiency or/and training stability diminish as the degree of parallelism increases. We consider this problem from the perspectives of both data generation and data utilization. In detail, we re-design experience replay by introducing the idea of episodic control so that the agent can latch on good trajectories rapidly. In addition, we also inject a new type of noise in action space to enrich the exploration behaviors. Experiments demonstrate that our AE-DDPG achieves higher rewards and requires less time consuming than most popular RL algorithms in Learning to Run task which has a computationally complex environment. Not limited to the control tasks in computationally complex environments, AE-DDPG also achieves higher rewards and 2- to 4-fold improvement in sample efficiency on average compared to other variants of DDPG in MuJoCo environments. Furthermore, we verify the effectiveness of each proposed technique component through abundant ablation study.
MMDec 22, 2018
Learned Scalable Image Compression with Bidirectional Context Disentanglement NetworkZhizheng Zhang, Zhibo Chen, Jianxin Lin et al.
In this paper, we propose a learned scalable/progressive image compression scheme based on deep neural networks (DNN), named Bidirectional Context Disentanglement Network (BCD-Net). For learning hierarchical representations, we first adopt bit-plane decomposition to decompose the information coarsely before the deep-learning-based transformation. However, the information carried by different bit-planes is not only unequal in entropy but also of different importance for reconstruction. We thus take the hidden features corresponding to different bit-planes as the context and design a network topology with bidirectional flows to disentangle the contextual information for more effective compressed representations. Our proposed scheme enables us to obtain the compressed codes with scalable rates via a one-pass encoding-decoding. Experiment results demonstrate that our proposed model outperforms the state-of-the-art DNN-based scalable image compression methods in both PSNR and MS-SSIM metrics. In addition, our proposed model achieves higher performance in MS-SSIM metric than conventional scalable image codecs. Effectiveness of our technical components is also verified through sufficient ablation experiments.
MMNov 30, 2018
Hybrid Distortion Aggregated Visual Comfort Assessment for Stereoscopic Image RetargetingYa Zhou, Zhibo Chen, Weiping Li
Visual comfort is a quite important factor in 3D media service. Few research efforts have been carried out in this area especially in case of 3D content retargeting which may introduce more complicated visual distortions. In this paper, we propose a Hybrid Distortion Aggregated Visual Comfort Assessment (HDA-VCA) scheme for stereoscopic retargeted images (SRI), considering aggregation of hybrid distortions including structure distortion, information loss, binocular incongruity and semantic distortion. Specifically, a Local-SSIM feature is proposed to reflect the local structural distortion of SRI, and information loss is represented by Dual Natural Scene Statistics (D-NSS) feature extracted from the binocular summation and difference channels. Regarding binocular incongruity, visual comfort zone, window violation, binocular rivalry, and accommodation-vergence conflict of human visual system (HVS) are evaluated. Finally, the semantic distortion is represented by the correlation distance of paired feature maps extracted from original stereoscopic image and its retargeted image by using trained deep neural network. We validate the effectiveness of HDA-VCA on published Stereoscopic Image Retargeting Database (SIRD) and two stereoscopic image databases IEEE-SA and NBU 3D-VCA. The results demonstrate HDA-VCA's superior performance in handling hybrid distortions compared to state-of-the-art VCA schemes.
LGApr 5, 2018
Review of Deep LearningRong Zhang, Weiping Li, Tong Mo
In recent years, China, the United States and other countries, Google and other high-tech companies have increased investment in artificial intelligence. Deep learning is one of the current artificial intelligence research's key areas. This paper analyzes and summarizes the latest progress and future research directions of deep learning. Firstly, three basic models of deep learning are outlined, including multilayer perceptrons, convolutional neural networks, and recurrent neural networks. On this basis, we further analyze the emerging new models of convolution neural networks and recurrent neural networks. This paper then summarizes deep learning's applications in many areas of artificial intelligence, including speech processing, computer vision, natural language processing and so on. Finally, this paper discusses the existing problems of deep learning and gives the corresponding possible solutions.
CVJan 30, 2018
Video-based Sign Language Recognition without Temporal SegmentationJie Huang, Wengang Zhou, Qilin Zhang et al.
Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.
AINov 22, 2017
An influence-based fast preceding questionnaire model for elderly assessmentsTong Mo, Rong Zhang, Weiping Li et al.
To improve the efficiency of elderly assessments, an influence-based fast preceding questionnaire model (FPQM) is proposed. Compared with traditional assessments, the FPQM optimizes questionnaires by reordering their attributes. The values of low-ranking attributes can be predicted by the values of the high-ranking attributes. Therefore, the number of attributes can be reduced without redesigning the questionnaires. A new function for calculating the influence of the attributes is proposed based on probability theory. Reordering and reducing algorithms are given based on the attributes' influences. The model is verified through a practical application. The practice in an elderly-care company shows that the FPQM can reduce the number of attributes by 90.56% with a prediction accuracy of 98.39%. Compared with other methods, such as the Expert Knowledge, Rough Set and C4.5 methods, the FPQM achieves the best performance. In addition, the FPQM can also be applied to other questionnaires.
ITJul 24, 2016
Joint Source-Channel Secrecy Using Uncoded Schemes: Towards Secure Source BroadcastLei Yu, Houqiang Li, Weiping Li
This paper investigates a joint source-channel secrecy problem for the Shannon cipher broadcast system. We suppose list secrecy is applied, i.e., a wiretapper is allowed to produce a list of reconstruction sequences and the secrecy is measured by the minimum distortion over the entire list. For discrete communication cases, we propose a permutation-based uncoded scheme, which cascades a random permutation with a symbol-by-symbol mapping. Using this scheme, we derive an inner bound for the admissible region of secret key rate, list rate, wiretapper distortion, and distortions of legitimate users. For the converse part, we easily obtain an outer bound for the admissible region from an existing result. Comparing the outer bound with the inner bound shows that the proposed scheme is optimal under certain conditions. Besides, we extend the proposed scheme to the scalar and vector Gaussian communication scenarios, and characterize the corresponding performance as well. For these two cases, we also propose another uncoded scheme, orthogonal-transform-based scheme, which achieves the same performance as the permutation-based scheme. Interestingly, by introducing the random permutation or the random orthogonal transform into the traditional uncoded scheme, the proposed uncoded schemes, on one hand, provide a certain level of secrecy, and on the other hand, do not lose any performance in terms of the distortions for legitimate users.
CVApr 5, 2016
Comparative Deep Learning of Hybrid Representations for Image RecommendationsChenyi Lei, Dong Liu, Weiping Li et al.
In many image-related tasks, learning expressive and discriminative representations of images is essential, and deep learning has been studied for automating the learning of such representations. Some user-centric tasks, such as image recommendations, call for effective representations of not only images but also preferences and intents of users over images. Such representations are termed \emph{hybrid} and addressed via a deep learning approach in this paper. We design a dual-net deep network, in which the two sub-networks map input images and preferences of users into a same latent semantic space, and then the distances between images and users in the latent space are calculated to make decisions. We further propose a comparative deep learning (CDL) method to train the deep network, using a pair of images compared against one user to learn the pattern of their relative distances. The CDL embraces much more training data than naive deep learning, and thus achieves superior performance than the latter, with no cost of increasing network complexity. Experimental results with real-world data sets for image recommendations have shown the proposed dual-net network and CDL greatly outperform other state-of-the-art image recommendation solutions.
ITJan 18, 2016
Source-Channel Secrecy for Shannon Cipher SystemLei Yu, Houqiang Li, Weiping Li
Recently, a secrecy measure based on list-reconstruction has been proposed [2], in which a wiretapper is allowed to produce a list of $2^{mR_{L}}$ reconstruction sequences and the secrecy is measured by the minimum distortion over the entire list. In this paper, we show that this list secrecy problem is equivalent to the one with secrecy measured by a new quantity \emph{lossy-equivocation}, which is proven to be the minimum optimistic 1-achievable source coding rate (the minimum coding rate needed to reconstruct the source within target distortion with positive probability for \emph{infinitely many blocklengths}) of the source with the wiretapped signal as two-sided information, and also can be seen as a lossy extension of conventional equivocation. Upon this (or list) secrecy measure, we study source-channel secrecy problem in the discrete memoryless Shannon cipher system with \emph{noisy} wiretap channel. Two inner bounds and an outer bound on the achievable region of secret key rate, list rate, wiretapper distortion, and distortion of legitimate user are given. The inner bounds are derived by using uncoded scheme and (operationally) separate scheme, respectively. Thanks to the equivalence between lossy-equivocation secrecy and list secrecy, information spectrum method is leveraged to prove the outer bound. As special cases, the admissible region for the case of degraded wiretap channel or lossless communication for legitimate user has been characterized completely. For both these two cases, separate scheme is proven to be optimal. Interestingly, however, separation indeed suffers performance loss for other certain cases. Besides, we also extend our results to characterize the achievable region for Gaussian communication case. As a side product optimistic lossy source coding has also been addressed.