LGMar 8, 2022Code
Dual Lottery Ticket HypothesisYue Bai, Huan Wang, Zhiqiang Tao et al.
Fully exploiting the learning capacity of neural networks requires overparameterized dense networks. On the other side, directly training sparse neural networks typically results in unsatisfactory performance. Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity. Concretely, it claims there exist winning tickets from a randomly initialized network found by iterative magnitude pruning and preserving promising trainability (or we say being in trainable condition). In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark, then go from a complementary direction to articulate the Dual Lottery Ticket Hypothesis (DLTH): Randomly selected subnetworks from a randomly initialized dense network can be transformed into a trainable condition and achieve admirable performance compared with LTH -- random tickets in a given lottery pool can be transformed into winning tickets. Specifically, by using uniform-randomly selected subnetworks to represent the general cases, we propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH. Concretely, we introduce a regularization term to borrow learning capacity and realize information extrusion from the weights which will be masked. After finishing the transformation for the randomly selected subnetworks, we conduct the regular finetuning to evaluate the model using fair comparisons with LTH and other strong baselines. Extensive experiments on several public datasets and comparisons with competitive approaches validate our DLTH as well as the effectiveness of the proposed model RST. Our work is expected to pave a way for inspiring new research directions of sparse network training in the future. Our code is available at https://github.com/yueb17/DLTH.
CVJun 1, 2023Code
Cooperative Hardware-Prompt Learning for Snapshot Compressive ImagingJiamian Wang, Zongliang Wu, Yulun Zhang et al.
Existing reconstruction models in snapshot compressive imaging systems (SCI) are trained with a single well-calibrated hardware instance, making their performance vulnerable to hardware shifts and limited in adapting to multiple hardware configurations. To facilitate cross-hardware learning, previous efforts attempt to directly collect multi-hardware data and perform centralized training, which is impractical due to severe user data privacy concerns and hardware heterogeneity across different platforms/institutions. In this study, we explicitly consider data privacy and heterogeneity in cooperatively optimizing SCI systems by proposing a Federated Hardware-Prompt learning (FedHP) framework. Rather than mitigating the client drift by rectifying the gradients, which only takes effect on the learning manifold but fails to solve the heterogeneity rooted in the input data space, FedHP learns a hardware-conditioned prompter to align inconsistent data distribution across clients, serving as an indicator of the data inconsistency among different hardware (e.g., coded apertures). Extensive experimental results demonstrate that the proposed FedHP coordinates the pre-trained model to multiple hardware configurations, outperforming prevalent FL frameworks for 0.35dB under challenging heterogeneous settings. Moreover, a Snapshot Spectral Heterogeneous Dataset has been built upon multiple practical SCI systems. Data and code are aveilable at https://github.com/Jiamian-Wang/FedHP-Snapshot-Compressive-Imaging
LGOct 13, 2022Code
Parameter-Efficient Masking NetworksYue Bai, Huan Wang, Xu Ma et al.
A deeper network structure generally handles more complicated non-linearity and performs more competitively. Nowadays, advanced network designs often contain a large number of repetitive structures (e.g., Transformer). They empower the network capacity to a new level but also increase the model size inevitably, which is unfriendly to either model restoring or transferring. In this study, we are the first to investigate the representative potential of fixed random weights with limited unique values by learning diverse masks and introduce the Parameter-Efficient Masking Networks (PEMN). It also naturally leads to a new paradigm for model compression to diminish the model size. Concretely, motivated by the repetitive structures in modern neural networks, we utilize one random initialized layer, accompanied with different masks, to convey different feature mappings and represent repetitive network modules. Therefore, the model can be expressed as \textit{one-layer} with a bunch of masks, which significantly reduce the model storage cost. Furthermore, we enhance our strategy by learning masks for a model filled by padding a given random weights vector. In this way, our method can further lower the space complexity, especially for models without many repetitive architectures. We validate the potential of PEMN learning masks on random weights with limited unique values and test its effectiveness for a new compression paradigm based on different network architectures. Code is available at https://github.com/yueb17/PEMN
CVMar 16, 2023Code
Iterative Soft Shrinkage Learning for Efficient Image Super-ResolutionJiamian Wang, Huan Wang, Yulun Zhang et al.
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. However, prevailing SR models suffer from prohibitive memory footprint and intensive computations, which limits further deployment on edge devices. This work investigates the potential of network pruning for super-resolution to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. Two main challenges remain in applying pruning methods for SR. First, the widely-used filter pruning technique reflects limited granularity and restricted adaptability to diverse network structures. Second, existing pruning methods generally operate upon a pre-trained network for the sparse structure determination, hard to get rid of dense model training in the traditional SR paradigm. To address these challenges, we adopt unstructured pruning with sparse models directly trained from scratch. Specifically, we propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly initialized network at each iteration and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly. We observe that the proposed ISS-P can dynamically learn sparse structures adapting to the optimization process and preserve the sparse model's trainability by yielding a more regularized gradient throughput. Experiments on benchmark datasets demonstrate the effectiveness of the proposed ISS-P over diverse network architectures. Code is available at https://github.com/Jiamian-Wang/Iterative-Soft-Shrinkage-SR
CLSep 29, 2024Code
Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation SystemsXuyang Wu, Shuowei Li, Hsin-Tai Wu et al.
Retrieval-Augmented Generation (RAG) has recently gained significant attention for its enhanced ability to integrate external knowledge sources into open-domain question answering (QA) tasks. However, it remains unclear how these models address fairness concerns, particularly with respect to sensitive attributes such as gender, geographic location, and other demographic factors. First, as language models evolve to prioritize utility, like improving exact match accuracy, fairness considerations may have been largely overlooked. Second, the complex, multi-component architecture of RAG methods poses challenges in identifying and mitigating biases, as each component is optimized for distinct objectives. In this paper, we aim to empirically evaluate fairness in several RAG methods. We propose a fairness evaluation framework tailored to RAG, using scenario-based questions and analyzing disparities across demographic attributes. Our experimental results indicate that, despite recent advances in utility-driven optimization, fairness issues persist in both the retrieval and generation stages. These findings underscore the need for targeted interventions to address fairness concerns throughout the RAG pipeline. The dataset and code used in this study are publicly available at this GitHub Repository https://github.com/elviswxy/RAG_fairness .
CLFeb 21, 2025Code
Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM ReasoningXuyang Wu, Jinming Nian, Ting-Ruen Wei et al.
Recent advances in large language models (LLMs) have enabled automatic generation of chain-of-thought (CoT) reasoning, leading to strong performance on tasks such as math and code. However, when reasoning steps reflect social stereotypes (e.g., those related to gender, race or age), they can reinforce harmful associations and lead to misleading conclusions. We present the first systematic evaluation of social bias within LLM-generated reasoning, focusing on reasoning language models (e.g., DeepSeek-R1, OpenAI o1) that natively produce reasoning chains as part of their answers. Using the BBQ dataset, we analyze both prediction accuracy and reasoning bias across a broad spectrum of models, including instruction-tuned and CoT-augmented variants of DeepSeek-R1 (8B/32B), ChatGPT, and other open-source LLMs. We quantify how biased reasoning steps correlate with incorrect predictions and often lead to stereotype expression. To mitigate reasoning-induced bias, we propose Answer Distribution as Bias Proxy (ADBP), a lightweight mitigation method that detects bias by tracking how model predictions change across incremental reasoning steps. ADBP outperforms Stereotype-free Reasoning Pattern (SfRP) baseline in most cases, mitigating bias and improving the accuracy of LLM outputs. Evaluation and mitigation code is available at https://github.com/elviswxy/LLM_reasoning_bias.
CVSep 25, 2025Code
X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought ReasoningPrasanna Reddy Pulakurthi, Jiamian Wang, Majid Rabbani et al.
Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.
CLJun 25, 2024Code
Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and PromptsXuyang Wu, Yuan Wang, Hsin-Tai Wu et al.
Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, age and race. In this paper, We empirically investigate \emph{visual fairness} in several mainstream LVLMs by auditing their performance disparities across demographic attributes using public fairness benchmark datasets (e.g., FACET, UTKFace). Our fairness evaluation framework employs direct and single-choice question prompt on visual question-answering/classification tasks. Despite advancements in visual understanding, our zero-shot prompting results show that both open-source and closed-source LVLMs continue to exhibit fairness issues across different prompts and demographic groups. Furthermore, we propose a potential multi-modal Chain-of-thought (CoT) based strategy for unfairness mitigation, applicable to both open-source and closed-source LVLMs. This approach enhances transparency and offers a scalable solution for addressing fairness, providing a solid foundation for future research and practical efforts in unfairness mitigation. The dataset and code used in this study are publicly available at this GitHub Repository.
IVDec 31, 2021Code
Modeling Mask Uncertainty in Hyperspectral Image ReconstructionJiamian Wang, Yulun Zhang, Xin Yuan et al.
Recently, hyperspectral imaging (HSI) has attracted increasing research attention, especially for the ones based on a coded aperture snapshot spectral imaging (CASSI) system. Existing deep HSI reconstruction models are generally trained on paired data to retrieve original signals upon 2D compressed measurements given by a particular optical hardware mask in CASSI, during which the mask largely impacts the reconstruction performance and could work as a "model hyperparameter" governing on data augmentations. This mask-specific training style will lead to a hardware miscalibration issue, which sets up barriers to deploying deep HSI models among different hardware and noisy environments. To address this challenge, we introduce mask uncertainty for HSI with a complete variational Bayesian learning treatment and explicitly model it through a mask decomposition inspired by real hardware. Specifically, we propose a novel Graph-based Self-Tuning (GST) network to reason uncertainties adapting to varying spatial structures of masks among different hardware. Moreover, we develop a bilevel optimization framework to balance HSI reconstruction and uncertainty estimation, accounting for the hyperparameter property of masks. Extensive experimental results and model discussions validate the effectiveness (over 33/30 dB) of the proposed GST method under two miscalibration scenarios and demonstrate a highly competitive performance compared with the state-of-the-art well-calibrated methods. Our code and pre-trained model are available at https://github.com/Jiamian-Wang/mask_uncertainty_spectral_SCI
CVMar 26, 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video RetrievalJiamian Wang, Guohao Sun, Pichao Wang et al.
The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.
CVMar 17, 2024
SQ-LLaVA: Self-Questioning for Large Vision-Language AssistantGuohao Sun, Can Qin, Jiamian Wang et al.
Recent advances in vision-language models have shown notable generalization in broad tasks through visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models (LLMs) becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which, however, is costly to obtain and has not thoroughly explored the rich contextual information contained in images. This paper first attempts to harness the overlooked context within visual instruction data, training the model to self-supervised "learning" how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.
IRApr 4, 2024
Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as RankersYuan Wang, Xuyang Wu, Hsin-Tai Wu et al.
The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works (e.g., RankGPT) have also demonstrated that the LLMs exhibit better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.
LGFeb 12
DICE: Diffusion Large Language Models Excel at Generating CUDA KernelsHaolei Bai, Lingcheng Kong, Xueyi Chen et al.
Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.
AIOct 27, 2025
Latent Chain-of-Thought for Visual ReasoningGuohao Sun, Hang Hua, Jian Wang et al.
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
CVOct 1, 2025
Visual Self-Refinement for Autoregressive ModelsJiamian Wang, Ziqi Zhou, Chaithanya Kumar Mummadi et al.
Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to suboptimal results. This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling within the generated visual sequence. This module operates as a post-pretraining step to jointly refine all generated tokens of autoregressive model, enhancing vision-language modeling under a shared sequential prediction framework. By leveraging global context and relationship across the tokens, our method mitigates the error accumulation issue within the sequential generation. Experiments demonstrate that the proposed method improves the generation quality, enhancing the model's ability to produce semantically consistent results.
CVJun 28, 2024
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-AnsweringGuohao Sun, Can Qin, Huazhu Fu et al.
Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medicine (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.
LGJun 10, 2024
Reinforced Compressive Neural Architecture Search for Versatile Adversarial RobustnessDingrong Wang, Hitesh Sapkota, Zhiqiang Tao et al.
Prior neural architecture search (NAS) for adversarial robustness works have discovered that a lightweight and adversarially robust neural network architecture could exist in a non-robust large teacher network, generally disclosed by heuristic rules through statistical analysis and neural architecture search, generally disclosed by heuristic rules from neural architecture search. However, heuristic methods cannot uniformly handle different adversarial attacks and "teacher" network capacity. To solve this challenge, we propose a Reinforced Compressive Neural Architecture Search (RC-NAS) for Versatile Adversarial Robustness. Specifically, we define task settings that compose datasets, adversarial attacks, and teacher network information. Given diverse tasks, we conduct a novel dual-level training paradigm that consists of a meta-training and a fine-tuning phase to effectively expose the RL agent to diverse attack scenarios (in meta-training), and making it adapt quickly to locate a sub-network (in fine-tuning) for any previously unseen scenarios. Experiments show that our framework could achieve adaptive compression towards different initial teacher networks, datasets, and adversarial attacks, resulting in more lightweight and adversarially robust architectures.
CVJun 3, 2024
Prototypical Transformer as Unified Motion LearnersCheng Han, Yawen Lu, Guohao Sun et al.
In this work, we introduce the Prototypical Transformer (ProtoFormer), a general and unified framework that approaches various motion tasks from a prototype perspective. ProtoFormer seamlessly integrates prototype learning with Transformer by thoughtfully considering motion dynamics, introducing two innovative designs. First, Cross-Attention Prototyping discovers prototypes based on signature motion patterns, providing transparency in understanding motion scenes. Second, Latent Synchronization guides feature representation learning via prototypes, effectively mitigating the problem of motion uncertainty. Empirical results demonstrate that our approach achieves competitive performance on popular motion tasks such as optical flow and scene depth. Furthermore, it exhibits generality across various downstream tasks, including object tracking and video stabilization.
IVSep 24, 2022
S^2-Transformer for Mask-Aware Hyperspectral Image ReconstructionJiamian Wang, Kunpeng Li, Yulun Zhang et al.
Snapshot compressive imaging (SCI) surges as a novel way of capturing hyperspectral images. It operates an optical encoder to compress the 3D data into a 2D measurement and adopts a software decoder for the signal reconstruction. Recently, a representative SCI set-up of coded aperture snapshot compressive imager (CASSI) with Transformer reconstruction backend remarks high-fidelity sensing performance. However, dominant spatial and spectral attention designs show limitations in hyperspectral modeling. The spatial attention values describe the inter-pixel correlation but overlook the across-spectra variation within each pixel. The spectral attention size is unscalable to the token spatial size and thus bottlenecks information allocation. Besides, CASSI entangles the spatial and spectral information into a 2D measurement, placing a barrier for information disentanglement and modeling. In addition, CASSI blocks the light with a physical binary mask, yielding the masked data loss. To tackle above challenges, we propose a spatial-spectral (S2-) Transformer implemented by a paralleled attention design and a mask-aware learning strategy. Firstly, we systematically explore pros and cons of different spatial (-spectral) attention designs, based on which we find performing both attentions in parallel well disentangles and models the blended information. Secondly, the masked pixels induce higher prediction difficulty and should be treated differently from unmasked ones. We adaptively prioritize the loss penalty attributing to the mask structure by referring to the mask-encoded prediction as an uncertainty estimator. We theoretically discuss the distinct convergence tendencies between masked/unmasked regions of the proposed learning strategy. Extensive experiments demonstrate that on average, the results of the proposed method are superior over the state-of-the-art method.
CVDec 18, 2021
Adversarial Memory Networks for Action PredictionZhiqiang Tao, Yue Bai, Handong Zhao et al.
Action prediction aims to infer the forthcoming human action with partially-observed videos, which is a challenging task due to the limited information underlying early observations. Existing methods mainly adopt a reconstruction strategy to handle this task, expecting to learn a single mapping function from partial observations to full videos to facilitate the prediction process. In this study, we propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query from two new aspects. Firstly, a key-value structured memory generator is designed to memorize different partial videos as key memories and dynamically write full videos in value memories with gating mechanism and querying attention. Secondly, we develop a class-aware discriminator to guide the memory generator to deliver not only realistic but also discriminative full video features upon adversarial training. The final prediction result of AMemNet is given by late fusion over RGB and optical flow streams. Extensive experimental results on two benchmark video datasets, UCF-101 and HMDB51, are provided to demonstrate the effectiveness of the proposed AMemNet model over state-of-the-art methods.
IVAug 17, 2021
A Simple and Efficient Reconstruction Backbone for Snapshot Compressive ImagingJiamian Wang, Yulun Zhang, Xin Yuan et al.
The emerging technology of snapshot compressive imaging (SCI) enables capturing high dimensional (HD) data in an efficient way. It is generally implemented by two components: an optical encoder that compresses HD signals into a 2D measurement and an algorithm decoder that retrieves the HD data upon the hardware-encoded measurement. Over a broad range of SCI applications, hyperspectral imaging (HSI) and video compressive sensing have received significant research attention in recent years. Among existing SCI reconstruction algorithms, deep learning-based methods stand out as their promising performance and efficient inference. However, the deep reconstruction network may suffer from overlarge model size and highly-specialized network design, which inevitably lead to costly training time, high memory usage, and limited flexibility, thus discouraging the deployments of SCI systems in practical scenarios. In this paper, we tackle the above challenges by proposing a simple yet highly efficient reconstruction method, namely stacked residual network (SRN), by revisiting the residual learning strategy with nested structures and spatial-invariant property. The proposed SRN empowers high-fidelity data retrieval with fewer computation operations and negligible model size compared with existing networks, and also serves as a versatile backbone applicable for both hyperspectral and video data. Based on the proposed backbone, we first develop the channel attention enhanced SRN (CAE-SRN) to explore the spectral inter-dependencies for fine-grained spatial estimation in HSI. We then employ SRN as a deep denoiser and incorporate it into a generalized alternating projection (GAP) framework -- resulting in GAP-SRN -- to handle the video compressive sensing task. Experimental results demonstrate the state-of-the-art performance, high computational efficiency of the proposed SRN on two SCI applications.
LGJul 9, 2021
Automated Graph Learning via Population Based Self-Tuning GCNRonghang Zhu, Zhiqiang Tao, Yaliang Li et al.
Owing to the remarkable capability of extracting effective graph embeddings, graph convolutional network (GCN) and its variants have been successfully applied to a broad range of tasks, such as node classification, link prediction, and graph classification. Traditional GCN models suffer from the issues of overfitting and oversmoothing, while some recent techniques like DropEdge could alleviate these issues and thus enable the development of deep GCN. However, training GCN models is non-trivial, as it is sensitive to the choice of hyperparameters such as dropout rate and learning weight decay, especially for deep GCN models. In this paper, we aim to automate the training of GCN models through hyperparameter optimization. To be specific, we propose a self-tuning GCN approach with an alternate training algorithm, and further extend our approach by incorporating the population based training scheme. Experimental results on three benchmark datasets demonstrate the effectiveness of our approaches on optimizing multi-layer GCN, compared with several representative baselines.
CVSep 14, 2020
Collaborative Attention Mechanism for Multi-View Action RecognitionYue Bai, Zhiqiang Tao, Lichen Wang et al.
Multi-view action recognition (MVAR) leverages complementary temporal information from different views to improve the learning performance. Obtaining informative view-specific representation plays an essential role in MVAR. Attention has been widely adopted as an effective strategy for discovering discriminative cues underlying temporal data. However, most existing MVAR methods only utilize attention to extract representation for each view individually, ignoring the potential to dig latent patterns based on mutual-support information in attention space. To this end, we propose a collaborative attention mechanism (CAM) for solving the MVAR problem in this paper. The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other. Specifically, we extend the long short-term memory (LSTM) to a Mutual-Aid RNN (MAR) to achieve the multi-view collaboration process. CAM takes advantages of view-specific attention pattern to guide another view and discover potential information which is hard to be explored by itself. It paves a novel way to leverage attention information and enhances the multi-view representation learning. Extensive experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
LGApr 9, 2020
Learnable Subspace ClusteringJun Li, Hongfu Liu, Zhiqiang Tao et al.
This paper studies the large-scale subspace clustering (LSSC) problem with million data points. Many popular subspace clustering methods cannot directly handle the LSSC problem although they have been considered as state-of-the-art methods for small-scale data points. A basic reason is that these methods often choose all data points as a big dictionary to build huge coding models, which results in a high time and space complexity. In this paper, we develop a learnable subspace clustering paradigm to efficiently solve the LSSC problem. The key idea is to learn a parametric function to partition the high-dimensional subspaces into their underlying low-dimensional subspaces instead of the expensive costs of the classical coding models. Moreover, we propose a unified robust predictive coding machine (RPCM) to learn the parametric function, which can be solved by an alternating minimization algorithm. In addition, we provide a bounded contraction analysis of the parametric function. To the best of our knowledge, this paper is the first work to efficiently cluster millions of data points among the subspace clustering methods. Experiments on million-scale datasets verify that our paradigm outperforms the related state-of-the-art methods in both efficiency and effectiveness.
CVMar 29, 2020
Generative Partial Multi-View ClusteringQianqian Wang, Zhengming Ding, Zhiqiang Tao et al.
Nowadays, with the rapid development of data collection sources and feature extraction methods, multi-view data are getting easy to obtain and have received increasing research attention in recent years, among which, multi-view clustering (MVC) forms a mainstream research direction and is widely used in data analysis. However, existing MVC methods mainly assume that each sample appears in all the views, without considering the incomplete view case due to data corruption, sensor failure, equipment malfunction, etc. In this study, we design and build a generative partial multi-view clustering model, named as GP-MVC, to address the incomplete multi-view problem by explicitly generating the data of missing views. The main idea of GP-MVC lies at two-fold. First, multi-view encoder networks are trained to learn common low-dimensional representations, followed by a clustering layer to capture the consistent cluster structure across multiple views. Second, view-specific generative adversarial networks are developed to generate the missing data of one view conditioning on the shared representation given by other views. These two steps could be promoted mutually, where learning common representations facilitates data imputation and the generated data could further explores the view consistency. Moreover, an weighted adaptive fusion scheme is implemented to exploit the complementary information among different views. Experimental results on four benchmark datasets are provided to show the effectiveness of the proposed GP-MVC over the state-of-the-art methods.
LGJan 3, 2020
Automated Relational Meta-learningHuaxiu Yao, Xian Wu, Zhiqiang Tao et al.
In order to efficiently learn with small amount of data on new tasks, meta-learning transfers knowledge learned from previous tasks to the new ones. However, a critical challenge in meta-learning is the task heterogeneity which cannot be well handled by traditional globally shared meta-learning methods. In addition, current task-specific meta-learning methods may either suffer from hand-crafted structure design or lack the capability to capture complex relations between tasks. In this paper, motivated by the way of knowledge organization in knowledge bases, we propose an automated relational meta-learning (ARML) framework that automatically extracts the cross-task relations and constructs the meta-knowledge graph. When a new task arrives, it can quickly find the most relevant structure and tailor the learned structure knowledge to the meta-learner. As a result, the proposed framework not only addresses the challenge of task heterogeneity by a learned meta-knowledge graph, but also increases the model interpretability. We conduct extensive experiments on 2D toy regression and few-shot image classification and the results demonstrate the superiority of ARML over state-of-the-art baselines.
LGNov 24, 2019
Correlative Channel-Aware Fusion for Multi-View Time Series ClassificationYue Bai, Lichen Wang, Zhiqiang Tao et al.
Multi-view time series classification (MVTSC) aims to improve the performance by fusing the distinctive temporal information from multiple views. Existing methods mainly focus on fusing multi-view information at an early stage, e.g., by learning a common feature subspace among multiple views. However, these early fusion methods may not fully exploit the unique temporal patterns of each view in complicated time series. Moreover, the label correlations of multiple views, which are critical to boost-ing, are usually under-explored for the MVTSC problem. To address the aforementioned issues, we propose a Correlative Channel-Aware Fusion (C2AF) network. First, C2AF extracts comprehensive and robust temporal patterns by a two-stream structured encoder for each view, and captures the intra-view and inter-view label correlations with a graph-based correlation matrix. Second, a channel-aware learnable fusion mechanism is implemented through convolutional neural networks to further explore the global correlative patterns. These two steps are trained end-to-end in the proposed C2AF network. Extensive experimental results on three real-world datasets demonstrate the superiority of our approach over the state-of-the-art methods. A detailed ablation study is also provided to show the effectiveness of each model component.
LGMay 31, 2019
Consensus Clustering: An Embedding Perspective, Extension and BeyondHongfu Liu, Zhiqiang Tao, Zhengming Ding
Consensus clustering fuses diverse basic partitions (i.e., clustering results obtained from conventional clustering methods) into an integrated one, which has attracted increasing attention in both academic and industrial areas due to its robust and effective performance. Tremendous research efforts have been made to thrive this domain in terms of algorithms and applications. Although there are some survey papers to summarize the existing literature, they neglect to explore the underlying connection among different categories. Differently, in this paper we aim to provide an embedding prospective to illustrate the consensus mechanism, which transfers categorical basic partitions to other representations (e.g., binary coding, spectral embedding, etc) for the clustering purpose. To this end, we not only unify two major categories of consensus clustering, but also build an intuitive connection between consensus clustering and graph embedding. Moreover, we elaborate several extensions of classical consensus clustering from different settings and problems. Beyond this, we demonstrate how to leverage consensus clustering to address other tasks, such as constrained clustering, domain adaptation, feature selection, and outlier detection. Finally, we conclude this survey with future work in terms of interpretability, learnability and theoretical analysis.
CVJan 6, 2019
Segmentation Guided Image-to-Image Translation with Adversarial NetworksSongyao Jiang, Zhiqiang Tao, Yun Fu
Recently image-to-image translation has received increasing attention, which aims to map images in one domain to another specific one. Existing methods mainly solve this task via a deep generative model, and focus on exploring the relationship between different domains. However, these methods neglect to utilize higher-level and instance-specific information to guide the training process, leading to a great deal of unrealistic generated images of low quality. Existing methods also lack of spatial controllability during translation. To address these challenge, we propose a novel Segmentation Guided Generative Adversarial Networks (SGGAN), which leverages semantic segmentation to further boost the generation performance and provide spatial mapping. In particular, a segmentor network is designed to impose semantic information on the generated images. Experimental results on multi-domain face image translation task empirically demonstrate our ability of the spatial modification and our superiority in image quality over several state-of-the-art methods.