CVMar 9, 2022
Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image ClassificationSen Jia, Yifan Wang
Hyperspectral images (HSI) not only have a broad macroscopic field of view but also contain rich spectral information, and the types of surface objects can be identified through spectral information, which is one of the main applications in hyperspectral image related research.In recent years, more and more deep learning methods have been proposed, among which convolutional neural networks (CNN) are the most influential. However, CNN-based methods are difficult to capture long-range dependencies, and also require a large amount of labeled data for model training.Besides, most of the self-supervised training methods in the field of HSI classification are based on the reconstruction of input samples, and it is difficult to achieve effective use of unlabeled samples. To address the shortcomings of CNN networks, we propose a noval multi-scale convolutional embedding module for HSI to realize effective extraction of spatial-spectral information, which can be better combined with Transformer network.In order to make more efficient use of unlabeled data, we propose a new self-supervised pretask. Similar to Mask autoencoder, but our pre-training method only masks the corresponding token of the central pixel in the encoder, and inputs the remaining token into the decoder to reconstruct the spectral information of the central pixel.Such a pretask can better model the relationship between the central feature and the domain feature, and obtain more stable training results.
IVAug 8, 2022
SelfCoLearn: Self-supervised collaborative learning for accelerating dynamic MR imagingJuan Zou, Cheng Li, Sen Jia et al.
Lately, deep learning has been extensively investigated for accelerating dynamic magnetic resonance (MR) imaging, with encouraging progresses achieved. However, without fully sampled reference data for training, current approaches may have limited abilities in recovering fine details or structures. To address this challenge, this paper proposes a self-supervised collaborative learning framework (SelfCoLearn) for accurate dynamic MR image reconstruction from undersampled k-space data. The proposed framework is equipped with three important components, namely, dual-network collaborative learning, reunderampling data augmentation and a specially designed co-training loss. The framework is flexible to be integrated with both data-driven networks and model-based iterative un-rolled networks. Our method has been evaluated on in-vivo dataset and compared it to four state-of-the-art methods. Results show that our method possesses strong capabilities in capturing essential and inherent representations for direct reconstructions from the undersampled k-space data and thus enables high-quality and fast dynamic MR imaging.
CVApr 11, 2023
SPIRiT-Diffusion: Self-Consistency Driven Diffusion Model for Accelerated MRIZhuo-Xu Cui, Chentao Cao, Yue Wang et al.
Diffusion models have emerged as a leading methodology for image generation and have proven successful in the realm of magnetic resonance imaging (MRI) reconstruction. However, existing reconstruction methods based on diffusion models are primarily formulated in the image domain, making the reconstruction quality susceptible to inaccuracies in coil sensitivity maps (CSMs). k-space interpolation methods can effectively address this issue but conventional diffusion models are not readily applicable in k-space interpolation. To overcome this challenge, we introduce a novel approach called SPIRiT-Diffusion, which is a diffusion model for k-space interpolation inspired by the iterative self-consistent SPIRiT method. Specifically, we utilize the iterative solver of the self-consistent term (i.e., k-space physical prior) in SPIRiT to formulate a novel stochastic differential equation (SDE) governing the diffusion process. Subsequently, k-space data can be interpolated by executing the diffusion process. This innovative approach highlights the optimization model's role in designing the SDE in diffusion models, enabling the diffusion process to align closely with the physics inherent in the optimization model, a concept referred to as model-driven diffusion. We evaluated the proposed SPIRiT-Diffusion method using a 3D joint intracranial and carotid vessel wall imaging dataset. The results convincingly demonstrate its superiority over image-domain reconstruction methods, achieving high reconstruction quality even at a substantial acceleration rate of 10.
CVAug 30, 2023
Physics-Informed DeepMRI: Bridging the Gap from Heat Diffusion to k-Space InterpolationZhuo-Xu Cui, Congcong Liu, Xiaohong Fan et al.
In the field of parallel imaging (PI), alongside image-domain regularization methods, substantial research has been dedicated to exploring $k$-space interpolation. However, the interpretability of these methods remains an unresolved issue. Furthermore, these approaches currently face acceleration limitations that are comparable to those experienced by image-domain methods. In order to enhance interpretability and overcome the acceleration limitations, this paper introduces an interpretable framework that unifies both $k$-space interpolation techniques and image-domain methods, grounded in the physical principles of heat diffusion equations. Building upon this foundational framework, a novel $k$-space interpolation method is proposed. Specifically, we model the process of high-frequency information attenuation in $k$-space as a heat diffusion equation, while the effort to reconstruct high-frequency information from low-frequency regions can be conceptualized as a reverse heat equation. However, solving the reverse heat equation poses a challenging inverse problem. To tackle this challenge, we modify the heat equation to align with the principles of magnetic resonance PI physics and employ the score-based generative method to precisely execute the modified reverse heat diffusion. Finally, experimental validation conducted on publicly available datasets demonstrates the superiority of the proposed approach over traditional $k$-space interpolation methods, deep learning-based $k$-space interpolation methods, and conventional diffusion models in terms of reconstruction accuracy, particularly in high-frequency regions.
IVDec 14, 2022
SPIRiT-Diffusion: SPIRiT-driven Score-Based Generative Modeling for Vessel Wall imagingChentao Cao, Zhuo-Xu Cui, Jing Cheng et al.
Diffusion model is the most advanced method in image generation and has been successfully applied to MRI reconstruction. However, the existing methods do not consider the characteristics of multi-coil acquisition of MRI data. Therefore, we give a new diffusion model, called SPIRiT-Diffusion, based on the SPIRiT iterative reconstruction algorithm. Specifically, SPIRiT-Diffusion characterizes the prior distribution of coil-by-coil images by score matching and characterizes the k-space redundant prior between coils based on self-consistency. With sufficient prior constraint utilized, we achieve superior reconstruction results on the joint Intracranial and Carotid Vessel Wall imaging dataset.
35.5CVMar 20
RAM: Recover Any 3D Human Motion in-the-WildSen Jia, Ning Zhu, Jinqin Zhong et al.
RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.
CVAug 11, 2022
K-UNN: k-Space Interpolation With Untrained Neural NetworkZhuo-Xu Cui, Sen Jia, Qingyong Zhu et al.
Recently, untrained neural networks (UNNs) have shown satisfactory performances for MR image reconstruction on random sampling trajectories without using additional full-sampled training data. However, the existing UNN-based approach does not fully use the MR image physical priors, resulting in poor performance in some common scenarios (e.g., partial Fourier, regular sampling, etc.) and the lack of theoretical guarantees for reconstruction accuracy. To bridge this gap, we propose a safeguarded k-space interpolation method for MRI using a specially designed UNN with a tripled architecture driven by three physical priors of the MR images (or k-space data), including sparsity, coil sensitivity smoothness, and phase smoothness. We also prove that the proposed method guarantees tight bounds for interpolated k-space data accuracy. Finally, ablation experiments show that the proposed method can more accurately characterize the physical priors of MR images than existing traditional methods. Additionally, under a series of commonly used sampling trajectories, experiments also show that the proposed method consistently outperforms traditional parallel imaging methods and existing UNNs, and even outperforms the state-of-the-art supervised-trained k-space deep learning methods in some cases.
19.4CLMay 19
BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented GenerationZijun Jia, Yuanchang Ye, Sen Jia et al.
Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.
AINov 25, 2024Code
Human Motion Instruction TuningLei Li, Sen Jia, Jianhao Wang et al.
This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model's ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction. Our code and models are available on the project website: https://github.com/ILGLJ/LLaMo.
CVNov 27, 2024Code
Graph Canvas for Controllable 3D Scene GenerationLibin Liu, Shen Chen, Sen Jia et al.
Spatial intelligence is foundational to AI systems that interact with the physical world, particularly in 3D scene generation and spatial comprehension. Current methodologies for 3D scene generation often rely heavily on predefined datasets, and struggle to adapt dynamically to changing spatial relationships. In this paper, we introduce GraphCanvas3D, a programmable, extensible, and adaptable framework for controllable 3D scene generation. Leveraging in-context learning, GraphCanvas3D enables dynamic adaptability without the need for retraining, supporting flexible and customizable scene creation. Our framework employs hierarchical, graph-driven scene descriptions, representing spatial elements as graph nodes and establishing coherent relationships among objects in 3D environments. Unlike conventional approaches, which are constrained in adaptability and often require predefined input masks or retraining for modifications, GraphCanvas3D allows for seamless object manipulation and scene adjustments on the fly. Additionally, GraphCanvas3D supports 4D scene generation, incorporating temporal dynamics to model changes over time. Experimental results and user studies demonstrate that GraphCanvas3D enhances usability, flexibility, and adaptability for scene generation. Our code and models are available on the project website: https://github.com/ILGLJ/Graph-Canvas.
CVAug 10, 2025Code
SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object TrackingFengchao Xiong, Zhenxing Wu, Sen Jia et al.
Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via https://github.com/bearshng/suit to support reproducibility.
CVDec 3, 2021Code
A Survey: Deep Learning for Hyperspectral Image Classification with Few Labeled SamplesSen Jia, Shuguo Jiang, Zhijie Lin et al.
With the rapid development of deep learning technology and improvement in computing capability, deep learning has been widely used in the field of hyperspectral image (HSI) classification. In general, deep learning models often contain many trainable parameters and require a massive number of labeled samples to achieve optimal performance. However, in regard to HSI classification, a large number of labeled samples is generally difficult to acquire due to the difficulty and time-consuming nature of manual labeling. Therefore, many research works focus on building a deep learning model for HSI classification with few labeled samples. In this article, we concentrate on this topic and provide a systematic review of the relevant literature. Specifically, the contributions of this paper are twofold. First, the research progress of related methods is categorized according to the learning paradigm, including transfer learning, active learning and few-shot learning. Second, a number of experiments with various state-of-the-art approaches has been carried out, and the results are summarized to reveal the potential research directions. More importantly, it is notable that although there is a vast gap between deep learning models (that usually need sufficient labeled samples) and the HSI scenario with few labeled samples, the issues of small-sample sets can be well characterized by fusion of deep learning methods and related techniques, such as transfer learning and a lightweight model. For reproducibility, the source codes of the methods assessed in the paper can be found at https://github.com/ShuGuoJ/HSI-Classification.git.
IVAug 12, 2021Code
Deep Amended Gradient Descent for Efficient Spectral Reconstruction from Single RGB ImagesZhiyu Zhu, Hui Liu, Junhui Hou et al.
This paper investigates the problem of recovering hyperspectral (HS) images from single RGB images. To tackle such a severely ill-posed problem, we propose a physically-interpretable, compact, efficient, and end-to-end learning-based framework, namely AGD-Net. Precisely, by taking advantage of the imaging process, we first formulate the problem explicitly based on the classic gradient descent algorithm. Then, we design a lightweight neural network with a multi-stage architecture to mimic the formed amended gradient descent process, in which efficient convolution and novel spectral zero-mean normalization are proposed to effectively extract spatial-spectral features for regressing an initialization, a basic gradient, and an incremental gradient. Besides, based on the approximate low-rank property of HS images, we propose a novel rank loss to promote the similarity between the global structures of reconstructed and ground-truth HS images, which is optimized with our singular value weighting strategy during training. Moreover, AGD-Net, a single network after one-time training, is flexible to handle the reconstruction with various spectral response functions. Extensive experiments over three commonly-used benchmark datasets demonstrate that AGD-Net can improve the reconstruction quality by more than 1.0 dB on average while saving 67$\times$ parameters and 32$\times$ FLOPs, compared with state-of-the-art methods. The code will be publicly available at https://github.com/zbzhzhy/GD-Net.
LGNov 1, 2025
Learning an Efficient Optimizer via Hybrid-Policy Sub-Trajectory BalanceYunchuan Guan, Yu Liu, Ke Zhou et al.
Recent advances in generative modeling enable neural networks to generate weights without relying on gradient-based optimization. However, current methods are limited by issues of over-coupling and long-horizon. The former tightly binds weight generation with task-specific objectives, thereby limiting the flexibility of the learned optimizer. The latter leads to inefficiency and low accuracy during inference, caused by the lack of local constraints. In this paper, we propose Lo-Hp, a decoupled two-stage weight generation framework that enhances flexibility through learning various optimization policies. It adopts a hybrid-policy sub-trajectory balance objective, which integrates on-policy and off-policy learning to capture local optimization policies. Theoretically, we demonstrate that learning solely local optimization policies can address the long-horizon issue while enhancing the generation of global optimal weights. In addition, we validate Lo-Hp's superior accuracy and inference efficiency in tasks that require frequent weight updates, such as transfer learning, few-shot learning, domain generalization, and large language model adaptation.
AIFeb 25, 2025
ChatMotion: A Multimodal Multi-Agent for Human Motion AnalysisLei Li, Sen Jia, Jianhao Wang et al.
Advancements in Multimodal Large Language Models (MLLMs) have improved human motion understanding. However, these models remain constrained by their "instruct-only" nature, lacking interactivity and adaptability for diverse analytical perspectives. To address these challenges, we introduce ChatMotion, a multimodal multi-agent framework for human motion analysis. ChatMotion dynamically interprets user intent, decomposes complex tasks into meta-tasks, and activates specialized function modules for motion comprehension. It integrates multiple specialized modules, such as the MotionCore, to analyze human motion from various perspectives. Extensive experiments demonstrate ChatMotion's precision, adaptability, and user engagement for human motion understanding.
CVDec 11, 2024
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM FrameworkXin Dong, Sen Jia, Ming Rui Wang et al.
Recently, with the emergence of recent Multimodal Large Language Model (MLLM) technology, it has become possible to exploit its video understanding capability on different classification tasks. In practice, we face the difficulty of huge requirements for GPU resource if we need to deploy MLLMs online. In this paper, we propose COEF-VQ, a novel cascaded MLLM framework designed to enhance video quality understanding on the short-video platform while optimizing computational efficiency. Our approach integrates an entropy-based pre-filtering stage, where a lightweight model assesses uncertainty and selectively filters cases before passing them to the more computationally intensive MLLM for final evaluation. By prioritizing high-uncertainty samples for deeper analysis, our framework significantly reduces GPU usage while maintaining the strong classification performance of a full MLLM deployment. To demonstrate the effectiveness of COEF-VQ, we deploy this new framework onto the video management platform (VMP) at the short-video platform, and perform a series of detailed experiments on two in-house tasks related to video quality understanding. We show that COEF-VQ leads to substantial performance gains from the offline evaluation in these two tasks and effectively enhances platform safety with limit resource consumption, significantly reducing inappropriate content video view rate by 9.9% in a online A/B test without affecting engagement. Post-launch monitoring confirmed sustained improvements, validating its real-world impact.
IVDec 9, 2024
Diff5T: Benchmarking Human Brain Diffusion MRI with an Extensive 5.0 Tesla K-Space and Spatial DatasetShanshan Wang, Shoujun Yu, Jian Cheng et al.
Diffusion magnetic resonance imaging (dMRI) provides critical insights into the microstructural and connectional organization of the human brain. However, the availability of high-field, open-access datasets that include raw k-space data for advanced research remains limited. To address this gap, we introduce Diff5T, a first comprehensive 5.0 Tesla diffusion MRI dataset focusing on the human brain. This dataset includes raw k-space data and reconstructed diffusion images, acquired using a variety of imaging protocols. Diff5T is designed to support the development and benchmarking of innovative methods in artifact correction, image reconstruction, image preprocessing, diffusion modelling and tractography. The dataset features a wide range of diffusion parameters, including multiple b-values and gradient directions, allowing extensive research applications in studying human brain microstructure and connectivity. With its emphasis on open accessibility and detailed benchmarks, Diff5T serves as a valuable resource for advancing human brain mapping research using diffusion MRI, fostering reproducibility, and enabling collaboration across the neuroscience and medical imaging communities.
CVMar 1, 2025
RFWNet: A Lightweight Remote Sensing Object Detector Integrating Multiscale Receptive Fields and Foreground Focus MechanismYujie Lei, Wenjie Sun, Sen Jia et al.
Challenges in remote sensing object detection(RSOD), such as high interclass similarity, imbalanced foreground-background distribution, and the small size of objects in remote sensing images, significantly hinder detection accuracy. Moreover, the tradeoff between model accuracy and computational complexity poses additional constraints on the application of RSOD algorithms. To address these issues, this study proposes an efficient and lightweight RSOD algorithm integrating multiscale receptive fields and foreground focus mechanism, named robust foreground weighted network(RFWNet). Specifically, we proposed a lightweight backbone network receptive field adaptive selection network (RFASNet), leveraging the rich context information of remote sensing images to enhance class separability. Additionally, we developed a foreground-background separation module(FBSM)consisting of a background redundant information filtering module (BRIFM) and a foreground information enhancement module (FIEM) to emphasize critical regions within images while filtering redundant background information. Finally, we designed a loss function, the weighted CIoU-Wasserstein loss (LWCW),which weights the IoU-based loss by using the normalized Wasserstein distance to mitigate model sensitivity to small object position deviations. The comprehensive experimental results demonstrate that RFWNet achieved 95.3% and 73.2% mean average precision (mAP) with 6.0 M parameters on the DOTA V1.0 and NWPU VHR-10 datasets, respectively, with an inference speed of 52 FPS.
CVNov 29, 2024
LaVIDE: A Language-Vision Discriminator for Detecting Changes in Satellite Image with Map ReferencesShuguo Jiang, Fang Xu, Sen Jia et al.
Change detection, which typically relies on the comparison of bi-temporal images, is significantly hindered when only a single image is available. Comparing a single image with an existing map, such as OpenStreetMap, which is continuously updated through crowd-sourcing, offers a viable solution to this challenge. Unlike images that carry low-level visual details of ground objects, maps convey high-level categorical information. This discrepancy in abstraction levels complicates the alignment and comparison of the two data types. In this paper, we propose a \textbf{La}nguage-\textbf{VI}sion \textbf{D}iscriminator for d\textbf{E}tecting changes in satellite image with map references, namely \ours{}, which leverages language to bridge the information gap between maps and images. Specifically, \ours{} formulates change detection as the problem of ``{\textit Does the pixel belong to [class]?}'', aligning maps and images within the feature space of the language-vision model to associate high-level map categories with low-level image details. Moreover, we build a mixture-of-experts discriminative module, which compares linguistic features from maps with visual features from images across various semantic perspectives, achieving comprehensive semantic comparison for change detection. Extensive evaluation on four benchmark datasets demonstrates that \ours{} can effectively detect changes in satellite image with map references, outperforming state-of-the-art change detection algorithms, e.g., with gains of about $13.8$\% on the DynamicEarthNet dataset and $4.3$\% on the SECOND dataset.
CVJun 15, 2024
Technique Report of CVPR 2024 PBDL ChallengesYing Fu, Yu Li, Shaodi You et al.
The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches.
LGDec 18, 2021
Equilibrated Zeroth-Order Unrolled Deep Networks for Accelerated MRIZhuo-Xu Cui, Jing Cheng, Qingyong Zhu et al.
Recently, model-driven deep learning unrolls a certain iterative algorithm of a regularization model into a cascade network by replacing the first-order information (i.e., (sub)gradient or proximal operator) of the regularizer with a network module, which appears more explainable and predictable compared to common data-driven networks. Conversely, in theory, there is not necessarily such a functional regularizer whose first-order information matches the replaced network module, which means the network output may not be covered by the original regularization model. Moreover, up to now, there is also no theory to guarantee the global convergence and robustness (regularity) of unrolled networks under realistic assumptions. To bridge this gap, this paper propose to present a safeguarded methodology on network unrolling. Specifically, focusing on accelerated MRI, we unroll a zeroth-order algorithm, of which the network module represents the regularizer itself, so that the network output can be still covered by the regularization model. Furthermore, inspired by the ideal of deep equilibrium models, before backpropagating, we carry out the unrolled iterative network to converge to a fixed point to ensure the convergence. In case the measurement data contains noise, we prove that the proposed network is robust against noisy interference. Finally, numerical experiments show that the proposed network consistently outperforms the state-of-the-art MRI reconstruction methods including traditional regularization methods and other deep learning methods.
CVOct 20, 2021
Simpler Does It: Generating Semantic Labels with Objectness GuidanceMd Amirul Islam, Matthew Kowal, Sen Jia et al.
Existing weakly or semi-supervised semantic segmentation methods utilize image or box-level supervision to generate pseudo-labels for weakly labeled images. However, due to the lack of strong supervision, the generated pseudo-labels are often noisy near the object boundaries, which severely impacts the network's ability to learn strong representations. To address this problem, we present a novel framework that generates pseudo-labels for training images, which are then used to train a segmentation model. To generate pseudo-labels, we combine information from: (i) a class agnostic objectness network that learns to recognize object-like regions, and (ii) either image-level or bounding box annotations. We show the efficacy of our approach by demonstrating how the objectness network can naturally be leveraged to generate object-like regions for unseen categories. We then propose an end-to-end multi-task learning strategy, that jointly learns to segment semantics and objectness using the generated pseudo-labels. Extensive experiments demonstrate the high quality of our generated pseudo-labels and effectiveness of the proposed framework in a variety of domains. Our approach achieves better or competitive performance compared to existing weakly-supervised and semi-supervised methods.
CVAug 17, 2021
Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNsMd Amirul Islam, Matthew Kowal, Sen Jia et al.
In this paper, we challenge the common assumption that collapsing the spatial dimensions of a 3D (spatial-channel) tensor in a convolutional neural network (CNN) into a vector via global pooling removes all spatial information. Specifically, we demonstrate that positional information is encoded based on the ordering of the channel dimensions, while semantic information is largely not. Following this demonstration, we show the real world impact of these findings by applying them to two applications. First, we propose a simple yet effective data augmentation strategy and loss function which improves the translation invariance of a CNN's output. Second, we propose a method to efficiently determine which channels in the latent representation are responsible for (i) encoding overall position information or (ii) region-specific positions. We first show that semantic segmentation has a significant reliance on the overall position channels to make predictions. We then show for the first time that it is possible to perform a `region-specific' attack, and degrade a network's performance in a particular part of the input. We believe our findings and demonstrated applications will benefit research areas concerned with understanding the characteristics of CNNs.
IVJul 14, 2021
Multi-Attention Generative Adversarial Network for Remote Sensing Image Super-ResolutionMeng Xu, Zhihao Wang, Jiasong Zhu et al.
Image super-resolution (SR) methods can generate remote sensing images with high spatial resolution without increasing the cost, thereby providing a feasible way to acquire high-resolution remote sensing images, which are difficult to obtain due to the high cost of acquisition equipment and complex weather. Clearly, image super-resolution is a severe ill-posed problem. Fortunately, with the development of deep learning, the powerful fitting ability of deep neural networks has solved this problem to some extent. In this paper, we propose a network based on the generative adversarial network (GAN) to generate high resolution remote sensing images, named the multi-attention generative adversarial network (MA-GAN). We first designed a GAN-based framework for the image SR task. The core to accomplishing the SR task is the image generator with post-upsampling that we designed. The main body of the generator contains two blocks; one is the pyramidal convolution in the residual-dense block (PCRDB), and the other is the attention-based upsample (AUP) block. The attentioned pyramidal convolution (AttPConv) in the PCRDB block is a module that combines multi-scale convolution and channel attention to automatically learn and adjust the scaling of the residuals for better results. The AUP block is a module that combines pixel attention (PA) to perform arbitrary multiples of upsampling. These two blocks work together to help generate better quality images. For the loss function, we design a loss function based on pixel loss and introduce both adversarial loss and feature loss to guide the generator learning. We have compared our method with several state-of-the-art methods on a remote sensing scene image dataset, and the experimental results consistently demonstrate the effectiveness of the proposed MA-GAN.
CVApr 13, 2021
SRR-Net: A Super-Resolution-Involved Reconstruction Method for High Resolution MR ImagingWenqi Huang, Sen Jia, Ziwen Ke et al.
Improving the image resolution and acquisition speed of magnetic resonance imaging (MRI) is a challenging problem. There are mainly two strategies dealing with the speed-resolution trade-off: (1) $k$-space undersampling with high-resolution acquisition, and (2) a pipeline of lower resolution image reconstruction and image super-resolution. However, these approaches either have limited performance at certain high acceleration factor or suffer from the error accumulation of two-step structure. In this paper, we combine the idea of MR reconstruction and image super-resolution, and work on recovering HR images from low-resolution under-sampled $k$-space data directly. Particularly, the SR-involved reconstruction can be formulated as a variational problem, and a learnable network unrolled from its solution algorithm is proposed. A discriminator was introduced to enhance the detail refining performance. Experiment results using in-vivo HR multi-coil brain data indicate that the proposed SRR-Net is capable of recovering high-resolution brain images with both good visual quality and perceptual quality.
IVMar 9, 2021
Deep Manifold Learning for Dynamic MR ImagingZiwen Ke, Zhuo-Xu Cui, Wenqi Huang et al.
Purpose: To develop a deep learning method on a nonlinear manifold to explore the temporal redundancy of dynamic signals to reconstruct cardiac MRI data from highly undersampled measurements. Methods: Cardiac MR image reconstruction is modeled as general compressed sensing (CS) based optimization on a low-rank tensor manifold. The nonlinear manifold is designed to characterize the temporal correlation of dynamic signals. Iterative procedures can be obtained by solving the optimization model on the manifold, including gradient calculation, projection of the gradient to tangent space, and retraction of the tangent space to the manifold. The iterative procedures on the manifold are unrolled to a neural network, dubbed as Manifold-Net. The Manifold-Net is trained using in vivo data with a retrospective electrocardiogram (ECG)-gated segmented bSSFP sequence. Results: Experimental results at high accelerations demonstrate that the proposed method can obtain improved reconstruction compared with a compressed sensing (CS) method k-t SLR and two state-of-the-art deep learning-based methods, DC-CNN and CRNN. Conclusion: This work represents the first study unrolling the optimization on manifolds into neural networks. Specifically, the designed low-rank manifold provides a new technical route for applying low-rank priors in dynamic MR imaging.
CVJan 28, 2021
Position, Padding and Predictions: A Deeper Look at Position Information in CNNsMd Amirul Islam, Matthew Kowal, Sen Jia et al.
In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. An implication of this is that a filter may know what it is looking at, but not where it is positioned in the image. In this paper, we first test this hypothesis and reveal that a surprising degree of absolute position information is encoded in commonly used CNNs. We show that zero padding drives CNNs to encode position information in their internal representations, while a lack of padding precludes position encoding. This gives rise to deeper questions about the role of position information in CNNs: (i) What boundary heuristics enable optimal position encoding for downstream tasks?; (ii) Does position encoding affect the learning of semantic representations?; (iii) Does position encoding always improve performance? To provide answers, we perform the largest case study to date on the role that padding and border heuristics play in CNNs. We design novel tasks which allow us to quantify boundary effects as a function of the distance to the border. Numerous semantic objectives reveal the effect of the border on semantic representations. Finally, we demonstrate the implications of these findings on multiple real-world tasks to show that position information can both help or hurt performance.
CVJan 27, 2021
Shape or Texture: Understanding Discriminative Features in CNNsMd Amirul Islam, Matthew Kowal, Patrick Esser et al.
Contrasting the previous evidence that neurons in the later layers of a Convolutional Neural Network (CNN) respond to complex object shapes, recent studies have shown that CNNs actually exhibit a `texture bias': given an image with both texture and shape cues (e.g., a stylized image), a CNN is biased towards predicting the category corresponding to the texture. However, these previous studies conduct experiments on the final classification output of the network, and fail to robustly evaluate the bias contained (i) in the latent representations, and (ii) on a per-pixel level. In this paper, we design a series of experiments that overcome these issues. We do this with the goal of better understanding what type of shape information contained in the network is discriminative, where shape information is encoded, as well as when the network learns about object shape during training. We show that a network learns the majority of overall shape information at the first few epochs of training and that this information is largely encoded in the last few layers of a CNN. Finally, we show that the encoding of shape does not imply the encoding of localized per-pixel semantic information. The experimental results and findings provide a more accurate understanding of the behaviour of current CNNs, thus helping to inform future design choices.
CVNov 9, 2020
Deep Learning based Monocular Depth Prediction: Datasets, Methods and ApplicationsQing Li, Jiasong Zhu, Jun Liu et al.
Estimating depth from RGB images can facilitate many computer vision tasks, such as indoor localization, height estimation, and simultaneous localization and mapping (SLAM). Recently, monocular depth estimation has obtained great progress owing to the rapid development of deep learning techniques. They surpass traditional machine learning-based methods by a large margin in terms of accuracy and speed. Despite the rapid progress in this topic, there are lacking of a comprehensive review, which is needed to summarize the current progress and provide the future directions. In this survey, we first introduce the datasets for depth estimation, and then give a comprehensive introduction of the methods from three perspectives: supervised learning-based methods, unsupervised learning-based methods, and sparse samples guidance-based methods. In addition, downstream applications that benefit from the progress have also been illustrated. Finally, we point out the future directions and conclude the paper.
IVOct 26, 2020
Deep Low-rank plus Sparse Network for Dynamic MR ImagingWenqi Huang, Ziwen Ke, Zhuo-Xu Cui et al.
In dynamic magnetic resonance (MR) imaging, low-rank plus sparse (L+S) decomposition, or robust principal component analysis (PCA), has achieved stunning performance. However, the selection of the parameters of L+S is empirical, and the acceleration rate is limited, which are common failings of iterative compressed sensing MR imaging (CS-MRI) reconstruction methods. Many deep learning approaches have been proposed to address these issues, but few of them use a low-rank prior. In this paper, a model-based low-rank plus sparse network, dubbed L+S-Net, is proposed for dynamic MR reconstruction. In particular, we use an alternating linearized minimization method to solve the optimization problem with low-rank and sparse regularization. Learned soft singular value thresholding is introduced to ensure the clear separation of the L component and S component. Then, the iterative steps are unrolled into a network in which the regularization parameters are learnable. We prove that the proposed L+S-Net achieves global convergence under two standard assumptions. Experiments on retrospective and prospective cardiac cine datasets show that the proposed model outperforms state-of-the-art CS and existing deep learning methods and has great potential for extremely high acceleration factors (up to 24x).
IVJun 22, 2020
Deep Low-rank Prior in Dynamic MR ImagingZiwen Ke, Wenqi Huang, Jing Cheng et al.
The deep learning methods have achieved attractive performance in dynamic MR cine imaging. However, all of these methods are only driven by the sparse prior of MR images, while the important low-rank (LR) prior of dynamic MR cine images is not explored, which limits the further improvements on dynamic MR reconstruction. In this paper, a learned singular value thresholding (Learned-SVT) operation is proposed to explore deep low-rank prior in dynamic MR imaging for obtaining improved reconstruction results. In particular, we come up with two novel and distinct schemes to introduce the learnable low-rank prior into deep network architectures in an unrolling manner and a plug-and-play manner respectively. In the unrolling manner, we put forward a model-based unrolling sparse and low-rank network for dynamic MR imaging, dubbed SLR-Net. The SLR-Net is defined over a deep network flow graph, which is unrolled from the iterative procedures in the Iterative Shrinkage-Thresholding Algorithm (ISTA) for optimizing a sparse and low-rank based dynamic MRI model. In the plug-and-play manner, we present a plug-and-play LR network module that can be easily embedded into any other dynamic MR neural networks without changing the network paradigm. Experimental results show that both schemes can further improve the state-of-the-art CS methods, such as k-t SLR, and sparsity-driven deep learning-based methods, such as DC-CNN and CRNN, both qualitatively and quantitatively.
CVFeb 24, 2020
Revisiting Saliency Metrics: Farthest-Neighbor Area Under CurveSen Jia, Neil D. B. Bruce
Saliency detection has been widely studied because it plays an important role in various vision applications, but it is difficult to evaluate saliency systems because each measure has its own bias. In this paper, we first revisit the problem of applying the widely used saliency metrics on modern Convolutional Neural Networks(CNNs). Our investigation shows the saliency datasets have been built based on different choices of parameters and CNNs are designed to fit a dataset-specific distribution. Secondly, we show that the Shuffled Area Under Curve(S-AUC) metric still suffers from spatial biases. We propose a new saliency metric based on the AUC property, which aims at sampling a more directional negative set for evaluation, denoted as Farthest-Neighbor AUC(FN-AUC). We also propose a strategy to measure the quality of the sampled negative set. Our experiment shows FN-AUC can measure spatial biases, central and peripheral, more effectively than S-AUC without penalizing the fixation locations. Thirdly, we propose a global smoothing function to overcome the problem of few value degrees (output quantization) in computing AUC metrics. Comparing with random noise, our smooth function can create unique values without losing the relative saliency relationship.
CVJan 22, 2020
How Much Position Information Do Convolutional Neural Networks Encode?Md Amirul Islam, Sen Jia, Neil D. B. Bruce
In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. An implication of this is that a filter may know what it is looking at, but not where it is positioned in the image. Information concerning absolute position is inherently useful, and it is reasonable to assume that deep CNNs may implicitly learn to encode this information if there is a means to do so. In this paper, we test this hypothesis revealing the surprising degree of absolute position information that is encoded in commonly used neural networks. A comprehensive set of experiments show the validity of this hypothesis and shed light on how and where this information is represented while offering clues to where positional information is derived from in deep CNNs.
CVJan 8, 2019
Richer and Deeper Supervision Network for Salient Object DetectionSen Jia, Neil D. B. Bruce
Recent Salient Object Detection (SOD) systems are mostly based on Convolutional Neural Networks (CNNs). Specifically, Deeply Supervised Saliency (DSS) system has shown it is very useful to add short connections to the network and supervising on the side output. In this work, we propose a new SOD system which aims at designing a more efficient and effective way to pass back global information. Richer and Deeper Supervision (RDS) is applied to better combine features from each side output without demanding much extra computational space. Meanwhile, the backbone network used for SOD is normally pre-trained on the object classification dataset, ImageNet. But the pre-trained model has been trained on cropped images in order to only focus on distinguishing features within the region of the object. But the ignored background information is also significant in the task of SOD. We try to solve this problem by introducing the training data designed for object detection. A coarse global information is learned based on an entire image with its bounding box before training on the SOD dataset. The large-scale of object images can slightly improve the performance of SOD. Our experiment shows the proposed RDS network achieves the state-of-the-art results on five public SOD datasets.
CVSep 30, 2018
DIMENSION: Dynamic MR Imaging with Both K-space and Spatial Prior Knowledge Obtained via Multi-Supervised Network TrainingShanshan Wang, Ziwen Ke, Huitao Cheng et al.
Dynamic MR image reconstruction from incomplete k-space data has generated great research interest due to its capability in reducing scan time. Nevertheless, the reconstruction problem is still challenging due to its ill-posed nature. Most existing methods either suffer from long iterative reconstruction time or explore limited prior knowledge. This paper proposes a dynamic MR imaging method with both k-space and spatial prior knowledge integrated via multi-supervised network training, dubbed as DIMENSION. Specifically, the DIMENSION architecture consists of a frequential prior network for updating the k-space with its network prediction and a spatial prior network for capturing image structures and details. Furthermore, a multisupervised network training technique is developed to constrain the frequency domain information and reconstruction results at different levels. The comparisons with classical k-t FOCUSS, k-t SLR, L+S and the state-of-the-art CNN-based method on in vivo datasets show our method can achieve improved reconstruction results in shorter time.
LGJun 16, 2018
Right for the Right Reason: Training Agnostic NetworksSen Jia, Thomas Lansdall-Welfare, Nello Cristianini
We consider the problem of a neural network being requested to classify images (or other inputs) without making implicit use of a "protected concept", that is a concept that should not play any role in the decision of the network. Typically these concepts include information such as gender or race, or other contextual information such as image backgrounds that might be implicitly reflected in unknown correlations with other variables, making it insufficient to simply remove them from the input features. In other words, making accurate predictions is not good enough if those predictions rely on information that should not be used: predictive performance is not the only important metric for learning systems. We apply a method developed in the context of domain adaptation to address this problem of "being right for the right reason", where we request a classifier to make a decision in a way that is entirely 'agnostic' to a given protected concept (e.g. gender, race, background etc.), even if this could be implicitly reflected in other attributes via unknown correlations. After defining the concept of an 'agnostic model', we demonstrate how the Domain-Adversarial Neural Network can remove unwanted information from a model using a gradient reversal layer.
CVMay 2, 2018
EML-NET:An Expandable Multi-Layer NETwork for Saliency PredictionSen Jia, Neil D. B. Bruce
Saliency prediction can benefit from training that involves scene understanding that may be tangential to the central task; this may include understanding places, spatial layout, objects or involve different datasets and their bias. One can combine models, but to do this in a sophisticated manner can be complex, and also result in unwieldy networks or produce competing objectives that are hard to balance. In this paper, we propose a scalable system to leverage multiple powerful deep CNN models to better extract visual features for saliency prediction. Our design differs from previous studies in that the whole system is trained in an almost end-to-end piece-wise fashion. The encoder and decoder components are separately trained to deal with complexity tied to the computational paradigm and required space. Furthermore, the encoder can contain more than one CNN model to extract features, and models can have different architectures or be pre-trained on different datasets. This parallel design yields a better computational paradigm overcoming limits to the variety of information or inference that can be combined at the encoder stage towards deeper networks and a more powerful encoding. Our network can be easily expanded almost without any additional cost, and other pre-trained CNN models can be incorporated availing a wider range of visual knowledge. We denote our expandable multi-layer network as EML-NET and our method achieves the state-of-the-art results on the public saliency benchmarks, SALICON, MIT300 and CAT2000.