CVJul 5, 2022Code
Softmax-free Linear TransformersJiachen Lu, Junge Zhang, Xiatian Zhu et al.
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at https://github.com/fudan-zvg/SOFT.
87.3LGMay 29
Conformal Reliability: A New Evaluation Metric for Conditional GenerationYachen Gao, Xinwei Sun, Yikai Wang et al.
Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is still lacking. Existing metrics, which typically assess a single output, may fail to capture the variability or potential risks in generation. In this paper, we propose a novel evaluation metric called reliability score based on conformal prediction, which measures the worst-case performance within the prediction set at a pre-specified confidence level. However, computing this score is challenging due to the high-dimensional nature of the output space and the nonconvexity of both the metric function and the prediction set. To efficiently compute this score, we introduce Conformal ReLiability (CReL), a framework that can (i) construct the prediction set with desired coverage; and (ii) accurately optimize the reliability score within the constructed prediction set. We provide theoretical results on coverage and demonstrate empirically that our method produces more informative prediction sets than existing approaches. Experiments on synthetic data and the image-to-text and text-to-image tasks further demonstrate the interpretability of our new metric, and the validity and effectiveness of our computational framework. Source code can be found at https://ggc29.github.io/CReL/.
CVAug 24, 2022
AGO-Net: Association-Guided 3D Point Cloud Object Detection NetworkLiang Du, Xiaoqing Ye, Xiao Tan et al.
The human brain can effortlessly recognize and localize objects, whereas current 3D object detection methods based on LiDAR point clouds still report inferior performance for detecting occluded and distant objects: the point cloud appearance varies greatly due to occlusion, and has inherent variance in point densities along the distance to sensors. Therefore, designing feature representations robust to such point clouds is critical. Inspired by human associative recognition, we propose a novel 3D detection framework that associates intact features for objects via domain adaptation. We bridge the gap between the perceptual domain, where features are derived from real scenes with sub-optimal representations, and the conceptual domain, where features are extracted from augmented scenes that consist of non-occlusion objects with rich detailed information. A feasible method is investigated to construct conceptual scenes without external datasets. We further introduce an attention-based re-weighting module that adaptively strengthens the feature adaptation of more informative regions. The network's feature enhancement ability is exploited without introducing extra cost during inference, which is plug-and-play in various 3D detection frameworks. We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed. Experiments on nuScenes and Waymo datasets also validate the versatility of our method.
CVFeb 13, 2023Code
Preconditioned Score-based Generative ModelsHengyuan Ma, Xiatian Zhu, Jianfeng Feng et al.
Score-based generative models (SGMs) have recently emerged as a promising class of generative models. However, a fundamental limitation is that their sampling process is slow due to a need for many (e.g., 2000) iterations of sequential computations. An intuitive acceleration method is to reduce the sampling iterations which however causes severe performance degradation. We assault this problem to the ill-conditioned issues of the Langevin dynamics and reverse diffusion in the sampling process. Under this insight, we propose a novel preconditioned diffusion sampling (PDS) method that leverages matrix preconditioning to alleviate the aforementioned problem. PDS alters the sampling process of a vanilla SGM at marginal extra computation cost and without model retraining. Theoretically, we prove that PDS preserves the output distribution of the SGM, with no risk of inducing systematical bias to the original sampling process. We further theoretically reveal a relation between the parameter of PDS and the sampling iterations, easing the parameter estimation under varying sampling iterations. Extensive experiments on various image datasets with a variety of resolutions and diversity validate that our PDS consistently accelerates off-the-shelf SGMs whilst maintaining the synthesis quality. In particular, PDS can accelerate by up to 28x on more challenging high-resolution (1024x1024) image generation. Compared with the latest generative models (e.g., CLD-SGM and Analytic-DDIM), PDS can achieve the best sampling quality on CIFAR-10 at an FID score of 1.99. Our code is publicly available to foster any further research https://github.com/fudan-zvg/PDS.
CVJul 19, 2022
Vision Transformers: From Semantic Segmentation to Dense PredictionLi Zhang, Jiachen Lu, Sixiao Zheng et al.
The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image patches, in comparison to the increasing receptive fields of CNNs across layers and other alternatives (e.g., large kernels and atrous convolution). In this work, for the first time we explore the global context learning potentials of ViTs for dense visual prediction (e.g., semantic segmentation). Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information, critical for dense prediction tasks. We first demonstrate that encoding an image as a sequence of patches, a vanilla ViT without local convolution and resolution reduction can yield stronger visual representation for semantic segmentation. For example, our model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission) and performs competitively on Cityscapes. However, the basic ViT architecture falls short in broader dense prediction applications, such as object detection and instance segmentation, due to its lack of a pyramidal structure, high computational demand, and insufficient local context. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification.
CVSep 17, 2024Code
MinD-3D++: Advancing fMRI-Based 3D Reconstruction with High-Quality Textured Mesh Generation and a Comprehensive DatasetJianxiong Gao, Yanwei Fu, Yuqian Fu et al.
Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4,768 3D objects. The dataset consists of two components: fMRI-Shape, previously introduced and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse, proposed in this paper and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse includes data from 5 subjects, 4 of whom are also part of the core set in fMRI-Shape. Each subject views 3,142 3D objects across 117 categories, all accompanied by text captions. This significantly enhances the diversity and potential applications of the dataset. Moreover, we propose MinD-3D++, a novel framework for decoding textured 3D visual information from fMRI signals. The framework evaluates the feasibility of not only reconstructing 3D objects from the human mind but also generating, for the first time, 3D textured meshes with detailed textures from fMRI data. We establish new benchmarks by designing metrics at the semantic, structural, and textured levels to evaluate model performance. Furthermore, we assess the model's effectiveness in out-of-distribution settings and analyze the attribution of the proposed 3D pari fMRI dataset in visual regions of interest (ROIs) in fMRI signals. Our experiments demonstrate that MinD-3D++ not only reconstructs 3D objects with high semantic and spatial accuracy but also provides deeper insights into how the human brain processes 3D visual information. Project page: https://jianxgao.github.io/MinD-3D.
CVMar 26, 2023
Joint fMRI Decoding and Encoding with Latent Embedding AlignmentXuelin Qian, Yikai Wang, Yanwei Fu et al.
The connection between brain activity and corresponding visual stimuli is crucial in comprehending the human brain. While deep generative models have exhibited advancement in recovering brain recordings by generating images conditioned on fMRI signals, accomplishing high-quality generation with consistent semantics continues to pose challenges. Moreover, the prediction of brain activity from visual stimuli remains a formidable undertaking. In this paper, we introduce a unified framework that addresses both fMRI decoding and encoding. Commencing with the establishment of two latent spaces capable of representing and reconstructing fMRI signals and visual images, respectively, we proceed to align the fMRI signals and visual images within the latent space, thereby enabling a bidirectional transformation between the two domains. Our Latent Embedding Alignment (LEA) model concurrently recovers visual stimuli from fMRI signals and predicts brain activity from images within a unified framework. The performance of LEA surpasses that of existing methods on multiple benchmark fMRI decoding and encoding datasets. By integrating fMRI decoding and encoding, LEA offers a comprehensive solution for modeling the intricate relationship between brain activity and visual stimuli.
91.8SDMay 29
MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained PriorsGuangyin Bao, Taiping Zeng, Jianfeng Feng et al.
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.
CVJan 15, 2023
Rethinking Precision of Pseudo Label: Test-Time Adaptation via Complementary LearningJiayi Han, Longbin Zeng, Liang Du et al.
In this work, we propose a novel complementary learning approach to enhance test-time adaptation (TTA), which has been proven to exhibit good performance on testing data with distribution shifts such as corruptions. In test-time adaptation tasks, information from the source domain is typically unavailable and the model has to be optimized without supervision for test-time samples. Hence, usual methods assign labels for unannotated data with the prediction by a well-trained source model in an unsupervised learning framework. Previous studies have employed unsupervised objectives, such as the entropy of model predictions, as optimization targets to effectively learn features for test-time samples. However, the performance of the model is easily compromised by the quality of pseudo-labels, since inaccuracies in pseudo-labels introduce noise to the model. Therefore, we propose to leverage the "less probable categories" to decrease the risk of incorrect pseudo-labeling. The complementary label is introduced to designate these categories. We highlight that the risk function of complementary labels agrees with their Vanilla loss formula under the conventional true label distribution. Experiments show that the proposed learning algorithm achieves state-of-the-art performance on different datasets and experiment settings.
IVDec 4, 2022
MouseGAN++: Unsupervised Disentanglement and Contrastive Representation for Multiple MRI Modalities Synthesis and Structural Segmentation of Mouse BrainZiqi Yu, Xiaoyang Han, Shengjie Zhang et al.
Segmenting the fine structure of the mouse brain on magnetic resonance (MR) images is critical for delineating morphological regions, analyzing brain function, and understanding their relationships. Compared to a single MRI modality, multimodal MRI data provide complementary tissue features that can be exploited by deep learning models, resulting in better segmentation results. However, multimodal mouse brain MRI data is often lacking, making automatic segmentation of mouse brain fine structure a very challenging task. To address this issue, it is necessary to fuse multimodal MRI data to produce distinguished contrasts in different brain structures. Hence, we propose a novel disentangled and contrastive GAN-based framework, named MouseGAN++, to synthesize multiple MR modalities from single ones in a structure-preserving manner, thus improving the segmentation performance by imputing missing modalities and multi-modality fusion. Our results demonstrate that the translation performance of our method outperforms the state-of-the-art methods. Using the subsequently learned modality-invariant information as well as the modality-translated images, MouseGAN++ can segment fine brain structures with averaged dice coefficients of 90.0% (T2w) and 87.9% (T1w), respectively, achieving around +10% performance improvement compared to the state-of-the-art algorithms. Our results demonstrate that MouseGAN++, as a simultaneous image synthesis and segmentation method, can be used to fuse cross-modality information in an unpaired manner and yield more robust performance in the absence of multimodal data. We release our method as a mouse brain structural segmentation tool for free academic usage at https://github.com/yu02019.
CVJul 5, 2022
Accelerating Score-based Generative Models with Preconditioned Diffusion SamplingHengyuan Ma, Li Zhang, Xiatian Zhu et al.
Score-based generative models (SGMs) have recently emerged as a promising class of generative models. However, a fundamental limitation is that their inference is very slow due to a need for many (e.g., 2000) iterations of sequential computations. An intuitive acceleration method is to reduce the sampling iterations which however causes severe performance degradation. We investigate this problem by viewing the diffusion sampling process as a Metropolis adjusted Langevin algorithm, which helps reveal the underlying cause to be ill-conditioned curvature. Under this insight, we propose a model-agnostic preconditioned diffusion sampling (PDS) method that leverages matrix preconditioning to alleviate the aforementioned problem. Crucially, PDS is proven theoretically to converge to the original target distribution of a SGM, no need for retraining. Extensive experiments on three image datasets with a variety of resolutions and diversity validate that PDS consistently accelerates off-the-shelf SGMs whilst maintaining the synthesis quality. In particular, PDS can accelerate by up to 29x on more challenging high resolution (1024x1024) image generation.
CVJun 8, 2022
Accelerating Score-based Generative Models for High-Resolution Image SynthesisHengyuan Ma, Li Zhang, Xiatian Zhu et al.
Score-based generative models (SGMs) have recently emerged as a promising class of generative models. The key idea is to produce high-quality images by recurrently adding Gaussian noises and gradients to a Gaussian sample until converging to the target distribution, a.k.a. the diffusion sampling. To ensure stability of convergence in sampling and generation quality, however, this sequential sampling process has to take a small step size and many sampling iterations (e.g., 2000). Several acceleration methods have been proposed with focus on low-resolution generation. In this work, we consider the acceleration of high-resolution generation with SGMs, a more challenging yet more important problem. We prove theoretically that this slow convergence drawback is primarily due to the ignorance of the target distribution. Further, we introduce a novel Target Distribution Aware Sampling (TDAS) method by leveraging the structural priors in space and frequency domains. Extensive experiments on CIFAR-10, CelebA, LSUN, and FFHQ datasets validate that TDAS can consistently accelerate state-of-the-art SGMs, particularly on more challenging high resolution (1024x1024) image generation tasks by up to 18.4x, whilst largely maintaining the synthesis quality. With fewer sampling iterations, TDAS can still generate good quality images. In contrast, the existing methods degrade drastically or even fails completely
CVNov 1, 2023
fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity DecodingXuelin Qian, Yun Wang, Jingyang Huo et al.
The exploration of brain activity and its decoding from fMRI data has been a longstanding pursuit, driven by its potential applications in brain-computer interfaces, medical diagnostics, and virtual reality. Previous approaches have primarily focused on individual subject analysis, highlighting the need for a more universal and adaptable framework, which is the core motivation behind our work. In this work, we propose fMRI-PTE, an innovative auto-encoder approach for fMRI pre-training, with a focus on addressing the challenges of varying fMRI data dimensions due to individual brain differences. Our approach involves transforming fMRI signals into unified 2D representations, ensuring consistency in dimensions and preserving distinct brain activity patterns. We introduce a novel learning strategy tailored for pre-training 2D fMRI images, enhancing the quality of reconstruction. fMRI-PTE's adaptability with image generators enables the generation of well-represented fMRI features, facilitating various downstream tasks, including within-subject and cross-subject brain activity decoding. Our contributions encompass introducing fMRI-PTE, innovative data transformation, efficient training, a novel learning strategy, and the universal applicability of our approach. Extensive experiments validate and support our claims, offering a promising foundation for further research in this domain.
CVMar 27, 2024Code
NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level ModulationJingyang Huo, Yikai Wang, Xuelin Qian et al.
Recent fMRI-to-image approaches mainly focused on associating fMRI signals with specific conditions of pre-trained diffusion models. These approaches, while producing high-quality images, capture only a limited aspect of the complex information in fMRI signals and offer little detailed control over image creation. In contrast, this paper proposes to directly modulate the generation process of diffusion models using fMRI signals. Our approach, NeuroPictor, divides the fMRI-to-image process into three steps: i) fMRI calibrated-encoding, to tackle multi-individual pre-training for a shared latent space to minimize individual difference and enable the subsequent multi-subject training; ii) fMRI-to-image multi-subject pre-training, perceptually learning to guide diffusion model with high- and low-level conditions across different individuals; iii) fMRI-to-image single-subject refining, similar with step ii but focus on adapting to particular individual. NeuroPictor extracts high-level semantic features from fMRI signals that characterizing the visual stimulus and incrementally fine-tunes the diffusion model with a low-level manipulation network to provide precise structural instructions. By training with about 67,000 fMRI-image pairs from various individuals, our model enjoys superior fMRI-to-image decoding capacity, particularly in the within-subject setting, as evidenced in benchmark datasets. Our code and model are available at https://jingyanghuo.github.io/neuropictor/.
CVDec 15, 2025
LongVie 2: Multimodal Controllable Ultra-Long Video World ModelJianxiong Gao, Zhaoxi Chen, Xian Liu et al.
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
IVNov 21, 2023
HiFi-Syn: Hierarchical Granularity Discrimination for High-Fidelity Synthesis of MR Images with Structure PreservationZiqi Yu, Botao Zhao, Shengjie Zhang et al.
Synthesizing medical images while preserving their structural information is crucial in medical research. In such scenarios, the preservation of anatomical content becomes especially important. Although recent advances have been made by incorporating instance-level information to guide translation, these methods overlook the spatial coherence of structural-level representation and the anatomical invariance of content during translation. To address these issues, we introduce hierarchical granularity discrimination, which exploits various levels of semantic information present in medical images. Our strategy utilizes three levels of discrimination granularity: pixel-level discrimination using a Brain Memory Bank, structure-level discrimination on each brain structure with a re-weighting strategy to focus on hard samples, and global-level discrimination to ensure anatomical consistency during translation. The image translation performance of our strategy has been evaluated on three independent datasets (UK Biobank, IXI, and BraTS 2018), and it has outperformed state-of-the-art algorithms. Particularly, our model excels not only in synthesizing normal structures but also in handling abnormal (pathological) structures, such as brain tumors, despite the variations in contrast observed across different imaging modalities due to their pathological characteristics. The diagnostic value of synthesized MR images containing brain tumors has been evaluated by radiologists. This indicates that our model may offer an alternative solution in scenarios where specific MR modalities of patients are unavailable. Extensive experiments further demonstrate the versatility of our method, providing unique insights into medical image translation.
CVDec 26, 2025
SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data AnalysisMo Wang, Junfeng Xia, Wenhao Ye et al.
Foundation models are emerging as a powerful paradigm for fMRI analysis, but current approaches face a dual bottleneck of data- and training-efficiency. Atlas-based methods aggregate voxel signals into fixed regions of interest, reducing data dimensionality but discarding fine-grained spatial details, and requiring extremely large cohorts to train effectively as general-purpose foundation models. Atlas-free methods, on the other hand, operate directly on voxel-level information - preserving spatial fidelity but are prohibitively memory- and compute-intensive, making large-scale pre-training infeasible. We introduce SLIM-Brain (Sample-efficient, Low-memory fMRI Foundation Model for Human Brain), a new atlas-free foundation model that simultaneously improves both data- and training-efficiency. SLIM-Brain adopts a two-stage adaptive design: (i) a lightweight temporal extractor captures global context across full sequences and ranks data windows by saliency, and (ii) a 4D hierarchical encoder (Hiera-JEPA) learns fine-grained voxel-level representations only from the top-$k$ selected windows, while deleting about 70% masked patches. Extensive experiments across seven public benchmarks show that SLIM-Brain establishes new state-of-the-art performance on diverse tasks, while requiring only 4 thousand pre-training sessions and approximately 30% of GPU memory comparing to traditional voxel-level methods.
CVJan 21
The Pictorial Cortex: Zero-Shot Cross-Subject fMRI-to-Image Reconstruction via Compositional Latent ModelingJingyang Huo, Yikai Wang, Yanwei Fu et al.
Decoding visual experiences from human brain activity remains a central challenge at the intersection of neuroscience, neuroimaging, and artificial intelligence. A critical obstacle is the inherent variability of cortical responses: neural activity elicited by the same visual stimulus differs across individuals and trials due to anatomical, functional, cognitive, and experimental factors, making fMRI-to-image reconstruction non-injective. In this paper, we tackle a challenging yet practically meaningful problem: zero-shot cross-subject fMRI-to-image reconstruction, where the visual experience of a previously unseen individual must be reconstructed without subject-specific training. To enable principled evaluation, we present a unified cortical-surface dataset -- UniCortex-fMRI, assembled from multiple visual-stimulus fMRI datasets to provide broad coverage of subjects and stimuli. Our UniCortex-fMRI is particularly processed by standardized data formats to make it possible to explore this possibility in the zero-shot scenario of cross-subject fMRI-to-image reconstruction. To tackle the modeling challenge, we propose PictorialCortex, which models fMRI activity using a compositional latent formulation that structures stimulus-driven representations under subject-, dataset-, and trial-related variability. PictorialCortex operates in a universal cortical latent space and implements this formulation through a latent factorization-composition module, reinforced by paired factorization and re-factorizing consistency regularization. During inference, surrogate latents synthesized under multiple seen-subject conditions are aggregated to guide diffusion-based image synthesis for unseen subjects. Extensive experiments show that PictorialCortex improves zero-shot cross-subject visual reconstruction, highlighting the benefits of compositional latent modeling and multi-dataset training.
LGJun 2, 2025Code
Latent Structured Hopfield Network for Semantic Association and RetrievalChong Li, Xiangyang Xue, Jianfeng Feng et al.
Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms. Code: https://github.com/fudan-birlab/LSHN.
LGMay 22, 2025Code
Stochastic Forward-Forward Learning through Representational Dimensionality CompressionZhichao Zhu, Yang Qi, Hengyuan Ma et al.
The Forward-Forward (FF) learning algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise "goodness" function with well-designed negative samples for contrastive learning. Existing goodness functions are typically defined as the sum of squared postsynaptic activations, neglecting correlated variability between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for noisy copies of individual inputs while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples.We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared output, which is equivalent to making predictions based on an energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at https://github.com/ZhichaoZhu/StochasticForwardForward
LGFeb 20, 2022Code
Clustering by the Probability Distributions from Extreme Value TheorySixiao Zheng, Ke Fan, Yanxi Hou et al.
Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in https://github.com/sixiaozheng/EVT-K-means.
CVMar 30, 2021Code
Depth-conditioned Dynamic Message Propagation for Monocular 3D Object DetectionLi Wang, Liang Du, Xiaoqing Ye et al.
The objective of this paper is to learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection. We make following contributions: (i) rather than appealing to the complicated pseudo-LiDAR based approach, we propose a depth-conditioned dynamic message propagation (DDMP) network to effectively integrate the multi-scale depth information with the image context;(ii) this is achieved by first adaptively sampling context-aware nodes in the image context and then dynamically predicting hybrid depth-dependent filter weights and affinity matrices for propagating information; (iii) by augmenting a center-aware depth encoding (CDE) task, our method successfully alleviates the inaccurate depth prior; (iv) we thoroughly demonstrate the effectiveness of our proposed approach and show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset. Particularly, we rank $1^{st}$ in the highly competitive KITTI monocular 3D object detection track on the submission day (November 16th, 2020). Code and models are released at \url{https://github.com/fudan-zvg/DDMP}
CVFeb 5, 2021Code
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep SetHanqing Chao, Kun Wang, Yiwei He et al.
Gait is a unique biometric feature that can be recognized at a distance; thus, it has broad applications in crime prevention, forensic identification, and social security. To portray a gait, existing gait recognition methods utilize either a gait template which makes it difficult to preserve temporal information, or a gait sequence that maintains unnecessary sequential constraints and thus loses the flexibility of gait recognition. In this paper, we present a novel perspective that utilizes gait as a deep set, which means that a set of gait frames are integrated by a global-local fused deep network inspired by the way our left- and right-hemisphere processes information to learn information that can be used in identification. Based on this deep set perspective, our method is immune to frame permutations, and can naturally integrate frames from different videos that have been acquired under different scenarios, such as diverse viewing angles, different clothes, or different item-carrying conditions. Experiments show that under normal walking conditions, our single-model method achieves an average rank-1 accuracy of 96.1% on the CASIA-B gait dataset and an accuracy of 87.9% on the OU-MVLP gait dataset. Under various complex scenarios, our model also exhibits a high level of robustness. It achieves accuracies of 90.8% and 70.3% on CASIA-B under bag-carrying and coat-wearing walking conditions respectively, significantly outperforming the best existing methods. Moreover, the proposed method maintains a satisfactory accuracy even when only small numbers of frames are available in the test samples; for example, it achieves 85.0% on CASIA-B even when using only 7 frames. The source code has been released at https://github.com/AbnerHqC/GaitSet.
CVNov 15, 2018Code
GaitSet: Regarding Gait as a Set for Cross-View Gait RecognitionHanqing Chao, Yiwei He, Junping Zhang et al.
As a unique biometric feature that can be recognized at a distance, gait has broad applications in crime prevention, forensic identification and social security. To portray a gait, existing gait recognition methods utilize either a gait template, where temporal information is hard to preserve, or a gait sequence, which must keep unnecessary sequential constraints and thus loses the flexibility of gait recognition. In this paper we present a novel perspective, where a gait is regarded as a set consisting of independent frames. We propose a new network named GaitSet to learn identity information from the set. Based on the set perspective, our method is immune to permutation of frames, and can naturally integrate frames from different videos which have been filmed under different scenarios, such as diverse viewing angles, different clothes/carrying conditions. Experiments show that under normal walking conditions, our single-model method achieves an average rank-1 accuracy of 95.0% on the CASIA-B gait dataset and an 87.1% accuracy on the OU-MVLP gait dataset. These results represent new state-of-the-art recognition accuracy. On various complex scenarios, our model exhibits a significant level of robustness. It achieves accuracies of 87.2% and 70.4% on CASIA-B under bag-carrying and coat-wearing walking conditions, respectively. These outperform the existing best methods by a large margin. The method presented can also achieve a satisfactory accuracy with a small number of frames in a test sample, e.g., 82.5% on CASIA-B with only 7 frames. The source code has been released at https://github.com/AbnerHqC/GaitSet.
CVDec 12, 2023
MinD-3D: Reconstruct High-quality 3D objects in Human BrainJianxiong Gao, Yuqian Fu, Yun Wang et al.
In this paper, we introduce Recon3DMind, an innovative task aimed at reconstructing 3D visuals from Functional Magnetic Resonance Imaging (fMRI) signals, marking a significant advancement in the fields of cognitive neuroscience and computer vision. To support this pioneering task, we present the fMRI-Shape dataset, which includes data from 14 participants and features 360-degree videos of 3D objects to enable comprehensive fMRI signal capture across various settings, thereby laying a foundation for future research. Furthermore, we propose MinD-3D, a novel and effective three-stage framework specifically designed to decode the brain's 3D visual information from fMRI signals, demonstrating the feasibility of this challenging task. The framework begins by extracting and aggregating features from fMRI frames through a neuro-fusion encoder, subsequently employs a feature bridge diffusion model to generate visual features, and ultimately recovers the 3D object via a generative transformer decoder. We assess the performance of MinD-3D using a suite of semantic and structural metrics and analyze the correlation between the features extracted by our model and the visual regions of interest (ROIs) in fMRI signals. Our findings indicate that MinD-3D not only reconstructs 3D objects with high semantic relevance and spatial similarity but also significantly enhances our understanding of the human brain's capabilities in processing 3D visual information. Project page at: https://jianxgao.github.io/MinD-3D.
CVDec 30, 2024
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D PrimitivesZeyu Yang, Zijie Pan, Xiatian Zhu et al.
Dynamic 3D scene representation and novel view synthesis are crucial for enabling immersive experiences required by AR/VR and metaverse applications. It is a challenging task due to the complexity of unconstrained real-world scenes and their temporal dynamics. In this paper, we reformulate the reconstruction of a time-varying 3D scene as approximating its underlying spatiotemporal 4D volume by optimizing a collection of native 4D primitives, i.e., 4D Gaussians, with explicit geometry and appearance modeling. Equipped with a tailored rendering pipeline, our representation can be end-to-end optimized using only photometric supervision while free viewpoint viewing at interactive frame rate, making it suitable for representing real world scene with complex dynamic. This approach has been the first solution to achieve real-time rendering of high-resolution, photorealistic novel views for complex dynamic scenes. To facilitate real-world applications, we derive several compact variants that effectively reduce the memory footprint to address its storage bottleneck. Extensive experiments validate the superiority of 4DGS in terms of visual quality and efficiency across a range of dynamic scene-related tasks (e.g., novel view synthesis, 4D generation, scene understanding) and scenarios (e.g., single object, indoor scenes, driving environments, synthetic and real data).
CVMar 10, 2025
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative ProcessingJianxiong Gao, Yichang Liu, Baofeng Yang et al.
In this paper, we introduce CineBrain, the first large-scale dataset featuring simultaneous EEG and fMRI recordings during dynamic audiovisual stimulation. Recognizing the complementary strengths of EEG's high temporal resolution and fMRI's deep-brain spatial coverage, CineBrain provides approximately six hours of narrative-driven content from the popular television series The Big Bang Theory for each of six participants. Building upon this unique dataset, we propose CineSync, an innovative multimodal decoding framework integrates a Multi-Modal Fusion Encoder with a diffusion-based Neural Latent Decoder. Our approach effectively fuses EEG and fMRI signals, significantly improving the reconstruction quality of complex audiovisual stimuli. To facilitate rigorous evaluation, we introduce Cine-Benchmark, a comprehensive evaluation protocol that assesses reconstructions across semantic and perceptual dimensions. Experimental results demonstrate that CineSync achieves state-of-the-art video reconstruction performance and highlight our initial success in combining fMRI and EEG for reconstructing both video and audio stimuli. Project Page: https://jianxgao.github.io/CineBrain.
CVMar 18, 2024
OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy RepresentationHaochen Jiang, Yueming Xu, Yihan Zeng et al.
3D reconstruction has been widely used in autonomous navigation fields of mobile robotics. However, the former research can only provide the basic geometry structure without the capability of open-world scene understanding, limiting advanced tasks like human interaction and visual navigation. Moreover, traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. Thus, geometric reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D Understanding and Reconstruction, is crucial for the future development of mobile robots. In this paper, we propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We model the geometric structure of the scene with occupancy representation and distill the pre-trained open vocabulary model into a 3D language field via volume rendering for zero-shot inference. Furthermore, a novel semantic-aware confidence propagation (SCP) method has been proposed to relieve the issue of language field representation degeneracy caused by inconsistent measurements in distilled features. Experimental results show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.
CVAug 5, 2025
LongVie: Multimodal-Guided Controllable Ultra-Long Video GenerationJianxiong Gao, Zhaoxi Chen, Xian Liu et al.
Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.
MLMar 13, 2024
Efficient Combinatorial Optimization via Heat DiffusionHengyuan Ma, Wenlian Lu, Jianfeng Feng
Combinatorial optimization problems are widespread but inherently challenging due to their discrete nature. The primary limitation of existing methods is that they can only access a small fraction of the solution space at each iteration, resulting in limited efficiency for searching the global optimal. To overcome this challenge, diverging from conventional efforts of expanding the solver's search scope, we focus on enabling information to actively propagate to the solver through heat diffusion. By transforming the target function while preserving its optima, heat diffusion facilitates information flow from distant regions to the solver, providing more efficient navigation. Utilizing heat diffusion, we propose a framework for solving general combinatorial optimization problems. The proposed methodology demonstrates superior performance across a range of the most challenging and widely encountered combinatorial optimizations. Echoing recent advancements in harnessing thermodynamics for generative artificial intelligence, our study further reveals its significant potential in advancing combinatorial optimization.
CVJan 16, 2025
Making Your Dreams A Reality: Decoding the Dreams into a Coherent Video Story from fMRI SignalsYanwei Fu, Jianxiong Gao, Baofeng Yang et al.
This paper studies the brave new idea for Multimedia community, and proposes a novel framework to convert dreams into coherent video narratives using fMRI data. Essentially, dreams have intrigued humanity for centuries, offering glimpses into our subconscious minds. Recent advancements in brain imaging, particularly functional magnetic resonance imaging (fMRI), have provided new ways to explore the neural basis of dreaming. By combining subjective dream experiences with objective neurophysiological data, we aim to understand the visual aspects of dreams and create complete video narratives. Our process involves three main steps: reconstructing visual perception, decoding dream imagery, and integrating dream stories. Using innovative techniques in fMRI analysis and language modeling, we seek to push the boundaries of dream research and gain deeper insights into visual experiences during sleep. This technical report introduces a novel approach to visually decoding dreams using fMRI signals and weaving dream visuals into narratives using language models. We gather a dataset of dreams along with descriptions to assess the effectiveness of our framework.
CVApr 1, 2025
DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video DecodingChong Li, Jingyang Huo, Weikang Gong et al.
Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.
CVJan 4, 2024
Bayesian Unsupervised Disentanglement of Anatomy and Geometry for Deep Groupwise Image RegistrationXinzhe Luo, Xin Wang, Linda Shapiro et al.
This article presents a general Bayesian learning framework for multi-modal groupwise image registration. The method builds on probabilistic modelling of the image generative process, where the underlying common anatomy and geometric variations of the observed images are explicitly disentangled as latent variables. Therefore, groupwise image registration is achieved via hierarchical Bayesian inference. We propose a novel hierarchical variational auto-encoding architecture to realise the inference procedure of the latent variables, where the registration parameters can be explicitly estimated in a mathematically interpretable fashion. Remarkably, this new paradigm learns groupwise image registration in an unsupervised closed-loop self-reconstruction process, sparing the burden of designing complex image-based similarity measures. The computationally efficient disentangled network architecture is also inherently scalable and flexible, allowing for groupwise registration on large-scale image groups with variable sizes. Furthermore, the inferred structural representations from multi-modal images via disentanglement learning are capable of capturing the latent anatomy of the observations with visual semantics. Extensive experiments were conducted to validate the proposed framework, including four different datasets from cardiac, brain, and abdominal medical images. The results have demonstrated the superiority of our method over conventional similarity-based approaches in terms of accuracy, efficiency, scalability, and interpretability.
CLMar 17, 2024
Decoding Continuous Character-based Language from Non-invasive Brain RecordingsCenyuan Zhang, Xiaoqing Zheng, Ruicheng Yin et al.
Deciphering natural language from brain activity through non-invasive devices remains a formidable challenge. Previous non-invasive decoders either require multiple experiments with identical stimuli to pinpoint cortical regions and enhance signal-to-noise ratios in brain activity, or they are limited to discerning basic linguistic elements such as letters and words. We propose a novel approach to decoding continuous language from single-trial non-invasive fMRI recordings, in which a three-dimensional convolutional network augmented with information bottleneck is developed to automatically identify responsive voxels to stimuli, and a character-based decoder is designed for the semantic reconstruction of continuous language characterized by inherent character structures. The resulting decoder can produce intelligible textual sequences that faithfully capture the meaning of perceived speech both within and across subjects, while existing decoders exhibit significantly inferior performance in cross-subject contexts. The ability to decode continuous language from single trials across subjects demonstrates the promising applications of non-invasive language brain-computer interfaces in both healthcare and neuroscience.
LGMay 30, 2023
Probabilistic computation and uncertainty quantification with emerging covarianceHengyuan Ma, Yang Qi, Li Zhang et al.
Building robust, interpretable, and secure AI system requires quantifying and representing uncertainty under a probabilistic perspective to mimic human cognitive abilities. However, probabilistic computation presents significant challenges for most conventional artificial neural network, as they are essentially implemented in a deterministic manner. In this paper, we develop an efficient probabilistic computation framework by truncating the probabilistic representation of neural activation up to its mean and covariance and construct a moment neural network that encapsulates the nonlinear coupling between the mean and covariance of the underlying stochastic network. We reveal that when only the mean but not the covariance is supervised during gradient-based learning, the unsupervised covariance spontaneously emerges from its nonlinear coupling with the mean and faithfully captures the uncertainty associated with model predictions. Our findings highlight the inherent simplicity of probabilistic computation by seamlessly incorporating uncertainty into model prediction, paving the way for integrating it into large-scale AI systems.
CVDec 28, 2021
The Devil is in the Task: Exploiting Reciprocal Appearance-Localization Features for Monocular 3D Object DetectionZhikang Zou, Xiaoqing Ye, Liang Du et al.
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving, whereas its accuracy is still far from satisfactory. In this paper, we dig into the 3D object detection task and reformulate it as the sub-tasks of object localization and appearance perception, which benefits to a deep excavation of reciprocal information underlying the entire task. We introduce a Dynamic Feature Reflecting Network, named DFR-Net, which contains two novel standalone modules: (i) the Appearance-Localization Feature Reflecting module (ALFR) that first separates taskspecific features and then self-mutually reflects the reciprocal features; (ii) the Dynamic Intra-Trading module (DIT) that adaptively realigns the training processes of various sub-tasks via a self-learning manner. Extensive experiments on the challenging KITTI dataset demonstrate the effectiveness and generalization of DFR-Net. We rank 1st among all the monocular 3D object detectors in the KITTI test set (till March 16th, 2021). The proposed method is also easy to be plug-and-play in many cutting-edge 3D detection frameworks at negligible cost to boost performance. The code will be made publicly available.
CVDec 3, 2021
SGM3D: Stereo Guided Monocular 3D Object DetectionZheyuan Zhou, Liang Du, Xiaoqing Ye et al.
Monocular 3D object detection aims to predict the object location, dimension and orientation in 3D space alongside the object category given only a monocular image. It poses a great challenge due to its ill-posed property which is critically lack of depth information in the 2D image plane. While there exist approaches leveraging off-the-shelve depth estimation or relying on LiDAR sensors to mitigate this problem, the dependence on the additional depth model or expensive equipment severely limits their scalability to generic 3D perception. In this paper, we propose a stereo-guided monocular 3D object detection framework, dubbed SGM3D, adapting the robust 3D features learned from stereo inputs to enhance the feature for monocular detection. We innovatively present a multi-granularity domain adaptation (MG-DA) mechanism to exploit the network's ability to generate stereo-mimicking features given only on monocular cues. Coarse BEV feature-level, as well as the fine anchor-level domain adaptation, are both leveraged for guidance in the monocular domain.In addition, we introduce an IoU matching-based alignment (IoU-MA) method for object-level domain adaptation between the stereo and monocular predictions to alleviate the mismatches while adopting the MG-DA. Extensive experiments demonstrate state-of-the-art results on KITTI and Lyft datasets.
CVDec 31, 2020
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with TransformersSixiao Zheng, Jiachen Lu, Hengshuang Zhao et al.
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
CVJun 8, 2020
Associate-3Ddet: Perceptual-to-Conceptual Association for 3D Point Cloud Object DetectionLiang Du, Xiaoqing Ye, Xiao Tan et al.
Object detection from 3D point clouds remains a challenging task, though recent studies pushed the envelope with the deep learning techniques. Owing to the severe spatial occlusion and inherent variance of point density with the distance to sensors, appearance of a same object varies a lot in point cloud data. Designing robust feature representation against such appearance changes is hence the key issue in a 3D object detection method. In this paper, we innovatively propose a domain adaptation like approach to enhance the robustness of the feature representation. More specifically, we bridge the gap between the perceptual domain where the feature comes from a real scene and the conceptual domain where the feature is extracted from an augmented scene consisting of non-occlusion point cloud rich of detailed information. This domain adaptation approach mimics the functionality of the human brain when proceeding object perception. Extensive experiments demonstrate that our simple yet effective approach fundamentally boosts the performance of 3D point cloud object detection and achieves the state-of-the-art results.
ROMar 1, 2020
3DCFS: Fast and Robust Joint 3D Semantic-Instance Segmentation via Coupled Feature SelectionLiang Du, Jingang Tan, Xiangyang Xue et al.
We propose a novel fast and robust 3D point clouds segmentation framework via coupled feature selection, named 3DCFS, that jointly performs semantic and instance segmentation. Inspired by the human scene perception process, we design a novel coupled feature selection module, named CFSM, that adaptively selects and fuses the reciprocal semantic and instance features from two tasks in a coupled manner. To further boost the performance of the instance segmentation task in our 3DCFS, we investigate a loss function that helps the model learn to balance the magnitudes of the output embedding dimensions during training, which makes calculating the Euclidean distance more reliable and enhances the generalizability of the model. Extensive experiments demonstrate that our 3DCFS outperforms state-of-the-art methods on benchmark datasets in terms of accuracy, speed and computational cost.
CVOct 28, 2017
Dual Skipping NetworksChangmao Cheng, Yanwei Fu, Yu-Gang Jiang et al.
Inspired by the recent neuroscience studies on the left-right asymmetry of the human brain in processing low and high spatial frequency information, this paper introduces a dual skipping network which carries out coarse-to-fine object categorization. Such a network has two branches to simultaneously deal with both coarse and fine-grained classification tasks. Specifically, we propose a layer-skipping mechanism that learns a gating network to predict which layers to skip in the testing stage. This layer-skipping mechanism endows the network with good flexibility and capability in practice. Evaluations are conducted on several widely used coarse-to-fine object categorization benchmarks, and promising results are achieved by our proposed network model.