CVJun 4, 2023Code
Scale Guided Hypernetwork for Blind Super-Resolution Image Quality AssessmentJun Fu
With the emergence of image super-resolution (SR) algorithm, how to blindly evaluate the quality of super-resolution images has become an urgent task. However, existing blind SR image quality assessment (IQA) metrics merely focus on visual characteristics of super-resolution images, ignoring the available scale information. In this paper, we reveal that the scale factor has a statistically significant impact on subjective quality scores of SR images, indicating that the scale information can be used to guide the task of blind SR IQA. Motivated by this, we propose a scale guided hypernetwork framework that evaluates SR image quality in a scale-adaptive manner. Specifically, the blind SR IQA procedure is divided into three stages, i.e., content perception, evaluation rule generation, and quality prediction. After content perception, a hypernetwork generates the evaluation rule used in quality prediction based on the scale factor of the SR image. We apply the proposed scale guided hypernetwork framework to existing representative blind IQA metrics, and experimental results show that the proposed framework not only boosts the performance of these IQA metrics but also enhances their generalization abilities. Source code will be available at https://github.com/JunFu1995/SGH.
MMJul 13, 2022
RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality AssessmentYiting Lu, Jun Fu, Xin Li et al.
Coronary CT Angiography (CCTA) is susceptible to various distortions (e.g., artifacts and noise), which severely compromise the exact diagnosis of cardiovascular diseases. The appropriate CCTA Vessel-level Image Quality Assessment (CCTA VIQA) algorithm can be used to reduce the risk of error diagnosis. The primary challenges of CCTA VIQA are that the local part of coronary that determines final quality is hard to locate. To tackle the challenge, we formulate CCTA VIQA as a multiple-instance learning (MIL) problem, and exploit Transformer-based MIL backbone (termed as T-MIL) to aggregate the multiple instances along the coronary centerline into the final quality. However, not all instances are informative for final quality. There are some quality-irrelevant/negative instances intervening the exact quality assessment(e.g., instances covering only background or the coronary in instances is not identifiable). Therefore, we propose a Progressive Reinforcement learning based Instance Discarding module (termed as PRID) to progressively remove quality-irrelevant/negative instances for CCTA VIQA. Based on the above two modules, we propose a Reinforced Transformer Network (RTN) for automatic CCTA VIQA based on end-to-end optimization. Extensive experimental results demonstrate that our proposed method achieves the state-of-the-art performance on the real-world CCTA dataset, exceeding previous MIL methods by a large margin.
LGJun 20, 2023
Variational Disentangled Graph Auto-Encoders for Link PredictionJun Fu, Xiaojuan Zhang, Shuang Li et al. · tsinghua
With the explosion of graph-structured data, link prediction has emerged as an increasingly important task. Embedding methods for link prediction utilize neural networks to generate node embeddings, which are subsequently employed to predict links between nodes. However, the existing embedding methods typically take a holistic strategy to learn node embeddings and ignore the entanglement of latent factors. As a result, entangled embeddings fail to effectively capture the underlying information and are vulnerable to irrelevant information, leading to unconvincing and uninterpretable link prediction results. To address these challenges, this paper proposes a novel framework with two variants, the disentangled graph auto-encoder (DGAE) and the variational disentangled graph auto-encoder (VDGAE). Our work provides a pioneering effort to apply the disentanglement strategy to link prediction. The proposed framework infers the latent factors that cause edges in the graph and disentangles the representation into multiple channels corresponding to unique latent factors, which contributes to improving the performance of link prediction. To further encourage the embeddings to capture mutually exclusive latent factors, we introduce mutual information regularization to enhance the independence among different channels. Extensive experiments on various real-world benchmarks demonstrate that our proposed methods achieve state-of-the-art results compared to a variety of strong baselines on link prediction tasks. Qualitative analysis on the synthetic dataset also illustrates that the proposed methods can capture distinct latent factors that cause links, providing empirical evidence that our models are able to explain the results of link prediction to some extent. All code will be made publicly available upon publication of the paper.
LGJun 20, 2023
Contrastive Disentangled Learning on Graph for Node ClassificationXiaojuan Zhang, Jun Fu, Shuang Li · tsinghua
Contrastive learning methods have attracted considerable attention due to their remarkable success in analyzing graph-structured data. Inspired by the success of contrastive learning, we propose a novel framework for contrastive disentangled learning on graphs, employing a disentangled graph encoder and two carefully crafted self-supervision signals. Specifically, we introduce a disentangled graph encoder to enforce the framework to distinguish various latent factors corresponding to underlying semantic information and learn the disentangled node embeddings. Moreover, to overcome the heavy reliance on labels, we design two self-supervision signals, namely node specificity and channel independence, which capture informative knowledge without the need for labeled data, thereby guiding the automatic disentanglement of nodes. Finally, we perform node classification tasks on three citation networks by using the disentangled node embeddings, and the relevant analysis is provided. Experimental results validate the effectiveness of the proposed framework compared with various baselines.
IVJun 7, 2022
Parotid Gland MRI Segmentation Based on Swin-Unet and Multimodal ImagesZi'an Xu, Yin Dai, Fayu Liu et al.
Background and objective: Parotid gland tumors account for approximately 2% to 10% of head and neck tumors. Preoperative tumor localization, differential diagnosis, and subsequent selection of appropriate treatment for parotid gland tumors are critical. However, the relative rarity of these tumors and the highly dispersed tissue types have left an unmet need for a subtle differential diagnosis of such neoplastic lesions based on preoperative radiomics. Recently, deep learning methods have developed rapidly, especially Transformer beats the traditional convolutional neural network in computer vision. Many new Transformer-based networks have been proposed for computer vision tasks. Methods: In this study, multicenter multimodal parotid gland MR images were collected. The Swin-Unet which was based on Transformer was used. MR images of short time inversion recovery, T1-weighted and T2-weighted modalities were combined into three-channel data to train the network. We achieved segmentation of the region of interest for parotid gland and tumor. Results: The Dice-Similarity Coefficient of the model on the test set was 88.63%, Mean Pixel Accuracy was 99.31%, Mean Intersection over Union was 83.99%, and Hausdorff Distance was 3.04. Then a series of comparison experiments were designed in this paper to further validate the segmentation performance of the algorithm. Conclusions: Experimental results showed that our method has good results for parotid gland and tumor segmentation. The Transformer-based network outperforms the traditional convolutional neural network in the field of medical images.
17.7CVApr 20Code
ParkingScenes: A Structured Dataset for End-to-End Autonomous Parking in Simulation ScenesHaonan Chen, Kaiwen Xiao, Bin Tian et al.
Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is essential. While recent advances in end-to-end learning have shown great promise, the lack of high-quality, structured datasets tailored for parking scenarios remains a significant bottleneck.To address this gap, we present ParkingScenes, a comprehensive multimodal dataset specifically designed for end-to-end autonomous parking in simulated scenes. Built on the CARLA simulator, ParkingScenes features structured parking trajectories generated by a Hybrid A* planner and a Model Predictive Controller (MPC), providing accurate and reproducible supervision signals. The dataset includes 16 reverse-in and 6 parallel parking scenarios, each executed under two pedestrian conditions (present and absent), resulting in 704 structured episodes and approximately 105000 frames. Each scenario is repeated 16 times to ensure consistent coverage. Each frame contains synchronized data from four RGB cameras, four depth sensors, vehicle motion states, and Bird's-Eye View (BEV) representations, enabling rich multimodal fusion and context-aware learning. To demonstrate the utility of our dataset, we compare models trained on ParkingScenes with those trained on unstructured, manually collected simulation data under identical conditions. Results show significant improvements in performance, underscoring the effectiveness of structured supervision for robust and accurate parking policy learning. By releasing both the dataset and the collection framework, ParkingScenes establishes a scalable and reproducible benchmark for advancing learning-based autonomous parking systems. The dataset and collection framework will be released at: https://github.com/haonan-ai/ParkingScenes
CVFeb 3, 2025Code
CLIP-DQA: Blindly Evaluating Dehazed Images from Global and Local Perspectives Using CLIPYirui Zeng, Jun Fu, Hadi Amirpour et al.
Blind dehazed image quality assessment (BDQA), which aims to accurately predict the visual quality of dehazed images without any reference information, is essential for the evaluation, comparison, and optimization of image dehazing algorithms. Existing learning-based BDQA methods have achieved remarkable success, while the small scale of DQA datasets limits their performance. To address this issue, in this paper, we propose to adapt Contrastive Language-Image Pre-Training (CLIP), pre-trained on large-scale image-text pairs, to the BDQA task. Specifically, inspired by the fact that the human visual system understands images based on hierarchical features, we take global and local information of the dehazed image as the input of CLIP. To accurately map the input hierarchical information of dehazed images into the quality score, we tune both the vision branch and language branch of CLIP with prompt learning. Experimental results on two authentic DQA datasets demonstrate that our proposed approach, named CLIP-DQA, achieves more accurate quality predictions over existing BDQA methods. The code is available at https://github.com/JunFu1995/CLIP-DQA.
CVJun 20, 2025Code
ParkFormer: A Transformer-Based Parking Policy with Goal Embedding and Pedestrian-Aware ControlJun Fu, Bin Tian, Haonan Chen et al.
Autonomous parking plays a vital role in intelligent vehicle systems, particularly in constrained urban environments where high-precision control is required. While traditional rule-based parking systems struggle with environmental uncertainties and lack adaptability in crowded or dynamic scenes, human drivers demonstrate the ability to park intuitively without explicit modeling. Inspired by this observation, we propose a Transformer-based end-to-end framework for autonomous parking that learns from expert demonstrations. The network takes as input surround-view camera images, goal-point representations, ego vehicle motion, and pedestrian trajectories. It outputs discrete control sequences including throttle, braking, steering, and gear selection. A novel cross-attention module integrates BEV features with target points, and a GRU-based pedestrian predictor enhances safety by modeling dynamic obstacles. We validate our method on the CARLA 0.9.14 simulator in both vertical and parallel parking scenarios. Experiments show our model achieves a high success rate of 96.57\%, with average positional and orientation errors of 0.21 meters and 0.41 degrees, respectively. The ablation studies further demonstrate the effectiveness of key modules such as pedestrian prediction and goal-point attention fusion. The code and dataset will be released at: https://github.com/little-snail-f/ParkFormer.
CVOct 21, 2025Code
Cross-Modal Scene Semantic Alignment for Image Complexity AssessmentYuqing Luo, Yixiao Li, Jiang Liu et al.
Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.
CVJun 24, 2024Code
Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality AssessmentJun Fu, Wei Zhou, Qiuping Jiang et al.
Recently, textual prompt tuning has shown inspirational performance in adapting Contrastive Language-Image Pre-training (CLIP) models to natural image quality assessment. However, such uni-modal prompt learning method only tunes the language branch of CLIP models. This is not enough for adapting CLIP models to AI generated image quality assessment (AGIQA) since AGIs visually differ from natural images. In addition, the consistency between AGIs and user input text prompts, which correlates with the perceptual quality of AGIs, is not investigated to guide AGIQA. In this letter, we propose vision-language consistency guided multi-modal prompt learning for blind AGIQA, dubbed CLIP-AGIQA. Specifically, we introduce learnable textual and visual prompts in language and vision branches of CLIP models, respectively. Moreover, we design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts. Experimental results on two public AGIQA datasets demonstrate that the proposed method outperforms state-of-the-art quality assessment models. The source code is available at https://github.com/JunFu1995/CLIP-AGIQA.
CVSep 9, 2018Code
Dual Attention Network for Scene SegmentationJun Fu, Jing Liu, Haijie Tian et al.
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the selfattention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data. We make the code and trained model publicly available at https://github.com/junfu1115/DANet
CVAug 18, 2025
Structure-preserving Feature Alignment for Old Photo ColorizationYingxue Pang, Xin Jin, Jun Fu et al.
Deep learning techniques have made significant advancements in reference-based colorization by training on large-scale datasets. However, directly applying these methods to the task of colorizing old photos is challenging due to the lack of ground truth and the notorious domain gap between natural gray images and old photos. To address this issue, we propose a novel CNN-based algorithm called SFAC, i.e., Structure-preserving Feature Alignment Colorizer. SFAC is trained on only two images for old photo colorization, eliminating the reliance on big data and allowing direct processing of the old photo itself to overcome the domain gap problem. Our primary objective is to establish semantic correspondence between the two images, ensuring that semantically related objects have similar colors. We achieve this through a feature distribution alignment loss that remains robust to different metric choices. However, utilizing robust semantic correspondence to transfer color from the reference to the old photo can result in inevitable structure distortions. To mitigate this, we introduce a structure-preserving mechanism that incorporates a perceptual constraint at the feature level and a frozen-updated pyramid at the pixel level. Extensive experiments demonstrate the effectiveness of our method for old photo colorization, as confirmed by qualitative and quantitative metrics.
CVSep 8, 2025
Perception-oriented Bidirectional Attention Network for Image Super-resolution Quality AssessmentYixiao Li, Xiaoyuan Yang, Guanghui Yue et al.
Many super-resolution (SR) algorithms have been proposed to increase image resolution. However, full-reference (FR) image quality assessment (IQA) metrics for comparing and evaluating different SR algorithms are limited. In this work, we propose the Perception-oriented Bidirectional Attention Network (PBAN) for image SR FR-IQA, which is composed of three modules: an image encoder module, a perception-oriented bidirectional attention (PBA) module, and a quality prediction module. First, we encode the input images for feature representations. Inspired by the characteristics of the human visual system, we then construct the perception-oriented PBA module. Specifically, different from existing attention-based SR IQA methods, we conceive a Bidirectional Attention to bidirectionally construct visual attention to distortion, which is consistent with the generation and evaluation processes of SR images. To further guide the quality assessment towards the perception of distorted information, we propose Grouped Multi-scale Deformable Convolution, enabling the proposed method to adaptively perceive distortion. Moreover, we design Sub-information Excitation Convolution to direct visual perception to both sub-pixel and sub-channel attention. Finally, the quality prediction module is exploited to integrate quality-aware features and regress quality scores. Extensive experiments demonstrate that our proposed PBAN outperforms state-of-the-art quality assessment methods.
CVOct 20, 2024
MMDS: A Multimodal Medical Diagnosis System Integrating Image Analysis and Knowledge-based Departmental ConsultationYi Ren, HanZhi Zhang, Weibin Li et al.
We present MMDS, a system capable of recognizing medical images and patient facial details, and providing professional medical diagnoses. The system consists of two core components:The first component is the analysis of medical images and videos. We trained a specialized multimodal medical model capable of interpreting medical images and accurately analyzing patients' facial emotions and facial paralysis conditions. The model achieved an accuracy of 72.59% on the FER2013 facial emotion recognition dataset, with a 91.1% accuracy in recognizing the "happy" emotion. In facial paralysis recognition, the model reached an accuracy of 92%, which is 30% higher than that of GPT-4o. Based on this model, we developed a parser for analyzing facial movement videos of patients with facial paralysis, achieving precise grading of the paralysis severity. In tests on 30 videos of facial paralysis patients, the system demonstrated a grading accuracy of 83.3%.The second component is the generation of professional medical responses. We employed a large language model, integrated with a medical knowledge base, to generate professional diagnoses based on the analysis of medical images or videos. The core innovation lies in our development of a department-specific knowledge base routing management mechanism, in which the large language model categorizes data by medical departments and, during the retrieval process, determines the appropriate knowledge base to query. This significantly improves retrieval accuracy in the RAG (retrieval-augmented generation) process.
CVJan 24, 2022
Mutual Attention-based Hybrid Dimensional Network for Multimodal Imaging Computer-aided DiagnosisYin Dai, Yifan Gao, Fayu Liu et al.
Recent works on Multimodal 3D Computer-aided diagnosis have demonstrated that obtaining a competitive automatic diagnosis model when a 3D convolution neural network (CNN) brings more parameters and medical images are scarce remains nontrivial and challenging. Considering both consistencies of regions of interest in multimodal images and diagnostic accuracy, we propose a novel mutual attention-based hybrid dimensional network for MultiModal 3D medical image classification (MMNet). The hybrid dimensional network integrates 2D CNN with 3D convolution modules to generate deeper and more informative feature maps, and reduce the training complexity of 3D fusion. Besides, the pre-trained model of ImageNet can be used in 2D CNN, which improves the performance of the model. The stereoscopic attention is focused on building rich contextual interdependencies of the region in 3D medical images. To improve the regional correlation of pathological tissues in multimodal medical images, we further design a mutual attention framework in the network to build the region-wise consistency in similar stereoscopic regions of different image modalities, providing an implicit manner to instruct the network to focus on pathological tissues. MMNet outperforms many previous solutions and achieves results competitive to the state-of-the-art on three multimodal imaging datasets, i.e., Parotid Gland Tumor (PGT) dataset, the MRNet dataset, and the PROSTATEx dataset, and its advantages are validated by extensive experiments.
CVJan 9, 2022
ImageSubject: A Large-scale Dataset for Subject DetectionXin Miao, Jiayi Liu, Huayan Wang et al.
Main subjects usually exist in the images or videos, as they are the objects that the photographer wants to highlight. Human viewers can easily identify them but algorithms often confuse them with other objects. Detecting the main subjects is an important technique to help machines understand the content of images and videos. We present a new dataset with the goal of training models to understand the layout of the objects and the context of the image then to find the main subjects among them. This is achieved in three aspects. By gathering images from movie shots created by directors with professional shooting skills, we collect the dataset with strong diversity, specifically, it contains 107\,700 images from 21\,540 movie shots. We labeled them with the bounding box labels for two classes: subject and non-subject foreground object. We present a detailed analysis of the dataset and compare the task with saliency detection and object detection. ImageSubject is the first dataset that tries to localize the subject in an image that the photographer wants to highlight. Moreover, we find the transformer-based detection model offers the best result among other popular model architectures. Finally, we discuss the potential applications and conclude with the importance of the dataset.
CVNov 25, 2021
A Close Look at Few-shot Real Image Super-resolution from the Distortion Relation PerspectiveXin Li, Xin Jin, Jun Fu et al.
Collecting amounts of distorted/clean image pairs in the real world is non-trivial, which seriously limits the practical applications of these supervised learning-based methods on real-world image super-resolution (RealSR). Previous works usually address this problem by leveraging unsupervised learning-based technologies to alleviate the dependency on paired training samples. However, these methods typically suffer from unsatisfactory texture synthesis due to the lack of supervision of clean images. To overcome this problem, we are the first to have a close look at the under-explored direction for RealSR, i.e., few-shot real-world image super-resolution, which aims to tackle the challenging RealSR problem with few-shot distorted/clean image pairs. Under this brand-new scenario, we propose Distortion Relation guided Transfer Learning (DRTL) for the few-shot RealSR by transferring the rich restoration knowledge from auxiliary distortions (i.e., synthetic distortions) to the target RealSR under the guidance of distortion relation. Concretely, DRTL builds a knowledge graph to capture the distortion relation between auxiliary distortions and target distortion (i.e., real distortions in RealSR). Based on the distortion relation, DRTL adopts a gradient reweighting strategy to guide the knowledge transfer process between auxiliary distortions and target distortions. In this way, DRTL could quickly learn the most relevant knowledge from the synthetic distortions for the target distortion. We instantiate DRTL with two commonly-used transfer learning paradigms, including pre-training and meta-learning pipelines, to realize a distortion relation-aware Few-shot RealSR. Extensive experiments on multiple benchmarks and thorough ablation studies demonstrate the effectiveness of our DRTL.
CVOct 19, 2021
Fine-Grained Control of Artistic Styles in Image GenerationXin Miao, Huayan Wang, Jun Fu et al.
Recent advances in generative models and adversarial training have enabled artificially generating artworks in various artistic styles. It is highly desirable to gain more control over the generated style in practice. However, artistic styles are unlike object categories -- there are a continuous spectrum of styles distinguished by subtle differences. Few works have been explored to capture the continuous spectrum of styles and apply it to a style generation task. In this paper, we propose to achieve this by embedding original artwork examples into a continuous style space. The style vectors are fed to the generator and discriminator to achieve fine-grained control. Our method can be used with common generative adversarial networks (such as StyleGAN). Experiments show that our method not only precisely controls the fine-grained artistic style but also improves image quality over vanilla StyleGAN as measured by FID.
IVMay 19, 2021
Adaptive Hypergraph Convolutional Network for No-Reference 360-degree Image Quality AssessmentJun Fu, Chen Hou, Wei Zhou et al.
In no-reference 360-degree image quality assessment (NR 360IQA), graph convolutional networks (GCNs), which model interactions between viewports through graphs, have achieved impressive performance. However, prevailing GCN-based NR 360IQA methods suffer from three main limitations. First, they only use high-level features of the distorted image to regress the quality score, while the human visual system (HVS) scores the image based on hierarchical features. Second, they simplify complex high-order interactions between viewports in a pairwise fashion through graphs. Third, in the graph construction, they only consider spatial locations of viewports, ignoring its content characteristics. Accordingly, to address these issues, we propose an adaptive hypergraph convolutional network for NR 360IQA, denoted as AHGCN. Specifically, we first design a multi-level viewport descriptor for extracting hierarchical representations from viewports. Then, we model interactions between viewports through hypergraphs, where each hyperedge connects two or more viewports. In the hypergraph construction, we build a location-based hyperedge and a content-based hyperedge for each viewport. Experimental results on two public 360IQA databases demonstrate that our proposed approach has a clear advantage over state-of-the-art full-reference and no-reference IQA models.
LGApr 1, 2021
Bayesian Graph Convolutional Network for Traffic PredictionJun Fu, Wei Zhou, Zhibo Chen
Recently, adaptive graph convolutional network based traffic prediction methods, learning a latent graph structure from traffic data via various attention-based mechanisms, have achieved impressive performance. However, they are still limited to find a better description of spatial relationships between traffic conditions due to: (1) ignoring the prior of the observed topology of the road network; (2) neglecting the presence of negative spatial relationships; and (3) lacking investigation on uncertainty of the graph structure. In this paper, we propose a Bayesian Graph Convolutional Network (BGCN) framework to alleviate these issues. Under this framework, the graph structure is viewed as a random realization from a parametric generative model, and its posterior is inferred using the observed topology of the road network and traffic data. Specifically, the parametric generative model is comprised of two parts: (1) a constant adjacency matrix which discovers potential spatial relationships from the observed physical connections between roads using a Bayesian approach; (2) a learnable adjacency matrix that learns a global shared spatial correlations from traffic data in an end-to-end fashion and can model negative spatial correlations. The posterior of the graph structure is then approximated by performing Monte Carlo dropout on the parametric graph structure. We verify the effectiveness of our method on five real-world datasets, and the experimental results demonstrate that BGCN attains superior performance compared with state-of-the-art methods.
LGOct 15, 2020
Bayesian Spatio-Temporal Graph Convolutional Network for Traffic ForecastingJun Fu, Wei Zhou, Zhibo Chen
In traffic forecasting, graph convolutional networks (GCNs), which model traffic flows as spatio-temporal graphs, have achieved remarkable performance. However, existing GCN-based methods heuristically define the graph structure as the physical topology of the road network, ignoring potential dependence of the graph structure over traffic data. And the defined graph structure is deterministic, which lacks investigation of uncertainty. In this paper, we propose a Bayesian Spatio-Temporal Graph Convolutional Network (BSTGCN) for traffic prediction. The graph structure in our network is learned from the physical topology of the road network and traffic data in an end-to-end manner, which discovers a more accurate description of the relationship among traffic flows. Moreover, a parametric generative model is proposed to represent the graph structure, which enhances the generalization capability of GCNs. We verify the effectiveness of our method on two real-world datasets, and the experimental results demonstrate that BSTGCN attains superior performance compared with state-of-the-art methods.
MMSep 29, 2020
Sequential Reinforced 360-Degree Video Adaptive Streaming with Cross-user Attentive NetworkJun Fu, Zhibo Chen, Xiaoming Chen et al.
In the tile-based 360-degree video streaming, predicting user's future viewpoints and developing adaptive bitrate (ABR) algorithms are essential for optimizing user's quality of experience (QoE). Traditional single-user based viewpoint prediction methods fail to achieve good performance in long-term prediction, and the recently proposed reinforcement learning (RL) based ABR schemes applied in traditional video streaming can not be directly applied in the tile-based 360-degree video streaming due to the exponential action space. Therefore, we propose a sequential reinforced 360-degree video streaming scheme with cross-user attentive network. Firstly, considering different users may have the similar viewing preference on the same video, we propose a cross-user attentive network (CUAN), boosting the performance of long-term viewpoint prediction by selectively utilizing cross-user information. Secondly, we propose a sequential RL-based (360SRL) ABR approach, transforming action space size of each decision step from exponential to linear via introducing a sequential decision structure. We evaluate the proposed CUAN and 360SRL using trace-driven experiments and experimental results demonstrate that CUAN and 360SRL outperform existing viewpoint prediction and ABR approaches with a noticeable margin.
IVJun 1, 2020
Residual Squeeze-and-Excitation Network for Fast Image DerainingJun Fu, Jianfeng Xu, Kazuyuki Tasaka et al.
Image deraining is an important image processing task as rain streaks not only severely degrade the visual quality of images but also significantly affect the performance of high-level vision tasks. Traditional methods progressively remove rain streaks via different recurrent neural networks. However, these methods fail to yield plausible rain-free images in an efficient manner. In this paper, we propose a residual squeeze-and-excitation network called RSEN for fast image deraining as well as superior deraining performance compared with state-of-the-art approaches. Specifically, RSEN adopts a lightweight encoder-decoder architecture to conduct rain removal in one stage. Besides, both encoder and decoder adopt a novel residual squeeze-and-excitation block as the core of feature extraction, which contains a residual block for producing hierarchical features, followed by a squeeze-and-excitation block for channel-wisely enhancing the resulted hierarchical features. Experimental results demonstrate that our method can not only considerably reduce the computational complexity but also significantly improve the deraining performance compared with state-of-the-art methods.
CVNov 5, 2019
Adaptive Context Network for Scene ParsingJun Fu, Jing Liu, Yuhang Wang et al.
Recent works attempt to improve scene parsing performance by exploring different levels of contexts, and typically train a well-designed convolutional network to exploit useful contexts across all pixels equally. However, in this paper, we find that the context demands are varying from different pixels or regions in each image. Based on this observation, we propose an Adaptive Context Network (ACNet) to capture the pixel-aware contexts by a competitive fusion of global context and local context according to different per-pixel demands. Specifically, when given a pixel, the global context demand is measured by the similarity between the global feature and its local feature, whose reverse value can be used to measure the local context demand. We model the two demand measurements by the proposed global context module and local context module, respectively, to generate adaptive contextual features. Furthermore, we import multiple such modules to build several adaptive context blocks in different levels of network to obtain a coarse-to-fine result. Finally, comprehensive experimental evaluations demonstrate the effectiveness of the proposed ACNet, and new state-of-the-arts performances are achieved on all four public datasets, i.e. Cityscapes, ADE20K, PASCAL Context, and COCO Stuff.
CVAug 16, 2017
Stacked Deconvolutional Network for Semantic SegmentationJun Fu, Jing Liu, Yuhang Wang et al.
Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for semantic segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, are stacked one by one to integrate contextual information and guarantee the fine recovery of localization information. Meanwhile, inter-unit and intra-unit connections are designed to assist network training and enhance feature fusion since the connections improve the flow of information and gradient propagation throughout the network. Besides, hierarchical supervision is applied during the upsampling process of each SDN unit, which guarantees the discrimination of feature representations and benefits the network optimization. We carry out comprehensive experiments and achieve the new state-of-the-art results on three datasets, including PASCAL VOC 2012, CamVid, GATECH. In particular, our best model without CRF post-processing achieves an intersection-over-union score of 86.6% in the test set.
SYJun 25, 2015
Direct Adaptive Controller for Uncertain MIMO Dynamic Systems with Time-varying Delay and Dead-zone InputsZhijun Li, Ziting Chen, Jun Fu et al.
This paper presents an adaptive tracking control method for a class of nonlinearly parameterized MIMO dynamic systems with time-varying delay and unknown nonlinear dead-zone inputs. A new high dimensional integral Lyapunov-Krasovskii functional is introduced for the adaptive controller to guarantee global stability of the considered systems and also ensure convergence of the tracking errors to the origin. The proposed method provides an alternative to existing methods used for MIMO time-delay systems with dead-zone nonlinearities.