Ruoteng Li

CV
h-index11
10papers
163citations
Novelty55%
AI Score32

10 Papers

CVNov 27, 2022Code
Estimating Reflectance Layer from A Single Image: Integrating Reflectance Guidance and Shadow/Specular Aware Learning

Yeying Jin, Ruoteng Li, Wenhan Yang et al.

Estimating the reflectance layer from a single image is a challenging task. It becomes more challenging when the input image contains shadows or specular highlights, which often render an inaccurate estimate of the reflectance layer. Therefore, we propose a two-stage learning method, including reflectance guidance and a Shadow/Specular-Aware (S-Aware) network to tackle the problem. In the first stage, an initial reflectance layer free from shadows and specularities is obtained with the constraint of novel losses that are guided by prior-based shadow-free and specular-free images. To further enforce the reflectance layer to be independent of shadows and specularities in the second-stage refinement, we introduce an S-Aware network that distinguishes the reflectance image from the input image. Our network employs a classifier to categorize shadow/shadow-free, specular/specular-free classes, enabling the activation features to function as attention maps that focus on shadow/specular regions. Our quantitative and qualitative evaluations show that our method outperforms the state-of-the-art methods in the reflectance layer estimation that is free from shadows and specularities. Code is at: \url{https://github.com/jinyeying/S-Aware-network}.

CVJul 4, 2023
AdAM: Few-Shot Image Generation via Adaptation-Aware Kernel Modulation

Yunqing Zhao, Keshigeyan Chandrasegaran, Milad Abdollahzadeh et al. · tsinghua

Few-shot image generation (FSIG) aims to learn to generate new and diverse images given few (e.g., 10) training samples. Recent work has addressed FSIG by leveraging a GAN pre-trained on a large-scale source domain and adapting it to the target domain with few target samples. Central to recent FSIG methods are knowledge preservation criteria, which select and preserve a subset of source knowledge to the adapted model. However, a major limitation of existing methods is that their knowledge preserving criteria consider only source domain/task and fail to consider target domain/adaptation in selecting source knowledge, casting doubt on their suitability for setups of different proximity between source and target domain. Our work makes two contributions. Firstly, we revisit recent FSIG works and their experiments. We reveal that under setups which assumption of close proximity between source and target domains is relaxed, many existing state-of-the-art (SOTA) methods which consider only source domain in knowledge preserving perform no better than a baseline method. As our second contribution, we propose Adaptation-Aware kernel Modulation (AdAM) for general FSIG of different source-target domain proximity. Extensive experiments show that AdAM consistently achieves SOTA performance in FSIG, including challenging setups where source and target domains are more apart.

CVJan 8, 2025
Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

Giorgio Giannone, Ruoteng Li, Qianli Feng et al.

Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Sampling-based Visual Projection), a novel framework that enhances vision-language alignment without relying on manually curated text-image pairs or preference annotation. SVP leverages a small set of manually selected images, self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14 % average improvement in captioning tasks, up to 12 % increase in object recall, and significantly reduced hallucinations, while maintaining question-answering capabilities. Using SVP, a small VLM achieves hallucination reductions similar to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.

CVOct 15, 2020
Object Tracking Using Spatio-Temporal Future Prediction

Yuan Liu, Ruoteng Li, Robby T. Tan et al.

Occlusion is a long-standing problem that causes many modern tracking methods to be erroneous. In this paper, we address the occlusion problem by exploiting the current and future possible locations of the target object from its past trajectory. To achieve this, we introduce a learning-based tracking method that takes into account background motion modeling and trajectory prediction. Our trajectory prediction module predicts the target object's locations in the current and future frames based on the object's past trajectory. Since, in the input video, the target object's trajectory is not only affected by the object motion but also the camera motion, our background motion module estimates the camera motion. So that the object's trajectory can be made independent from it. To dynamically switch between the appearance-based tracker and the trajectory prediction, we employ a network that can assess how good a tracking prediction is, and we use the assessment scores to choose between the appearance-based tracker's prediction and the trajectory-based prediction. Comprehensive evaluations show that the proposed method sets a new state-of-the-art performance on commonly used tracking benchmarks.

CVApr 18, 2020
Realistic Large-Scale Fine-Depth Dehazing Dataset from 3D Videos

Ruoteng Li, Xiaoyi Zhang, Shaodi You et al.

Image dehazing is one of the important and popular topics in computer vision and machine learning. A reliable real-time dehazing method with reliable performance is highly desired for many applications such as autonomous driving, security surveillance, etc. While recent learning-based methods require datasets containing pairs of hazy images and clean ground truth, it is impossible to capture them in real scenes. Many existing works compromise this difficulty to generate hazy images by rendering the haze from depth on common RGBD datasets using the haze imaging model. However, there is still a gap between the synthetic datasets and real hazy images as large datasets with high-quality depth are mostly indoor and depth maps for outdoor are imprecise. In this paper, we complement the existing datasets with a new, large, and diverse dehazing dataset containing real outdoor scenes from High-Definition (HD) 3D movies. We select a large number of high-quality frames of real outdoor scenes and render haze on them using depth from stereo. Our dataset is clearly more realistic and more diversified with better visual quality than existing ones. More importantly, we demonstrate that using this dataset greatly improves the dehazing performance on real scenes. In addition to the dataset, we also evaluate a series state of the art methods on the proposed benchmarking datasets.

CVDec 8, 2019
Single image reflection removal via learning with multi-image constraints

Yingda Yin, Qingnan Fan, Dongdong Chen et al.

Reflections are very common phenomena in our daily photography, which distract people's attention from the scene behind the glass. The problem of removing reflection artifacts is important but challenging due to its ill-posed nature. The traditional approaches solve an optimization problem over the constraints induced from multiple images, at the expense of large computation costs. Recent learning-based approaches have demonstrated a significant improvement in both performance and running time for single image reflection removal, but are limited as they require a large number of synthetic reflection/clean image pairs for direct supervision to approximate the ground truth, at the risk of overfitting in the synthetic image domain and degrading in the real image domain. In this paper, we propose a novel learning-based solution that combines the advantages of the aforementioned approaches and overcomes their drawbacks. Our algorithm works by learning a deep neural network to optimize the target with joint constraints enhanced among multiple input images during the training phase, but is able to eliminate reflections only from a single input for evaluation. Our algorithm runs in real-time and achieves state-of-the-art reflection removal performance on real images. We further propose a strong network backbone that disentangles the background and reflection information into separate latent codes, which are embedded into a shared one-branch deep neural network for both background and reflection predictions. The proposed backbone experimentally performs better than the other common network implementations, and provides insightful knowledge to understand the reflection removal task.

LGJul 23, 2019
GraphX$^{NET}-$ Chest X-Ray Classification Under Extreme Minimal Supervision

Angelica I. Aviles-Rivero, Nicolas Papadakis, Ruoteng Li et al.

The task of classifying X-ray data is a problem of both theoretical and clinical interest. Whilst supervised deep learning methods rely upon huge amounts of labelled data, the critical problem of achieving a good classification accuracy when an extremely small amount of labelled data is available has yet to be tackled. In this work, we introduce a novel semi-supervised framework for X-ray classification which is based on a graph-based optimisation model. To the best of our knowledge, this is the first method that exploits graph-based semi-supervised learning for X-ray data classification. Furthermore, we introduce a new multi-class classification functional with carefully selected class priors which allows for a smooth solution that strengthens the synergy between the limited number of labels and the huge amount of unlabelled data. We demonstrate, through a set of numerical and visual experiments, that our method produces highly competitive results on the ChestX-ray14 data set whilst drastically reducing the need for annotated data.

LGJun 20, 2019
Energy Models for Better Pseudo-Labels: Improving Semi-Supervised Classification with the 1-Laplacian Graph Energy

Angelica I. Aviles-Rivero, Nicolas Papadakis, Ruoteng Li et al.

Semi-supervised classification is a great focus of interest, as in real-world scenarios obtaining labels is expensive, time-consuming and might require expert knowledge. This has motivated the fast development of semi-supervised techniques, whose performance is on a par with or better than supervised approaches. A current major challenge for semi-supervised techniques is how to better handle the network calibration and confirmation bias problems for improving performance. In this work, we argue that energy models are an effective alternative to such problems. With this motivation in mind, we propose a hybrid framework for semi-supervised classification called CREPE model (1-Lapla$\mathbf{C}$ian g$\mathbf{R}$aph $\mathbf{E}$nergy for $\mathbf{P}$seudo-lab$\mathbf{E}$ls). Firstly, we introduce a new energy model based on the non-smooth $\ell_1$ norm of the normalised graph 1-Laplacian. Our functional enforces a sufficiently smooth solution and strengthens the intrinsic relation between the labelled and unlabelled data. Secondly, we provide a theoretical analysis for our proposed scheme and show that the solution trajectory does converge to a non-constant steady point. Thirdly, we derive the connection of our energy model for pseudo-labelling. We show that our energy model produces more meaningful pseudo-labels than the ones generated directly by a deep network. We extensively evaluate our framework, through numerical and visual experiments, using six benchmarking datasets for natural and medical images. We demonstrate that our technique reports state-of-the-art results for semi-supervised classification.

CVDec 19, 2017
Single Image Deraining using Scale-Aware Multi-Stage Recurrent Network

Ruoteng Li, Loong-Fah Cheong, Robby T. Tan

Given a single input rainy image, our goal is to visually remove rain streaks and the veiling effect caused by scattering and transmission of rain streaks and rain droplets. We are particularly concerned with heavy rain, where rain streaks of various sizes and directions can overlap each other and the veiling effect reduces contrast severely. To achieve our goal, we introduce a scale-aware multi-stage convolutional neural network. Our main idea here is that different sizes of rain-streaks visually degrade the scene in different ways. Large nearby streaks obstruct larger regions and are likely to reflect specular highlights more prominently than smaller distant streaks. These different effects of different streaks have their own characteristics in their image features, and thus need to be treated differently. To realize this, we create parallel sub-networks that are trained and made aware of these different scales of rain streaks. To our knowledge, this idea of parallel sub-networks that treats the same class of objects according to their unique sub-classes is novel, particularly in the context of rain removal. To verify our idea, we conducted experiments on both synthetic and real images, and found that our method is effective and outperforms the state-of-the-art methods.

CVApr 18, 2017
Robust Optical Flow Estimation in Rainy Scenes

Ruoteng Li, Robby T. Tan, Loong-Fah Cheong

Optical flow estimation in the rainy scenes is challenging due to background degradation introduced by rain streaks and rain accumulation effects in the scene. Rain accumulation effect refers to poor visibility of remote objects due to the intense rainfall. Most existing optical flow methods are erroneous when applied to rain sequences because the conventional brightness constancy constraint (BCC) and gradient constancy constraint (GCC) generally break down in this situation. Based on the observation that the RGB color channels receive raindrop radiance equally, we introduce a residue channel as a new data constraint to reduce the effect of rain streaks. To handle rain accumulation, our method decomposes the image into a piecewise-smooth background layer and a high-frequency detail layer. It also enforces the BCC on the background layer only. Results on both synthetic dataset and real images show that our algorithm outperforms existing methods on different types of rain sequences. To our knowledge, this is the first optical flow method specifically dealing with rain.