87.0AIMay 29
SLAT: Segment-Level Adaptive Trimming for Efficient CoT ReasoningJian Yao, Xiongcai Luo, Ran Cheng et al.
Recent advances in Large Reasoning Models have significantly improved chain-of-thought (CoT) capabilities via reinforcement learning (RL). However, generated reasoning chains frequently suffer from structural redundancy (i.e., \emph{overthinking}), incurring high computational overhead without improving answer correctness. Existing mitigation strategies typically rely on token-uniform length penalties, which provide coarse, segment-agnostic pressure toward shorter outputs and can inadvertently suppress useful reasoning alongside redundancy. To address this, we demonstrate that inefficiency concentrates in high-probability segments with low marginal utility. We derive a theoretical characterization of segment suboptimality under the correctness-length trade-off objective and propose \textsc{SLAT} (Segment-Level Adaptive Trimming), an RL framework that selectively suppresses redundant segments based on this criterion. Empirical results on standard benchmarks indicate that \textsc{SLAT} establishes a superior accuracy-efficiency Pareto frontier, reducing reasoning length by $50\%$ relative to uncompressed baselines while maintaining competitive accuracy. Overall, our results suggest that theoretically grounded, segment-aware trimming is a promising direction for efficient CoT reasoning in large language models.
56.0IRJun 3
Beyond Retrieval: Learning Compact User Representations for Scalable LLM PersonalizationHeng Cao, Fan Zhang, Jian Yao et al.
Personalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.
CVOct 23, 2022
Self-supervised Amodal Video Object SegmentationJian Yao, Yuxin Hong, Chiyu Wang et al.
Amodal perception requires inferring the full shape of an object that is partially occluded. This task is particularly challenging on two levels: (1) it requires more information than what is contained in the instant retina or imaging sensor, (2) it is difficult to obtain enough well-annotated amodal labels for supervision. To this end, this paper develops a new framework of Self-supervised amodal Video object segmentation (SaVos). Our method efficiently leverages the visual information of video temporal sequences to infer the amodal mask of objects. The key intuition is that the occluded part of an object can be explained away if that part is visible in other frames, possibly deformed as long as the deformation can be reasonably learned. Accordingly, we derive a novel self-supervised learning paradigm that efficiently utilizes the visible object parts as the supervision to guide the training on videos. In addition to learning type prior to complete masks for known types, SaVos also learns the spatiotemporal prior, which is also useful for the amodal task and could generalize to unseen types. The proposed framework achieves the state-of-the-art performance on the synthetic amodal segmentation benchmark FISHBOWL and the real world benchmark KINS-Video-Car. Further, it lends itself well to being transferred to novel distributions using test-time adaptation, outperforming existing models even after the transfer to a new distribution.
CVSep 7, 2023
A boundary-aware point clustering approach in Euclidean and embedding spaces for roof plane segmentationLi Li, Qingqing Li, Guozheng Xu et al.
Roof plane segmentation from airborne LiDAR point clouds is an important technology for 3D building model reconstruction. One of the key issues of plane segmentation is how to design powerful features that can exactly distinguish adjacent planar patches. The quality of point feature directly determines the accuracy of roof plane segmentation. Most of existing approaches use handcrafted features to extract roof planes. However, the abilities of these features are relatively low, especially in boundary area. To solve this problem, we propose a boundary-aware point clustering approach in Euclidean and embedding spaces constructed by a multi-task deep network for roof plane segmentation. We design a three-branch network to predict semantic labels, point offsets and extract deep embedding features. In the first branch, we classify the input data as non-roof, boundary and plane points. In the second branch, we predict point offsets for shifting each point toward its respective instance center. In the third branch, we constrain that points of the same plane instance should have the similar embeddings. We aim to ensure that points of the same plane instance are close as much as possible in both Euclidean and embedding spaces. However, although deep network has strong feature representative ability, it is still hard to accurately distinguish points near plane instance boundary. Therefore, we first group plane points into many clusters in the two spaces, and then we assign the rest boundary points to their closest clusters to generate final complete roof planes. In this way, we can effectively reduce the influence of unreliable boundary points. In addition, we prepare a synthetic dataset and two real datasets to train and evaluate our approach. The experiments results show that the proposed approach significantly outperforms the existing state-of-the-art approaches.
GTJun 29, 2023
Policy Space Diversity for Non-Transitive GamesJian Yao, Weiming Liu, Haobo Fu et al.
Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.
CVOct 11, 2022
DeepMLE: A Robust Deep Maximum Likelihood Estimator for Two-view Structure from MotionYuxi Xiao, Li Li, Xiaodi Li et al.
Two-view structure from motion (SfM) is the cornerstone of 3D reconstruction and visual SLAM (vSLAM). Many existing end-to-end learning-based methods usually formulate it as a brute regression problem. However, the inadequate utilization of traditional geometry model makes the model not robust in unseen environments. To improve the generalization capability and robustness of end-to-end two-view SfM network, we formulate the two-view SfM problem as a maximum likelihood estimation (MLE) and solve it with the proposed framework, denoted as DeepMLE. First, we propose to take the deep multi-scale correlation maps to depict the visual similarities of 2D image matches decided by ego-motion. In addition, in order to increase the robustness of our framework, we formulate the likelihood function of the correlations of 2D image matches as a Gaussian and Uniform mixture distribution which takes the uncertainty caused by illumination changes, image noise and moving objects into account. Meanwhile, an uncertainty prediction module is presented to predict the pixel-wise distribution parameters. Finally, we iteratively refine the depth and relative camera pose using the gradient-like information to maximize the likelihood function of the correlations. Extensive experimental results on several datasets prove that our method significantly outperforms the state-of-the-art end-to-end two-view SfM approaches in accuracy and generalization capability.
CVMar 4, 2022
PatchMVSNet: Patch-wise Unsupervised Multi-View Stereo for Weakly-Textured Surface ReconstructionHaonan Dong, Jian Yao
Learning-based multi-view stereo (MVS) has gained fine reconstructions on popular datasets. However, supervised learning methods require ground truth for training, which is hard to be collected, especially for the large-scale datasets. Though nowadays unsupervised learning methods have been proposed and have gotten gratifying results, those methods still fail to reconstruct intact results in challenging scenes, such as weakly-textured surfaces, as those methods primarily depend on pixel-wise photometric consistency which is subjected to various illuminations. To alleviate matching ambiguity in those challenging scenes, this paper proposes robust loss functions leveraging constraints beneath multi-view images: 1) Patch-wise photometric consistency loss, which expands the receptive field of the features in multi-view similarity measuring, 2) Robust twoview geometric consistency, which includes a cross-view depth consistency checking with the minimum occlusion. Our unsupervised strategy can be implemented with arbitrary depth estimation frameworks and can be trained with arbitrary large-scale MVS datasets. Experiments show that our method can decrease the matching ambiguity and particularly improve the completeness of weakly-textured reconstruction. Moreover, our method reaches the performance of the state-of-the-art methods on popular benchmarks, like DTU, Tanks and Temples and ETH3D. The code will be released soon.
CVDec 25, 2022
A Lightweight Reconstruction Network for Surface Defect InspectionChao Hu, Jian Yao, Weijie Wu et al.
Currently, most deep learning methods cannot solve the problem of scarcity of industrial product defect samples and significant differences in characteristics. This paper proposes an unsupervised defect detection algorithm based on a reconstruction network, which is realized using only a large number of easily obtained defect-free sample data. The network includes two parts: image reconstruction and surface defect area detection. The reconstruction network is designed through a fully convolutional autoencoder with a lightweight structure. Only a small number of normal samples are used for training so that the reconstruction network can be A defect-free reconstructed image is generated. A function combining structural loss and $\mathit{L}1$ loss is proposed as the loss function of the reconstruction network to solve the problem of poor detection of irregular texture surface defects. Further, the residual of the reconstructed image and the image to be tested is used as the possible region of the defect, and conventional image operations can realize the location of the fault. The unsupervised defect detection algorithm of the proposed reconstruction network is used on multiple defect image sample sets. Compared with other similar algorithms, the results show that the unsupervised defect detection algorithm of the reconstructed network has strong robustness and accuracy.
92.7CVMay 12
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D ImagesZishan Liu, Ruoxi Zang, Yanglin Zhang et al.
Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.
CVMar 26, 2021Code
DDR-Net: Learning Multi-Stage Multi-View Stereo With Dynamic Depth RangePuyuan Yi, Shengkun Tang, Jian Yao
To obtain high-resolution depth maps, some previous learning-based multi-view stereo methods build a cost volume pyramid in a coarse-to-fine manner. These approaches leverage fixed depth range hypotheses to construct cascaded plane sweep volumes. However, it is inappropriate to set identical range hypotheses for each pixel since the uncertainties of previous per-pixel depth predictions are spatially varying. Distinct from these approaches, we propose a Dynamic Depth Range Network (DDR-Net) to determine the depth range hypotheses dynamically by applying a range estimation module (REM) to learn the uncertainties of range hypotheses in the former stages. Specifically, in our DDR-Net, we first build an initial depth map at the coarsest resolution of an image across the entire depth range. Then the range estimation module (REM) leverages the probability distribution information of the initial depth to estimate the depth range hypotheses dynamically for the following stages. Moreover, we develop a novel loss strategy, which utilizes learned dynamic depth ranges to generate refined depth maps, to keep the ground truth value of each pixel covered in the range hypotheses of the next stage. Extensive experimental results show that our method achieves superior performance over other state-of-the-art methods on the DTU benchmark and obtains comparable results on the Tanks and Temples benchmark. The code is available at https://github.com/Tangshengku/DDR-Net.
CVMar 30, 2018Code
Regularizing RNNs for Caption Generation by Reconstructing The Past with The PresentXinpeng Chen, Lin Ma, Wenhao Jiang et al.
Recently, caption generation with an encoder-decoder framework has been extensively studied and applied in different domains, such as image captioning, code captioning, and so on. In this paper, we propose a novel architecture, namely Auto-Reconstructor Network (ARNet), which, coupling with the conventional encoder-decoder framework, works in an end-to-end fashion to generate captions. ARNet aims at reconstructing the previous hidden state with the present one, besides behaving as the input-dependent transition operator. Therefore, ARNet encourages the current hidden state to embed more information from the previous one, which can help regularize the transition dynamics of recurrent neural networks (RNNs). Extensive experimental results show that our proposed ARNet boosts the performance over the existing encoder-decoder models on both image captioning and source code captioning tasks. Additionally, ARNet remarkably reduces the discrepancy between training and inference processes for caption generation. Furthermore, the performance on permuted sequential MNIST demonstrates that ARNet can effectively regularize RNN, especially on modeling long-term dependencies. Our code is available at: https://github.com/chenxinpeng/ARNet
58.1CVMar 12
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video UnderstandingSonglin Li, Xin Zhu, Zechao Guan et al.
Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).
LGMay 29, 2025
Diversity-Aware Policy Optimization for Large Language Model ReasoningJian Yao, Ran Cheng, Xingyu Wu et al.
The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM's reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.
LGOct 21, 2024
Diverse Policies Recovering via Pointwise Mutual Information Weighted Imitation LearningHanlin Yang, Jian Yao, Weiming Liu et al.
Recovering a spectrum of diverse policies from a set of expert trajectories is an important research topic in imitation learning. After determining a latent style for a trajectory, previous diverse policies recovering methods usually employ a vanilla behavioral cloning learning objective conditioned on the latent style, treating each state-action pair in the trajectory with equal importance. Based on an observation that in many scenarios, behavioral styles are often highly relevant with only a subset of state-action pairs, this paper presents a new principled method in diverse polices recovery. In particular, after inferring or assigning a latent style for a trajectory, we enhance the vanilla behavioral cloning by incorporating a weighting mechanism based on pointwise mutual information. This additional weighting reflects the significance of each state-action pair's contribution to learning the style, thus allowing our method to focus on state-action pairs most representative of that style. We provide theoretical justifications for our new objective, and extensive empirical evaluations confirm the effectiveness of our method in recovering diverse policies from expert data.
GRSep 28, 2025
DFG-PCN: Point Cloud Completion with Degree-Flexible Point GraphZhenyu Shu, Jian Yao, Shiqing Xin
Point cloud completion is a vital task focused on reconstructing complete point clouds and addressing the incompleteness caused by occlusion and limited sensor resolution. Traditional methods relying on fixed local region partitioning, such as k-nearest neighbors, which fail to account for the highly uneven distribution of geometric complexity across different regions of a shape. This limitation leads to inefficient representation and suboptimal reconstruction, especially in areas with fine-grained details or structural discontinuities. This paper proposes a point cloud completion framework called Degree-Flexible Point Graph Completion Network (DFG-PCN). It adaptively assigns node degrees using a detail-aware metric that combines feature variation and curvature, focusing on structurally important regions. We further introduce a geometry-aware graph integration module that uses Manhattan distance for edge aggregation and detail-guided fusion of local and global features to enhance representation. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms state-of-the-art approaches.
AIJul 17, 2025
VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance BenchmarksJian Yao, Ran Cheng, Kay Chen Tan
Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of LLMs, as measured by standard benchmarks. Yet these gains often persist even when models are trained with flawed signals, such as random or inverted rewards. This raises a fundamental question: do such improvements reflect genuine reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To answer this question, we adopt an evaluation-centric perspective and highlight two critical shortcomings in existing protocols. First, benchmark contamination arises because test problems are publicly available, thereby increasing the risk of data leakage. Second, evaluation fragility results from reliance on single-instance assessments, which are sensitive to stochastic outputs and fail to capture reasoning consistency. These limitations suggest the need for a new evaluation paradigm that can probe reasoning ability beyond memorization and one-off success. As response, we propose VAR-MATH, a symbolic evaluation framework that converts fixed numerical problems into parameterized templates and requires models to solve multiple instantiations of each. This design enforces consistency across structurally equivalent variants, mitigates contamination, and enhances robustness through bootstrapped metrics. We apply VAR-MATH to transform three popular benchmarks, AMC23, AIME24, and AIME25, into their symbolic counterparts, VAR-AMC23, VAR-AIME24, and VAR-AIME25. Experimental results show substantial performance drops for RL-trained models on these variabilized benchmarks, especially for smaller models, with average declines of 47.9\% on AMC23, 58.8\% on AIME24, and 72.9\% on AIME25. These findings indicate that some existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms.
CVAug 26, 2025
RoofSeg: An edge-aware transformer-based network for end-to-end roof plane segmentationSiyuan You, Guozheng Xu, Pengwei Zhou et al.
Roof plane segmentation is one of the key procedures for reconstructing three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from airborne light detection and ranging (LiDAR) point clouds. The majority of current approaches for roof plane segmentation rely on the manually designed or learned features followed by some specifically designed geometric clustering strategies. Because the learned features are more powerful than the manually designed features, the deep learning-based approaches usually perform better than the traditional approaches. However, the current deep learning-based approaches have three unsolved problems. The first is that most of them are not truly end-to-end, the plane segmentation results may be not optimal. The second is that the point feature discriminability near the edges is relatively low, leading to inaccurate planar edges. The third is that the planar geometric characteristics are not sufficiently considered to constrain the network training. To solve these issues, a novel edge-aware transformer-based network, named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds in a truly end-to-end manner. In the RoofSeg, we leverage a transformer encoder-decoder-based framework to hierarchically predict the plane instance masks with the use of a set of learnable plane queries. To further improve the segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module (EAMM) that sufficiently incorporates planar geometric prior of edges to enhance its discriminability for plane instance mask refinement. In addition, we propose an adaptive weighting strategy in the mask loss to reduce the influence of misclassified points, and also propose a new plane geometric loss to constrain the network training.
CVJul 8, 2020
Deformable spatial propagation network for depth completionZheyuan Xu, Hongche Yin, Jian Yao
Depth completion has attracted extensive attention recently due to the development of autonomous driving, which aims to recover dense depth map from sparse depth measurements. Convolutional spatial propagation network (CSPN) is one of the state-of-the-art methods in this task, which adopt a linear propagation model to refine coarse depth maps with local context. However, the propagation of each pixel occurs in a fixed receptive field. This may not be the optimal for refinement since different pixel needs different local context. To tackle this issue, in this paper, we propose a deformable spatial propagation network (DSPN) to adaptively generates different receptive field and affinity matrix for each pixel. It allows the network obtain information with much fewer but more relevant pixels for propagation. Experimental results on KITTI depth completion benchmark demonstrate that our proposed method achieves the state-of-the-art performance.
CVApr 6, 2020
Vanishing Point Guided Natural Image StitchingKai Chen, Jian Yao, Jingmin Tu et al.
Recently, works on improving the naturalness of stitching images gain more and more extensive attention. Previous methods suffer the failures of severe projective distortion and unnatural rotation, especially when the number of involved images is large or images cover a very wide field of view. In this paper, we propose a novel natural image stitching method, which takes into account the guidance of vanishing points to tackle the mentioned failures. Inspired by a vital observation that mutually orthogonal vanishing points in Manhattan world can provide really useful orientation clues, we design a scheme to effectively estimate prior of image similarity. Given such estimated prior as global similarity constraints, we feed it into a popular mesh deformation framework to achieve impressive natural stitching performances. Compared with other existing methods, including APAP, SPHP, AANAP, and GSP, our method achieves state-of-the-art performance in both quantitative and qualitative experiments on natural image stitching.
CVMar 15, 2020
GMM-UNIT: Unsupervised Multi-Domain and Multi-Modal Image-to-Image Translation via Attribute Gaussian Mixture ModelingYahui Liu, Marco De Nadai, Jian Yao et al.
Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images. Recent studies have shown remarkable success for multiple domains but they suffer from two main limitations: they are either built from several two-domain mappings that are required to be learned independently, or they generate low-diversity results, a problem known as mode collapse. To overcome these limitations, we propose a method named GMM-UNIT, which is based on a content-attribute disentangled representation where the attribute space is fitted with a GMM. Each GMM component represents a domain, and this simple assumption has two prominent advantages. First, it can be easily extended to most multi-domain and multi-modal image-to-image translation tasks. Second, the continuous domain encoding allows for interpolation between domains and for extrapolation to unseen domains and translations. Additionally, we show how GMM-UNIT can be constrained down to different methods in the literature, meaning that GMM-UNIT is a unifying framework for unsupervised image-to-image translation.
SDAug 21, 2019
Coarse-to-fine Optimization for Speech EnhancementJian Yao, Ahmad Al-Dahle
In this paper, we propose the coarse-to-fine optimization for the task of speech enhancement. Cosine similarity loss [1] has proven to be an effective metric to measure similarity of speech signals. However, due to the large variance of the enhanced speech with even the same cosine similarity loss in high dimensional space, a deep neural network learnt with this loss might not be able to predict enhanced speech with good quality. Our coarse-to-fine strategy optimizes the cosine similarity loss for different granularities so that more constraints are added to the prediction from high dimension to relatively low dimension. In this way, the enhanced speech will better resemble the clean speech. Experimental results show the effectiveness of our proposed coarse-to-fine optimization in both discriminative models and generative models. Moreover, we apply the coarse-to-fine strategy to the adversarial loss in generative adversarial network (GAN) and propose dynamic perceptual loss, which dynamically computes the adversarial loss from coarse resolution to fine resolution. Dynamic perceptual loss further improves the accuracy and achieves state-of-the-art results compared with other generative models.
CVSep 18, 2018
Generalized Content-Preserving Warps for Image StitchingKai Chen, Jingmin Tu, Jian Yao
Local misalignment caused by global homography is a common issue in image stitching task. Content-Preserving Warping (CPW) is a typical method to deal with this issue, in which geometric and photometric constraints are imposed to guide the warping process. One of its essential condition however, is colour consistency, and an elusive goal in real world applications. In this paper, we propose a Generalized Content-Preserving Warping (GCPW) method to alleviate this problem. GCPW extends the original CPW by applying a colour model that expresses the colour transformation between images locally, thus meeting the photometric constraint requirements for effective image stitching. We combine the photometric and geometric constraints and jointly estimate the colour transformation and the warped mesh vertexes, simultaneously. We align images locally with an optimal grid mesh generated by our GCPW method. Experiments on both synthetic and real images demonstrate that our new method is robust to colour variations, outperforming other state-of-the-art CPW-based image stitching methods.
CVSep 18, 2018
Multiple Combined Constraints for Image StitchingKai Chen, Jingmin Tu, Binbin Xiang et al.
Several approaches to image stitching use different constraints to estimate the motion model between image pairs. These constraints can be roughly divided into two categories: geometric constraints and photometric constraints. In this paper, geometric and photometric constraints are combined to improve the alignment quality, which is based on the observation that these two kinds of constraints are complementary. On the one hand, geometric constraints (e.g., point and line correspondences) are usually spatially biased and are insufficient in some extreme scenes, while photometric constraints are always evenly and densely distributed. On the other hand, photometric constraints are sensitive to displacements and are not suitable for images with large parallaxes, while geometric constraints are usually imposed by feature matching and are more robust to handle parallaxes. The proposed method therefore combines them together in an efficient mesh-based image warping framework. It achieves better alignment quality than methods only with geometric constraints, and can handle larger parallax than photometric-constraint-based method. Experimental results on various images illustrate that the proposed method outperforms representative state-of-the-art image stitching methods reported in the literature.
CVApr 4, 2018
Fine-grained Video Attractiveness Prediction Using Multimodal Deep Learning on a Large Real-world DatasetXinpeng Chen, Jingyuan Chen, Lin Ma et al.
Nowadays, billions of videos are online ready to be viewed and shared. Among an enormous volume of videos, some popular ones are widely viewed by online users while the majority attract little attention. Furthermore, within each video, different segments may attract significantly different numbers of views. This phenomenon leads to a challenging yet important problem, namely fine-grained video attractiveness prediction. However, one major obstacle for such a challenging problem is that no suitable benchmark dataset currently exists. To this end, we construct the first fine-grained video attractiveness dataset, which is collected from one of the most popular video websites in the world. In total, the constructed FVAD consists of 1,019 drama episodes with 780.6 hours covering different categories and a wide variety of video contents. Apart from the large amount of videos, hundreds of millions of user behaviors during watching videos are also included, such as "view counts", "fast-forward", "fast-rewind", and so on, where "view counts" reflects the video attractiveness while other engagements capture the interactions between the viewers and videos. First, we demonstrate that video attractiveness and different engagements present different relationships. Second, FVAD provides us an opportunity to study the fine-grained video attractiveness prediction problem. We design different sequential models to perform video attractiveness prediction by relying solely on video contents. The sequential models exploit the multimodal relationships between visual and audio components of the video contents at different levels. Experimental results demonstrate the effectiveness of our proposed sequential models with different visual and audio representations, the necessity of incorporating the two modalities, and the complementary behaviors of the sequential prediction models at different levels.
CVMay 12, 2017
Learning to Refine Object Contours with a Top-Down Fully Convolutional Encoder-Decoder NetworkYahui Liu, Jian Yao, Li Li et al.
We develop a novel deep contour detection algorithm with a top-down fully convolutional encoder-decoder network. Our proposed method, named TD-CEDN, solves two important issues in this low-level vision problem: (1) learning multi-scale and multi-level features; and (2) applying an effective top-down refined approach in the networks. TD-CEDN performs the pixel-wise prediction by means of leveraging features at all layers of the net. Unlike skip connections and previous encoder-decoder methods, we first learn a coarse feature map after the encoder stage in a feedforward pass, and then refine this feature map in a top-down strategy during the decoder stage utilizing features at successively lower layers. Therefore, the deconvolutional process is conducted stepwise, which is guided by Deeply-Supervision Net providing the integrated direct supervision. The above proposed technologies lead to a more precise and clearer prediction. Our proposed algorithm achieved the state-of-the-art on the BSDS500 dataset (ODS F-score of 0.788), the PASCAL VOC2012 dataset (ODS F-score of 0.588), and and the NYU Depth dataset (ODS F-score of 0.735).
CVMar 31, 2016
Accurate Text Localization in Natural Image with Cascaded Convolutional Text NetworkTong He, Weilin Huang, Yu Qiao et al.
We introduce a new top-down pipeline for scene text detection. We propose a novel Cascaded Convolutional Text Network (CCTN) that joints two customized convolutional networks for coarse-to-fine text localization. The CCTN fast detects text regions roughly from a low-resolution image, and then accurately localizes text lines from each enlarged region. We cast previous character based detection into direct text region estimation, avoiding multiple bottom- up post-processing steps. It exhibits surprising robustness and discriminative power by considering whole text region as detection object which provides strong semantic information. We customize convolutional network by develop- ing rectangle convolutions and multiple in-network fusions. This enables it to handle multi-shape and multi-scale text efficiently. Furthermore, the CCTN is computationally efficient by sharing convolutional computations, and high-level property allows it to be invariant to various languages and multiple orientations. It achieves 0.84 and 0.86 F-measures on the ICDAR 2011 and ICDAR 2013, delivering substantial improvements over state-of-the-art results [23, 1].
CVOct 12, 2015
Text-Attentional Convolutional Neural Networks for Scene Text DetectionTong He, Weilin Huang, Yu Qiao et al.
Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature computed globally from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this work, we present a new system for scene text detection by proposing a novel Text-Attentional Convolutional Neural Network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/nontext information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates main task of text/non-text classification. In addition, a powerful low-level detector called Contrast- Enhancement Maximally Stable Extremal Regions (CE-MSERs) is developed, which extends the widely-used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 dataset, with a F-measure of 0.82, improving the state-of-the-art results substantially.