CVMar 5, 2023Code
Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video RecognitionJunyan Wang, Zhenhong Sun, Yichen Qian et al.
3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the modeling ability or make training time-consuming. In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity. To measure the expressiveness of 3D CNNs efficiently, we formulate a 3D CNN as an information system and derive an analytic entropy score, based on the Maximum Entropy Principle. Specifically, we propose a spatio-temporal entropy score (STEntr-Score) with a refinement factor to handle the discrepancy of visual information in spatial and temporal dimensions, through dynamically leveraging the correlation between the feature map size and kernel size depth-wisely. Highly efficient and expressive 3D CNN architectures, \ie entropy-based 3D CNNs (E3D family), can then be efficiently searched by maximizing the STEntr-Score under a given computational budget, via an evolutionary algorithm without training the network parameters. Extensive experiments on Something-Something V1\&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance with higher computational efficiency. Code is available at https://github.com/alibaba/lightweight-neural-architecture-search.
ROFeb 4
HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online DistillationPuyue Wang, Jiawei Hu, Yan Gao et al.
Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state--action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher's robust control capabilities into a transformer-based student policy that operates on sparse root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments show HoRD outperforms strong baselines in robustness and transfer, especially under unseen domains and external perturbations. Code and project page are available at https://tonywang-0517.github.io/hord/.
CLJun 3, 2025Code
FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language ModelsYan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques et al.
Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.
CVMar 17, 2025Code
Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video PredictionZheyuan Liu, Junyan Wang, Zicheng Duan et al.
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at https://github.com/Cuberick-Orion/FCA .
LGFeb 4
Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMsLetian Cheng, Junyan Wang, Yan Gao et al.
Perplexity is a widely adopted metric for assessing the predictive quality of large language models (LLMs) and often serves as a reference metric for downstream evaluations. However, recent evidence shows that perplexity can be unreliable, especially when irrelevant long inputs are used, raising concerns for both benchmarking and system deployment. While prior efforts have employed selective input filtering and curated datasets, the impact of input length on perplexity has not been systematically studied from a systems perspective and input length has rarely been treated as a first-class system variable affecting both fairness and efficiency. In this work, we close this gap by introducing LengthBenchmark, a system-conscious evaluation framework that explicitly integrates input length, evaluation protocol design, and system-level costs, evaluating representative LLMs under two scoring protocols (direct accumulation and fixed window sliding) across varying context lengths. Unlike prior work that focuses solely on accuracy-oriented metrics, LengthBenchmark additionally measures latency, memory footprint, and evaluation cost, thereby linking predictive metrics to deployment realities. We further incorporate quantized variants not as a main contribution, but as robustness checks, showing that length-induced biases persist across both full-precision and compressed models. This design disentangles the effects of evaluation logic, quantization, and input length, and demonstrates that length bias is a general phenomenon that undermines fair cross-model comparison. Our analysis yields two key observations: (i) sliding window evaluation consistently inflates performance on short inputs, and (ii) both full-precision and quantized models appear to realise gains as the evaluated segment length grows.
CVFeb 9, 2022Code
GiraffeDet: A Heavy-Neck Paradigm for Object DetectionYiqi Jiang, Zhiyu Tan, Junyan Wang et al.
In conventional object detection frameworks, a backbone body inherited from image recognition models extracts deep latent features and then a neck module fuses these latent features to capture information at different scales. As the resolution in object detection is much larger than in image recognition, the computational cost of the backbone often dominates the total inference cost. This heavy-backbone design paradigm is mostly due to the historical legacy when transferring image recognition models to object detection rather than an end-to-end optimized design for object detection. In this work, we show that such paradigm indeed leads to sub-optimal object detection models. To this end, we propose a novel heavy-neck paradigm, GiraffeDet, a giraffe-like network for efficient object detection. The GiraffeDet uses an extremely lightweight backbone and a very deep and large neck module which encourages dense information exchange among different spatial scales as well as different levels of latent semantics simultaneously. This design paradigm allows detectors to process the high-level semantic information and low-level spatial information at the same priority even in the early stage of the network, making it more effective in detection tasks. Numerical evaluations on multiple popular object detection benchmarks show that GiraffeDet consistently outperforms previous SOTA models across a wide spectrum of resource constraints. The source code is available at https://github.com/jyqi/GiraffeDet.
CVAug 8, 2021Code
Discriminative Latent Semantic Graph for Video CaptioningYang Bai, Junyan Wang, Yang Long et al.
Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT) manifest significant improvements over state-of-the-art approaches on all metrics, especially for BLEU-4 and CIDEr. Our code is available at https://github.com/baiyang4/D-LSG-Video-Caption.
CVJul 20, 2019Code
Order Matters: Shuffling Sequence Generation for Video PredictionJunyan Wang, Bingzhang Hu, Yang Long et al.
Predicting future frames in natural video sequences is a new challenge that is receiving increasing attention in the computer vision community. However, existing models suffer from severe loss of temporal information when the predicted sequence is long. Compared to previous methods focusing on generating more realistic contents, this paper extensively studies the importance of sequential order information for video generation. A novel Shuffling sEquence gEneration network (SEE-Net) is proposed that can learn to discriminate unnatural sequential orders by shuffling the video frames and comparing them to the real video sequence. Systematic experiments on three datasets with both synthetic and real-world videos manifest the effectiveness of shuffling sequence generation for video prediction in our proposed model and demonstrate state-of-the-art performance by both qualitative and quantitative evaluations. The source code is available at https://github.com/andrewjywang/SEENet.
CVDec 6, 2024
LiFT: Leveraging Human Feedback for Text-to-Video Model AlignmentYibin Wang, Zhiyu Tan, Junyan Wang et al.
Recent advances in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are subjective and challenging to formalize as objective functions. Existing studies train video quality assessment models that rely on human-annotated ratings for video evaluation but overlook the reasoning behind evaluations, limiting their ability to capture nuanced human criteria. Moreover, aligning T2V model using video-based human feedback remains unexplored. Therefore, this paper proposes LiFT, the first method designed to leverage human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.
CVMar 8, 2024
Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image GenerationJunyan Wang, Zhenhong Sun, Zhiyu Tan et al.
Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.
CVMar 11, 2025
SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion ModelsHesen Chen, Junyan Wang, Zhiyu Tan et al.
Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.
CVFeb 28, 2025
Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating VideosZhiyu Tan, Junyan Wang, Hao Yang et al.
Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset's emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.
CVApr 14, 2025
Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion TransformersChunyang Zhang, Zhenhong Sun, Zhicheng Zhang et al.
Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.
LGMay 28, 2025
SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion TrainingXiaomeng Yang, Zhiyu Tan, Junyan Wang et al.
Preference learning has become a central technique for aligning generative models with human expectations. Recently, it has been extended to diffusion models through methods like Direct Preference Optimization (DPO). However, existing approaches such as Diffusion-DPO suffer from two key challenges: timestep-dependent instability, caused by a mismatch between the reverse and forward diffusion processes and by high gradient variance in early noisy timesteps, and off-policy bias arising from the mismatch between optimization and data collection policies. We begin by analyzing the reverse diffusion trajectory and observe that instability primarily occurs at early timesteps with low importance weights. To address these issues, we first propose DPO-C\&M, a practical strategy that improves stability by clipping and masking uninformative timesteps while partially mitigating off-policy bias. Building on this, we introduce SDPO (Importance-Sampled Direct Preference Optimization), a principled framework that incorporates importance sampling into the objective to fully correct for off-policy bias and emphasize informative updates during the diffusion process. Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B demonstrate that both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness. These results highlight the importance of timestep-aware, distribution-corrected optimization in diffusion-based preference learning.
CVFeb 23, 2022
Absolute Zero-Shot LearningRui Gao, Fan Wan, Daniel Organisciak et al.
Considering the increasing concerns about data copyright and privacy issues, we present a novel Absolute Zero-Shot Learning (AZSL) paradigm, i.e., training a classifier with zero real data. The key innovation is to involve a teacher model as the data safeguard to guide the AZSL model training without data leaking. The AZSL model consists of a generator and student network, which can achieve date-free knowledge transfer while maintaining the performance of the teacher network. We investigate `black-box' and `white-box' scenarios in AZSL task as different levels of model security. Besides, we also provide discussion of teacher model in both inductive and transductive settings. Despite embarrassingly simple implementations and data-missing disadvantages, our AZSL framework can retain state-of-the-art ZSL and GZSL performance under the `white-box' scenario. Extensive qualitative and quantitative analysis also demonstrates promising results when deploying the model under `black-box' scenario.
CVSep 2, 2021
AnANet: Modeling Association and Alignment for Cross-modal Correlation ClassificationNan Xu, Junyan Wang, Yuan Tian et al.
The explosive increase of multimodal data makes a great demand in many cross-modal applications that follow the strict prior related assumption. Thus researchers study the definition of cross-modal correlation category and construct various classification systems and predictive models. However, those systems pay more attention to the fine-grained relevant types of cross-modal correlation, ignoring lots of implicit relevant data which are often divided into irrelevant types. What's worse is that none of previous predictive models manifest the essence of cross-modal correlation according to their definition at the modeling stage. In this paper, we present a comprehensive analysis of the image-text correlation and redefine a new classification system based on implicit association and explicit alignment. To predict the type of image-text correlation, we propose the Association and Alignment Network according to our proposed definition (namely AnANet) which implicitly represents the global discrepancy and commonality between image and text and explicitly captures the cross-modal local relevance. The experimental results on our constructed new image-text correlation dataset show the effectiveness of our model.
CVAug 19, 2020
Query Twice: Dual Mixture Attention Meta Learning for Video SummarizationJunyan Wang, Yang Bai, Yang Long et al.
Video summarization aims to select representative frames to retain high-level information, which is usually solved by predicting the segment-wise importance score via a softmax function. However, softmax function suffers in retaining high-rank representations for complex visual or sequential information, which is known as the Softmax Bottleneck problem. In this paper, we propose a novel framework named Dual Mixture Attention (DMASum) model with Meta Learning for video summarization that tackles the softmax bottleneck problem, where the Mixture of Attention layer (MoA) effectively increases the model capacity by employing twice self-query attention that can capture the second-order changes in addition to the initial query-key attention, and a novel Single Frame Meta Learning rule is then introduced to achieve more generalization to small datasets with limited training sources. Furthermore, the DMASum significantly exploits both visual and sequential attention that connects local key-frame and global attention in an accumulative way. We adopt the new evaluation protocol on two public datasets, SumMe, and TVSum. Both qualitative and quantitative experiments manifest significant improvements over the state-of-the-art methods.
CVFeb 16, 2016
Segmentation Rectification for Video Cutout via One-Class Structured LearningJunyan Wang, Sai-kit Yeung, Jue Wang et al.
Recent works on interactive video object cutout mainly focus on designing dynamic foreground-background (FB) classifiers for segmentation propagation. However, the research on optimally removing errors from the FB classification is sparse, and the errors often accumulate rapidly, causing significant errors in the propagated frames. In this work, we take the initial steps to addressing this problem, and we call this new task \emph{segmentation rectification}. Our key observation is that the possibly asymmetrically distributed false positive and false negative errors were handled equally in the conventional methods. We, alternatively, propose to optimally remove these two types of errors. To this effect, we propose a novel bilayer Markov Random Field (MRF) model for this new task. We also adopt the well-established structured learning framework to learn the optimal model from data. Additionally, we propose a novel one-class structured SVM (OSSVM) which greatly speeds up the structured learning process. Our method naturally extends to RGB-D videos as well. Comprehensive experiments on both RGB and RGB-D data demonstrate that our simple and effective method significantly outperforms the segmentation propagation methods adopted in the state-of-the-art video cutout systems, and the results also suggest the potential usefulness of our method in image cutout system.
CVMar 23, 2015
Superpixelizing Binary MRF for Image Labeling ProblemsJunyan Wang, Sai-Kit Yeung
Superpixels have become prevalent in computer vision. They have been used to achieve satisfactory performance at a significantly smaller computational cost for various tasks. People have also combined superpixels with Markov random field (MRF) models. However, it often takes additional effort to formulate MRF on superpixel-level, and to the best of our knowledge there exists no principled approach to obtain this formulation. In this paper, we show how generic pixel-level binary MRF model can be solved in the superpixel space. As the main contribution of this paper, we show that a superpixel-level MRF can be derived from the pixel-level MRF by substituting the superpixel representation of the pixelwise label into the original pixel-level MRF energy. The resultant superpixel-level MRF energy also remains submodular for a submodular pixel-level MRF. The derived formula hence gives us a handy way to formulate MRF energy in superpixel-level. In the experiments, we demonstrate the efficacy of our approach on several computer vision problems.
CVDec 29, 2014
Rigid and Non-rigid Shape Evolutions for Shape Alignment and Recovery in ImagesJunyan Wang, Kap-Luk Chan
The same type of objects in different images may vary in their shapes because of rigid and non-rigid shape deformations, occluding foreground as well as cluttered background. The problem concerned in this work is the shape extraction in such challenging situations. We approach the shape extraction through shape alignment and recovery. This paper presents a novel and general method for shape alignment and recovery by using one example shapes based on deterministic energy minimization. Our idea is to use general model of shape deformation in minimizing active contour energies. Given \emph{a priori} form of the shape deformation, we show how the curve evolution equation corresponding to the shape deformation can be derived. The curve evolution is called the prior variation shape evolution (PVSE). We also derive the energy-minimizing PVSE for minimizing active contour energies. For shape recovery, we propose to use the PVSE that deforms the shape while preserving its shape characteristics. For choosing such shape-preserving PVSE, a theory of shape preservability of the PVSE is established. Experimental results validate the theory and the formulations, and they demonstrate the effectiveness of our method.
CVApr 9, 2014
A Compact Linear Programming Relaxation for Binary Sub-modular MRFJunyan Wang, Sai-Kit Yeung
We propose a novel compact linear programming (LP) relaxation for binary sub-modular MRF in the context of object segmentation. Our model is obtained by linearizing an $l_1^+$-norm derived from the quadratic programming (QP) form of the MRF energy. The resultant LP model contains significantly fewer variables and constraints compared to the conventional LP relaxation of the MRF energy. In addition, unlike QP which can produce ambiguous labels, our model can be viewed as a quasi-total-variation minimization problem, and it can therefore preserve the discontinuities in the labels. We further establish a relaxation bound between our LP model and the conventional LP model. In the experiments, we demonstrate our method for the task of interactive object segmentation. Our LP model outperforms QP when converting the continuous labels to binary labels using different threshold values on the entire Oxford interactive segmentation dataset. The computational complexity of our LP is of the same order as that of the QP, and it is significantly lower than the conventional LP relaxation.
CVJul 24, 2013
Matching-Constrained Active ContoursJunyan Wang, Kap Luk Chan
In object segmentation by active contours, the initial contour is often required. Conventionally, the initial contour is provided by the user. This paper extends the conventional active contour model by incorporating feature matching in the formulation, which gives rise to a novel matching-constrained active contour. The numerical solution to the new optimization model provides an automated framework of object segmentation without user intervention. The main idea is to incorporate feature point matching as a constraint in active contour models. To this effect, we obtain a mathematical model of interior points to boundary contour such that matching of interior feature points gives contour alignment, and we formulate the matching score as a constraint to active contour model such that the feature matching of maximum score that gives the contour alignment provides the initial feasible solution to the constrained optimization model of segmentation. The constraint also ensures that the optimal contour does not deviate too much from the initial contour. Projected-gradient descent equations are derived to solve the constrained optimization. In the experiments, we show that our method is capable of achieving the automatic object segmentation, and it outperforms the related methods.
CVJul 21, 2012
Piecewise Linear Patch Reconstruction for Segmentation and Description of Non-smooth Image StructuresJunyan Wang, Kap Luk Chan
In this paper, we propose a unified energy minimization model for the segmentation of non-smooth image structures. The energy of piecewise linear patch reconstruction is considered as an objective measure of the quality of the segmentation of non-smooth structures. The segmentation is achieved by minimizing the single energy without any separate process of feature extraction. We also prove that the error of segmentation is bounded by the proposed energy functional, meaning that minimizing the proposed energy leads to reducing the error of segmentation. As a by-product, our method produces a dictionary of optimized orthonormal descriptors for each segmented region. The unique feature of our method is that it achieves the simultaneous segmentation and description for non-smooth image structures under the same optimization framework. The experiments validate our theoretical claims and show the clear superior performance of our methods over other related methods for segmentation of various image textures. We show that our model can be coupled with the piecewise smooth model to handle both smooth and non-smooth structures, and we demonstrate that the proposed model is capable of coping with multiple different regions through the one-against-all strategy.
CVJun 17, 2012
The Stability of Convergence of Curve Evolutions in Vector FieldsJunyan Wang, Kap Luk Chan
Curve evolution is often used to solve computer vision problems. If the curve evolution fails to converge, we would not be able to solve the targeted problem in a lifetime. This paper studies the theoretical aspect of the convergence of a type of general curve evolutions. We establish a theory for analyzing and improving the stability of the convergence of the general curve evolutions. Based on this theory, we ascertain that the convergence of a known curve evolution is marginal stable. We propose a way of modifying the original curve evolution equation to improve the stability of the convergence according to our theory. Numerical experiments show that the modification improves the convergence of the curve evolution, which validates our theory.
CVApr 29, 2012
Active Contour with A Tangential ComponentJunyan Wang, Kap Luk Chan
Conventional edge-based active contours often require the normal component of an edge indicator function on the optimal contours to approximate zero, while the tangential component can still be significant. In real images, the full gradients of the edge indicator function along the object boundaries are often small. Hence, the curve evolution of edge-based active contours can terminate early before converging to the object boundaries with a careless contour initialization. We propose a novel Geodesic Snakes (GeoSnakes) active contour that requires the full gradients of the edge indicator to vanish at the optimal solution. Besides, the conventional curve evolution approach for minimizing active contour energy cannot fully solve the Euler-Lagrange (EL) equation of our GeoSnakes active contour, causing a Pseudo Stationary Phenomenon (PSP). To address the PSP problem, we propose an auxiliary curve evolution equation, named the equilibrium flow (EF) equation. Based on the EF and the conventional curve evolution, we obtain a solution to the full EL equation of GeoSnakes active contour. Experimental results validate the proposed geometrical interpretation of the early termination problem, and they also show that the proposed method overcomes the problem.