CVSep 18, 2022
RVSL: Robust Vehicle Similarity Learning in Real Hazy Scenes Based on Semi-supervised LearningWei-Ting Chen, I-Hsiang Chen, Chih-Yuan Yeh et al.
Recently, vehicle similarity learning, also called re-identification (ReID), has attracted significant attention in computer vision. Several algorithms have been developed and obtained considerable success. However, most existing methods have unpleasant performance in the hazy scenario due to poor visibility. Though some strategies are possible to resolve this problem, they still have room to be improved due to the limited performance in real-world scenarios and the lack of real-world clear ground truth. Thus, to resolve this problem, inspired by CycleGAN, we construct a training paradigm called \textbf{RVSL} which integrates ReID and domain transformation techniques. The network is trained on semi-supervised fashion and does not require to employ the ID labels and the corresponding clear ground truths to learn hazy vehicle ReID mission in the real-world haze scenes. To further constrain the unsupervised learning process effectively, several losses are developed. Experimental results on synthetic and real-world datasets indicate that the proposed method can achieve state-of-the-art performance on hazy vehicle ReID problems. It is worth mentioning that although the proposed method is trained without real-world label information, it can achieve competitive performance compared to existing supervised methods trained on complete label information.
ASMay 17
SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context EnhancementKuan-Yu Chen, Jeng-Lin Li, De-Yan Lu et al.
With the fast development of zero-shot text-to-speech technologies, it is possible to generate high-quality speech signals that are indistinguishable from the real ones. Speech editing, including speech insertion and replacement, appeals to researchers due to its potential applications. However, existing studies only considered clean speech scenarios. In real-world applications, the existence of environmental noise could significantly degrade the quality of generation. In this study, we propose a noise-resilient speech editing framework, SeamlessEdit, for noisy speech editing. SeamlessEdit adopts a frequency-band-aware noise suppression module and an in-content refinement strategy. It can well address the scenario where the frequency bands of voice and background noise are not separated. The proposed SeamlessEdit framework outperforms state-of-the-art approaches in multiple quantitative and qualitative evaluations.
SDMay 29
AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music EditingChih-Heng Chang, Keng-Seng Ho, Chih-Yu Tsai et al.
Controllable music editing is to modify high-level attributes while strictly preserving rhythmic and melodic structures. However, this task is challenged by a semantic-structural entanglement: steering methods often degrade structure to achieve editing performance, while structural adaptors suppress semantic responsiveness. We propose AnchorSteer, a framework that disentangles this tension by coupling structural anchoring with self-discovered semantic steering. The proposed approach probes internal representations to extract interpretable, label-free concept vectors via a self-supervised reconstruction objective, isolating attributes without curated data. During editing, these portable, plug-and-play concept vectors are injected into diffusion hidden manifolds while a structural adaptor enforces consistency. Variants for unconditioned and conditioned injections are provided to balance robustness and semantic strength. Experiments on ZoME-Bench and subjective tests show that the proposed framework outperforms both steering-only and anchoring-only baselines, enabling significant semantic transformations with high-fidelity structural preservation.
SPMar 21Code
The Binding Effect: Analyzing How Multi-Dimensional Cues Form Gender Bias in Instruction TTSKuan-Yu Chen, Yi-Cheng Lin, Po-Chung Hsieh et al.
Current bias evaluations in Instruction Text-to-Speech (ITTS) often rely on univariate testing, overlooking the compositional structure of social cues. In this work, we investigate gender bias by modeling prompts as combinations of Social Status, Career stereotypes, and Persona descriptors. Analyzing open-source ITTS models, we uncover systematic interaction effects where social dimensions modulate one another, creating complex bias patterns missed by univariate baselines. Crucially, our findings indicate that these biases extend beyond surface-level artifacts, demonstrating strong associations with the semantic priors of pre-trained text encoders and the skewed distributions inherent in training data. We further demonstrate that generic diversity prompting is insufficient to override these entrenched patterns, underscoring the need for compositional analysis to diagnose latent risks in generative speech.
CVNov 25, 2021Code
ContourletNet: A Generalized Rain Removal Architecture Using Multi-Direction Hierarchical RepresentationWei-Ting Chen, Cheng-Che Tsai, Hao-Yu Fang et al.
Images acquired from rainy scenes usually suffer from bad visibility which may damage the performance of computer vision applications. The rainy scenarios can be categorized into two classes: moderate rain and heavy rain scenes. Moderate rain scene mainly consists of rain streaks while heavy rain scene contains both rain streaks and the veiling effect (similar to haze). Although existing methods have achieved excellent performance on these two cases individually, it still lacks a general architecture to address both heavy rain and moderate rain scenarios effectively. In this paper, we construct a hierarchical multi-direction representation network by using the contourlet transform (CT) to address both moderate rain and heavy rain scenarios. The CT divides the image into the multi-direction subbands (MS) and the semantic subband (SS). First, the rain streak information is retrieved to the MS based on the multi-orientation property of the CT. Second, a hierarchical architecture is proposed to reconstruct the background information including damaged semantic information and the veiling effect in the SS. Last, the multi-level subband discriminator with the feedback error map is proposed. By this module, all subbands can be well optimized. This is the first architecture that can address both of the two scenarios effectively. The code is available in https://github.com/cctakaet/ContourletNet-BMVC2021.
LGJun 6, 2019
Nonconvex Approach for Sparse and Low-Rank Constrained Models with Dual MomentumCho-Ying Wu, Jian-Jiun Ding
In this manuscript, we research on the behaviors of surrogates for the rank function on different image processing problems and their optimization algorithms. We first propose a novel nonconvex rank surrogate on the general rank minimization problem and apply this to the corrupted image completion problem. Then, we propose that nonconvex rank surrogates can be introduced into two well-known sparse and low-rank models: Robust Principal Component Analysis (RPCA) and Low-Rank Representation (LRR). For optimization, we use alternating direction method of multipliers (ADMM) and propose a trick, which is called the dual momentum. We add the difference of the dual variable between the current and the last iteration with a weight. This trick can avoid the local minimum problem and make the algorithm converge to a solution with smaller recovery error in the nonconvex optimization problem. Also, it can boost the convergence when the variable updates too slowly. We also give a severe proof and verify that the proposed algorithms are convergent. Then, several experiments are conducted, including image completion, denoising, and spectral clustering with outlier detection. These experiments show that the proposed methods are effective in image and signal processing applications, and have the best performance compared with state-of-the-art methods.
IVJun 6, 2019
Occluded Face Recognition Using Low-rank Regression with Generalized Gradient DirectionCho-Ying Wu, Jian-Jiun Ding
In this paper, a very effective method to solve the contiguous face occlusion recognition problem is proposed. It utilizes the robust image gradient direction features together with a variety of mapping functions and adopts a hierarchical sparse and low-rank regression model. This model unites the sparse representation in dictionary learning and the low-rank representation on the error term that is usually messy in the gradient domain. We call it the "weak low-rankness" optimization problem, which can be efficiently solved by the framework of Alternating Direction Method of Multipliers (ADMM). The optimum of the error term has a similar weak low-rank structure as the reference error map and the recognition performance can be enhanced by leaps and bounds using weak low-rankness optimization. Extensive experiments are conducted on real-world disguise / occlusion data and synthesized contiguous occlusion data. These experiments show that the proposed gradient direction-based hierarchical adaptive sparse and low-rank (GD-HASLR) algorithm has the best performance compared to state-of-the-art methods, including popular convolutional neural network-based methods.
CVJun 3, 2017
Discrete Gyrator Transforms: Computational Algorithms and ApplicationsSoo-Chang Pei, Shih-Gu Huang, Jian-Jiun Ding
As an extension of the 2D fractional Fourier transform (FRFT) and a special case of the 2D linear canonical transform (LCT), the gyrator transform was introduced to produce rotations in twisted space/spatial-frequency planes. It is a useful tool in optics, signal processing and image processing. In this paper, we develop discrete gyrator transforms (DGTs) based on the 2D LCT. Taking the advantage of the additivity property of the 2D LCT, we propose three kinds of DGTs, each of which is a cascade of low-complexity operators. These DGTs have different constraints, characteristics, and properties, and are realized by different computational algorithms. Besides, we propose a kind of DGT based on the eigenfunctions of the gyrator transform. This DGT is an orthonormal transform, and thus its comprehensive properties, especially the additivity property, make it more useful in many applications. We also develop an efficient computational algorithm to significantly reduce the complexity of this DGT. At the end, a brief review of some important applications of the DGTs is presented, including mode conversion, sampling and reconstruction, watermarking, and image encryption.