ASNov 11, 2022
Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representationsYoori Oh, Juheon Lee, Yoseob Han et al.
Recent text-to-speech models have reached the level of generating natural speech similar to what humans say. But there still have limitations in terms of expressiveness. The existing emotional speech synthesis models have shown controllability using interpolated features with scaling parameters in emotional latent space. However, the emotional latent space generated from the existing models is difficult to control the continuous emotional intensity because of the entanglement of features like emotions, speakers, etc. In this paper, we propose a novel method to control the continuous intensity of emotions using semi-supervised learning. The model learns emotions of intermediate intensity using pseudo-labels generated from phoneme-level sequences of speech information. An embedding space built from the proposed model satisfies the uniform grid geometry with an emotional basis. The experimental results showed that the proposed method was superior in controllability and naturalness.
58.8SDMay 21
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow MatchingJinhyeok Yang, Hyeongju Kim, Yechan Yu et al.
While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/
SDDec 19, 2025Code
Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS TrackJune Young Yi, Hyeongju Kim, Juheon Lee
This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.
CVJul 4, 2024
Detect Closer Surfaces that can be Seen: New Modeling and Evaluation in Cross-domain 3D Object DetectionRuixiao Zhang, Yihong Wu, Juheon Lee et al.
The performance of domain adaptation technologies has not yet reached an ideal level in the current 3D object detection field for autonomous driving, which is mainly due to significant differences in the size of vehicles, as well as the environments they operate in when applied across domains. These factors together hinder the effective transfer and application of knowledge learned from specific datasets. Since the existing evaluation metrics are initially designed for evaluation on a single domain by calculating the 2D or 3D overlap between the prediction and ground-truth bounding boxes, they often suffer from the overfitting problem caused by the size differences among datasets. This raises a fundamental question related to the evaluation of the 3D object detection models' cross-domain performance: Do we really need models to maintain excellent performance in their original 3D bounding boxes after being applied across domains? From a practical application perspective, one of our main focuses is actually on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the size of vehicles is much more difficult. In other words, as long as a model can accurately identify the closest surfaces to the ego vehicle, it is sufficient to effectively avoid obstacles. In this paper, we propose two metrics to measure 3D object detection models' ability of detecting the closer surfaces to the sensor on the ego vehicle, which can be used to evaluate their cross-domain performance more comprehensively and reasonably. Furthermore, we propose a refinement head, named EdgeHead, to guide models to focus more on the learnable closer surfaces, which can greatly improve the cross-domain performance of existing models not only under our new metrics, but even also under the original BEV/3D metrics.
CVAug 22, 2024
Revisiting Cross-Domain Problem for LiDAR-based 3D Object DetectionRuixiao Zhang, Juheon Lee, Xiaohao Cai et al.
Deep learning models such as convolutional neural networks and transformers have been widely applied to solve 3D object detection problems in the domain of autonomous driving. While existing models have achieved outstanding performance on most open benchmarks, the generalization ability of these deep networks is still in doubt. To adapt models to other domains including different cities, countries, and weather, retraining with the target domain data is currently necessary, which hinders the wide application of autonomous driving. In this paper, we deeply analyze the cross-domain performance of the state-of-the-art models. We observe that most models will overfit the training domains and it is challenging to adapt them to other domains directly. Existing domain adaptation methods for 3D object detection problems are actually shifting the models' knowledge domain instead of improving their generalization ability. We then propose additional evaluation metrics -- the side-view and front-view AP -- to better analyze the core issues of the methods' heavy drops in accuracy levels. By using the proposed metrics and further evaluating the cross-domain performance in each dimension, we conclude that the overfitting problem happens more obviously on the front-view surface and the width dimension which usually faces the sensor and has more 3D points surrounding it. Meanwhile, our experiments indicate that the density of the point cloud data also significantly influences the models' cross-domain performance.
CVJul 5, 2024
Neural varifolds: an aggregate representation for quantifying the geometry of point cloudsJuheon Lee, Xiaohao Cai, Carola-Bibian Schönlieb et al.
Point clouds are popular 3D representations for real-life objects (such as in LiDAR and Kinect) due to their detailed and compact representation of surface-based geometry. Recent approaches characterise the geometry of point clouds by bringing deep learning based techniques together with geometric fidelity metrics such as optimal transportation costs (e.g., Chamfer and Wasserstein metrics). In this paper, we propose a new surface geometry characterisation within this realm, namely a neural varifold representation of point clouds. Here the surface is represented as a measure/distribution over both point positions and tangent spaces of point clouds. The varifold representation quantifies not only the surface geometry of point clouds through the manifold-based discrimination, but also subtle geometric consistencies on the surface due to the combined product space. This study proposes neural varifold algorithms to compute the varifold norm between two point clouds using neural networks on point clouds and their neural tangent kernel representations. The proposed neural varifold is evaluated on three different sought-after tasks -- shape matching, few-shot shape classification and shape reconstruction. Detailed evaluation and comparison to the state-of-the-art methods demonstrate that the proposed versatile neural varifold is superior in shape matching and few-shot shape classification, and is competitive for shape reconstruction.
LGApr 17, 2024
Virtual Foundry Graphnet for Metal Sintering Deformation PredictionRachel, Chen, Juheon Lee et al.
Metal Sintering is a necessary step for Metal Injection Molded parts and binder jet such as HP's metal 3D printer. The metal sintering process introduces large deformation varying from 25 to 50% depending on the green part porosity. In this paper, we use a graph-based deep learning approach to predict the part deformation, which can speed up the deformation simulation substantially at the voxel level. Running a well-trained Metal Sintering inferencing engine only takes a range of seconds to obtain the final sintering deformation value. The tested accuracy on example complex geometry achieves 0.7um mean deviation for a 63mm testing part.
ASSep 14, 2025
Length-Aware Rotary Position Embedding for Text-Speech AlignmentHyeongju Kim, Juheon Lee, Jinhyeok Yang et al.
Many recent text-to-speech (TTS) systems are built on transformer architectures and employ cross-attention mechanisms for text-speech alignment. Within these systems, rotary position embedding (RoPE) is commonly used to encode positional information in text and speech representations. In this work, we introduce length-aware RoPE (LARoPE), a simple yet effective extension of RoPE that improves text-speech alignment. Unlike RoPE, which relies on absolute indices, LARoPE computes relative distances between query and key positions using length-normalized indices. Experimental results show that LARoPE consistently outperforms RoPE, offering faster loss convergence, more accurate text-speech alignment, and higher overall TTS quality. Furthermore, LARoPE demonstrates greater resilience to variations in utterance duration and maintains stable performance in extended speech generation up to 30 seconds, whereas RoPE suffers from notable degradation. Notably, our method achieves a state-of-the-art word error rate on a standard zero-shot TTS benchmark.
ASSep 23, 2025
Training Flow Matching Models with Reliable Labels via Self-PurificationHyeongju Kim, Yechan Yu, June Young Yi et al.
Training datasets are inherently imperfect, often containing mislabeled samples due to human annotation errors, limitations of tagging models, and other sources of noise. Such label contamination can significantly degrade the performance of a trained model. In this work, we introduce Self-Purifying Flow Matching (SPFM), a principled approach to filtering unreliable data within the flow-matching framework. SPFM identifies suspicious data using the model itself during the training process, bypassing the need for pretrained models or additional modules. Our experiments demonstrate that models trained with SPFM generate samples that accurately adhere to the specified conditioning, even when trained on noisy labels. Furthermore, we validate the robustness of SPFM on the TITW dataset, which consists of in-the-wild speech data, achieving performance that surpasses existing baselines.
ASMar 29, 2025
SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech SystemHyeongju Kim, Jinhyeok Yang, Yechan Yu et al.
We introduce SupertonicTTS, a novel text-to-speech (TTS) system designed for efficient and streamlined speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. The TTS pipeline is further simplified by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we propose context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment with minimal memory and I/O overhead. Experimental results demonstrate that SupertonicTTS delivers performance comparable to contemporary zero-shot TTS models with only 44M parameters, while significantly reducing architectural complexity and computational cost. Audio samples are available at: https://supertonictts.github.io/.
CVFeb 11, 2025
GraphCompNet: A Position-Aware Model for Predicting and Compensating Shape Deviations in 3D PrintingLei, Chen, Juheon Lee et al.
This paper introduces a data-driven algorithm for modeling and compensating shape deviations in additive manufacturing (AM), addressing challenges in geometric accuracy and batch production. While traditional methods, such as analytical models and metrology, laid the groundwork for geometric precision, they are often impractical for large-scale production. Recent advancements in machine learning (ML) have improved compensation precision, but issues remain in generalizing across complex geometries and adapting to position-dependent variations. We present a novel approach for powder bed fusion (PBF) processes, using GraphCompNet, which is a computational framework combining graph-based neural networks with a generative adversarial network (GAN)-inspired training process. By leveraging point cloud data and dynamic graph convolutional neural networks (DGCNNs), GraphCompNet models complex shapes and incorporates position-specific thermal and mechanical factors. A two-stage adversarial training procedure iteratively refines compensated designs via a compensator-predictor architecture, offering real-time feedback and optimization. Experimental validation across diverse shapes and positions shows the framework significantly improves compensation accuracy (35 to 65 percent) across the entire print space, adapting to position-dependent variations. This work advances the development of Digital Twin technology for AM, enabling scalable, real-time monitoring and compensation, and addressing critical gaps in AM process control. The proposed method supports high-precision, automated industrial-scale design and manufacturing systems.
SDOct 27, 2021
Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised RepresentationsHyeong-Seok Choi, Juheon Lee, Wansoo Kim et al.
We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e.g., formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i.e., wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully self-supervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification.
SDOct 29, 2019
Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis SystemJuheon Lee, Hyeong-Seok Choi, Junghyun Koo et al.
In this study, we define the identity of the singer with two independent concepts - timbre and singing style - and propose a multi-singer singing synthesis system that can model them separately. To this end, we extend our single-singer model into a multi-singer model in the following ways: first, we design a singer identity encoder that can adequately reflect the identity of a singer. Second, we use encoded singer identity to condition the two independent decoders that model timbre and singing style, respectively. Through a user study with the listening tests, we experimentally verify that the proposed framework is capable of generating a natural singing voice of high quality while independently controlling the timbre and singing style. Also, by using the method of changing singing styles while fixing the timbre, we suggest that our proposed network can produce a more expressive singing voice.
SDAug 6, 2019
Adversarially Trained End-to-end Korean Singing Voice Synthesis SystemJuheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon et al.
In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied to generate implicit formant masks solely from the input text, which enables a more accurate phonetic control of singing voice. In addition, we show that two other proposed methods -- local conditioning of text and pitch, and conditional adversarial training -- are crucial for a realistic generation of the human singing voice in the super-resolution process. Finally, both quantitative and qualitative evaluations are conducted, confirming the validity of all proposed methods.
CVMar 20, 2019
Three-dimensional Segmentation of Trees Through a Flexible Multi-Class Graph Cut Algorithm (MCGC)Jonathan Williams, Carola-Bibiane Schönlieb, Tom Swinfield et al.
Developing a robust algorithm for automatic individual tree crown (ITC) detection from laser scanning datasets is important for tracking the responses of trees to anthropogenic change. Such approaches allow the size, growth and mortality of individual trees to be measured, enabling forest carbon stocks and dynamics to be tracked and understood. Many algorithms exist for structurally simple forests including coniferous forests and plantations. Finding a robust solution for structurally complex, species-rich tropical forests remains a challenge; existing segmentation algorithms often perform less well than simple area-based approaches when estimating plot-level biomass. Here we describe a Multi-Class Graph Cut (MCGC) approach to tree crown delineation. This uses local three-dimensional geometry and density information, alongside knowledge of crown allometries, to segment individual tree crowns from LiDAR point clouds. Our approach robustly identifies trees in the top and intermediate layers of the canopy, but cannot recognise small trees. From these three-dimensional crowns, we are able to measure individual tree biomass. Comparing these estimates to those from permanent inventory plots, our algorithm is able to produce robust estimates of hectare-scale carbon density, demonstrating the power of ITC approaches in monitoring forests. The flexibility of our method to add additional dimensions of information, such as spectral reflectance, make this approach an obvious avenue for future development and extension to other sources of three-dimensional data, such as structure from motion datasets.
MMNov 2, 2018
Listen to Dance: Music-driven choreography generation using Autoregressive Encoder-Decoder NetworkJuheon Lee, Seohyun Kim, Kyogu Lee
Automatic choreography generation is a challenging task because it often requires an understanding of two abstract concepts - music and dance - which are realized in the two different modalities, namely audio and video, respectively. In this paper, we propose a music-driven choreography generation system using an auto-regressive encoder-decoder network. To this end, we first collect a set of multimedia clips that include both music and corresponding dance motion. We then extract the joint coordinates of the dancer from video and the mel-spectrogram of music from audio, and train our network using music-choreography pairs as input. Finally, a novel dance motion is generated at the inference time when only music is given as an input. We performed a user study for a qualitative evaluation of the proposed method, and the results show that the proposed model is able to generate musically meaningful and natural dance movements given an unheard song.
IRAug 31, 2018
Content-based feature exploration for transparent music recommendation using self-attentive genre classificationSeungjin Lee, Juheon Lee, Kyogu lee
Interpretation of retrieved results is an important issue in music recommender systems, particularly from a user perspective. In this study, we investigate the methods for providing interpretability of content features using self-attention. We extract lyric features with the self-attentive genre classification model trained on 140,000 tracks of lyrics. Likewise, we extract acoustic features using the acoustic model with self-attention trained on 120,000 tracks of acoustic signals. The experimental results show that the proposed methods provide the characteristics that are interpretable in terms of both lyrical and musical contents. We demonstrate this by visualizing the attention weights, and by presenting the most similar songs found using lyric or audio features.
SDDec 1, 2017
Audio Cover Song Identification using Convolutional Neural NetworkSungkyun Chang, Juheon Lee, Sang Keun Choe et al.
In this paper, we propose a new approach to cover song identification using a CNN (convolutional neural network). Most previous studies extract the feature vectors that characterize the cover song relation from a pair of songs and used it to compute the (dis)similarity between the two songs. Based on the observation that there is a meaningful pattern between cover songs and that this can be learned, we have reformulated the cover song identification problem in a machine learning framework. To do this, we first build the CNN using as an input a cross-similarity matrix generated from a pair of songs. We then construct the data set composed of cover song pairs and non-cover song pairs, which are used as positive and negative training samples, respectively. The trained CNN outputs the probability of being in the cover song relation given a cross-similarity matrix generated from any two pieces of music and identifies the cover song by ranking on the probability. Experimental results show that the proposed algorithm achieves performance better than or comparable to the state-of-the-art.
CVJan 24, 2017
A graph cut approach to 3D tree delineation, using integrated airborne LiDAR and hyperspectral imageryJuheon Lee, David Coomes, Carola-Bibiane Schonlieb et al.
Recognising individual trees within remotely sensed imagery has important applications in forest ecology and management. Several algorithms for tree delineation have been suggested, mostly based on locating local maxima or inverted basins in raster canopy height models (CHMs) derived from Light Detection And Ranging (LiDAR) data or photographs. However, these algorithms often lead to inaccurate estimates of forest stand characteristics due to the limited information content of raster CHMs. Here we develop a 3D tree delineation method which uses graph cut to delineate trees from the full 3D LiDAR point cloud, and also makes use of any optical imagery available (hyperspectral imagery in our case). First, conventional methods are used to locate local maxima in the CHM and generate an initial map of trees. Second, a graph is built from the LiDAR point cloud, fused with the hyperspectral data. For computational efficiency, the feature space of hyperspectral imagery is reduced using robust PCA. Third, a multi-class normalised cut is applied to the graph, using the initial map of trees to constrain the number of clusters and their locations. Finally, recursive normalised cut is used to subdivide, if necessary, each of the clusters identified by the initial analysis. We call this approach Multiclass Cut followed by Recursive Cut (MCRC). The effectiveness of MCRC was tested using three datasets: i) NewFor, ii) a coniferous forest in the Italian Alps, and iii) a deciduous woodland in the UK. The performance of MCRC was usually superior to that of other delineation methods, and was further improved by including high-resolution optical imagery. Since MCRC delineates the entire LiDAR point cloud in 3D, it allows individual crown characteristics to be measured. By making full use of the data available, graph cut has the potential to considerably improve the accuracy of tree delineation.
CVJul 28, 2014
Non-parametric Image Registration of Airborne LiDAR, Hyperspectral and Photographic Imagery of ForestsJuheon Lee, Xiaohao Cai, Carola-Bibiane Schonlieb et al.
There is much current interest in using multi-sensor airborne remote sensing to monitor the structure and biodiversity of forests. This paper addresses the application of non-parametric image registration techniques to precisely align images obtained from multimodal imaging, which is critical for the successful identification of individual trees using object recognition approaches. Non-parametric image registration, in particular the technique of optimizing one objective function containing data fidelity and regularization terms, provides flexible algorithms for image registration. Using a survey of woodlands in southern Spain as an example, we show that non-parametric image registration can be successful at fusing datasets when there is little prior knowledge about how the datasets are interrelated (i.e. in the absence of ground control points). The validity of non-parametric registration methods in airborne remote sensing is demonstrated by a series of experiments. Precise data fusion is a prerequisite to accurate recognition of objects within airborne imagery, so non-parametric image registration could make a valuable contribution to the analysis pipeline.