Siyang Wang

NA
h-index36
21papers
219citations
Novelty44%
AI Score53

21 Papers

ASJun 15, 2023
Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

Shivam Mehta, Siyang Wang, Simon Alexanderson et al.

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach. Please see https://shivammehta25.github.io/Diff-TTSG/ for video examples, data, and code.

6.3NAMay 28
Enriched higher-order multiscale approaches with applications to wave propagation

Balaje Kalyanaraman, Felix Krumbiegel, Roland Maier et al.

We consider the numerical solution of partial differential equations with coefficients that are strongly heterogeneous in space. We provide an overview of higher-order localized orthogonal decomposition (LOD) methods for the elliptic setting, including recent advancements, and then present a generalization of the strategy to linear hyperbolic multiscale problems. We address the limitations of earlier constructions for the wave equation, which only achieve second-order convergence in space, independent of the chosen polynomial degree. Building on the methodology of enriched corrections recently developed for parabolic multiscale problems, we motivate and propose an enriched higher-order LOD method for the wave equation. The enriched corrections exhibit exponential decay and can be computed on patches. Under minimal assumptions on the coefficient and standard well-preparedness conditions on the data, we derive a priori error estimates that achieve optimal high-order convergence rates, thereby overcoming the previously observed saturation of the convergence rate. With the fifth-order Rosenbrock-Wanner (ROW) time integrator, we conduct a series of numerical examples to verify our theoretical results. We provide examples showing the optimal spatial convergence of the method including the localization errors for different polynomial orders. We also present examples showing the optimal convergence rates of the time discretization.

ASJul 11, 2023
On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Siyang Wang, Gustav Eje Henter, Joakim Gustafson et al.

Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications. Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech. However, it is still not clear which SSL and which layer from each SSL model is most suited for spontaneous TTS. We address this shortcoming by extending the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL. Furthermore, SSL has also shown potential in predicting the mean opinion scores (MOS) of synthesized speech, but this has only been done in read-speech MOS prediction. We extend an SSL-based MOS prediction framework previously developed for scoring read speech synthesis and evaluate its performance on synthesized spontaneous speech. All experiments are conducted twice on two different spontaneous corpora in order to find generalizable trends. Overall, we present comprehensive experimental results on the use of SSL in spontaneous TTS and MOS prediction to further quantify and understand how SSL can be used in spontaneous TTS. Audios samples: https://www.speech.kth.se/tts-demos/sp_ssl_tts

NANov 21, 2017
High-order numerical methods for 2D parabolic problems in single and composite domains

Gustav Ludvigsson, Kyle R. Steffen, Simon Sticko et al.

In this work, we discuss and compare three methods for the numerical approximation of constant- and variable-coefficient diffusion equations in both single and composite domains with possible discontinuity in the solution/flux at interfaces, considering (i) the Cut Finite Element Method; (ii) the Difference Potentials Method; and (iii) the summation-by-parts Finite Difference Method. First we give a brief introduction for each of the three methods. Next, we propose benchmark problems, and consider numerical tests-with respect to accuracy and convergence-for linear parabolic problems on a single domain, and continue with similar tests for linear parabolic problems on a composite domain (with the interface defined either explicitly or implicitly). Lastly, a comparative discussion of the methods and numerical results will be given.

ASMar 5, 2023
A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Siyang Wang, Gustav Eje Henter, Joakim Gustafson et al.

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

14.5NAMay 14
Optimal higher-order convergence rates for parabolic multiscale problems

Balaje Kalyanaraman, Felix Krumbiegel, Roland Maier et al.

In this paper, we introduce a higher-order multiscale method for time-dependent problems with highly oscillatory coefficients. Building on the localized orthogonal decomposition (LOD) framework, we construct enriched correction operators to enrich the multiscale spaces, ensuring higher-order convergence without requiring assumptions on the coefficient beyond boundedness. This approach addresses the challenge of a reduction of convergence rates when applying higher-order LOD methods to time-dependent problems. Addressing a parabolic equation as a model problem, we prove the exponential decay of these enriched corrections and establish rigorous a priori error estimates. Numerical experiments confirm our theoretical results.

NAMar 15, 2019
An Energy Based Discontinuous Galerkin Method for Coupled Elasto-Acoustic Wave Equations in Second Order Form

Daniel Appelö, Siyang Wang

We consider wave propagation in a coupled fluid-solid region, separated by a static but possibly curved interface. The wave propagation is modeled by the acoustic wave equation in terms of a velocity potential in the fluid, and the elastic wave equation for the displacement in the solid. At the fluid solid interface, we impose suitable interface conditions to couple the two equations. We use a recently developed, energy based discontinuous Galerkin method to discretize the governing equations in space. Both energy conserving and upwind numerical fluxes are derived to impose the interface conditions. The highlights of the developed scheme include provable energy stability and high order accuracy. We present numerical experiments to illustrate the accuracy property and robustness of the developed scheme.

NAJun 5, 2018
Order Preserving Interpolation for Summation-by-Parts Operators at Non-Conforming Grid Interfaces

Martin Almquist, Siyang Wang, Jonatan Werpers

We study non-conforming grid interfaces for summation-by-parts finite difference methods applied to partial differential equations with second derivatives in space. To maintain energy stability, previous efforts have been forced to accept a reduction of the global convergence rate by one order, due to large truncation errors at the non-conforming interface. We avoid the order reduction by generalizing the interface treatment and introducing order preserving interpolation operators. We prove that, given two diagonal-norm summation-by-parts schemes, order preserving interpolation operators with the necessary properties are guaranteed to exist, regardless of the grid-point distributions along the interface. The new methods retain the stability and global accuracy properties of the underlying schemes for conforming interfaces.

NAApr 12, 2018
An improved high order finite difference method for non-conforming grid interfaces for the wave equation

Siyang Wang

This paper presents an extension of a recently developed high order finite difference method for the wave equation on a grid with non-conforming interfaces. The stability proof of the existing methods relies on the interpolation operators being norm-contracting, which is satisfied by the second and fourth order operators, but not by the sixth order operator. We construct new penalty terms to impose interface conditions such that the stability proof does not require the norm-contracting condition. As a consequence, the sixth order accurate scheme is also provably stable. Numerical experiments demonstrate the improved stability and accuracy property.

CVJul 19, 2024
Enhancing Layout Hotspot Detection Efficiency with YOLOv8 and PCA-Guided Augmentation

Dongyang Wu, Siyang Wang, Mehdi Kamal et al.

In this paper, we present a YOLO-based framework for layout hotspot detection, aiming to enhance the efficiency and performance of the design rule checking (DRC) process. Our approach leverages the YOLOv8 vision model to detect multiple hotspots within each layout image, even when dealing with large layout image sizes. Additionally, to enhance pattern-matching effectiveness, we introduce a novel approach to augment the layout image using information extracted through Principal Component Analysis (PCA). The core of our proposed method is an algorithm that utilizes PCA to extract valuable auxiliary information from the layout image. This extracted information is then incorporated into the layout image as an additional color channel. This augmentation significantly improves the accuracy of multi-hotspot detection while reducing the false alarm rate of the object detection algorithm. We evaluate the effectiveness of our framework using four datasets generated from layouts found in the ICCAD-2019 benchmark dataset. The results demonstrate that our framework achieves a precision (recall) of approximately 83% (86%) while maintaining a false alarm rate of less than 7.4\%. Also, the studies show that the proposed augmentation approach could improve the detection ability of never-seen-before (NSB) hotspots by about 10%.

CVDec 31, 2025
From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation

Siyang Wang, Hanting Li, Wei Li et al.

Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.

ROSep 15, 2025
Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents

Anna Deichler, Siyang Wang, Simon Alexanderson et al.

One of the main goals of robotics and intelligent agent research is to enable natural communication with humans in physically situated settings. While recent work has focused on verbal modes such as language and speech, non-verbal communication is crucial for flexible interaction. We present a framework for generating pointing gestures in embodied agents by combining imitation and reinforcement learning. Using a small motion capture dataset, our method learns a motor control policy that produces physically valid, naturalistic gestures with high referential accuracy. We evaluate the approach against supervised learning and retrieval baselines in both objective metrics and a virtual reality referential game with human users. Results show that our system achieves higher naturalness and accuracy than state-of-the-art supervised models, highlighting the promise of imitation-RL for communicative gesture generation and its potential application to robots.

ROSep 16, 2025
Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation

Anna Deichler, Siyang Wang, Simon Alexanderson et al.

Pointing is a key mode of interaction with robots, yet most prior work has focused on recognition rather than generation. We present a motion capture dataset of human pointing gestures covering diverse styles, handedness, and spatial targets. Using reinforcement learning with motion imitation, we train policies that reproduce human-like pointing while maximizing precision. Results show our approach enables context-aware pointing behaviors in simulation, balancing task performance with natural dynamics.

CVDec 30, 2024
Navigating Image Restoration with VAR's Distribution Alignment Prior

Siyang Wang, Feng Zhao

Generative models trained on extensive high-quality datasets effectively capture the structural and statistical properties of clean images, rendering them powerful priors for transforming degraded features into clean ones in image restoration. VAR, a novel image generative paradigm, surpasses diffusion models in generation quality by applying a next-scale prediction approach. It progressively captures both global structures and fine-grained details through the autoregressive process, consistent with the multi-scale restoration principle widely acknowledged in the restoration community. Furthermore, we observe that during the image reconstruction process utilizing VAR, scale predictions automatically modulate the input, facilitating the alignment of representations at subsequent scales with the distribution of clean images. To harness VAR's adaptive distribution alignment capability in image restoration tasks, we formulate the multi-scale latent representations within VAR as the restoration prior, thus advancing our delicately designed VarFormer framework. The strategic application of these priors enables our VarFormer to achieve remarkable generalization on unseen tasks while also reducing training computational costs. Extensive experiments underscores that our VarFormer outperforms existing multi-task image restoration methods across various restoration tasks.

HCAug 25, 2021
Integrated Speech and Gesture Synthesis

Siyang Wang, Simon Alexanderson, Joakim Gustafson et al.

Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications -- speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem. Videos and code are available on our project page at https://swatsw.github.io/isg_icmi21/

CVOct 9, 2019
Unaligned Image-to-Sequence Transformation with Loop Consistency

Siyang Wang, Justin Lazarow, Kwonjoon Lee et al.

We tackle the problem of modeling sequential visual phenomena. Given examples of a phenomena that can be divided into discrete time steps, we aim to take an input from any such time and realize this input at all other time steps in the sequence. Furthermore, we aim to do this without ground-truth aligned sequences -- avoiding the difficulties needed for gathering aligned data. This generalizes the unpaired image-to-image problem from generating pairs to generating sequences. We extend cycle consistency to loop consistency and alleviate difficulties associated with learning in the resulting long chains of computation. We show competitive results compared to existing image-to-image techniques when modeling several different data sets including the Earth's seasons and aging of human faces.

CVJun 6, 2019
Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers

Manjot Bilkhu, Siyang Wang, Tushar Dobhal

Video Captioning and Summarization have become very popular in the recent years due to advancements in Sequence Modelling, with the resurgence of Long-Short Term Memory networks (LSTMs) and introduction of Gated Recurrent Units (GRUs). Existing architectures extract spatio-temporal features using CNNs and utilize either GRUs or LSTMs to model dependencies with soft attention layers. These attention layers do help in attending to the most prominent features and improve upon the recurrent units, however, these models suffer from the inherent drawbacks of the recurrent units themselves. The introduction of the Transformer model has driven the Sequence Modelling field into a new direction. In this project, we implement a Transformer-based model for Video captioning, utilizing 3D CNN architectures like C3D and Two-stream I3D for video extraction. We also apply certain dimensionality reduction techniques so as to keep the overall size of the model within limits. We finally present our results on the MSVD and ActivityNet datasets for Single and Dense video captioning tasks respectively.

CVDec 6, 2017
Controllable Top-down Feature Transformer

Zhiwei Jia, Haoshen Hong, Siyang Wang et al.

We study the intrinsic transformation of feature maps across convolutional network layers with explicit top-down control. To this end, we develop top-down feature transformer (TFT), under controllable parameters, that are able to account for the hidden layer transformation while maintaining the overall consistency across layers. The learned generators capture the underlying feature transformation processes that are independent of particular training images. Our proposed TFT framework brings insights to and helps the understanding of, an important problem of studying the CNN internal feature representation and transformation under the top-down processes. In the case of spatial transformations, we demonstrate the significant advantage of TFT over existing data-driven approaches in building data-independent transformations. We also show that it can be adopted in other applications such as data augmentation and image style transfer.

NAMay 18, 2017
Convergence of finite difference methods for the wave equation in two space dimensions

Siyang Wang, Anna Nissen, Gunilla Kreiss

When using a finite difference method to solve an initial--boundary--value problem, the truncation error is often of lower order at a few grid points near boundaries than in the interior. Normal mode analysis is a powerful tool to analyze the effect of the large truncation error near boundaries on the overall convergence rate, and has been used in many previous literatures for different equations. However, existing work only concerns problems in one space dimension. In this paper, we extend the analysis to problems in two space dimensions. The two dimensional analysis is based on a diagonalization procedure that decomposes a two dimensional problem to many one dimensional problems of the same type. We present a general framework of analyzing convergence for such one dimensional problems, and explain how to obtain the result for the corresponding two dimensional problem. In particular, we consider two kinds of truncation errors in two space dimensions: the truncation error along an entire boundary, and the truncation error localized at a few grid points close to a corner of the computational domain. The accuracy analysis is in a general framework, here applied to the second order wave equation. Numerical experiments corroborate our accuracy analysis.

NASep 25, 2015
High order finite difference methods for the wave equation with non-conforming grid interfaces

Siyang Wang, Kristoffer Virta, Gunilla Kreiss

We use high order finite difference methods to solve the wave equation in the second order form. The spatial discretization is performed by finite difference operators satisfying a summation-by-parts property. The focus of this work is on the numerical treatment of non-conforming grid interfaces. The interface conditions are imposed weakly by the simultaneous approximation term technique in combination with interface operators, which move the discrete solutions between the grids on the interface. In particular, we consider interpolation operators and projection operators. A norm-compatibility condition, which leads to stability for first order hyperbolic systems, does not suffice for second order wave equations. An extra constraint on the interface operators must be satisfied to derive an energy estimate for stability. We carry out eigenvalue analyses to investigate the additional constraint and how it is related to stability, and find that the projection operators have better stability properties than the interpolation operators. In addition, a truncation error analysis is performed to study the convergence property of the numerical schemes. In the numerical experiments, the stability and accuracy properties of the numerical schemes are further explored, and the practical usefulness of non-conforming grid interfaces is presented and discussed in two efficiency studies.

NASep 3, 2015
Convergence of summation-by-parts finite difference methods for the wave equation

Siyang Wang, Gunilla Kreiss

In this paper, we consider finite difference approximations of the second order wave equation. We use finite difference operators satisfying the summation-by-parts property to discretize the equation in space. Boundary conditions and grid interface conditions are imposed by the simultaneous-approximation-term technique. Typically, the truncation error is larger at the grid points near a boundary or grid interface than that in the interior. Normal mode analysis can be used to analyze how the large truncation error affects the convergence rate of the underlying stable numerical scheme. If the semi-discretized equation satisfies a determinant condition, two orders are gained from the large truncation error. However, many interesting second order equations do not satisfy the determinant condition. We then carefully analyze the solution of the boundary system to derive a sharp estimate for the error in the solution and acquire the gain in convergence rate. The result shows that stability does not automatically yield a gain of two orders in convergence rate. The accuracy analysis is verified by numerical experiments.