Jiadong Liang

ML
h-index9
14papers
221citations
Novelty49%
AI Score43

14 Papers

CVAug 12, 2022
Layout-Bridging Text-to-Image Synthesis

Jiadong Liang, Wenjie Pei, Feng Lu

The crux of text-to-image synthesis stems from the difficulty of preserving the cross-modality semantic consistency between the input text and the synthesized image. Typical methods, which seek to model the text-to-image mapping directly, could only capture keywords in the text that indicates common objects or actions but fail to learn their spatial distribution patterns. An effective way to circumvent this limitation is to generate an image layout as guidance, which is attempted by a few methods. Nevertheless, these methods fail to generate practically effective layouts due to the diversity of input text and object location. In this paper we push for effective modeling in both text-to-layout generation and layout-to-image synthesis. Specifically, we formulate the text-to-layout generation as a sequence-to-sequence modeling task, and build our model upon Transformer to learn the spatial relationships between objects by modeling the sequential dependencies between them. In the stage of layout-to-image synthesis, we focus on learning the textual-visual semantic alignment per object in the layout to precisely incorporate the input text into the layout-to-image synthesizing process. To evaluate the quality of generated layout, we design a new metric specifically, dubbed Layout Quality Score, which considers both the absolute distribution errors of bounding boxes in the layout and the mutual spatial relationships between them. Extensive experiments on three datasets demonstrate the superior performance of our method over state-of-the-art methods on both predicting the layout and synthesizing the image from the given text.

MLSep 29, 2023
Estimation and Inference in Distributional Reinforcement Learning

Liangyu Zhang, Yang Peng, Jiadong Liang et al.

In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $η^π$) attained by a given policy $π$. We use the certainty-equivalence method to construct our estimator $\hatη^π$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-γ)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hatη^π$ and $η^π$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-γ)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hatη^π$ and $η^π$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hatη^π$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hatη^π-η^π)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $η^π$.

CVMar 20
PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing

Jiadong Liang, Bojun Xiong, Jie Tian et al.

This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at https://youku-aigc.github.io/PerformRecast.

STFeb 18
Optimal training-conditional regret for online conformal prediction

Jiadong Liang, Zhimei Ren, Yuxin Chen

We study online conformal prediction for non-stationary data streams subject to unknown distribution drift. While most prior work studied this problem under adversarial settings and/or assessed performance in terms of gaps of time-averaged marginal coverage, we instead evaluate performance through training-conditional cumulative regret. We specifically focus on independently generated data with two types of distribution shift: abrupt change points and smooth drift. When non-conformity score functions are pretrained on an independent dataset, we propose a split-conformal style algorithm that leverages drift detection to adaptively update calibration sets, which provably achieves minimax-optimal regret. When non-conformity scores are instead trained online, we develop a full-conformal style algorithm that again incorporates drift detection to handle non-stationarity; this approach relies on stability - rather than permutation symmetry - of the model-fitting algorithm, which is often better suited to online learning under evolving environments. We establish non-asymptotic regret guarantees for our online full conformal algorithm, which match the minimax lower bound under appropriate restrictions on the prediction sets. Numerical experiments corroborate our theoretical findings.

MLJan 22, 2025
Low-dimensional adaptation of diffusion models: Convergence in total variation

Jiadong Liang, Zhihan Huang, Yuxin Chen

This paper investigates how diffusion generative models leverage (unknown) low-dimensional structure to accelerate sampling. Focusing on two mainstream samplers -- the denoising diffusion implicit model (DDIM) and the denoising diffusion probabilistic model (DDPM) -- and assuming accurate score estimates, we prove that their iteration complexities are no greater than the order of $k/\varepsilon$ (up to some log factor), where $\varepsilon$ is the precision in total variation distance and $k$ is some intrinsic dimension of the target distribution. Our results are applicable to a broad family of target distributions without requiring smoothness or log-concavity assumptions. Further, we develop a lower bound that suggests the (near) necessity of the coefficients introduced by Ho et al.(2020) and Song et al.(2020) in facilitating low-dimensional adaptation. Our findings provide the first rigorous evidence for the adaptivity of the DDIM-type samplers to unknown low-dimensional structure, and improve over the state-of-the-art DDPM theory regarding total variation convergence.

MLOct 19, 2024
Asymptotic Time-Uniform Inference for Parameters in Averaged Stochastic Approximation

Chuhan Xie, Kaicheng Jin, Jiadong Liang et al.

We study time-uniform statistical inference for parameters in stochastic approximation (SA), which encompasses a bunch of applications in optimization and machine learning. To that end, we analyze the almost-sure convergence rates of the averaged iterates to a scaled sum of Gaussians in both linear and nonlinear SA problems. We then construct three types of asymptotic confidence sequences that are valid uniformly across all times with coverage guarantees, in an asymptotic sense that the starting time is sufficiently large. These coverage guarantees remain valid if the unknown covariance matrix is replaced by its plug-in estimator, and we conduct experiments to validate our methodology.

CLNov 14, 2024
Semantic, Orthographic, and Phonological Biases in Humans' Wordle Gameplay

Jiadong Liang, Adam Kabbara, Jiaying Liu et al.

We show that human players' gameplay in the game of Wordle is influenced by the semantics, orthography, and phonology of the player's previous guesses. We compare actual human players' guesses with near-optimal guesses using NLP techniques. We study human language use in the constrained environment of Wordle, which is situated between natural language use and the artificial word association task

CVJun 12, 2024
Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Jiadong Liang, Feng Lu

Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.

MLDec 29, 2021
A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Xiang Li, Wenhao Yang, Jiadong Liang et al.

We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The functional central limit theorem implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-learning without the Lipschitz condition.

MLSep 3, 2021
Statistical Estimation and Inference via Local SGD in Federated Learning

Xiang Li, Jiadong Liang, Xiangyu Chang et al.

Federated Learning (FL) makes a large amount of edge computing devices (e.g., mobile phones) jointly learn a global model without data sharing. In FL, data are generated in a decentralized manner with high heterogeneity. This paper studies how to perform statistical estimation and inference in the federated setting. We analyze the so-called Local SGD, a multi-round estimation procedure that uses intermittent communication to improve communication efficiency. We first establish a {\it functional central limit theorem} that shows the averaged iterates of Local SGD weakly converge to a rescaled Brownian motion. We next provide two iterative inference methods: the {\it plug-in} and the {\it random scaling}. Random scaling constructs an asymptotically pivotal statistic for inference by using the information along the whole Local SGD path. Both the methods are communication efficient and applicable to online data. Our theoretical and empirical results show that Local SGD simultaneously achieves both statistical efficiency and communication efficiency.

MLAug 9, 2020
Intervention Generative Adversarial Networks

Jiadong Liang, Liangyu Zhang, Cheng Zhang et al.

In this paper we propose a novel approach for stabilizing the training process of Generative Adversarial Networks as well as alleviating the mode collapse problem. The main idea is to introduce a regularization term that we call intervention loss into the objective. We refer to the resulting generative model as Intervention Generative Adversarial Networks (IVGAN). By perturbing the latent representations of real images obtained from an auxiliary encoder network with Gaussian invariant interventions and penalizing the dissimilarity of the distributions of the resulting generated images, the intervention loss provides more informative gradient for the generator, significantly improving GAN's training stability. We demonstrate the effectiveness and efficiency of our methods via solid theoretical analysis and thorough evaluation on standard real-world datasets as well as the stacked MNIST dataset.

CVDec 18, 2019
CPGAN: Full-Spectrum Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis

Jiadong Liang, Wenjie Pei, Feng Lu

Typical methods for text-to-image synthesis seek to design effective generative architecture to model the text-to-image mapping directly. It is fairly arduous due to the cross-modality translation. In this paper we circumvent this problem by focusing on parsing the content of both the input text and the synthesized image thoroughly to model the text-to-image consistency in the semantic level. Particularly, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images during text encoding. Meanwhile, the synthesized image is parsed to learn its semantics in an object-aware manner. Moreover, we customize a conditional discriminator to model the fine-grained correlations between words and image sub-regions to push for the text-image semantic alignment. Extensive experiments on COCO dataset manifest that our model advances the state-of-the-art performance significantly (from 35.69 to 52.73 in Inception Score).

LGFeb 15, 2019
Lipschitz Generative Adversarial Nets

Zhiming Zhou, Jiadong Liang, Yuxuan Song et al.

In this paper, we study the convergence of generative adversarial networks (GANs) from the perspective of the informativeness of the gradient of the optimal discriminative function. We show that GANs without restriction on the discriminative function space commonly suffer from the problem that the gradient produced by the discriminator is uninformative to guide the generator. By contrast, Wasserstein GAN (WGAN), where the discriminative function is restricted to 1-Lipschitz, does not suffer from such a gradient uninformativeness problem. We further show in the paper that the model with a compact dual form of Wasserstein distance, where the Lipschitz condition is relaxed, may also theoretically suffer from this issue. This implies the importance of Lipschitz condition and motivates us to study the general formulation of GANs with Lipschitz constraint, which leads to a new family of GANs that we call Lipschitz GANs (LGANs). We show that LGANs guarantee the existence and uniqueness of the optimal discriminative function as well as the existence of a unique Nash equilibrium. We prove that LGANs are generally capable of eliminating the gradient uninformativeness problem. According to our empirical analysis, LGANs are more stable and generate consistently higher quality samples compared with WGAN.

LGJul 2, 2018
Understanding the Effectiveness of Lipschitz-Continuity in Generative Adversarial Nets

Zhiming Zhou, Yuxuan Song, Lantao Yu et al.

In this paper, we investigate the underlying factor that leads to failure and success in the training of GANs. We study the property of the optimal discriminative function and show that in many GANs, the gradient from the optimal discriminative function is not reliable, which turns out to be the fundamental cause of failure in training of GANs. We further demonstrate that a well-defined distance metric does not necessarily guarantee the convergence of GANs. Finally, we prove in this paper that Lipschitz-continuity condition is a general solution to make the gradient of the optimal discriminative function reliable, and characterized the necessary condition where Lipschitz-continuity ensures the convergence, which leads to a broad family of valid GAN objectives under Lipschitz-continuity condition, where Wasserstein distance is one special case. We experiment with several new objectives, which are sound according to our theorems, and we found that, compared with Wasserstein distance, the outputs of the discriminator with new objectives are more stable and the final qualities of generated samples are also consistently higher than those produced by Wasserstein distance.