Yilin Yang

CL
h-index88
21papers
4,767citations
Novelty44%
AI Score53

21 Papers

CLAug 22, 2023Code
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Seamless Communication, Loïc Barrault, Yu-An Chung et al. · meta-ai, mit

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

CLNov 11, 2022Code
Speech-to-Speech Translation For A Real-world Unwritten Language

Peng-Jen Chen, Kevin Tran, Yilin Yang et al. · meta-ai

We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field. The demo can be found at https://huggingface.co/spaces/facebook/Hokkien_Translation .

NIDec 5, 2022
Differentiated Federated Reinforcement Learning Based Traffic Offloading on Space-Air-Ground Integrated Networks

Yeguang Qin, Yilin Yang, Fengxiao Tang et al. · mila

The Space-Air-Ground Integrated Network (SAGIN) plays a pivotal role as a comprehensive foundational network communication infrastructure, presenting opportunities for highly efficient global data transmission. Nonetheless, given SAGIN's unique characteristics as a dynamically heterogeneous network, conventional network optimization methodologies encounter challenges in satisfying the stringent requirements for network latency and stability inherent to data transmission within this network environment. Therefore, this paper proposes the use of differentiated federated reinforcement learning (DFRL) to solve the traffic offloading problem in SAGIN, i.e., using multiple agents to generate differentiated traffic offloading policies. Considering the differentiated characteristics of each region of SAGIN, DFRL models the traffic offloading policy optimization process as the process of solving the Decentralized Partially Observable Markov Decision Process (DEC-POMDP) problem. The paper proposes a novel Differentiated Federated Soft Actor-Critic (DFSAC) algorithm to solve the problem. The DFSAC algorithm takes the network packet delay as the joint reward value and introduces the global trend model as the joint target action-value function of each agent to guide the update of each agent's policy. The simulation results demonstrate that the traffic offloading policy based on the DFSAC algorithm achieves better performance in terms of network throughput, packet loss rate, and packet delay compared to the traditional federated reinforcement learning approach and other baseline approaches.

CLMay 30
From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users

Wuqiang Zheng, Chengbing Wang, Yilin Yang et al.

As Large Language Models (LLMs) are increasingly deployed in long-term interactions with users, empathy has become an increasingly important capability. However, existing research overlooks the influence of users' personality traits on empathetic strategies during long-term interactions. To address this gap, we introduce the task of personalized empathy, which focuses on adapting empathetic strategies according to users' personalized characteristics derived from history. To study and enhance this capability, we construct PersonaEmp, a personalized empathy dataset built from long-term user-AI interactions, featuring rich user histories, persona information, and empathy-seeking queries. We further propose PereGRM, a reward modeling framework that combines the empathy evaluation structure with dynamic evaluation criteria generation for fine-grained reward modeling. Experimental results across different settings and multiple judge models show that PereGRM consistently achieves the strongest performance improvements, indicating its effectiveness for enhancing personalized empathetic capabilities.

CVSep 9, 2024Code
Prototype-Driven Multi-Feature Generation for Visible-Infrared Person Re-identification

Jiarui Li, Zhen Qiu, Yilin Yang et al.

The primary challenges in visible-infrared person re-identification arise from the differences between visible (vis) and infrared (ir) images, including inter-modal and intra-modal variations. These challenges are further complicated by varying viewpoints and irregular movements. Existing methods often rely on horizontal partitioning to align part-level features, which can introduce inaccuracies and have limited effectiveness in reducing modality discrepancies. In this paper, we propose a novel Prototype-Driven Multi-feature generation framework (PDM) aimed at mitigating cross-modal discrepancies by constructing diversified features and mining latent semantically similar features for modal alignment. PDM comprises two key components: Multi-Feature Generation Module (MFGM) and Prototype Learning Module (PLM). The MFGM generates diversity features closely distributed from modality-shared features to represent pedestrians. Additionally, the PLM utilizes learnable prototypes to excavate latent semantic similarities among local features between visible and infrared modalities, thereby facilitating cross-modal instance-level alignment. We introduce the cosine heterogeneity loss to enhance prototype diversity for extracting rich local features. Extensive experiments conducted on the SYSU-MM01 and LLCM datasets demonstrate that our approach achieves state-of-the-art performance. Our codes are available at https://github.com/mmunhappy/ICASSP2025-PDM.

LGMar 3, 2023
Differentially Private Neural Tangent Kernels for Privacy-Preserving Data Generation

Yilin Yang, Kamil Adamczewski, Danica J. Sutherland et al.

Maximum mean discrepancy (MMD) is a particularly useful distance metric for differentially private data generation: when used with finite-dimensional features it allows us to summarize and privatize the data distribution once, which we can repeatedly use during generator training without further privacy loss. An important question in this framework is, then, what features are useful to distinguish between real and synthetic data distributions, and whether those enable us to generate quality synthetic data. This work considers the using the features of $\textit{neural tangent kernels (NTKs)}$, more precisely $\textit{empirical}$ NTKs (e-NTKs). We find that, perhaps surprisingly, the expressiveness of the untrained e-NTK features is comparable to that of the features taken from pre-trained perceptual features using public data. As a result, our method improves the privacy-accuracy trade-off compared to other state-of-the-art methods, without relying on any public data, as demonstrated on several tabular and image benchmark datasets.

CLDec 8, 2023Code
Seamless: Multilingual Expressive and Streaming Speech Translation

Seamless Communication, Loïc Barrault, Yu-An Chung et al. · meta-ai, stanford

Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication

CLMay 13, 2022
Sentiment Analysis of Covid-related Reddits

Yilin Yang, Tomas Fieg, Marina Sokolova

This paper focuses on Sentiment Analysis of Covid-19 related messages from the r/Canada and r/Unitedkingdom subreddits of Reddit. We apply manual annotation and three Machine Learning algorithms to analyze sentiments conveyed in those messages. We use VADER and TextBlob to label messages for Machine Learning experiments. Our results show that removal of shortest and longest messages improves VADER and TextBlob agreement on positive sentiments and F-score of sentiment classification by all the three algorithms

CVFeb 11
HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

Yilin Yang, Zhenghui Guo, Yuke Wang et al.

Large Vision-Language Models (VLMs) have achieved remarkable success across diverse multimodal tasks but remain vulnerable to hallucinations rooted in inherent language bias. Despite recent progress, existing hallucination mitigation methods often overlook the underlying hallucination patterns driven by language bias. In this work, we design a novel pipeline to accurately synthesize Hallucination-Inducing Images (HIIs). Using synthesized HIIs, we reveal a consistent scene-conditioned hallucination pattern: models tend to mention objects that are highly typical of the scene even when visual evidence is removed. To quantify the susceptibility of VLMs to this hallucination pattern, we establish the Masked-Object-Hallucination (MOH) benchmark to rigorously evaluate existing state-of-the-art alignment frameworks. Finally, we leverage HIIs to construct high-quality preference datasets for fine-grained alignment. Experimental results demonstrate that our approach effectively mitigates hallucinations while preserving general model capabilities. Specifically, our method achieves up to a 38% improvement over the current state-of-the-art on standard hallucination benchmarks.

CLAug 11, 2024
Language-Informed Beam Search Decoding for Multilingual Machine Translation

Yilin Yang, Stefan Lee, Prasad Tadepalli

Beam search decoding is the de-facto method for decoding auto-regressive Neural Machine Translation (NMT) models, including multilingual NMT where the target language is specified as an input. However, decoding multilingual NMT models commonly produces ``off-target'' translations -- yielding translation outputs not in the intended language. In this paper, we first conduct an error analysis of off-target translations for a strong multilingual NMT model and identify how these decodings are produced during beam search. We then propose Language-informed Beam Search (LiBS), a general decoding algorithm incorporating an off-the-shelf Language Identification (LiD) model into beam search decoding to reduce off-target translations. LiBS is an inference-time procedure that is NMT-model agnostic and does not require any additional parallel data. Results show that our proposed LiBS algorithm on average improves +1.1 BLEU and +0.9 BLEU on WMT and OPUS datasets, and reduces off-target rates from 22.9\% to 7.7\% and 65.8\% to 25.3\% respectively.

CVJun 27, 2025
Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero et al.

Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.

CLMar 19, 2024
MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng, Ilia Kulikov, Yilin Yang et al.

There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.

CLMar 19, 2024
An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

Yifan Peng, Ilia Kulikov, Yilin Yang et al.

Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning. A typical speech LM takes discrete semantic units as content and a short utterance as prompt, and synthesizes speech which preserves the content's semantics but mimics the prompt's style. However, there is no systematic understanding on how the synthesized audio is controlled by the prompt and content. In this work, we conduct an empirical study of the widely used autoregressive (AR) and non-autoregressive (NAR) speech LMs and provide insights into the prompt design and content semantic units. Our analysis reveals that heterogeneous and nonstationary prompts hurt the audio quality in contrast to the previous finding that longer prompts always lead to better synthesis. Moreover, we find that the speaker style of the synthesized audio is also affected by the content in addition to the prompt. We further show that semantic units carry rich acoustic information such as pitch, tempo, volume and speech emphasis, which might be leaked from the content to the synthesized audio.

CLSep 10, 2021
Improving Multilingual Translation by Representation and Gradient Regularization

Yilin Yang, Akiko Eriguchi, Alexandre Muzio et al.

Multilingual Neural Machine Translation (NMT) enables one model to serve all translation directions, including ones that are unseen during training, i.e. zero-shot translation. Despite being theoretically attractive, current models often produce low quality translations -- commonly failing to even produce outputs in the right target language. In this work, we observe that off-target translation is dominant even in strong multilingual systems, trained on massive multilingual corpora. To address this issue, we propose a joint approach to regularize NMT models at both representation-level and gradient-level. At the representation level, we leverage an auxiliary target language prediction task to regularize decoder outputs to retain information about the target language. At the gradient level, we leverage a small amount of direct data (in thousands of sentence pairs) to regularize model gradients. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance by +5.59 and +10.38 BLEU on WMT and OPUS datasets respectively. Moreover, experiments show that our method also works well when the small amount of direct data is not available.

CLOct 6, 2020
On the Sub-Layer Functionalities of Transformer Decoder

Yilin Yang, Longyue Wang, Shuming Shi et al.

There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During translation, the decoder must predict output tokens by considering both the source-language text from the encoder and the target-language prefix produced in previous steps. In this work, we study how Transformer-based decoders leverage information from the source and target languages -- developing a universal probe task to assess how information is propagated through each module of each decoder layer. We perform extensive experiments on three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis provides insight on when and where decoders leverage different sources. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance -- a significant reduction in computation and number of parameters, and consequently a significant boost to both training and inference speed.

LGMay 6, 2020
Towards Frequency-Based Explanation for Robust CNN

Zifan Wang, Yilin Yang, Ankit Shrivastava et al.

Current explanation techniques towards a transparent Convolutional Neural Network (CNN) mainly focuses on building connections between the human-understandable input features with models' prediction, overlooking an alternative representation of the input, the frequency components decomposition. In this work, we present an analysis of the connection between the distribution of frequency components in the input dataset and the reasoning process the model learns from the data. We further provide quantification analysis about the contribution of different frequency components toward the model's prediction. We show that the vulnerability of the model against tiny distortions is a result of the model is relying on the high-frequency features, the target features of the adversarial (black and white-box) attackers, to make the prediction. We further show that if the model develops stronger association between the low-frequency component with true labels, the model is more robust, which is the explanation of why adversarially trained models are more robust against tiny distortions.

HCMar 20, 2020
EchoLock: Towards Low Effort Mobile User Identification

Yilin Yang, Chen Wang, Yingying Chen et al.

User identification plays a pivotal role in how we interact with our mobile devices. Many existing authentication approaches require active input from the user or specialized sensing hardware, and studies on mobile device usage show significant interest in less inconvenient procedures. In this paper, we propose EchoLock, a low effort identification scheme that validates the user by sensing hand geometry via commodity microphones and speakers. These acoustic signals produce distinct structure-borne sound reflections when contacting the user's hand, which can be used to differentiate between different people based on how they hold their mobile devices. We process these reflections to derive unique acoustic features in both the time and frequency domain, which can effectively represent physiological and behavioral traits, such as hand contours, finger sizes, holding strength, and gesture. Furthermore, learning-based algorithms are developed to robustly identify the user under various environments and conditions. We conduct extensive experiments with 20 participants using different hardware setups in key use case scenarios and study various attack models to demonstrate the performance of our proposed system. Our results show that EchoLock is capable of verifying users with over 90% accuracy, without requiring any active input from the user.

SEOct 8, 2019
Software Engineering Practice in the Development of Deep Learning Applications

Xufan Zhang, Yilin Yang, Yang Feng et al.

Deep-Learning(DL) applications have been widely employed to assist in various tasks. They are constructed based on a data-driven programming paradigm that is different from conventional software applications. Given the increasing popularity and importance of DL applications, software engineering practitioners have some techniques specifically for them. However, little research is conducted to identify the challenges and lacks in practice. To fill this gap, in this paper, we surveyed 195 practitioners to understand their insight and experience in the software engineering practice of DL applications. Specifically, we asked the respondents to identify lacks and challenges in the practice of the development life cycle of DL applications. The results present 13 findings that provide us with a better understanding of software engineering practice of DL applications. Further, we distil these findings into 7 actionable recommendations for software engineering researchers and practitioners to improve the development of DL applications.

CLMay 23, 2019
Deep learning based mood tagging for Chinese song lyrics

Jie Wang, Yilin Yang

Nowadays, listening music has been and will always be an indispensable part of our daily life. In recent years, sentiment analysis of music has been widely used in the information retrieval systems, personalized recommendation systems and so on. Due to the development of deep learning, this paper commits to find an effective approach for mood tagging of Chinese song lyrics. To achieve this goal, both machine-learning and deep-learning models have been studied and compared. Eventually, a CNN-based model with pre-trained word embedding has been demonstrated to effectively extract the distribution of emotional features of Chinese lyrics, with at least 15 percentage points higher than traditional machine-learning methods (i.e. TF-IDF+SVM and LIWC+SVM), and 7 percentage points higher than other deep-learning models (i.e. RNN, LSTM). In this paper, more than 160,000 lyrics corpus has been leveraged for pre-training word embedding for mood tagging boost.

CLAug 31, 2018
Ensemble Sequence Level Training for Multimodal MT: OSU-Baidu WMT18 Multimodal Machine Translation System Report

Renjie Zheng, Yilin Yang, Mingbo Ma et al.

This paper describes multimodal machine translation systems developed jointly by Oregon State University and Baidu Research for WMT 2018 Shared Task on multimodal translation. In this paper, we introduce a simple approach to incorporate image information by feeding image features to the decoder side. We also explore different sequence level training methods including scheduled sampling and reinforcement learning which lead to substantial improvements. Our systems ensemble several models using different architectures and training methods and achieve the best performance for three subtasks: En-De and En-Cs in task 1 and (En+De+Fr)-Cs task 1B.

CLAug 28, 2018
Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation

Yilin Yang, Liang Huang, Mingbo Ma

Beam search is widely used in neural machine translation, and usually improves translation quality compared to greedy search. It has been widely observed that, however, beam sizes larger than 5 hurt translation quality. We explain why this happens, and propose several methods to address this problem. Furthermore, we discuss the optimal stopping criteria for these methods. Results show that our hyperparameter-free methods outperform the widely-used hyperparameter-free heuristic of length normalization by +2.0 BLEU, and achieve the best results among all methods on Chinese-to-English translation.