Liyang Lu

CL
h-index4
6papers
466citations
Novelty63%
AI Score37

6 Papers

CLAug 21, 2022
Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization

Pengcheng He, Baolin Peng, Liyang Lu et al. · microsoft-research

This paper presents Z-Code++, a new pre-trained language model optimized for abstractive text summarization. The model extends the state of the art encoder-decoder model using three techniques. First, we use a two-phase pre-training process to improve model's performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ creates new state of the art on 9 out of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.

LGMay 3, 2022
i-Code: An Integrative and Composable Multimodal Learning Framework

Ziyi Yang, Yuwei Fang, Chenguang Zhu et al. · gatech, stanford

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

CLSep 22, 2021Code
Scalable and Efficient MoE Training for Multitask Multilingual Models

Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio et al.

The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.

CLApr 9, 2020Code
Improving Readability for Automatic Speech Recognition Transcription

Junwei Liao, Sefik Emre Eskimez, Liyang Lu et al.

Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to grammatical errors, disfluency, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR system alike will be propagated to the next task in the pipeline. In this work, we propose a novel NLP task called ASR post-processing for readability (APR) that aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. In addition, we describe a method to address the lack of task-specific data by synthesizing examples for the APR task using the datasets collected for Grammatical Error Correction (GEC) followed by text-to-speech (TTS) and ASR. Furthermore, we propose metrics borrowed from similar tasks to evaluate performance on the APR task. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method. Our results suggest that finetuned models improve the performance on the APR task significantly, hinting at the potential benefits of using APR systems. We hope that the read, understand, and rewrite approach of our work can serve as a basis that many NLP tasks and human readers can benefit from.

SPJan 20, 2025
Collaborative Channel Access and Transmission for NR Sidelink and Wi-Fi Coexistence over Unlicensed Spectrum

Zhuangzhuang Yan, Xinyu Gu, Zhenyu Liu et al.

With the rapid development of various internet of things (IoT) applications, including industrial IoT (IIoT) and visual IoT (VIoT), the demand for direct device-to-device communication to support high data rates continues to grow. To address this demand, 5G-Advanced has introduced sidelink communication over the unlicensed spectrum (SL-U) to increase data rates. However, the primary challenge of SL-U in the unlicensed spectrum is ensuring fair coexistence with other incumbent systems, such as Wi-Fi. In this paper, we address the challenge by designing channel access mechanisms and power control strategies to mitigate interference and ensure fair coexistence. First, we propose a novel collaborative channel access (CCHA) mechanism that integrates channel access with resource allocation through collaborative interactions between base stations (BS) and SL-U users. This mechanism ensures fair coexistence with incumbent systems while improving resource utilization. Second, to further enhance the performance of the coexistence system, we develop a cooperative subgoal-based hierarchical deep reinforcement learning (C-GHDRL) algorithm framework. The framework enables SL-U users to make globally optimal decisions by leveraging cooperative operations between the BS and SL-U users, effectively overcoming the limitations of traditional optimization methods in solving joint optimization problems with nonlinear constraints. Finally, we mathematically model the joint channel access and power control problem and balance the trade-off between fairness and transmission rate in the coexistence system by defining a suitable reward function in the C-GHDRL algorithm. Simulation results demonstrate that the proposed scheme significantly enhances the performance of the coexistence system while ensuring fair coexistence between SL-U and Wi-Fi users.

CLFeb 22, 2021
Generating Human Readable Transcript for Automatic Speech Recognition with Pre-trained Language Model

Junwei Liao, Yu Shi, Ming Gong et al.

Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to disfluency, filter words, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR system alike will be propagated to the next task in the pipeline. In this work, we propose an ASR post-processing model that aims to transform the incorrect and noisy ASR output into a readable text for humans and downstream tasks. We leverage the Metadata Extraction (MDE) corpus to construct a task-specific dataset for our study. Since the dataset is small, we propose a novel data augmentation method and use a two-stage training strategy to fine-tune the RoBERTa pre-trained model. On the constructed test set, our model outperforms a production two-step pipeline-based post-processing method by a large margin of 13.26 on readability-aware WER (RA-WER) and 17.53 on BLEU metrics. Human evaluation also demonstrates that our method can generate more human-readable transcripts than the baseline method.