24.9CLApr 27
IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical ReasoningNavya Gupta, Rishitej Reddy Vyalla, Avinash Anand et al.
Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Indian languages remains limited. We propose IRIS: Interleaved Reinforcement with Incremental Staged Curriculum, a two-axis framework that combines Supervised Fine-Tuning on progressively harder problems (vertical axis) with Reverse Curriculum Reinforcement Learning to reduce reliance on step-by-step guidance (horizontal axis). We design a composite reward combining correctness, step-wise alignment, continuity, and numeric incentives, optimized via Group Relative Policy Optimization (GRPO). We release CL-Math, a dataset of 29k problems with step-level annotations in English, Hindi, and Marathi. Across standard benchmarks and curated multilingual test sets, IRIS consistently improves performance, with strong results on math reasoning tasks and substantial gains in low-resource and bilingual settings, alongside modest improvements in high-resource languages.
CVOct 26, 2021
ViDA-MAN: Visual Dialog with Digital HumansTong Shen, Jiawei Zuo, Fan Shi et al.
We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries. Compared to traditional text or voice-based system, ViDA-MAN offers human-like interactions (e.g, vivid voice, natural facial expression and body gestures). Given a speech request, the demonstration is able to response with high quality videos in sub-second latency. To deliver immersive user experience, ViDA-MAN seamlessly integrates multi-modal techniques including Acoustic Speech Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video generation. Backed with large knowledge base, ViDA-MAN is able to chat with users on a number of topics including chit-chat, weather, device control, News recommendations, booking hotels, as well as answering questions via structured knowledge.
ASOct 8, 2021
SCaLa: Supervised Contrastive Learning for End-to-End Speech RecognitionLi Fu, Xiaoxiao Li, Runyu Wang et al.
End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision. This could result in recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems. Specifically, we extend the self-supervised Masked Contrastive Predictive Coding (MCPC) to a fully-supervised setting, where the supervision is applied in the following way. First, SCaLa masks variable-length encoder features according to phoneme boundaries given phoneme forced-alignment extracted from a pre-trained acoustic model; it then predicts the masked features via contrastive learning. The forced-alignment can provide phoneme labels to mitigate the noise introduced by positive-negative pairs in self-supervised MCPC. Experiments on reading and spontaneous speech datasets show that our proposed approach achieves 2.8 and 1.4 points Character Error Rate (CER) absolute reductions compared to the baseline, respectively.
ASNov 6, 2020
Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech SynthesisGuanghui Xu, Wei Song, Zhengchen Zhang et al.
Despite prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account that within each sentence, which makes it challenging when converting a paragraph of texts into natural and expressive speech. In this paper, we propose to use the text embeddings of the neighboring sentences to improve the prosody generation for each utterance of a paragraph in an end-to-end fashion without using any explicit prosody features. More specifically, cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model, are used to augment the input of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which leads to the use of different CU encoder structures. Experimental results on a Mandarin audiobook dataset and the LJ-Speech English audiobook dataset demonstrate the use of CU information can improve the naturalness and expressiveness of the synthesized speech. Subjective listening testing shows most of the participants prefer the voice generated using the CU encoder over that generated using standard Tacotron2. It is also found that the prosody can be controlled indirectly by changing the neighbouring sentences.
ASMay 11, 2020
Incremental Learning for End-to-End Automatic Speech RecognitionLi Fu, Xiaoxiao Li, Libo Zi et al.
In this paper, we propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR) which enables an ASR system to perform well on new tasks while maintaining the performance on its originally learned ones. To mitigate catastrophic forgetting during incremental learning, we design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions. Our method works without access to the training data of original tasks, which addresses the cases where the previous data is no longer available or joint training is costly. Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting. Furthermore, in two practical scenarios, compared to the target-reference joint training method, the performance drop of our method is 0.02% Character Error Rate (CER), which is 97% smaller than the drops of the baseline methods.
CLDec 15, 2016
Transition-based Parsing with Context Enhancement and Future Reward RerankingFugen Zhou, Fuxiang Wu, Zhengchen Zhang et al.
This paper presents a novel reranking model, future reward reranking, to re-score the actions in a transition-based parser by using a global scorer. Different to conventional reranking parsing, the model searches for the best dependency tree in all feasible trees constraining by a sequence of actions to get the future reward of the sequence. The scorer is based on a first-order graph-based parser with bidirectional LSTM, which catches different parsing view compared with the transition-based parser. Besides, since context enhancement has shown substantial improvement in the arc-stand transition-based parsing over the parsing accuracy, we implement context enhancement on an arc-eager transition-base parser with stack LSTMs, the dynamic oracle and dropout supporting and achieve further improvement. With the global scorer and context enhancement, the results show that UAS of the parser increases as much as 1.20% for English and 1.66% for Chinese, and LAS increases as much as 1.32% for English and 1.63% for Chinese. Moreover, we get state-of-the-art LASs, achieving 87.58% for Chinese and 93.37% for English.
HCJul 8, 2012
Telerobotic Pointing Gestures Shape Human Spatial CognitionJohn-John Cabibihan, Wing-Chee So, Sujin Saj et al.
This paper aimed to explore whether human beings can understand gestures produced by telepresence robots. If it were the case, they can derive meaning conveyed in telerobotic gestures when processing spatial information. We conducted two experiments over Skype in the present study. Participants were presented with a robotic interface that had arms, which were teleoperated by an experimenter. The robot could point to virtual locations that represented certain entities. In Experiment 1, the experimenter described spatial locations of fictitious objects sequentially in two conditions: speech condition (SO, verbal descriptions clearly indicated the spatial layout) and speech and gesture condition (SR, verbal descriptions were ambiguous but accompanied by robotic pointing gestures). Participants were then asked to recall the objects' spatial locations. We found that the number of spatial locations recalled in the SR condition was on par with that in the SO condition, suggesting that telerobotic pointing gestures compensated ambiguous speech during the process of spatial information. In Experiment 2, the experimenter described spatial locations non-sequentially in the SR and SO conditions. Surprisingly, the number of spatial locations recalled in the SR condition was even higher than that in the SO condition, suggesting that telerobotic pointing gestures were more powerful than speech in conveying spatial information when information was presented in an unpredictable order. The findings provide evidence that human beings are able to comprehend telerobotic gestures, and importantly, integrate these gestures with co-occurring speech. This work promotes engaging remote collaboration among humans through a robot intermediary.