CLOct 6, 2022
ReAct: Synergizing Reasoning and Acting in Language ModelsShunyu Yao, Jeffrey Zhao, Dian Yu et al. · deepmind
While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io
CLSep 30, 2023
SLM: Bridge the thin gap between speech and text foundation modelsMingqiu Wang, Wei Han, Izhak Shafran et al. · deepmind
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achieve strong performance on conventional tasks such as speech recognition (ASR) and speech translation (AST), but also introduces the novel capability of zero-shot instruction-following for more diverse tasks: given a speech input and a text instruction, SLM is able to perform unseen generation tasks including contextual biasing ASR using real-time context, dialog generation, speech continuation, and question answering, etc. Our approach demonstrates that the representational gap between pretrained speech and language models might be narrower than one would expect, and can be bridged by a simple adaptation mechanism. As a result, SLM is not only efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
CLDec 20, 2022
AnyTOD: A Programmable Task-Oriented Dialog SystemJeffrey Zhao, Yuan Cao, Raghav Gupta et al. · deepmind
We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A neural LM keeps track of events occurring during a conversation and a symbolic program implementing the dialog policy is executed to recommend next actions AnyTOD should take. This approach drastically reduces data annotation and model training requirements, addressing the enduring challenge of rapidly adapting a TOD system to unseen tasks and domains. We demonstrate state-of-the-art results on STAR, ABCD and SGD benchmarks. We also demonstrate strong zero-shot transfer ability in low-resource settings, such as zero-shot on MultiWOZ. In addition, we release STARv2, an updated version of the STAR dataset with richer annotations, for benchmarking zero-shot end-to-end TOD models.
ASJun 8, 2023
Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech UnderstandingMingqiu Wang, Izhak Shafran, Hagen Soltau et al. · deepmind
Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, we can reduce the speech sequence length to that of text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to address errors on rare entities, we augment SLM with a Speech2Entity retriever, which uses speech to retrieve relevant entities, and then adds them to the original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with the dialog understanding task improves the ASR performance from 9.4% to 8.5% WER.
CLMay 9, 2022
Unsupervised Slot Schema Induction for Task-oriented DialogDian Yu, Mingqiu Wang, Yuan Cao et al. · deepmind
Carefully-designed schemas describing how to collect and annotate dialog corpora are a prerequisite towards building task-oriented dialog systems. In practical applications, manually designing schemas can be error-prone, laborious, iterative, and slow, especially when the schema is complicated. To alleviate this expensive and time consuming process, we propose an unsupervised approach for slot schema induction from unlabeled dialog corpora. Leveraging in-domain language models and unsupervised parsing structures, our data-driven approach extracts candidate slots without constraints, followed by coarse-to-fine clustering to induce slot types. We compare our method against several strong supervised baselines, and show significant performance improvement in slot schema induction on MultiWoz and SGD datasets. We also demonstrate the effectiveness of induced schemas on downstream applications including dialog state tracking and response generation.
AIDec 16, 2022
Speech Aware Dialog System Technology Challenge (DSTC11)Hagen Soltau, Izhak Shafran, Mingqiu Wang et al. · deepmind
Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, our goal in hosting the speech-aware dialog state tracking challenge was to create a public corpus or task which can be used to investigate the performance gap between the written and spoken forms of input, develop models that could alleviate this gap, and establish whether Text-to-Speech-based (TTS) systems is a reasonable surrogate to the more-labor intensive human data collection. We created three spoken versions of the popular written-domain MultiWoz task -- (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs. Additionally, we provided different forms of ASR output to encourage wider participation from teams that may not have access to state-of-the-art ASR systems. These included ASR transcripts, word time stamps, and latent representations of the audio (audio encoder outputs). In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain.
LGFeb 24, 2023
MUX-PLMs: Data Multiplexing for High-throughput Language ModelsVishvak Murahari, Ameet Deshpande, Carlos E. Jimenez et al. · deepmind, princeton
The widespread adoption of large language models such as ChatGPT and Bard has led to unprecedented demand for these technologies. The burgeoning cost of inference for ever-increasing model sizes coupled with hardware shortages has limited affordable access and poses a pressing need for efficiency approaches geared towards high throughput and performance. Multi-input multi-output (MIMO) algorithms such as data multiplexing, offer a promising solution with a many-fold increase in throughput by performing inference for multiple inputs at the cost of a single input. Yet these approaches are not currently performant enough to be deployed in modern systems. We change that by developing MUX-PLMs, a class of high throughput pre-trained language models (PLMs) trained with data multiplexing, that can be fine-tuned for any downstream task to yield high-throughput high-performance. Our novel multiplexing and demultiplexing modules proficiently entangle and disentangle inputs, and enable high-performance high throughput \muxplms{} that are competitive with vanilla PLMs while achieving 2x/5x inference speedup with only a $1-4\%$ drop on a broad suite of tasks.
CLOct 13, 2022
Knowledge-grounded Dialog State TrackingDian Yu, Mingqiu Wang, Yuan Cao et al. · deepmind
Knowledge (including structured knowledge such as schema and ontology, and unstructured knowledge such as web corpus) is a critical part of dialog understanding, especially for unseen tasks and domains. Traditionally, such domain-specific knowledge is encoded implicitly into model parameters for the execution of downstream tasks, which makes training inefficient. In addition, such models are not easily transferable to new tasks with different schemas. In this work, we propose to perform dialog state tracking grounded on knowledge encoded externally. We query relevant knowledge of various forms based on the dialog context where such information can ground the prediction of dialog states. We demonstrate superior performance of our proposed method over strong baselines, especially in the few-shot learning setting.
ASOct 16, 2023
Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that Leverages a Universal Speech ModelHagen Soltau, Izhak Shafran, Alex Ottenwess et al.
We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders. We combine this classifier with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings. Our model compresses long sequences into a small set of class-specific latent representations and a factorized projection is used to predict different attributes of the disordered input speech. The benefit of our approach is that it allows us to model different regions of the input for different classes and is at the same time data efficient. We evaluated the proposed model extensively on a curated corpus from the Mayo Clinic. Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%. With limited task-specific data, we find that pretraining is important and surprisingly pretraining with the unrelated automatic speech recognition (ASR) task is also beneficial. Encodings from the middle layers provide a mix of both acoustic and phonetic information and achieve best prediction results compared to just using the final layer encodings (83.1% vs. 79.6%). The results are promising and with further refinements may help clinicians detect speech abnormalities without needing access to highly specialized speech-language pathologists.
CLMar 8, 2024
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextGemini Team, Petko Georgiev, Ving Ian Lei et al. · deepmind, mila
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic CapabilitiesGheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
CLMay 22, 2024Code
Knowledge Graph Reasoning with Self-supervised Reinforcement LearningYing Ma, Owen Burns, Mingqiu Wang et al.
Reinforcement learning (RL) is an effective method of finding reasoning pathways in incomplete knowledge graphs (KGs). To overcome the challenges of a large action space, a self-supervised pre-training method is proposed to warm up the policy network before the RL training stage. To alleviate the distributional mismatch issue in general self-supervised RL (SSRL), in our supervised learning (SL) stage, the agent selects actions based on the policy network and learns from generated labels; this self-generation of labels is the intuition behind the name self-supervised. With this training framework, the information density of our SL objective is increased and the agent is prevented from getting stuck with the early rewarded paths. Our self-supervised RL (SSRL) method improves the performance of RL by pairing it with the wide coverage achieved by SL during pretraining, since the breadth of the SL objective makes it infeasible to train an agent with that alone. We show that our SSRL model meets or exceeds current state-of-the-art results on all Hits@k and mean reciprocal rank (MRR) metrics on four large benchmark KG datasets. This SSRL method can be used as a plug-in for any RL architecture for a KGR task. We adopt two RL architectures, i.e., MINERVA and MultiHopKG as our baseline RL models and experimentally show that our SSRL model consistently outperforms both baselines on all of these four KG reasoning tasks. Full code for the paper available at https://github.com/owenonline/Knowledge-Graph-Reasoning-with-Self-supervised-Reinforcement-Learning.
CLMay 17, 2023Code
Tree of Thoughts: Deliberate Problem Solving with Large Language ModelsShunyu Yao, Dian Yu, Jeffrey Zhao et al.
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.
CLFeb 2, 2024
Retrieval Augmented End-to-End Spoken Dialog ModelsMingqiu Wang, Izhak Shafran, Hagen Soltau et al. · deepmind
We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain domain-specific entities, i.e., restaurants, hotels, train stations, and city names, which are difficult to recognize, however, critical for the downstream applications. Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness. We first train a speech retriever to retrieve text entities mentioned in the audio. The retrieved entities are then added as text inputs to the underlying SLM to bias model predictions. We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance, achieving joint goal accuracy (38.6% vs 32.7%), slot error rate (20.6% vs 24.8%) and ASR word error rate (5.5% vs 6.7%). While demonstrated on dialog state tracking, our approach is broadly applicable to other speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability.
CLDec 19, 2023
Gemini: A Family of Highly Capable Multimodal ModelsGemini Team, Rohan Anil, Sebastian Borgeaud et al.
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
CLFeb 8, 2022
RNN Transducers for Nested Named Entity Recognition with constraints on alignment for long sequencesHagen Soltau, Izhak Shafran, Mingqiu Wang et al.
Popular solutions to Named Entity Recognition (NER) include conditional random fields, sequence-to-sequence models, or utilizing the question-answering framework. However, they are not suitable for nested and overlapping spans with large ontologies and for predicting the position of the entities. To fill this gap, we introduce a new model for NER task -- an RNN transducer (RNN-T). These models are trained using paired input and output sequences without explicitly specifying the alignment between them, similar to other seq-to-seq models. RNN-T models learn the alignment using a loss function that sums over all alignments. In NER tasks, however, the alignment between words and target labels are available from the human annotations. We propose a fixed alignment RNN-T model that utilizes the given alignment, while preserving the benefits of RNN-Ts such as modeling output dependencies. As a more general case, we also propose a constrained alignment model where users can specify a relaxation of the given input alignment and the model will learn an alignment within the given constraints. In other words, we propose a family of seq-to-seq models which can leverage alignments between input and target sequences when available. Through empirical experiments on a challenging real-world medical NER task with multiple nested ontologies, we demonstrate that our fixed alignment model outperforms the standard RNN-T model, improving F1-score from 0.70 to 0.74.
CLJan 21, 2022
Description-Driven Task-Oriented Dialog ModelingJeffrey Zhao, Raghav Gupta, Yuan Cao et al.
Task-oriented dialogue (TOD) systems are required to identify key information from conversations for the completion of given tasks. Such information is conventionally specified in terms of intents and slots contained in task-specific ontology or schemata. Since these schemata are designed by system developers, the naming convention for slots and intents is not uniform across tasks, and may not convey their semantics effectively. This can lead to models memorizing arbitrary patterns in data, resulting in suboptimal performance and generalization. In this paper, we propose that schemata should be modified by replacing names or notations entirely with natural language descriptions. We show that a language description-driven system exhibits better understanding of task specifications, higher performance on state tracking, improved data efficiency, and effective zero-shot transfer to unseen tasks. Following this paradigm, we present a simple yet effective Description-Driven Dialog State Tracking (D3ST) model, which relies purely on schema descriptions and an "index-picking" mechanism. We demonstrate the superiority in quality, data efficiency and robustness of our approach as measured on the MultiWOZ (Budzianowski et al.,2018), SGD (Rastogi et al., 2020), and the recent SGD-X (Lee et al., 2021) benchmarks.
CLSep 28, 2021
Word-level confidence estimation for RNN transducersMingqiu Wang, Hagen Soltau, Laurent El Shafey et al.
Confidence estimate is an often requested feature in applications such as medical transcription where errors can impact patient care and the confidence estimate could be used to alert medical professionals to verify potential errors in recognition. In this paper, we present a lightweight neural confidence model tailored for Automatic Speech Recognition (ASR) system with Recurrent Neural Network Transducers (RNN-T). Compared to other existing approaches, our model utilizes: (a) the time information associated with recognized words, which reduces the computational complexity, and (b) a simple and elegant trick for mapping between sub-word and word sequences. The mapping addresses the non-unique tokenization and token deletion problems while amplifying differences between confusable words. Through extensive empirical evaluations on two different long-form test sets, we demonstrate that the model achieves a performance of 0.4 Normalized Cross Entropy (NCE) and 0.05 Expected Calibration Error (ECE). It is robust across different ASR configurations, including target types (graphemes vs. morphemes), traffic conditions (streaming vs. non-streaming), and encoder types. We further discuss the importance of evaluation metrics to reflect practical applications and highlight the need for further work in improving Area Under the Curve (AUC) for Negative Precision Rate (NPV) and True Negative Rate (TNR).
CLMay 10, 2021
R2D2: Relational Text Decoding with TransformersAryan Arbabi, Mingqiu Wang, Laurent El Shafey et al.
We propose a novel framework for modeling the interaction between graphical structures and the natural language text associated with their nodes and edges. Existing approaches typically fall into two categories. On group ignores the relational structure by converting them into linear sequences and then utilize the highly successful Seq2Seq models. The other side ignores the sequential nature of the text by representing them as fixed-dimensional vectors and apply graph neural networks. Both simplifications lead to information loss. Our proposed method utilizes both the graphical structure as well as the sequential nature of the texts. The input to our model is a set of text segments associated with the nodes and edges of the graph, which are then processed with a transformer encoder-decoder model, equipped with a self-attention mechanism that is aware of the graphical relations between the nodes containing the segments. This also allows us to use BERT-like models that are already trained on large amounts of text. While the proposed model has wide applications, we demonstrate its capabilities on data-to-text generation tasks. Our approach compares favorably against state-of-the-art methods in four tasks without tailoring the model architecture. We also provide an early demonstration in a novel practical application -- generating clinical notes from the medical entities mentioned during clinical visits.
LGApr 6, 2021
Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information ExtractionHagen Soltau, Mingqiu Wang, Izhak Shafran et al.
In this paper, we describe novel components for extracting clinically relevant information from medical conversations which will be available as Google APIs. We describe a transformer-based Recurrent Neural Network Transducer (RNN-T) model tailored for long-form audio, which can produce rich transcriptions including speaker segmentation, speaker role labeling, punctuation and capitalization. On a representative test set, we compare performance of RNN-T models with different encoders, units and streaming constraints. Our transformer-based streaming model performs at about 20% WER on the ASR task, 6% WDER on the diarization task, 43% SER on periods, 52% SER on commas, 43% SER on question marks and 30% SER on capitalization. Our recognizer is paired with a confidence model that utilizes both acoustic and lexical features from the recognizer. The model performs at about 0.37 NCE. Finally, we describe a RNN-T based tagging model. The performance of the model depends on the ontologies, with F-scores of 0.90 for medications, 0.76 for symptoms, 0.75 for conditions, 0.76 for diagnosis, and 0.61 for treatments. While there is still room for improvement, our results suggest that these models are sufficiently accurate for practical applications.
CRSep 2, 2020
Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description (version 1.0)Shailesh Bavadekar, Andrew Dai, John Davis et al.
This report describes the aggregation and anonymization process applied to the initial version of COVID-19 Search Trends symptoms dataset (published at https://goo.gle/covid19symptomdataset on September 2, 2020), a publicly available dataset that shows aggregated, anonymized trends in Google searches for symptoms (and some related topics). The anonymization process is designed to protect the daily symptom search activity of every user with $\varepsilon$-differential privacy for $\varepsilon$ = 1.68.
CLMar 12, 2020
The Medical Scribe: Corpus Development and Model Performance AnalysesIzhak Shafran, Nan Du, Linh Tran et al.
There is a growing interest in creating tools to assist in clinical note generation using the audio of provider-patient encounters. Motivated by this goal and with the help of providers and medical scribes, we developed an annotation scheme to extract relevant clinical concepts. We used this annotation scheme to label a corpus of about 6k clinical encounters. This was used to train a state-of-the-art tagging model. We report ontologies, labeling results, model performances, and detailed analyses of the results. Our results show that the entities related to medications can be extracted with a relatively high accuracy of 0.90 F-score, followed by symptoms at 0.72 F-score, and conditions at 0.57 F-score. In our task, we not only identify where the symptoms are mentioned but also map them to canonical forms as they appear in the clinical notes. Of the different types of errors, in about 19-38% of the cases, we find that the model output was correct, and about 17-32% of the errors do not impact the clinical note. Taken together, the models developed in this work are more useful than the F-scores reflect, making it a promising approach for practical applications.
CLAug 30, 2019
Learning to Infer Entities, Properties and their Relations from Clinical ConversationsNan Du, Mingqiu Wang, Linh Tran et al.
Recently we proposed the Span Attribute Tagging (SAT) Model (Du et al., 2019) to infer clinical entities (e.g., symptoms) and their properties (e.g., duration). It tackles the challenge of large label space and limited training data using a hierarchical two-stage approach that identifies the span of interest in a tagging step and assigns labels to the span in a classification step. We extend the SAT model to jointly infer not only entities and their properties but also relations between them. Most relation extraction models restrict inferring relations between tokens within a few neighboring sentences, mainly to avoid high computational complexity. In contrast, our proposed Relation-SAT (R-SAT) model is computationally efficient and can infer relations over the entire conversation, spanning an average duration of 10 minutes. We evaluate our model on a corpus of clinical conversations. When the entities are given, the R-SAT outperforms baselines in identifying relations between symptoms and their properties by about 32% (0.82 vs 0.62 F-score) and by about 50% (0.60 vs 0.41 F-score) on medications and their properties. On the more difficult task of jointly inferring entities and relations, the R-SAT model achieves a performance of 0.34 and 0.45 for symptoms and medications respectively, which is significantly better than 0.18 and 0.35 for the baseline model. The contributions of different components of the model are quantified using ablation analysis.
CLJul 9, 2019
Joint Speech Recognition and Speaker Diarization via Sequence TransductionLaurent El Shafey, Hagen Soltau, Izhak Shafran
Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when. The task of assigning words to speakers is typically addressed by merging the outputs of two separate systems, namely, an automatic speech recognition (ASR) system and a speaker diarization (SD) system. The two systems are trained independently with different objective functions. Often the SD systems operate directly on the acoustics and are not constrained to respect word boundaries and this deficiency is overcome in an ad hoc manner. Motivated by recent advances in sequence to sequence learning, we propose a novel approach to tackle the two tasks by a joint ASR and SD system using a recurrent neural network transducer. Our approach utilizes both linguistic and acoustic cues to infer speaker roles, as opposed to typical SD systems, which only use acoustic cues. We evaluated the performance of our approach on a large corpus of medical conversations between physicians and patients. Compared to a competitive conventional baseline, our approach improves word-level diarization error rate from 15.8% to 2.2%.
LGJun 5, 2019
Complex Evolution Recurrent Neural Networks (ceRNNs)Izhak Shafran, Tom Bagby, R. J. Skerry-Ryan
Unitary Evolution Recurrent Neural Networks (uRNNs) have three attractive properties: (a) the unitary property, (b) the complex-valued nature, and (c) their efficient linear operators. The literature so far does not address -- how critical is the unitary property of the model? Furthermore, uRNNs have not been evaluated on large tasks. To study these shortcomings, we propose the complex evolution Recurrent Neural Networks (ceRNNs), which is similar to uRNNs but drops the unitary property selectively. On a simple multivariate linear regression task, we illustrate that dropping the constraints improves the learning trajectory. In copy memory task, ceRNNs and uRNNs perform identically, demonstrating that their superior performance over LSTMs is due to complex-valued nature and their linear operators. In a large scale real-world speech recognition, we find that pre-pending a uRNN degrades the performance of our baseline LSTM acoustic models, while pre-pending a ceRNN improves the performance over the baseline by 0.8% absolute WER.
CLJun 5, 2019
Extracting Symptoms and their Status from Clinical ConversationsNan Du, Kai Chen, Anjuli Kannan et al.
This paper describes novel models tailored for a new application, that of extracting the symptoms mentioned in clinical conversations along with their status. Lack of any publicly available corpus in this privacy-sensitive domain led us to develop our own corpus, consisting of about 3K conversations annotated by professional medical scribes. We propose two novel deep learning approaches to infer the symptom names and their status: (1) a new hierarchical span-attribute tagging (\SAT) model, trained using curriculum learning, and (2) a variant of sequence-to-sequence model which decodes the symptoms and their status from a few speaker turns within a sliding window over the conversation. This task stems from a realistic application of assisting medical providers in capturing symptoms mentioned by patients from their clinical conversations. To reflect this application, we define multiple metrics. From inter-rater agreement, we find that the task is inherently difficult. We conduct comprehensive evaluations on several contrasting conditions and observe that the performance of the models range from an F-score of 0.5 to 0.8 depending on the condition. Our analysis not only reveals the inherent challenges of the task, but also provides useful directions to improve the models.
CLMar 17, 2019
Audio De-identification: A New Entity Recognition TaskIdo Cohn, Itay Laish, Genady Beryozkin et al.
Named Entity Recognition (NER) has been mostly studied in the context of written text. Specifically, NER is an important step in de-identification (de-ID) of medical records, many of which are recorded conversations between a patient and a doctor. In such recordings, audio spans with personal information should be redacted, similar to the redaction of sensitive character spans in de-ID for written text. The application of NER in the context of audio de-identification has yet to be fully investigated. To this end, we define the task of audio de-ID, in which audio spans with entity mentions should be detected. We then present our pipeline for this task, which involves Automatic Speech Recognition (ASR), NER on the transcript text, and text-to-audio alignment. Finally, we introduce a novel metric for audio de-ID and a new evaluation benchmark consisting of a large labeled segment of the Switchboard and Fisher audio datasets and detail our pipeline's results on it.
CVNov 18, 2016
Fuzzy Statistical Matrices for Cell ClassificationGuillaume Thibault, Izhak Shafran
In this paper, we generalize image (texture) statistical descriptors and propose algorithms that improve their efficacy. Recently, a new method showed how the popular Co-Occurrence Matrix (COM) can be modified into a fuzzy version (FCOM) which is more effective and robust to noise. Here, we introduce new fuzzy versions of two additional higher order statistical matrices: the Run Length Matrix (RLM) and the Size Zone Matrix (SZM). We define the fuzzy zones and propose an efficient algorithm to compute the descriptors. We demonstrate the advantage of the proposed improvements over several state-of-the-art methods on three tasks from quantitative cell biology: analyzing and classifying Human Epithelial type 2 (HEp-2) cells using Indirect Immunofluorescence protocol (IFF).