AIMar 17, 2025
The Amazon Nova Family of Models: Technical Report and Model CardAmazon AGI, Aaron Langford, Aayush Shah et al. · amazon-science
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.
CLJun 15, 2022
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding SystemsJack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas et al. · amazon-science, gatech
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
CLAug 2, 2022
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq ModelSaleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald et al. · amazon-science, gatech
In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training.
SYSep 20, 2017
REACT to Cyber Attacks on Power GridsSaleh Soltan, Mihalis Yannakakis, Gil Zussman
Motivated by the recent cyber attack on the Ukrainian power grid, we study cyber attacks on power grids that affect both the physical infrastructure and the data at the control center. In particular, we assume that an adversary attacks an area by: (i) remotely disconnecting some lines within the attacked area, and (ii) modifying the information received from the attacked area to mask the line failures and hide the attacked area from the control center. For the latter, we consider two types of attacks: (i) data distortion: which distorts the data by adding powerful noise to the actual data, and (ii) data replay: which replays a locally consistent old data instead of the actual data. We use the DC power flow model and prove that the problem of finding the set of line failures given the phase angles of the nodes outside of the attacked area is strongly NP-hard, even when the attacked area is known. However, we introduce the polynomial time REcurrent Attack Containment and deTection (REACT) Algorithm to approximately detect the attacked area and line failures after a cyber attack. We numerically show that it performs very well in detecting the attacked area, and detecting single, double, and triple line failures in small and large attacked areas.
CLOct 13, 2022
CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic ParsingAndy Rosenbaum, Saleh Soltan, Wael Hamza et al. · amazon-science
A bottleneck to developing Semantic Parsing (SP) models is the need for a large volume of human-labeled training data. Given the complexity and cost of human annotation for SP, labeled data is often scarce, particularly in multilingual settings. Large Language Models (LLMs) excel at SP given only a few examples, however LLMs are unsuitable for runtime systems which require low latency. In this work, we propose CLASP, a simple method to improve low-resource SP for moderate-sized models: we generate synthetic data from AlexaTM 20B to augment the training set for a model 40x smaller (500M parameters). We evaluate on two datasets in low-resource settings: English PIZZA, containing either 348 or 16 real examples, and mTOP cross-lingual zero-shot, where training data is available only in English, and the model must generalize to four new languages. On both datasets, we show significant improvements over strong baseline methods.
CLSep 20, 2022
LINGUIST: Language Model Instruction Tuning to Generate Annotated Utterances for Intent Classification and Slot TaggingAndy Rosenbaum, Saleh Soltan, Wael Hamza et al. · amazon-science
We present LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt. In a 10-shot novel intent setting for the SNIPS dataset, LINGUIST surpasses state-of-the-art approaches (Back-Translation and Example Extrapolation) by a wide margin, showing absolute improvement for the target intents of +1.9 points on IC Recall and +2.5 points on ST F1 Score. In the zero-shot cross-lingual setting of the mATIS++ dataset, LINGUIST out-performs a strong baseline of Machine Translation with Slot Alignment by +4.14 points absolute on ST F1 Score across 6 languages, while matching performance on IC. Finally, we verify our results on an internal large-scale multilingual dataset for conversational agent IC+ST and show significant improvements over a baseline which uses Back-Translation, Paraphrasing and Slot Catalog Resampling. To our knowledge, we are the first to demonstrate instruction fine-tuning of a large-scale seq2seq model to control the outputs of multilingual intent- and slot-labeled data generation.
SYAug 11, 2018
Protecting the Grid against IoT Botnets of High-Wattage DevicesSaleh Soltan, Prateek Mittal, H. Vincent Poor
We provide methods to prevent line failures in the power grid caused by a newly revealed MAnipulation of Demand (MAD) attacks via an IoT botnet of high-wattage devices. In particular, we develop two algorithms named Securing Additional margin For generators in Economic dispatch (SAFE) Algorithm and Iteratively MiniMize and boUNd Economic dispatch (IMMUNE) Algorithm for finding robust operating points for generators during the economic dispatch such that no lines are overloaded after automatic primary control response to any MAD attacks. In situations that the operating cost of the grid in a robust state is costly (or no robust operating points exist), we provide efficient methods to verify--in advance--if possible line overloads can be cleared during the secondary control after any MAD attacks. We then define the $αD$-robustness notion for the grids indicating that any line failures can be cleared during the secondary control if an adversary can increase/decrease the demands by $α$ fraction. We demonstrate that practical upper and lower bounds on the maximum $α$ for which the grid is $αD$-robust can be found efficiently in polynomial time. Finally, we evaluate the performance of the developed algorithms and methods on realistic power grid test cases. Our work provides the first methods for protecting the grid against potential line failures caused by MAD attacks.
SYSep 21, 2017
EXPOSE the Line Failures following a Cyber-Physical Attack on the Power GridSaleh Soltan, Gil Zussman
Recent attacks on power grids demonstrated the vulnerability of the grids to cyber and physical attacks. To analyze this vulnerability, we study cyber-physical attacks that affect both the power grid physical infrastructure and its underlying Supervisory Control And Data Acquisition (SCADA) system. We assume that an adversary attacks an area by: (i) disconnecting some lines within that area, and (ii) obstructing the information (e.g., status of the lines and voltage measurements) from within the area to reach the control center. We leverage the algebraic properties of the AC power flows to introduce the efficient EXPOSE Algorithm for detecting line failures and recovering voltages inside that attacked area after such an attack. The EXPOSE Algorithm outperforms the state-of-the-art algorithm for detecting line failures using partial information under the AC power flow model in terms of scalability and accuracy. The main advantages of the EXPOSE Algorithm are that its running time is independent of the size of the grid and number of line failures, and that it provides accurate information recovery under some conditions on the attacked area. Moreover, it approximately recovers the information and provides the confidence of the solution when these conditions do not hold.
CLJun 14, 2023
Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq ModelsSaleh Soltan, Andy Rosenbaum, Tobias Falke et al. · amazon-science
Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%.
CLApr 14, 2024
GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot LearningAmani Namboori, Shivam Mangale, Andy Rosenbaum et al.
The emergence of Large Language Models (LLMs) with capabilities like In-Context Learning (ICL) has ushered in new possibilities for data generation across various domains while minimizing the need for extensive data collection and modeling techniques. Researchers have explored ways to use this generated synthetic data to optimize smaller student models for reduced deployment costs and lower latency in downstream tasks. However, ICL-generated data often suffers from low quality as the task specificity is limited with few examples used in ICL. In this paper, we propose GeMQuAD - a semi-supervised learning approach, extending the WeakDAP framework, applied to a dataset generated through ICL with just one example in the target language using AlexaTM 20B Seq2Seq LLM. Through our approach, we iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting in the context of Extractive Question Answering task. Our framework outperforms the machine translation-augmented model by 0.22/1.68 F1/EM (Exact Match) points for Hindi and 0.82/1.37 F1/EM points for Spanish on the MLQA dataset, and it surpasses the performance of model trained on an English-only dataset by 5.05/6.50 F1/EM points for Hindi and 3.81/3.69 points F1/EM for Spanish on the same dataset. Notably, our approach uses a pre-trained LLM for generation with no fine-tuning (FT), utilizing just a single annotated example in ICL to generate data, providing a cost-effective development process.
CLOct 8, 2020
Don't Parse, Insert: Multilingual Semantic Parsing with Insertion Based DecodingQile Zhu, Haidar Khan, Saleh Soltan et al.
Semantic parsing is one of the key components of natural language understanding systems. A successful parse transforms an input utterance to an action that is easily understood by the system. Many algorithms have been proposed to solve this problem, from conventional rulebased or statistical slot-filling systems to shiftreduce based neural parsers. For complex parsing tasks, the state-of-the-art method is based on autoregressive sequence to sequence models to generate the parse directly. This model is slow at inference time, generating parses in O(n) decoding steps (n is the length of the target sequence). In addition, we demonstrate that this method performs poorly in zero-shot cross-lingual transfer learning settings. In this paper, we propose a non-autoregressive parser which is based on the insertion transformer to overcome these two issues. Our approach 1) speeds up decoding by 3x while outperforming the autoregressive model and 2) significantly improves cross-lingual transfer in the low-resource setting by 37% compared to autoregressive baseline. We test our approach on three well-known monolingual datasets: ATIS, SNIPS and TOP. For cross lingual semantic parsing, we use the MultiATIS++ and the multilingual TOP datasets.
SYAug 18, 2015
Generation of Synthetic Spatially Embedded Power Grid NetworksSaleh Soltan, Gil Zussman
The development of algorithms for enhancing the resilience and efficiency of the power grid requires performance evaluation with real topologies of power transmission networks. However, due to security reasons, such topologies and particularly the locations of the substations and the lines are usually not publicly available. Therefore, we study the structural properties of the North American grids and present an algorithm for generating synthetic spatially embedded networks with similar properties to a given grid. The algorithm uses the Gaussian Mixture Model (GMM) for density estimation of the node positions and generates a set of nodes with similar spatial distribution to the nodes in a given network. Then, it uses two procedures, which are inspired by the historical evolution of the grids, to connect the nodes. The algorithm has several tunable parameters that allow generating grids similar to any given grid. Particularly, we apply it to the Western Interconnection (WI) and to grids that operate under the SERC Reliability Corporation (SERC) and the Florida Reliability Coordinating Council (FRCC), and show that it generates grids with similar structural and spatial properties to these grids. To the best of our knowledge, this is the first attempt to consider the spatial distribution of the nodes and lines and its importance in generating synthetic power grids.