Shuai Tang

LG
h-index61
25papers
3,686citations
Novelty48%
AI Score42

25 Papers

AIMar 17, 2025
The Amazon Nova Family of Models: Technical Report and Model Card

Amazon AGI, Aaron Langford, Aayush Shah et al. · amazon-science

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.

LGJul 7, 2023
Scalable Membership Inference Attacks via Quantile Regression

Martin Bertran, Shuai Tang, Michael Kearns et al. · amazon-science

Membership inference attacks are designed to determine, using black box access to trained models, whether a particular example was used in training or not. Membership inference can be formalized as a hypothesis testing problem. The most effective existing attacks estimate the distribution of some test statistic (usually the model's confidence on the true label) on points that were (and were not) used in training by training many \emph{shadow models} -- i.e. models of the same architecture as the model being attacked, trained on a random subsample of data. While effective, these attacks are extremely computationally expensive, especially when the model under attack is large. We introduce a new class of attacks based on performing quantile regression on the distribution of confidence scores induced by the model under attack on points that are not used in training. We show that our method is competitive with state-of-the-art shadow model attacks, while requiring substantially less compute because our attack requires training only a single model. Moreover, unlike shadow model attacks, our proposed attack does not require any knowledge of the architecture of the model under attack and is therefore truly ``black-box". We show the efficacy of this approach in an extensive series of experiments on various datasets and model architectures.

LGSep 15, 2022
Private Synthetic Data for Multitask Learning and Marginal Queries

Giuseppe Vietri, Cedric Archambeau, Sergul Aydore et al. · amazon-science

We provide a differentially private algorithm for producing synthetic data simultaneously useful for multiple tasks: marginal queries and multitask machine learning (ML). A key innovation in our algorithm is the ability to directly handle numerical features, in contrast to a number of related prior approaches which require numerical features to be first converted into {high cardinality} categorical features via {a binning strategy}. Higher binning granularity is required for better accuracy, but this negatively impacts scalability. Eliminating the need for binning allows us to produce synthetic data preserving large numbers of statistical queries such as marginals on numerical features, and class conditional linear threshold queries. Preserving the latter means that the fraction of points of each class label above a particular half-space is roughly the same in both the real and synthetic data. This is the property that is needed to train a linear classifier in a multitask setting. Our algorithm also allows us to produce high quality synthetic data for mixed marginal queries, that combine both categorical and numerical features. Our method consistently runs 2-5x faster than the best comparable techniques, and provides significant accuracy improvements in both marginal queries and linear prediction tasks for mixed-type datasets.

LGMar 6, 2023
Improved Differentially Private Regression via Gradient Boosting

Shuai Tang, Sergul Aydore, Michael Kearns et al. · amazon-science

We revisit the problem of differentially private squared error linear regression. We observe that existing state-of-the-art methods are sensitive to the choice of hyperparameters -- including the ``clipping threshold'' that cannot be set optimally in a data-independent way. We give a new algorithm for private linear regression based on gradient boosting. We show that our method consistently improves over the previous state of the art when the clipping threshold is taken to be fixed without knowledge of the data, rather than optimized in a non-private way -- and that even when we optimize the hyperparameters of competitor algorithms non-privately, our algorithm is no worse and often better. In addition to a comprehensive set of experiments, we give theoretical insights to explain this behavior.

CVAug 13, 2024
Imagen 3

Imagen-Team-Google, Jason Baldridge, Jakob Bauer et al.

We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.

MLMar 2, 2021Code
Fast Adaptation with Linearized Neural Networks

Wesley J. Maddox, Shuai Tang, Pablo Garcia Moreno et al.

The inductive biases of trained neural networks are difficult to understand and, consequently, to adapt to new settings. We study the inductive biases of linearizations of neural networks, which we show to be surprisingly good summaries of the full network functions. Inspired by this finding, we propose a technique for embedding these inductive biases into Gaussian processes through a kernel designed from the Jacobian of the network. In this setting, domain adaptation takes the form of interpretable posterior inference, with accompanying uncertainty estimation. This inference is analytic and free of local optima issues found in standard techniques such as fine-tuning neural network weights to a new task. We develop significant computational speed-ups based on matrix multiplies, including a novel implementation for scalable Fisher vector products. Our experiments on both image classification and regression demonstrate the promise and convenience of this framework for transfer learning, compared to neural network fine-tuning. Code is available at https://github.com/amzn/xfer/tree/master/finite_ntk.

LGDec 8, 2023
Membership Inference Attacks on Diffusion Models via Quantile Regression

Shuai Tang, Zhiwei Steven Wu, Sergul Aydore et al.

Recently, diffusion models have become popular tools for image synthesis because of their high-quality outputs. However, like other large-scale models, they may leak private information about their training data. Here, we demonstrate a privacy vulnerability of diffusion models through a \emph{membership inference (MI) attack}, which aims to identify whether a target example belongs to the training set when given the trained diffusion model. Our proposed MI attack learns quantile regression models that predict (a quantile of) the distribution of reconstruction loss on examples not used in training. This allows us to define a granular hypothesis test for determining the membership of a point in the training set, based on thresholding the reconstruction loss of that point using a custom threshold tailored to the example. We also provide a simple bootstrap technique that takes a majority membership prediction over ``a bag of weak attackers'' which improves the accuracy over individual quantile regression models. We show that our attack outperforms the prior state-of-the-art attack while being substantially less computationally expensive -- prior attacks required training multiple ``shadow models'' with the same architecture as the model under attack, whereas our attack requires training only much smaller models.

AIMar 10, 2025
Encoding Argumentation Frameworks to Propositional Logic Systems

Shuai Tang, Jiachao Wu, Ning Zhou

This paper generalizes the encoding of argumentation frameworks beyond the classical 2-valued propositional logic system ($PL_2$) to 3-valued propositional logic systems ($PL_3$s) and fuzzy propositional logic systems ($PL_{[0,1]}s$), employing two key encodings: normal encoding ($ec_1$) and regular encoding ($ec_2$). Specifically, via $ec_1$ and $ec_2$, we establish model relationships between Dung's classical semantics (stable and complete semantics) and the encoded semantics associated with Kleene's $PL_3$ and Łukasiewicz's $PL_3$. Through $ec_1$, we also explore connections between Gabbay's real equational semantics and the encoded semantics of $PL_{[0,1]}s$, including showing that Gabbay's $Eq_{\text{max}}^R$ and $Eq_{\text{inverse}}^R$ correspond to the fuzzy encoded semantics of $PL_{[0,1]}^G$ and $PL_{[0,1]}^P$ respectively. Additionally, we propose a new fuzzy encoded semantics ($Eq^L$) associated with Łukasiewicz's $PL_{[0,1]}$ and investigate interactions between complete semantics and fuzzy encoded semantics. This work strengthens the links between argumentation frameworks and propositional logic systems, providing a framework for constructing new argumentation semantics.

CVNov 24, 2025
LAA3D: A Benchmark of Detecting and Tracking Low-Altitude Aircraft in 3D Space

Hai Wu, Shuai Tang, Jiale Wang et al.

Perception of Low-Altitude Aircraft (LAA) in 3D space enables precise 3D object localization and behavior understanding. However, datasets tailored for 3D LAA perception remain scarce. To address this gap, we present LAA3D, a large-scale dataset designed to advance 3D detection and tracking of low-altitude aerial vehicles. LAA3D contains 15,000 real images and 600,000 synthetic frames, captured across diverse scenarios, including urban and suburban environments. It covers multiple aerial object categories, including electric Vertical Take-Off and Landing (eVTOL) aircraft, Micro Aerial Vehicles (MAVs), and Helicopters. Each instance is annotated with 3D bounding box, class label, and instance identity, supporting tasks such as 3D object detection, 3D multi-object tracking (MOT), and 6-DoF pose estimation. Besides, we establish the LAA3D Benchmark, integrating multiple tasks and methods with unified evaluation protocols for comparison. Furthermore, we propose MonoLAA, a monocular 3D detection baseline, achieving robust 3D localization from zoom cameras with varying focal lengths. Models pretrained on synthetic images transfer effectively to real-world data with fine-tuning, demonstrating strong sim-to-real generalization. Our LAA3D provides a comprehensive foundation for future research in low-altitude 3D object perception.

IRFeb 25, 2022
MAMDR: A Model Agnostic Learning Method for Multi-Domain Recommendation

Linhao Luo, Yumeng Li, Buyu Gao et al.

Large-scale e-commercial platforms in the real-world usually contain various recommendation scenarios (domains) to meet demands of diverse customer groups. Multi-Domain Recommendation (MDR), which aims to jointly improve recommendations on all domains and easily scales to thousands of domains, has attracted increasing attention from practitioners and researchers. Existing MDR methods usually employ a shared structure and several specific components to respectively leverage reusable features and domain-specific information. However, data distribution differs across domains, making it challenging to develop a general model that can be applied to all circumstances. Additionally, during training, shared parameters often suffer from the domain conflict while specific parameters are inclined to overfitting on data sparsity domains. we first present a scalable MDR platform served in Taobao that enables to provide services for thousands of domains without specialists involved. To address the problems of MDR methods, we propose a novel model agnostic learning framework, namely MAMDR, for the multi-domain recommendation. Specifically, we first propose a Domain Negotiation (DN) strategy to alleviate the conflict between domains. Then, we develop a Domain Regularization (DR) to improve the generalizability of specific parameters by learning from other domains. We integrate these components into a unified framework and present MAMDR, which can be applied to any model structure to perform multi-domain recommendation. Finally, we present a large-scale implementation of MAMDR in the Taobao application and construct various public MDR benchmark datasets which can be used for following studies. Extensive experiments on both benchmark datasets and industry datasets demonstrate the effectiveness and generalizability of MAMDR.

SPFeb 9, 2022
Spectrally Adaptive Common Spatial Patterns

Mahta Mousavi, Eric Lybrand, Shuangquan Feng et al.

The method of Common Spatial Patterns (CSP) is widely used for feature extraction of electroencephalography (EEG) data, such as in motor imagery brain-computer interface (BCI) systems. It is a data-driven method estimating a set of spatial filters so that the power of the filtered EEG signal is maximized for one motor imagery class and minimized for the other. This method, however, is prone to overfitting and is known to suffer from poor generalization especially with limited calibration data. Additionally, due to the high heterogeneity in brain data and the non-stationarity of brain activity, CSP is usually trained for each user separately resulting in long calibration sessions or frequent re-calibrations that are tiring for the user. In this work, we propose a novel algorithm called Spectrally Adaptive Common Spatial Patterns (SACSP) that improves CSP by learning a temporal/spectral filter for each spatial filter so that the spatial filters are concentrated on the most relevant temporal frequencies for each user. We show the efficacy of SACSP in providing better generalizability and higher classification accuracy from calibration to online control compared to existing methods. Furthermore, we show that SACSP provides neurophysiologically relevant information about the temporal frequencies of the filtered signals. Our results highlight the differences in the motor imagery signal among BCI users as well as spectral differences in the signals generated for each class, and show the importance of learning robust user-specific features in a data-driven manner.

LGJun 11, 2020
Deep Transfer Learning with Ridge Regression

Shuai Tang, Virginia R. de Sa

The large amount of online data and vast array of computing resources enable current researchers in both industry and academia to employ the power of deep learning with neural networks. While deep models trained with massive amounts of data demonstrate promising generalisation ability on unseen data from relevant domains, the computational cost of finetuning gradually becomes a bottleneck in transfering the learning to new domains. We address this issue by leveraging the low-rank property of learnt feature vectors produced from deep neural networks (DNNs) with the closed-form solution provided in kernel ridge regression (KRR). This frees transfer learning from finetuning and replaces it with an ensemble of linear systems with many fewer hyperparameters. Our method is successful on supervised and semi-supervised transfer learning tasks.

LGMar 25, 2020
Similarity of Neural Networks with Gradients

Shuai Tang, Wesley J. Maddox, Charlie Dickens et al.

A suitable similarity index for comparing learnt neural networks plays an important role in understanding the behaviour of the highly-nonlinear functions, and can provide insights on further theoretical analysis and empirical studies. We define two key steps when comparing models: firstly, the representation abstracted from the learnt model, where we propose to leverage both feature vectors and gradient ones (which are largely ignored in prior work) into designing the representation of a neural network. Secondly, we define the employed similarity index which gives desired invariance properties, and we facilitate the chosen ones with sketching techniques for comparing various datasets efficiently. Empirically, we show that the proposed approach provides a state-of-the-art method for computing similarity of neural networks that are trained independently on different datasets and the tasks defined by the datasets.

CVOct 21, 2019
Improving Style Transfer with Calibrated Metrics

Mao-Chuang Yeh, Shuai Tang, Anand Bhattad et al.

Style transfer methods produce a transferred image which is a rendering of a content image in the manner of a style image. We seek to understand how to improve style transfer. To do so requires quantitative evaluation procedures, but the current evaluation is qualitative, mostly involving user studies. We describe a novel quantitative evaluation procedure. Our procedure relies on two statistics: the Effectiveness (E) statistic measures the extent that a given style has been transferred to the target, and the Coherence (C) statistic measures the extent to which the original image's content is preserved. Our statistics are calibrated to human preference: targets with larger values of E (resp C) will reliably be preferred by human subjects in comparisons of style (resp. content). We use these statistics to investigate the relative performance of a number of Neural Style Transfer(NST) methods, revealing several intriguing properties. Admissible methods lie on a Pareto frontier (i.e. improving E reduces C or vice versa). Three methods are admissible: Universal style transfer produces very good C but weak E; modifying the optimization used for Gatys' loss produces a method with strong E and strong C; and a modified cross-layer method has slightly better E at strong cost in C. While the histogram loss improves the E statistics of Gatys' method, it does not make the method admissible. Surprisingly, style weights have relatively little effect in improving EC scores, and most variability in the transfer is explained by the style itself (meaning experimenters can be misguided by selecting styles).

LGMay 27, 2019
An Empirical Study on Post-processing Methods for Word Embeddings

Shuai Tang, Mahta Mousavi, Virginia R. de Sa

Word embeddings learnt from large corpora have been adopted in various applications in natural language processing and served as the general input representations to learning systems. Recently, a series of post-processing methods have been proposed to boost the performance of word embeddings on similarity comparison and analogy retrieval tasks, and some have been adapted to compose sentence representations. The general hypothesis behind these methods is that by enforcing the embedding space to be more isotropic, the similarity between words can be better expressed. We view these methods as an approach to shrink the covariance/gram matrix, which is estimated by learning word vectors, towards a scaled identity matrix. By optimising an objective in the semi-Riemannian manifold with Centralised Kernel Alignment (CKA), we are able to search for the optimal shrinkage parameter, and provide a post-processing method to smooth the spectrum of learnt word vectors which yields improved performance on downstream tasks.

NEOct 29, 2018
A Simple Recurrent Unit with Reduced Tensor Product Representations

Shuai Tang, Paul Smolensky, Virginia R. de Sa

idely used recurrent units, including Long-short Term Memory (LSTM) and the Gated Recurrent Unit (GRU), perform well on natural language tasks, but their ability to learn structured representations is still questionable. Exploiting reduced Tensor Product Representations (TPRs) --- distributed representations of symbolic structure in which vector-embedded symbols are bound to vector-embedded structural positions --- we propose the TPRU, a simple recurrent unit that, at each time step, explicitly executes structural-role binding and unbinding operations to incorporate structural information into learning. A gradient analysis of our proposed TPRU is conducted to support our model design, and its performance on multiple datasets shows the effectiveness of our design choices. Furthermore, observations on a linguistically grounded study demonstrate the interpretability of our TPRU.

CLOct 2, 2018
Improving Sentence Representations with Consensus Maximisation

Shuai Tang, Virginia R. de Sa

Consensus maximisation learning can provide self-supervision when different views are available of the same data. The distributional hypothesis provides another form of useful self-supervision from adjacent sentences which are plentiful in large unlabelled corpora. Motivated by the observation that different learning architectures tend to emphasise different aspects of sentence meaning, we present a new self-supervised learning framework for learning sentence representations which minimises the disagreement between two views of the same sentence where one view encodes the sentence with a recurrent neural network (RNN), and the other view encodes the same sentence with a simple linear model. After learning, the individual views (networks) result in higher quality sentence representations than their single-view learnt counterparts (learnt using only the distributional hypothesis) as judged by performance on standard downstream tasks. An ensemble of both views provides even better generalisation on both supervised and unsupervised downstream tasks. Also, importantly the ensemble of views trained with consensus maximisation between the two different architectures performs better on downstream tasks than an analogous ensemble made from the single-view trained counterparts.

NESep 8, 2018
Exploiting Invertible Decoders for Unsupervised Sentence Representation Learning

Shuai Tang, Virginia R. de Sa

The encoder-decoder models for unsupervised sentence representation learning tend to discard the decoder after being trained on a large unlabelled corpus, since only the encoder is needed to map the input sentence into a vector representation. However, parameters learnt in the decoder also contain useful information about language. In order to utilise the decoder after learning, we present two types of decoding functions whose inverse can be easily derived without expensive inverse calculation. Therefore, the inverse of the decoding function serves as another encoder that produces sentence representations. We show that, with careful design of the decoding functions, the model learns good sentence representations, and the ensemble of the representations produced from the encoder and the inverse of the decoder demonstrate even better generalisation ability and solid transferability.

CLMay 18, 2018
Multi-view Sentence Representation Learning

Shuai Tang, Virginia R. de Sa

Multi-view learning can provide self-supervision when different views are available of the same data. The distributional hypothesis provides another form of useful self-supervision from adjacent sentences which are plentiful in large unlabelled corpora. Motivated by the asymmetry in the two hemispheres of the human brain as well as the observation that different learning architectures tend to emphasise different aspects of sentence meaning, we create a unified multi-view sentence representation learning framework, in which, one view encodes the input sentence with a Recurrent Neural Network (RNN), and the other view encodes it with a simple linear model, and the training objective is to maximise the agreement specified by the adjacent context information between two views. We show that, after training, the vectors produced from our multi-view training provide improved representations over the single-view training, and the combination of different views gives further representational improvement and demonstrates solid transferability on standard downstream tasks.

CVMar 31, 2018
Quantitative Evaluation of Style Transfer

Mao-Chuang Yeh, Shuai Tang, Anand Bhattad et al.

Style transfer methods produce a transferred image which is a rendering of a content image in the manner of a style image. There is a rich literature of variant methods. However, evaluation procedures are qualitative, mostly involving user studies. We describe a novel quantitative evaluation procedure. One plots effectiveness (a measure of the extent to which the style was transferred) against coherence (a measure of the extent to which the transferred image decomposes into objects in the same way that the content image does) to obtain an EC plot. We construct EC plots comparing a number of recent style transfer methods. Most methods control within-layer gram matrices, but we also investigate a method that controls cross-layer gram matrices. These EC plots reveal a number of intriguing properties of recent style transfer methods. The style used has a strong effect on the outcome, for all methods. Using large style weights does not necessarily improve effectiveness, and can produce worse results. Cross-layer gram matrices easily beat all other methods, but some styles remain difficult for all methods. Ensemble methods show real promise. It is likely that, for current methods, each style requires a different choice of weights to obtain the best results, so that automated weight setting methods are desirable. Finally, we show evidence comparing our EC evaluations to human evaluations.

CVJan 5, 2018
Improved Style Transfer by Respecting Inter-layer Correlations

Mao-Chuang Yeh, Shuai Tang

A popular series of style transfer methods apply a style to a content image by controlling mean and covariance of values in early layers of a feature stack. This is insufficient for transferring styles that have strong structure across spatial scales like, e.g., textures where dots lie on long curves. This paper demonstrates that controlling inter-layer correlations yields visible improvements in style transfer methods. We achieve this control by computing cross-layer, rather than within-layer, gram matrices. We find that (a) cross-layer gram matrices are sufficient to control within-layer statistics. Inter-layer correlations improves style transfer and texture synthesis. The paper shows numerous examples on "hard" real style transfer problems (e.g. long scale and hierarchical patterns); (b) a fast approximate style transfer method can control cross-layer gram matrices; (c) we demonstrate that multiplicative, rather than additive style and content loss, results in very good style transfer. Multiplicative loss produces a visible emphasis on boundaries, and means that one hyper-parameter can be eliminated.

NEOct 28, 2017
Speeding up Context-based Sentence Representation Learning with Non-autoregressive Convolutional Decoding

Shuai Tang, Hailin Jin, Chen Fang et al.

Context plays an important role in human language understanding, thus it may also be useful for machines learning vector representations of language. In this paper, we explore an asymmetric encoder-decoder structure for unsupervised context-based sentence representation learning. We carefully designed experiments to show that neither an autoregressive decoder nor an RNN decoder is required. After that, we designed a model which still keeps an RNN as the encoder, while using a non-autoregressive convolutional decoder. We further combine a suite of effective designs to significantly improve model efficiency while also achieving better performance. Our model is trained on two different large unlabelled corpora, and in both cases the transferability is evaluated on a set of downstream NLP tasks. We empirically show that our model is simple and fast while producing rich sentence representations that excel in downstream tasks.

CLJun 9, 2017
Trimming and Improving Skip-thought Vectors

Shuai Tang, Hailin Jin, Chen Fang et al.

The skip-thought model has been proven to be effective at learning sentence representations and capturing sentence semantics. In this paper, we propose a suite of techniques to trim and improve it. First, we validate a hypothesis that, given a current sentence, inferring the previous and inferring the next sentence provide similar supervision power, therefore only one decoder for predicting the next sentence is preserved in our trimmed skip-thought model. Second, we present a connection layer between encoder and decoder to help the model to generalize better on semantic relatedness tasks. Third, we found that a good word embedding initialization is also essential for learning better sentence representations. We train our model unsupervised on a large corpus with contiguous sentences, and then evaluate the trained model on 7 supervised tasks, which includes semantic relatedness, paraphrase detection, and text classification benchmarks. We empirically show that, our proposed model is a faster, lighter-weight and equally powerful alternative to the original skip-thought model.

CLJun 9, 2017
Rethinking Skip-thought: A Neighborhood based Approach

Shuai Tang, Hailin Jin, Chen Fang et al.

We study the skip-thought model with neighborhood information as weak supervision. More specifically, we propose a skip-thought neighbor model to consider the adjacent sentences as a neighborhood. We train our skip-thought neighbor model on a large corpus with continuous sentences, and then evaluate the trained model on 7 tasks, which include semantic relatedness, paraphrase detection, and classification benchmarks. Both quantitative comparison and qualitative investigation are conducted. We empirically show that, our skip-thought neighbor model performs as well as the skip-thought model on evaluation tasks. In addition, we found that, incorporating an autoencoder path in our model didn't aid our model to perform better, while it hurts the performance of the skip-thought model.

NENov 23, 2015
What Happened to My Dog in That Network: Unraveling Top-down Generators in Convolutional Neural Networks

Patrick W. Gallagher, Shuai Tang, Zhuowen Tu

Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".