Jing Qian

h-index45

30papers

8,997citations

Novelty52%

AI Score59

Ranked #11,328 of 205,806 authors (top 6%)#2,695 in CL (top 8%)

30 Papers

CLOct 9, 2022Code

Controllable Dialogue Simulation with In-Context Learning

Zekun Li, Wenhu Chen, Shiyang Li et al.

Building dialogue systems requires a large corpus of annotated dialogues. Such datasets are usually created via crowdsourcing, which is expensive and time-consuming. In this paper, we propose \textsc{Dialogic}, a novel dialogue simulation method based on large language model in-context learning to automate dataset creation. Seeded with a few annotated dialogues, \textsc{Dialogic} automatically selects in-context examples for demonstration and prompts GPT-3 to generate new dialogues and annotations in a controllable way. Our method can rapidly expand a small set of dialogue data with minimum or zero \textit{human involvement} and \textit{parameter update} and is thus much more cost-efficient and time-saving than crowdsourcing. Experimental results on the MultiWOZ dataset demonstrate that training a model on the simulated dialogues leads to even better performance than using the same amount of human-generated dialogues under the challenging low-resource settings, with as few as 85 dialogues as a seed. When enough data is available, our method can still serve as an effective data augmentation method. Human evaluation results also show that our simulated dialogues have near-human fluency and annotation accuracy. The code and data are available at \textbf{\url{https://github.com/Leezekun/dialogic}}.

CLOct 13, 2022

Explanations from Large Language Models Make Small Reasoners Better

Shiyang Li, Jianshu Chen, Yelong Shen et al.

Integrating free-text explanations to in-context learning of large language models (LLM) is shown to elicit strong reasoning capabilities along with reasonable explanations. In this paper, we consider the problem of leveraging the explanations generated by LLM to improve the training of small reasoners, which are more favorable in real-production deployment due to their low cost. We systematically explore three explanation generation approaches from LLM and utilize a multi-task learning framework to facilitate small models to acquire strong reasoning power together with explanation generation capabilities. Experiments on multiple reasoning tasks show that our method can consistently and significantly outperform finetuning baselines across different settings, and even perform better than finetuning/prompting a 60x larger GPT-3 (175B) model by up to 9.5% in accuracy. As a side benefit, human evaluation further shows that our method can generate high-quality explanations to justify its predictions, moving towards the goal of explainable AI.

CLAug 9, 2022

Limitations of Language Models in Arithmetic and Symbolic Induction

Jing Qian, Hong Wang, Zekun Li et al.

Recent work has shown that large pretrained Language Models (LMs) can not only perform remarkably well on a range of Natural Language Processing (NLP) tasks but also start improving on reasoning tasks such as arithmetic induction, symbolic manipulation, and commonsense reasoning with increasing size of models. However, it is still unclear what the underlying capabilities of these LMs are. Surprisingly, we find that these models have limitations on certain basic symbolic manipulation tasks such as copy, reverse, and addition. When the total number of symbols or repeating symbols increases, the model performance drops quickly. We investigate the potential causes behind this phenomenon and examine a set of possible methods, including explicit positional markers, fine-grained computation steps, and LMs with callable programs. Experimental results show that none of these techniques can solve the simplest addition induction problem completely. In the end, we introduce LMs with tutor, which demonstrates every single step of teaching. LMs with tutor is able to deliver 100% accuracy in situations of OOD and repeating symbols, shedding new insights on the boundary of large LMs in induction.

CLJan 25, 2023

Language Model Detoxification in Dialogue with Contextualized Stance Control

Jing Qian, Xifeng Yan

To reduce the toxic degeneration in a pretrained Language Model (LM), previous work on Language Model detoxification has focused on reducing the toxicity of the generation itself (self-toxicity) without consideration of the context. As a result, a type of implicit offensive language where the generations support the offensive language in the context is ignored. Different from the LM controlling tasks in previous work, where the desired attributes are fixed for generation, the desired stance of the generation depends on the offensiveness of the context. Therefore, we propose a novel control method to do context-dependent detoxification with the stance taken into consideration. We introduce meta prefixes to learn the contextualized stance control strategy and to generate the stance control prefix according to the input context. The generated stance prefix is then combined with the toxicity control prefix to guide the response generation. Experimental results show that our proposed method can effectively learn the context-dependent stance control strategies while keeping a low self-toxicity of the underlying LM.

CLFeb 4

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu, Tian Wu et al.

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

91.4SDApr 19

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

Junyi Wang, Chi Zhang, Jing Qian et al.

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 200 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 3.68% WER on the LibriSpeech test-clean set at 200 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.20% on test-clean and 8.93% on test-other, corresponding to a 13% relative reduction while preserving perceptual quality.

82.7SDMay 19

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

Junyi Wang, Chi Zhang, Jing Qian et al.

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.

HCFeb 29, 2024

ARTiST: Automated Text Simplification for Task Guidance in Augmented Reality

Guande Wu, Jing Qian, Sonia Castelo et al.

Text presented in augmented reality provides in-situ, real-time information for users. However, this content can be challenging to apprehend quickly when engaging in cognitively demanding AR tasks, especially when it is presented on a head-mounted display. We propose ARTiST, an automatic text simplification system that uses a few-shot prompt and GPT-3 models to specifically optimize the text length and semantic content for augmented reality. Developed out of a formative study that included seven users and three experts, our system combines a customized error calibration model with a few-shot prompt to integrate the syntactic, lexical, elaborative, and content simplification techniques, and generate simplified AR text for head-worn displays. Results from a 16-user empirical study showed that ARTiST lightens the cognitive load and improves performance significantly over both unmodified text and text modified via traditional methods. Our work constitutes a step towards automating the optimization of batch text data for readability and performance in augmented reality.

HCOct 22, 2024

Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling

Chenyi Li, Guande Wu, Gromit Yeuk-Yin Chan et al.

Augmented Reality (AR) assistance is increasingly used for supporting users with physical tasks like assembly and cooking. However, most systems rely on reactive responses triggered by user input, overlooking rich contextual and user-specific information. To address this, we present Satori, a novel AR system that proactively guides users by modeling both -- their mental states and environmental contexts. Satori integrates the Belief-Desire-Intention (BDI) framework with the state-of-the-art multi-modal large language model (LLM) to deliver contextually appropriate guidance. Our system is designed based on two formative studies involving twelve experts. We evaluated the system with a sixteen within-subject study and found that Satori matches the performance of designer-created Wizard-of-Oz (WoZ) systems, without manual configurations or heuristics, thereby improving generalizability, reusability, and expanding the potential of AR assistance.

LGNov 26, 2025

Generative Early Stage Ranking

Juhee Hong, Meng Liu, Shengzhi Wang et al.

Large-scale recommendations commonly adopt a multi-stage cascading ranking system paradigm to balance effectiveness and efficiency. Early Stage Ranking (ESR) systems utilize the "user-item decoupling" approach, where independently learned user and item representations are only combined at the final layer. While efficient, this design is limited in effectiveness, as it struggles to capture fine-grained user-item affinities and cross-signals. To address these, we propose the Generative Early Stage Ranking (GESR) paradigm, introducing the Mixture of Attention (MoA) module which leverages diverse attention mechanisms to bridge the effectiveness gap: the Hard Matching Attention (HMA) module encodes explicit cross-signals by computing raw match counts between user and item features; the Target-Aware Self Attention module generates target-aware user representations conditioned on the item, enabling more personalized learning; and the Cross Attention modules facilitate early and more enriched interactions between user-item features. MoA's specialized attention encodings are further refined in the final layer through a Multi-Logit Parameterized Gating (MLPG) module, which integrates the newly learned embeddings via gating and produces secondary logits that are fused with the primary logit. To address the efficiency and latency challenges, we have introduced a comprehensive suite of optimization techniques. These span from custom kernels that maximize the capabilities of the latest hardware to efficient serving solutions powered by caching mechanisms. The proposed GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks, as validated by both offline and online experiments. To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.

IRAug 4, 2025

Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment

Dai Li, Kevin Course, Wei Li et al.

While scaling laws promise significant performance gains for recommender systems, efficiently deploying hyperscale models remains a major unsolved challenge. In contrast to fields where FMs are already widely adopted such as natural language processing and computer vision, progress in recommender systems is hindered by unique challenges including the need to learn from online streaming data under shifting data distributions, the need to adapt to different recommendation surfaces with a wide diversity in their downstream tasks and their input distributions, and stringent latency and computational constraints. To bridge this gap, we propose to leverage the Foundation-Expert Paradigm: a framework designed for the development and deployment of hyperscale recommendation FMs. In our approach, a central FM is trained on lifelong, cross-surface, multi-modal user data to learn generalizable knowledge. This knowledge is then efficiently transferred to various lightweight, surface-specific "expert" models via target-aware embeddings, allowing them to adapt to local data distributions and optimization goals with minimal overhead. To meet our training, inference and development needs, we built HyperCast, a production-grade infrastructure system that re-engineers training, serving, logging and iteration to power this decoupled paradigm. Our approach is now deployed at Meta serving tens of billions of user requests daily, demonstrating online metric improvements over our previous one-stage production system while improving developer velocity and maintaining infrastructure efficiency. To the best of our knowledge, this work represents the first successful deployment of a Foundation-Expert paradigm at this scale, offering a proven, compute-efficient, and developer-friendly blueprint to realize the promise of scaling laws in recommender systems.

CLFeb 27, 2022

Controllable Natural Language Generation with Contrastive Prefixes

Jing Qian, Li Dong, Yelong Shen et al.

To guide the generation of large pretrained language models (LM), previous work has focused on directly fine-tuning the language model or utilizing an attribute discriminator. In this work, we propose a novel lightweight framework for controllable GPT2 generation, which utilizes a set of small attribute-specific vectors, called prefixes, to steer natural language generation. Different from prefix-tuning, where each prefix is trained independently, we take the relationship among prefixes into consideration and train multiple prefixes simultaneously. We propose a novel supervised method and also an unsupervised method to train the prefixes for single-aspect control while the combination of these two methods can achieve multi-aspect control. Experimental results on both single-aspect and multi-aspect control show that our methods can guide generation towards the desired attributes while keeping high linguistic quality.

HCJul 20, 2021

Readability Research: An Interdisciplinary Approach

Sofie Beier, Sam Berlow, Esat Boucaud et al.

Readability is on the cusp of a revolution. Fixed text is becoming fluid as a proliferation of digital reading devices rewrite what a document can do. As past constraints make way for more flexible opportunities, there is great need to understand how reading formats can be tuned to the situation and the individual. We aim to provide a firm foundation for readability research, a comprehensive framework for modern, multi-disciplinary readability research. Readability refers to aspects of visual information design which impact information flow from the page to the reader. Readability can be enhanced by changes to the set of typographical characteristics of a text. These aspects can be modified on-demand, instantly improving the ease with which a reader can process and derive meaning from text. We call on a multi-disciplinary research community to take up these challenges to elevate reading outcomes and provide the tools to do so effectively.

CLJun 5, 2021

Lifelong Learning of Hate Speech Classification on Social Media

Jing Qian, Hong Wang, Mai ElSherief et al.

Existing work on automated hate speech classification assumes that the dataset is fixed and the classes are pre-defined. However, the amount of data in social media increases every day, and the hot topics changes rapidly, requiring the classifiers to be able to continuously adapt to new data without forgetting the previously learned knowledge. This ability, referred to as lifelong learning, is crucial for the real-word application of hate speech classifiers in social media. In this work, we propose lifelong learning of hate speech classification on social media. To alleviate catastrophic forgetting, we propose to use Variational Representation Learning (VRL) along with a memory module based on LB-SOINN (Load-Balancing Self-Organizing Incremental Neural Network). Experimentally, we show that combining variational representation learning and the LB-SOINN memory module achieves better performance than the commonly-used lifelong learning techniques.

COMP-PHApr 25, 2021

Revisiting the dynamics of Bose-Einstein condensates in a double well by deep learning with a hybrid network

Shurui Li, Jianqin Xu, Jing Qian et al.

Deep learning, accounting for the use of an elaborate neural network, has recently been developed as an efficient and powerful tool to solve diverse problems in physics and other sciences. In the present work, we propose a novel learning method based on a hybrid network integrating two different kinds of neural networks: Long Short-Term Memory(LSTM) and Deep Residual Network(ResNet), in order to overcome the difficulty met in numerically simulating strongly-oscillating dynamical evolutions of physical systems. By taking the dynamics of Bose-Einstein condensates in a double-well potential as an example, we show that our new method makes a high efficient pre-learning and a high-fidelity prediction about the whole dynamics. This benefits from the advantage of the combination of the LSTM and the ResNet and is impossibly achieved by a single network in the case of direct learning. Our method can be applied for simulating complex cooperative dynamics in a system with fast multiple-frequency oscillations with the aid of auxiliary spectrum analysis.

HCApr 13, 2021

Lets Make A Story Measuring MR Child Engagement

Duotun Wang, Jennifer Healey, Jing Qian et al.

We present the result of a pilot study measuring child engagement with the Lets Make A Story system, a novel mixed reality, MR, collaborative storytelling system designed for grandparents and grandchildren. We compare our MR experience against an equivalent paper story experience. The goal of our pilot was to test the system with actual child users and assess the goodness of using metrics of time, user generated story content and facial expression analysis as metrics of child engagement. We find that multiple confounding variables make these metrics problematic including attribution of engagement time, spontaneous non-story related conversation and having the childs full forward face continuously in view during the story. We present our platform and experiences and our finding that the strongest metric was user comments in the post-experiential interview.

LGNov 9, 2019

Towards Understanding Gender Bias in Relation Extraction

Andrew Gaut, Tony Sun, Shirlyn Tang et al.

Recent developments in Neural Relation Extraction (NRE) have made significant strides towards Automated Knowledge Base Construction (AKBC). While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to our knowledge to evaluate social biases in NRE systems. We create WikiGenderBias, a distantly supervised dataset with a human annotated test set. WikiGenderBias has sentences specifically curated to analyze gender bias in relation extraction systems. We use WikiGenderBias to evaluate systems for bias and find that NRE systems exhibit gender biased predictions and lay groundwork for future evaluation of bias in NRE. We also analyze how name anonymization, hard debiasing for word embeddings, and counterfactual data augmentation affect gender bias in predictions and performance.

CLSep 10, 2019

A Benchmark Dataset for Learning to Intervene in Online Hate Speech

Jing Qian, Anna Bethke, Yinyin Liu et al.

Countering online hate speech is a critical yet challenging task, but one which can be aided by the use of Natural Language Processing (NLP) techniques. Previous research has primarily focused on the development of NLP methods to automatically and effectively detect online hate speech while disregarding further action needed to calm and discourage individuals from using hate speech in the future. In addition, most existing hate speech datasets treat each post as an isolated instance, ignoring the conversational context. In this paper, we propose a novel task of generative hate speech intervention, where the goal is to automatically generate responses to intervene during online conversations that contain hate speech. As a part of this work, we introduce two fully-labeled large-scale hate speech intervention datasets collected from Gab and Reddit. These datasets provide conversation segments, hate speech labels, as well as intervention responses written by Mechanical Turk Workers. In this paper, we also analyze the datasets to understand the common intervention strategies and explore the performance of common automatic response generation methods on these new datasets to provide a benchmark for future research.

CLApr 4, 2019

Learning to Decipher Hate Symbols

Jing Qian, Mai ElSherief, Elizabeth Belding et al.

Existing computational models to understand hate speech typically frame the problem as a simple classification task, bypassing the understanding of hate symbols (e.g., 14 words, kigy) and their secret connotations. In this paper, we propose a novel task of deciphering hate symbols. To do this, we leverage the Urban Dictionary and collected a new, symbol-rich Twitter corpus of hate speech. We investigate neural network latent context models for deciphering hate symbols. More specifically, we study Sequence-to-Sequence models and show how they are able to crack the ciphers based on context. Furthermore, we propose a novel Variational Decipher and show how it can generalize better to unseen hate symbols in a more challenging testing setting.

CLNov 2, 2018

A Survey on Natural Language Processing for Fake News Detection

Ray Oshikawa, Jing Qian, William Yang Wang

Fake news detection is a critical yet challenging problem in Natural Language Processing (NLP). The rapid rise of social networking platforms has not only yielded a vast increase in information accessibility but has also accelerated the spread of fake news. Thus, the effect of fake news has been growing, sometimes extending to the offline world and threatening public safety. Given the massive amount of Web content, automatic fake news detection is a practical NLP problem useful to all online content providers, in order to reduce the human time and effort to detect and prevent the spread of fake news. In this paper, we describe the challenges involved in fake news detection and also describe related tasks. We systematically review and compare the task formulations, datasets and NLP solutions that have been developed for this task, and also discuss the potentials and limitations of them. Based on our insights, we outline promising research directions, including more fine-grained, detailed, fair, and practical detection models. We also highlight the difference between fake news detection and other related tasks, and the importance of NLP solutions for fake news detection.

CLAug 31, 2018

Hierarchical CVAE for Fine-Grained Hate Speech Classification

Jing Qian, Mai ElSherief, Elizabeth Belding et al.

Existing work on automated hate speech detection typically focuses on binary classification or on differentiating among a small set of categories. In this paper, we propose a novel method on a fine-grained hate speech classification task, which focuses on differentiating among 40 hate groups of 13 different hate group categories. We first explore the Conditional Variational Autoencoder (CVAE) as a discriminative model and then extend it to a hierarchical architecture to utilize the additional hate category information for more accurate prediction. Experimentally, we show that incorporating the hate category information for training can significantly improve the classification performance and our proposed model outperforms commonly-used discriminative models.

CLApr 9, 2018

Leveraging Intra-User and Inter-User Representation Learning for Automated Hate Speech Detection

Jing Qian, Mai ElSherief, Elizabeth M. Belding et al.

Hate speech detection is a critical, yet challenging problem in Natural Language Processing (NLP). Despite the existence of numerous studies dedicated to the development of NLP hate speech detection approaches, the accuracy is still poor. The central problem is that social media posts are short and noisy, and most existing hate speech detection solutions take each post as an isolated input instance, which is likely to yield high false positive and negative rates. In this paper, we radically improve automated hate speech detection by presenting a novel model that leverages intra-user and inter-user representation learning for robust hate speech detection on Twitter. In addition to the target Tweet, we collect and analyze the user's historical posts to model intra-user Tweet representations. To suppress the noise in a single Tweet, we also model the similar Tweets posted by all other users with reinforced inter-user representation learning techniques. Experimentally, we show that leveraging these two representations can significantly improve the f-score of a strong bidirectional LSTM baseline model by 10.1%.

MLAug 26, 2016

Clustering and Community Detection with Imbalanced Clusters

Cem Aksoylar, Jing Qian, Venkatesh Saligrama

Spectral clustering methods which are frequently used in clustering and community detection applications are sensitive to the specific graph constructions particularly when imbalanced clusters are present. We show that ratio cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced cluster sizes since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced cluster sizes. Our approach parameterizes a family of graphs by adaptively modulating node degrees on a fixed node set, yielding a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach and demonstrate the superiority of our method through experiments on synthetic and real datasets for data clustering, semi-supervised learning and community detection.

MLJan 22, 2016

Learning Minimum Volume Sets and Anomaly Detectors from KNN Graphs

Jonathan Root, Venkatesh Saligrama, Jing Qian

We propose a non-parametric anomaly detection algorithm for high dimensional data. We first rank scores derived from nearest neighbor graphs on $n$-point nominal training data. We then train limited complexity models to imitate these scores based on the max-margin learning-to-rank framework. A test-point is declared as an anomaly at $α$-false alarm level if the predicted score is in the $α$-percentile. The resulting anomaly detector is shown to be asymptotically optimal in that for any false alarm rate $α$, its decision region converges to the $α$-percentile minimum volume level set of the unknown underlying density. In addition, we test both the statistical performance and computational efficiency of our algorithm on a number of synthetic and real-data experiments. Our results demonstrate the superiority of our algorithm over existing $K$-NN based anomaly detection algorithms, with significant computational savings.

LGFeb 6, 2015

Learning Efficient Anomaly Detectors from $K$-NN Graphs

Jing Qian, Jonathan Root, Venkatesh Saligrama

We propose a non-parametric anomaly detection algorithm for high dimensional data. We score each datapoint by its average $K$-NN distance, and rank them accordingly. We then train limited complexity models to imitate these scores based on the max-margin learning-to-rank framework. A test-point is declared as an anomaly at $α$-false alarm level if the predicted score is in the $α$-percentile. The resulting anomaly detector is shown to be asymptotically optimal in that for any false alarm rate $α$, its decision region converges to the $α$-percentile minimum volume level set of the unknown underlying density. In addition, we test both the statistical performance and computational efficiency of our algorithm on a number of synthetic and real-data experiments. Our results demonstrate the superiority of our algorithm over existing $K$-NN based anomaly detection algorithms, with significant computational savings.

MLNov 23, 2014

Efficient Minimax Signal Detection on Graphs

Jing Qian, Venkatesh Saligrama

Several problems such as network intrusion, community detection, and disease outbreak can be described by observations attributed to nodes or edges of a graph. In these applications presence of intrusion, community or disease outbreak is characterized by novel observations on some unknown connected subgraph. These problems can be formulated in terms of optimization of suitable objectives on connected subgraphs, a problem which is generally computationally difficult. We overcome the combinatorics of connectivity by embedding connected subgraphs into linear matrix inequalities (LMI). Computationally efficient tests are then realized by optimizing convex objective functions subject to these LMI constraints. We prove, by means of a novel Euclidean embedding argument, that our tests are minimax optimal for exponential family of distributions on 1-D and 2-D lattices. We show that internal conductance of the connected subgraph family plays a fundamental role in characterizing detectability.

MLMay 2, 2014

A Rank-SVM Approach to Anomaly Detection

Jing Qian, Jonathan Root, Venkatesh Saligrama et al.

We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on rank-SVM. Data points are first ranked based on scores derived from nearest neighbor graphs on n-point nominal data. We then train a rank-SVM using this ranked data. A test-point is declared as an anomaly at alpha-false alarm level if the predicted score is in the alpha-percentile. The resulting anomaly detector is shown to be asymptotically optimal and adaptive in that for any false alarm rate alpha, its decision region converges to the alpha-percentile level set of the unknown underlying density. In addition we illustrate through a number of synthetic and real-data experiments both the statistical performance and computational efficiency of our anomaly detector.

MLSep 9, 2013

Spectral Clustering with Imbalanced Data

Jing Qian, Venkatesh Saligrama

Spectral clustering is sensitive to how graphs are constructed from data particularly when proximal and imbalanced clusters are present. We show that Ratio-Cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced data since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced data. Our approach parameterizes a family of graphs, by adaptively modulating node degrees on a fixed node set, to yield a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach. We demonstrate the superiority of our method through unsupervised and semi-supervised experiments on synthetic and real data sets.

MLFeb 20, 2013

Spectral Clustering with Unbalanced Data

Jing Qian, Venkatesh Saligrama

Spectral clustering (SC) and graph-based semi-supervised learning (SSL) algorithms are sensitive to how graphs are constructed from data. In particular if the data has proximal and unbalanced clusters these algorithms can lead to poor performance on well-known graphs such as $k$-NN, full-RBF, $ε$-graphs. This is because the objectives such as Ratio-Cut (RCut) or normalized cut (NCut) attempt to tradeoff cut values with cluster sizes, which are not tailored to unbalanced data. We propose a novel graph partitioning framework, which parameterizes a family of graphs by adaptively modulating node degrees in a $k$-NN graph. We then propose a model selection scheme to choose sizable clusters which are separated by smallest cut values. Our framework is able to adapt to varying levels of unbalancedness of data and can be naturally used for small cluster detection. We theoretically justify our ideas through limit cut analysis. Unsupervised and semi-supervised experiments on synthetic and real data sets demonstrate the superiority of our method.

MLMay 7, 2012

Graph-based Learning with Unbalanced Clusters

Jing Qian, Venkatesh Saligrama, Manqi Zhao

Graph construction is a crucial step in spectral clustering (SC) and graph-based semi-supervised learning (SSL). Spectral methods applied on standard graphs such as full-RBF, $ε$-graphs and $k$-NN graphs can lead to poor performance in the presence of proximal and unbalanced data. This is because spectral methods based on minimizing RatioCut or normalized cut on these graphs tend to put more importance on balancing cluster sizes over reducing cut values. We propose a novel graph construction technique and show that the RatioCut solution on this new graph is able to handle proximal and unbalanced data. Our method is based on adaptively modulating the neighborhood degrees in a $k$-NN graph, which tends to sparsify neighborhoods in low density regions. Our method adapts to data with varying levels of unbalancedness and can be naturally used for small cluster detection. We justify our ideas through limit cut analysis. Unsupervised and semi-supervised experiments on synthetic and real data sets demonstrate the superiority of our method.