LGJul 12, 2024
Your Diffusion Model is Secretly a Noise Classifier and Benefits from Contrastive TrainingYunshu Wu, Yingtao Luo, Xianghao Kong et al. · cmu
Diffusion models learn to denoise data and the trained denoiser is then used to generate new samples from the data distribution. In this paper, we revisit the diffusion sampling process and identify a fundamental cause of sample quality degradation: the denoiser is poorly estimated in regions that are far Outside Of the training Distribution (OOD), and the sampling process inevitably evaluates in these OOD regions. This can become problematic for all sampling methods, especially when we move to parallel sampling which requires us to initialize and update the entire sample trajectory of dynamics in parallel, leading to many OOD evaluations. To address this problem, we introduce a new self-supervised training objective that differentiates the levels of noise added to a sample, leading to improved OOD denoising performance. The approach is based on our observation that diffusion models implicitly define a log-likelihood ratio that distinguishes distributions with different amounts of noise, and this expression depends on denoiser performance outside the standard training distribution. We show by diverse experiments that the proposed contrastive diffusion training is effective for both sequential and parallel settings, and it improves the performance and speed of parallel samplers significantly.
LGNov 25, 2022
Link Prediction with Non-Contrastive LearningWilliam Shiao, Zhichun Guo, Tong Zhao et al.
A recent focal area in the space of graph neural networks (GNNs) is graph self-supervised learning (SSL), which aims to derive useful node representations without labeled data. Notably, many state-of-the-art graph SSL methods are contrastive methods, which use a combination of positive and negative samples to learn node representations. Owing to challenges in negative sampling (slowness and model sensitivity), recent literature introduced non-contrastive methods, which instead only use positive samples. Though such methods have shown promising performance in node-level tasks, their suitability for link prediction tasks, which are concerned with predicting link existence between pairs of nodes (and have broad applicability to recommendation systems contexts) is yet unexplored. In this work, we extensively evaluate the performance of existing non-contrastive methods for link prediction in both transductive and inductive settings. While most existing non-contrastive methods perform poorly overall, we find that, surprisingly, BGRL generally performs well in transductive settings. However, it performs poorly in the more realistic inductive settings where the model has to generalize to links to/from unseen nodes. We find that non-contrastive models tend to overfit to the training graph and use this analysis to propose T-BGRL, a novel non-contrastive framework that incorporates cheap corruptions to improve the generalization ability of the model. This simple modification strongly improves inductive performance in 5/6 of our datasets, with up to a 120% improvement in Hits@50--all with comparable speed to other non-contrastive baselines and up to 14x faster than the best-performing contrastive baseline. Our work imparts interesting findings about non-contrastive learning for link prediction and paves the way for future researchers to further expand upon this area.
LGMay 25, 2022
MAVIPER: Learning Decision Tree Policies for Interpretable Multi-Agent Reinforcement LearningStephanie Milani, Zhicheng Zhang, Nicholay Topin et al.
Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. On the other hand, existing work on interpretable reinforcement learning (RL) has shown promise in extracting more interpretable decision tree-based policies from neural networks, but only in the single-agent setting. To fill this gap, we propose the first set of algorithms that extract interpretable decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER learns high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.
LGJun 12, 2023
CARL-G: Clustering-Accelerated Representation Learning on GraphsWilliam Shiao, Uday Singh Saini, Yozen Liu et al.
Self-supervised learning on graphs has made large strides in achieving great performance in various downstream tasks. However, many state-of-the-art methods suffer from a number of impediments, which prevent them from realizing their full potential. For instance, contrastive methods typically require negative sampling, which is often computationally costly. While non-contrastive methods avoid this expensive step, most existing methods either rely on overly complex architectures or dataset-specific augmentations. In this paper, we ask: Can we borrow from classical unsupervised machine learning literature in order to overcome those obstacles? Guided by our key insight that the goal of distance-based clustering closely resembles that of contrastive learning: both attempt to pull representations of similar items together and dissimilar items apart. As a result, we propose CARL-G - a novel clustering-based framework for graph representation learning that uses a loss inspired by Cluster Validation Indices (CVIs), i.e., internal measures of cluster quality (no ground truth required). CARL-G is adaptable to different clustering methods and CVIs, and we show that with the right choice of clustering method and CVI, CARL-G outperforms node classification baselines on 4/5 datasets with up to a 79x training speedup compared to the best-performing baseline. CARL-G also performs at par or better than baselines in node clustering and similarity search tasks, training up to 1,500x faster than the best-performing baseline. Finally, we also provide theoretical foundations for the use of CVI-inspired losses in graph representation learning.
LGJun 19, 2022
FRAPPE: $\underline{\text{F}}$ast $\underline{\text{Ra}}$nk $\underline{\text{App}}$roximation with $\underline{\text{E}}$xplainable Features for TensorsWilliam Shiao, Evangelos E. Papalexakis
Tensor decompositions have proven to be effective in analyzing the structure of multidimensional data. However, most of these methods require a key parameter: the number of desired components. In the case of the CANDECOMP/PARAFAC decomposition (CPD), the ideal value for the number of components is known as the canonical rank and greatly affects the quality of the decomposition results. Existing methods use heuristics or Bayesian methods to estimate this value by repeatedly calculating the CPD, making them extremely computationally expensive. In this work, we propose FRAPPE, the first method to estimate the canonical rank of a tensor without having to compute the CPD. This method is the result of two key ideas. First, it is much cheaper to generate synthetic data with known rank compared to computing the CPD. Second, we can greatly improve the generalization ability and speed of our model by generating synthetic data that matches a given input tensor in terms of size and sparsity. We can then train a specialized single-use regression model on a synthetic set of tensors engineered to match a given input tensor and use that to estimate the canonical rank of the tensor - all without computing the expensive CPD. FRAPPE is over 24 times faster than the best-performing baseline and exhibits a 10% improvement in MAPE on a synthetic dataset. It also performs as well as or better than the baselines on real-world datasets.
IMDec 18, 2025
Graph Neural Networks for Interferometer SimulationsSidharth Kannan, Pooyan Goodarzi, Evangelos E. Papalexakis et al.
In recent years, graph neural networks (GNNs) have shown tremendous promise in solving problems in high energy physics, materials science, and fluid dynamics. In this work, we introduce a new application for GNNs in the physical sciences: instrumentation design. As a case study, we apply GNNs to simulate models of the Laser Interferometer Gravitational-Wave Observatory (LIGO) and show that they are capable of accurately capturing the complex optical physics at play, while achieving runtimes 815 times faster than state of the art simulation packages. We discuss the unique challenges this problem provides for machine learning models. In addition, we provide a dataset of high-fidelity optical physics simulations for three interferometer topologies, which can be used as a benchmarking suite for future work in this direction.
LGFeb 18
Discrete Stochastic Localization for Non-autoregressive GenerationYunshu Wu, Jiayi Cheng, Partha Thakuria et al.
Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with \(\sim\)4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.
LGFeb 17
Extracting and Analyzing Rail Crossing Behavior Signatures from Videos using Tensor MethodsDawon Ahn, Het Patel, Aemal Khattak et al.
Railway crossings present complex safety challenges where driver behavior varies by location, time, and conditions. Traditional approaches analyze crossings individually, limiting the ability to identify shared behavioral patterns across locations. We propose a multi-view tensor decomposition framework that captures behavioral similarities across three temporal phases: Approach (warning activation to gate lowering), Waiting (gates down to train passage), and Clearance (train passage to gate raising). We analyze railway crossing videos from multiple locations using TimeSformer embeddings to represent each phase. By constructing phase-specific similarity matrices and applying non-negative symmetric CP decomposition, we discover latent behavioral components with distinct temporal signatures. Our tensor analysis reveals that crossing location appears to be a stronger determinant of behavior patterns than time of day, and that approach-phase behavior provides particularly discriminative signatures. Visualization of the learned component space confirms location-based clustering, with certain crossings forming distinct behavioral clusters. This automated framework enables scalable pattern discovery across multiple crossings, providing a foundation for grouping locations by behavioral similarity to inform targeted safety interventions.
CRFeb 12Code
Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language ModelsSri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model Mamba, and identify consistent latent-space patterns associated with harmful inputs. We then propose a tensor-based latent representation framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that the latent signals can be used to actively disrupt jailbreak execution at inference time. On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts. This intervention operates entirely at inference time and introduces minimal overhead, providing a scalable foundation for achieving stronger coverage by incorporating additional attack distributions or more refined susceptibility thresholds. Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures and suggest a complementary, architecture-agnostic direction for improving LLM security.
LGFeb 11
Tensor Methods: A Unified and Interpretable Approach for Material DesignShaan Pakala, Aldair E. Gongora, Brian Giera et al.
When designing new materials, it is often necessary to tailor the material design (with respect to its design parameters) to have some desired properties (e.g. Young's modulus). As the set of design parameters grow, the search space grows exponentially, making the actual synthesis and evaluation of all material combinations virtually impossible. Even using traditional computational methods such as Finite Element Analysis becomes too computationally heavy to search the design space. Recent methods use machine learning (ML) surrogate models to more efficiently determine optimal material designs; unfortunately, these methods often (i) are notoriously difficult to interpret and (ii) under perform when the training data comes from a non-uniform sampling of the design space. We suggest the use of tensor completion methods as an all-in-one approach for interpretability and predictions. We observe classical tensor methods are able to compete with traditional ML in predictions, with the added benefit of their interpretable tensor factors (which are given completely for free, as a result of the prediction). In our experiments, we are able to rediscover physical phenomena via the tensor factors, indicating that our predictions are aligned with the true underlying physics of the problem. This also means these tensor factors could be used by experimentalists to identify potentially novel patterns, given we are able to rediscover existing ones. We also study the effects of both types of surrogate models when we encounter training data from a non-uniform sampling of the design space. We observe more specialized tensor methods that can give better generalization in these non-uniforms sampling scenarios. We find the best generalization comes from a tensor model, which is able to improve upon the baseline ML methods by up to 5% on aggregate $R^2$, and halve the error in some out of distribution regions.
78.3LGApr 21
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse AutoencodersHet Patel, Tiejin Chen, Hua Wei et al.
Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.
75.0LGMay 13
Discrete Stochastic Localization for Non-autoregressive GenerationYunshu Wu, Jiayi Cheng, Longxuan Yu et al.
Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.
88.8LGApr 9
Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor DecompositionTiejin Chen, Huaiyuan Yao, Jia Chen et al.
While Large Language Model-based Multi-Agent Systems (MAS) consistently outperform single-agent systems on complex tasks, their intricate interactions introduce critical reliability challenges arising from communication dynamics and role dependencies. Existing Uncertainty Quantification methods, typically designed for single-turn outputs, fail to address the unique complexities of the MAS. Specifically, these methods struggle with three distinct challenges: the cascading uncertainty in multi-step reasoning, the variability of inter-agent communication paths, and the diversity of communication topologies. To bridge this gap, we introduce MATU, a novel framework that quantifies uncertainty through tensor decomposition. MATU moves beyond analyzing final text outputs by representing entire reasoning trajectories as embedding matrices and organizing multiple execution runs into a higher-order tensor. By applying tensor decomposition, we disentangle and quantify distinct sources of uncertainty, offering a comprehensive reliability measure that is generalizable across different agent structures. We provide comprehensive experiments to show that MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies.
LGOct 22, 2025Code
Preliminary Use of Vision Language Model Driven Extraction of Mouse Behavior Towards Understanding Fear ExpressionPaimon Goulart, Jordan Steinhauser, Kylene Shuler et al.
Integration of diverse data will be a pivotal step towards improving scientific explorations in many disciplines. This work establishes a vision-language model (VLM) that encodes videos with text input in order to classify various behaviors of a mouse existing in and engaging with their environment. Importantly, this model produces a behavioral vector over time for each subject and for each session the subject undergoes. The output is a valuable dataset that few programs are able to produce with as high accuracy and with minimal user input. Specifically, we use the open-source Qwen2.5-VL model and enhance its performance through prompts, in-context learning (ICL) with labeled examples, and frame-level preprocessing. We found that each of these methods contributes to improved classification, and that combining them results in strong F1 scores across all behaviors, including rare classes like freezing and fleeing, without any model fine-tuning. Overall, this model will support interdisciplinary researchers studying mouse behavior by enabling them to integrate diverse behavioral features, measured across multiple time points and environments, into a comprehensive dataset that can address complex research questions.
CLOct 8, 2025Code
Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs. Adversarial users often exploit these models through carefully engineered prompts to elicit restricted or sensitive outputs, a strategy widely referred to as jailbreaking. While numerous defense mechanisms have been proposed, attackers continuously develop novel prompting techniques, and no existing model can be considered fully resistant. In this study, we investigate the jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM GPT-J and the state-space model Mamba2, presenting preliminary findings that highlight distinct layer-wise behaviors. Our results suggest promising directions for further research on leveraging internal model dynamics for robust jailbreak detection and defense.
LGAug 20, 2025Code
Multi-view Graph Condensation via Tensor DecompositionNícolas Roque dos Santos, Dawon Ahn, Diego Minatel et al.
Graph Neural Networks (GNNs) have demonstrated remarkable results in various real-world applications, including drug discovery, object detection, social media analysis, recommender systems, and text classification. In contrast to their vast potential, training them on large-scale graphs presents significant computational challenges due to the resources required for their storage and processing. Graph Condensation has emerged as a promising solution to reduce these demands by learning a synthetic compact graph that preserves the essential information of the original one while maintaining the GNN's predictive performance. Despite their efficacy, current graph condensation approaches frequently rely on a computationally intensive bi-level optimization. Moreover, they fail to maintain a mapping between synthetic and original nodes, limiting the interpretability of the model's decisions. In this sense, a wide range of decomposition techniques have been applied to learn linear or multi-linear functions from graph data, offering a more transparent and less resource-intensive alternative. However, their applicability to graph condensation remains unexplored. This paper addresses this gap and proposes a novel method called Multi-view Graph Condensation via Tensor Decomposition (GCTD) to investigate to what extent such techniques can synthesize an informative smaller graph and achieve comparable downstream task performance. Extensive experiments on six real-world datasets demonstrate that GCTD effectively reduces graph size while preserving GNN performance, achieving up to a 4.0\ improvement in accuracy on three out of six datasets and competitive performance on large graphs compared to existing approaches. Our code is available at https://anonymous.4open.science/r/gctd-345A.
LGFeb 19
Transforming Behavioral Neuroscience Discovery with In-Context Learning and AI-Enhanced Tensor MethodsPaimon Goulart, Jordan Steinhauser, Dawon Ahn et al.
Scientific discovery pipelines typically involve complex, rigid, and time-consuming processes, from data preparation to analyzing and interpreting findings. Recent advances in AI have the potential to transform such pipelines in a way that domain experts can focus on interpreting and understanding findings, rather than debugging rigid pipelines or manually annotating data. As part of an active collaboration between data science/AI researchers and behavioral neuroscientists, we showcase an example AI-enhanced pipeline, specifically designed to transform and accelerate the way that the domain experts in the team are able to gain insights out of experimental data. The application at hand is in the domain of behavioral neuroscience, studying fear generalization in mice, an important problem whose progress can advance our understanding of clinically significant and often debilitating conditions such as PTSD (Post-Traumatic Stress Disorder). We identify the emerging paradigm of "In-Context Learning" (ICL) as a suitable interface for domain experts to automate parts of their pipeline without the need for or familiarity with AI model training and fine-tuning, and showcase its remarkable efficacy in data preparation and pattern interpretation. Also, we introduce novel AI-enhancements to tensor decomposition model, which allows for more seamless pattern discovery from the heterogeneous data in our application. We thoroughly evaluate our proposed pipeline experimentally, showcasing its superior performance compared to what is standard practice in the domain, as well as against reasonable ML baselines that do not fall under the ICL paradigm, to ensure that we are not compromising performance in our quest for a seamless and easy-to-use interface for domain experts. Finally, we demonstrate effective discovery, with results validated by the domain experts in the team.
CLMar 12, 2024
GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection MethodZubair Qazi, William Shiao, Evangelos E. Papalexakis
As natural language models like ChatGPT become increasingly prevalent in applications and services, the need for robust and accurate methods to detect their output is of paramount importance. In this paper, we present GPT Reddit Dataset (GRiD), a novel Generative Pretrained Transformer (GPT)-generated text detection dataset designed to assess the performance of detection models in identifying generated responses from ChatGPT. The dataset consists of a diverse collection of context-prompt pairs based on Reddit, with human-generated and ChatGPT-generated responses. We provide an analysis of the dataset's characteristics, including linguistic diversity, context complexity, and response quality. To showcase the dataset's utility, we benchmark several detection methods on it, demonstrating their efficacy in distinguishing between human and ChatGPT-generated responses. This dataset serves as a resource for evaluating and advancing detection techniques in the context of ChatGPT and contributes to the ongoing efforts to ensure responsible and trustworthy AI-driven communication on the internet. Finally, we propose GpTen, a novel tensor-based GPT text detection method that is semi-supervised in nature since it only has access to human-generated text and performs on par with fully-supervised baselines.
CLAug 5, 2025
CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence TensorsSri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models.
CLMar 4, 2025
ExpertGenQA: Open-ended QA generation in Specialized DomainsHaz Sameen Shahgir, Chansong Lim, Jia Chen et al.
Generating high-quality question-answer pairs for specialized technical domains remains challenging, with existing approaches facing a tradeoff between leveraging expert examples and achieving topical diversity. We present ExpertGenQA, a protocol that combines few-shot learning with structured topic and style categorization to generate comprehensive domain-specific QA pairs. Using U.S. Federal Railroad Administration documents as a test bed, we demonstrate that ExpertGenQA achieves twice the efficiency of baseline few-shot approaches while maintaining $94.4\%$ topic coverage. Through systematic evaluation, we show that current LLM-based judges and reward models exhibit strong bias toward superficial writing styles rather than content quality. Our analysis using Bloom's Taxonomy reveals that ExpertGenQA better preserves the cognitive complexity distribution of expert-written questions compared to template-based approaches. When used to train retrieval models, our generated queries improve top-1 accuracy by $13.02\%$ over baseline performance, demonstrating their effectiveness for downstream applications in technical domains.
LGOct 8, 2025
Surrogate Modeling for the Design of Optimal Lattice Structures using Tensor CompletionShaan Pakala, Aldair E. Gongora, Brian Giera et al.
When designing new materials, it is often necessary to design a material with specific desired properties. Unfortunately, as new design variables are added, the search space grows exponentially, which makes synthesizing and validating the properties of each material very impractical and time-consuming. In this work, we focus on the design of optimal lattice structures with regard to mechanical performance. Computational approaches, including the use of machine learning (ML) methods, have shown improved success in accelerating materials design. However, these ML methods are still lacking in scenarios when training data (i.e. experimentally validated materials) come from a non-uniformly random sampling across the design space. For example, an experimentalist might synthesize and validate certain materials more frequently because of convenience. For this reason, we suggest the use of tensor completion as a surrogate model to accelerate the design of materials in these atypical supervised learning scenarios. In our experiments, we show that tensor completion is superior to classic ML methods such as Gaussian Process and XGBoost with biased sampling of the search space, with around 5\% increased $R^2$. Furthermore, tensor completion still gives comparable performance with a uniformly random sampling of the entire search space.
CVSep 19, 2025
Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial AttacksHet Patel, Muzammil Allie, Qian Zhang et al.
Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks. Existing defenses often demand costly retraining or significant architecture changes. We introduce a lightweight defense using tensor decomposition suitable for any pre-trained VLM, requiring no retraining. By decomposing and reconstructing vision encoder representations, it filters adversarial noise while preserving meaning. Experiments with CLIP on COCO and Flickr30K show improved robustness. On Flickr30K, it restores 12.3\% performance lost to attacks, raising Recall@1 accuracy from 7.5\% to 19.8\%. On COCO, it recovers 8.1\% performance, improving accuracy from 3.8\% to 11.9\%. Analysis shows Tensor Train decomposition with low rank (8-32) and low residual strength ($α=0.1-0.2$) is optimal. This method is a practical, plug-and-play solution with minimal overhead for existing VLMs.
LGJul 28, 2025
Improving Group Fairness in Tensor Completion via Imbalance Mitigating Entity AugmentationDawon Ahn, Jun-Gi Jang, Evangelos E. Papalexakis
Group fairness is important to consider in tensor decomposition to prevent discrimination based on social grounds such as gender or age. Although few works have studied group fairness in tensor decomposition, they suffer from performance degradation. To address this, we propose STAFF(Sparse Tensor Augmentation For Fairness) to improve group fairness by minimizing the gap in completion errors of different groups while reducing the overall tensor completion error. Our main idea is to augment a tensor with augmented entities including sufficient observed entries to mitigate imbalance and group bias in the sparse tensor. We evaluate \method on tensor completion with various datasets under conventional and deep learning-based tensor models. STAFF consistently shows the best trade-off between completion error and group fairness; at most, it yields 36% lower MSE and 59% lower MADE than the second-best baseline.
LGMar 26, 2025
Global and Local Structure Learning for Sparse Tensor CompletionDawon Ahn, Evangelos E. Papalexakis
How can we accurately complete tensors by learning relationships of dimensions along each mode? Tensor completion, a widely studied problem, is to predict missing entries in incomplete tensors. Tensor decomposition methods, fundamental tensor analysis tools, have been actively developed to solve tensor completion tasks. However, standard tensor decomposition models have not been designed to learn relationships of dimensions along each mode, which limits to accurate tensor completion. Also, previously developed tensor decomposition models have required prior knowledge between relations within dimensions to model the relations, expensive to obtain. This paper proposes TGL (Tensor Decomposition Learning Global and Local Structures) to accurately predict missing entries in tensors. TGL reconstructs a tensor with factor matrices which learn local structures with GNN without prior knowledges. Extensive experiments are conducted to evaluate TGL with baselines and datasets.
LGJan 30, 2025
ACTGNN: Assessment of Clustering Tendency with Synthetically-Trained Graph Neural NetworksYiran Luo, Evangelos E. Papalexakis
Determining clustering tendency in datasets is a fundamental but challenging task, especially in noisy or high-dimensional settings where traditional methods, such as the Hopkins Statistic and Visual Assessment of Tendency (VAT), often struggle to produce reliable results. In this paper, we propose ACTGNN, a graph-based framework designed to assess clustering tendency by leveraging graph representations of data. Node features are constructed using Locality-Sensitive Hashing (LSH), which captures local neighborhood information, while edge features incorporate multiple similarity metrics, such as the Radial Basis Function (RBF) kernel, to model pairwise relationships. A Graph Neural Network (GNN) is trained exclusively on synthetic datasets, enabling robust learning of clustering structures under controlled conditions. Extensive experiments demonstrate that ACTGNN significantly outperforms baseline methods on both synthetic and real-world datasets, exhibiting superior performance in detecting faint clustering structures, even in high-dimensional or noisy data. Our results highlight the generalizability and effectiveness of the proposed approach, making it a promising tool for robust clustering tendency assessment.
LGJan 20, 2025
Multi-View Spectral Clustering for Graphs with Multiple View StructuresYorgos Tsitsikas, Evangelos E. Papalexakis
Despite the fundamental importance of clustering, to this day, much of the relevant research is still based on ambiguous foundations, leading to an unclear understanding of whether or how the various clustering methods are connected with each other. In this work, we provide an additional stepping stone towards resolving such ambiguities by presenting a general clustering framework that subsumes a series of seemingly disparate clustering methods, including various methods belonging to the widely popular spectral clustering framework. In fact, the generality of the proposed framework is additionally capable of shedding light to the largely unexplored area of multi-view graphs where each view may have differently clustered nodes. In turn, we propose GenClus: a method that is simultaneously an instance of this framework and a generalization of spectral clustering, while also being closely related to k-means as well. This results in a principled alternative to the few existing methods studying this special type of multi-view graphs. Then, we conduct in-depth experiments, which demonstrate that GenClus is more computationally efficient than existing methods, while also attaining similar or better clustering performance. Lastly, a qualitative real-world case-study further demonstrates the ability of GenClus to produce meaningful clusterings.
LGNov 24, 2024
Can a Large Language Model Learn Matrix Functions In Context?Paimon Goulart, Evangelos E. Papalexakis
Large Language Models (LLMs) have demonstrated the ability to solve complex tasks through In-Context Learning (ICL), where models learn from a few input-output pairs without explicit fine-tuning. In this paper, we explore the capacity of LLMs to solve non-linear numerical computations, with specific emphasis on functions of the Singular Value Decomposition. Our experiments show that while LLMs perform comparably to traditional models such as Stochastic Gradient Descent (SGD) based Linear Regression and Neural Networks (NN) for simpler tasks, they outperform these models on more complex tasks, particularly in the case of top-k Singular Values. Furthermore, LLMs demonstrate strong scalability, maintaining high accuracy even as the matrix size increases. Additionally, we found that LLMs can achieve high accuracy with minimal prior examples, converging quickly and avoiding the overfitting seen in classical models. These results suggest that LLMs could provide an efficient alternative to classical methods for solving high-dimensional problems. Future work will focus on extending these findings to larger matrices and more complex matrix operations while exploring the effect of using different numerical representations in ICL.
CLJun 25, 2024
TRAWL: Tensor Reduced and Approximated Weights for Large Language ModelsYiran Luo, Het Patel, Yu Fu et al.
Recent research has shown that pruning large-scale language models for inference is an effective approach to improving model efficiency, significantly reducing model weights with minimal impact on performance. Interestingly, pruning can sometimes even enhance accuracy by removing noise that accumulates during training, particularly through matrix decompositions. However, recent work has primarily focused on single matrix decompositions or lower precision techniques, which may fail to fully capture structural patterns. To address these limitations, we introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns. Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
CVAug 15, 2021
Deepfake Representation with Multilinear RegressionSara Abdali, M. Alex O. Vasilescu, Evangelos E. Papalexakis
Generative neural network architectures such as GANs, may be used to generate synthetic instances to compensate for the lack of real data. However, they may be employed to create media that may cause social, political or economical upheaval. One emerging media is "Deepfake".Techniques that can discriminate between such media is indispensable. In this paper, we propose a modified multilinear (tensor) method, a combination of linear and multilinear regressions for representing fake and real data. We test our approach by representing Deepfakes with our modified multilinear (tensor) approach and perform SVM classification with encouraging results.
LGJul 2, 2021
Subspace Clustering Based Analysis of Neural NetworksUday Singh Saini, Pravallika Devineni, Evangelos E. Papalexakis
Tools to analyze the latent space of deep neural networks provide a step towards better understanding them. In this work, we motivate sparse subspace clustering (SSC) with an aim to learn affinity graphs from the latent structure of a given neural network layer trained over a set of inputs. We then use tools from Community Detection to quantify structures present in the input. These experiments reveal that as we go deeper in a network, inputs tend to have an increasing affinity to other inputs of the same class. Subsequently, we utilise matrix similarity measures to perform layer-wise comparisons between affinity graphs. In doing so we first demonstrate that when comparing a given layer currently under training to its final state, the shallower the layer of the network, the quicker it is to converge than the deeper layers. When performing a pairwise analysis of the entire network architecture, we observe that, as the network increases in size, it reorganises from a state where each layer is moderately similar to its neighbours, to a state where layers within a block have high similarity than to layers in other blocks. Finally, we analyze the learned affinity graphs of the final convolutional layer of the network and demonstrate how an input's local neighbourhood affects its classification by the network.
LGFeb 15, 2021
KNH: Multi-View Modeling with K-Nearest Hyperplanes Graph for Misinformation DetectionSara Abdali, Neil Shah, Evangelos E. Papalexakis
Graphs are one of the most efficacious structures for representing datapoints and their relations, and they have been largely exploited for different applications. Previously, the higher-order relations between the nodes have been modeled by a generalization of graphs known as hypergraphs. In hypergraphs, the edges are defined by a set of nodes i.e., hyperedges to demonstrate the higher order relationships between the data. However, there is no explicit higher-order generalization for nodes themselves. In this work, we introduce a novel generalization of graphs i.e., K-Nearest Hyperplanes graph (KNH) where the nodes are defined by higher order Euclidean subspaces for multi-view modeling of the nodes. In fact, in KNH, nodes are hyperplanes or more precisely m-flats instead of datapoints. We experimentally evaluate the KNH graph on two multi-aspect datasets for misinformation detection. The experimental results suggest that multi-view modeling of articles using KNH graph outperforms the classic KNN graph in terms of classification performance.
LGFeb 15, 2021
Identifying Misinformation from Website ScreenshotsSara Abdali, Rutuja Gurav, Siddharth Menon et al.
Can the look and the feel of a website give information about the trustworthiness of an article? In this paper, we propose to use a promising, yet neglected aspect in detecting the misinformativeness: the overall look of the domain webpage. To capture this overall look, we take screenshots of news articles served by either misinformative or trustworthy web domains and leverage a tensor decomposition based semi-supervised classification technique. The proposed approach i.e., VizFake is insensitive to a number of image transformations such as converting the image to grayscale, vectorizing the image and losing some parts of the screenshots. VizFake leverages a very small amount of known labels, mirroring realistic and practical scenarios, where labels (especially for known misinformative articles), are scarce and quickly become dated. The F1 score of VizFake on a dataset of 50k screenshots of news articles spanning more than 500 domains is roughly 85% using only 5% of ground truth labels. Furthermore, tensor representations of VizFake, obtained in an unsupervised manner, allow for exploratory analysis of the data that provides valuable insights into the problem. Finally, we compare VizFake with deep transfer learning, since it is a very popular black-box approach for image classification and also well-known text text-based methods. VizFake achieves competitive accuracy with deep transfer learning models while being two orders of magnitude faster and not requiring laborious hyper-parameter tuning.
LGDec 23, 2020
Analyzing Representations inside Convolutional Neural NetworksUday Singh Saini, Evangelos E. Papalexakis
How can we discover and succinctly summarize the concepts that a neural network has learned? Such a task is of great importance in applications of networks in areas of inference that involve classification, like medical diagnosis based on fMRI/x-ray etc. In this work, we propose a framework to categorize the concepts a network learns based on the way it clusters a set of input examples, clusters neurons based on the examples they activate for, and input features all in the same latent space. This framework is unsupervised and can work without any labels for input features, it only needs access to internal activations of the network for each input example, thereby making it widely applicable. We extensively evaluate the proposed method and demonstrate that it produces human-understandable and coherent concepts that a ResNet-18 has learned on the CIFAR-100 dataset.
IRNov 14, 2020
RecTen: A Recursive Hierarchical Low Rank Tensor Factorization Method to Discover Hierarchical Patterns in Multi-modal DataRisul Islam, Md Omar Faruk Rokon, Evangelos E. Papalexakis et al.
How can we expand the tensor decomposition to reveal a hierarchical structure of the multi-modal data in a self-adaptive way? Current tensor decomposition provides only a single layer of clusters. We argue that with the abundance of multimodal data and time-evolving networks nowadays, the ability to identify emerging hierarchies is important. To this effect, we propose RecTen, a recursive hierarchical soft clustering approach based on tensor decomposition. Our approach enables us to: (a) recursively decompose clusters identified in the previous step, and (b) identify the right conditions for terminating this process. In the absence of proper ground truth, we evaluate our approach with synthetic data and test its sensitivity to different parameters. We also apply RecTen on five real datasets which involve the activities of users in online discussion platforms, such as security forums. This analysis helps us reveal clusters of users with interesting behaviors, including but not limited to early detection of some real events like ransomware outbreaks, the emergence of a blackmarket of decryption tools, and romance scamming. To maximize the usefulness of our approach, we develop a tool which can help the data analysts and overall research community by identifying hierarchical structures. RecTen is an unsupervised approach which can be used to take the pulse of the large multi-modal data and let the data discover its own hidden structures by itself.
CRNov 14, 2020
TenFor: A Tensor-Based Tool to Extract Interesting Events from Security ForumsRisul Islam, Md Omar Faruk Rokon, Evangelos E. Papalexakis et al.
How can we get a security forum to "tell" us its activities and events of interest? We take a unique angle: we want to identify these activities without any a priori knowledge, which is a key difference compared to most of the previous problem formulations. Despite some recent efforts, mining security forums to extract useful information has received relatively little attention, while most of them are usually searching for specific information. We propose TenFor, an unsupervised tensor-based approach, to systematically identify important events in a three-dimensional space: (a) user, (b) thread, and (c) time. Our method consists of three high-level steps: (a) a tensor-based clustering across the three dimensions, (b) an extensive cluster profiling that uses both content and behavioral features, and (c) a deeper investigation, where we identify key users and threads within the events of interest. In addition, we implement our approach as a powerful and easy-to-use platform for practitioners. In our evaluation, we find that 83% of our clusters capture meaningful events and we find more meaningful clusters compared to previous approaches. Our approach and our platform constitute an important step towards detecting activities of interest from a forum in an unsupervised learning fashion in practice.
LGAug 17, 2020
Ensemble Node Embeddings using Tensor Decomposition: A Case-Study on DeepWalkJia Chen, Evangelos E. Papalexakis
Node embeddings have been attracting increasing attention during the past years. In this context, we propose a new ensemble node embedding approach, called TenSemble2Vec, by first generating multiple embeddings using the existing techniques and taking them as multiview data input of the state-of-art tensor decomposition model namely PARAFAC2 to learn the shared lower-dimensional representations of the nodes. Contrary to other embedding methods, our TenSemble2Vec takes advantage of the complementary information from different methods or the same method with different hyper-parameters, which bypasses the challenge of choosing models. Extensive tests using real-world data validates the efficiency of the proposed method.
SIMay 8, 2020
Semi-Supervised Multi-aspect Detection of Misinformation using Hierarchical Joint DecompositionSara Abdali, Neil Shah, Evangelos E. Papalexakis
Distinguishing between misinformation and real information is one of the most challenging problems in today's interconnected world. The vast majority of the state-of-the-art in detecting misinformation is fully supervised, requiring a large number of high-quality human annotations. However, the availability of such annotations cannot be taken for granted, since it is very costly, time-consuming, and challenging to do so in a way that keeps up with the proliferation of misinformation. In this work, we are interested in exploring scenarios where the number of annotations is limited. In such scenarios, we investigate how tapping on a diverse number of resources that characterize a news article, henceforth referred to as "aspects" can compensate for the lack of labels. In particular, our contributions in this paper are twofold: 1) We propose the use of three different aspects: article content, context of social sharing behaviors, and host website/domain features, and 2) We introduce a principled tensor based embedding framework that combines all those aspects effectively. We propose HiJoD a 2-level decomposition pipeline which not only outperforms state-of-the-art methods with F1-scores of 74% and 81% on Twitter and Politifact datasets respectively but also is an order of magnitude faster than similar ensemble approaches.
LGFeb 18, 2020
TensorShield: Tensor-based Defense Against Adversarial Attacks on ImagesNegin Entezari, Evangelos E. Papalexakis
Recent studies have demonstrated that machine learning approaches like deep neural networks (DNNs) are easily fooled by adversarial attacks. Subtle and imperceptible perturbations of the data are able to change the result of deep neural networks. Leveraging vulnerable machine learning methods raises many concerns especially in domains where security is an important factor. Therefore, it is crucial to design defense mechanisms against adversarial attacks. For the task of image classification, unnoticeable perturbations mostly occur in the high-frequency spectrum of the image. In this paper, we utilize tensor decomposition techniques as a preprocessing step to find a low-rank approximation of images which can significantly discard high-frequency perturbations. Recently a defense framework called Shield could "vaccinate" Convolutional Neural Networks (CNN) against adversarial examples by performing random-quality JPEG compressions on local patches of images on the ImageNet dataset. Our tensor-based defense mechanism outperforms the SLQ method from Shield by 14% against FastGradient Descent (FGSM) adversarial attacks, while maintaining comparable speed.
CLJan 8, 2020
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security ForumsJoobin Gharibshah, Evangelos E. Papalexakis, Michalis Faloutsos
How can we extract useful information from a security forum? We focus on identifying threads of interest to a security professional: (a) alerts of worrisome events, such as attacks, (b) offering of malicious services and products, (c) hacking information to perform malicious acts, and (d) useful security-related experiences. The analysis of security forums is in its infancy despite several promising recent works. Novel approaches are needed to address the challenges in this domain: (a) the difficulty in specifying the "topics" of interest efficiently, and (b) the unstructured and informal nature of the text. We propose, REST, a systematic methodology to: (a) identify threads of interest based on a, possibly incomplete, bag of words, and (b) classify them into one of the four classes above. The key novelty of the work is a multi-step weighted embedding approach: we project words, threads and classes in appropriate embedding spaces and establish relevance and similarity there. We evaluate our method with real data from three security forums with a total of 164k posts and 21K threads. First, REST robustness to initial keyword selection can extend the user-provided keyword set and thus, it can recover from missing keywords. Second, REST categorizes the threads into the classes of interest with superior accuracy compared to five other methods: REST exhibits an accuracy between 63.3-76.9%. We see our approach as a first step for harnessing the wealth of information of online forums in a user-friendly way, since the user can loosely specify her keywords of interest.
LGDec 19, 2019
Adaptive Granularity in Tensors: A Quest for Interpretable StructureRavdeep Pasricha, Ekta Gujral, Evangelos E. Papalexakis
Data collected at very frequent intervals is usually extremely sparse and has no structure that is exploitable by modern tensor decomposition algorithms. Thus the utility of such tensors is low, in terms of the amount of interpretable and exploitable structure that one can extract from them. In this paper, we introduce the problem of finding a tensor of adaptive aggregated granularity that can be decomposed to reveal meaningful latent concepts (structures) from datasets that, in their original form, are not amenable to tensor analysis. Such datasets fall under the broad category of sparse point processes that evolve over space and/or time. To the best of our knowledge, this is the first work that explores adaptive granularity aggregation in tensors. Furthermore, we formally define the problem and discuss what different definitions of "good structure" can be in practice, and show that optimal solution is of prohibitive combinatorial complexity. Subsequently, we propose an efficient and effective greedy algorithm called IceBreaker, which follows a number of intuitive decision criteria that locally maximize the "goodness of structure", resulting in high-quality tensors. We evaluate our method on synthetic, semi-synthetic and real datasets. In all the cases, our proposed method constructs tensors that have very high structure quality.
LGNov 18, 2018
The core consistency of a compressed tensorGeorgios Tsitsikas, Evangelos E. Papalexakis
Tensor decomposition on big data has attracted significant attention recently. Among the most popular methods is a class of algorithms that leverages compression in order to reduce the size of the tensor and potentially parallelize computations. A fundamental requirement for such methods to work properly is that the low-rank tensor structure is retained upon compression. In lieu of efficient and realistic means of computing and studying the effects of compression on the low rank of a tensor, we study the effects of compression on the core consistency; a widely used heuristic that has been used as a proxy for estimating that low rank. We provide theoretical analysis, where we identify sufficient conditions for the compression such that the core consistency is preserved, and we conduct extensive experiments that validate our analysis. Further, we explore popular compression schemes and how they affect the core consistency.
LGNov 5, 2018
Representation Learning by Reconstructing NeighborhoodsChin-Chia Michael Yeh, Yan Zhu, Evangelos E. Papalexakis et al.
Since its introduction, unsupervised representation learning has attracted a lot of attention from the research community, as it is demonstrated to be highly effective and easy-to-apply in tasks such as dimension reduction, clustering, visualization, information retrieval, and semi-supervised learning. In this work, we propose a novel unsupervised representation learning framework called neighbor-encoder, in which domain knowledge can be easily incorporated into the learning process without modifying the general encoder-decoder architecture of the classic autoencoder.In contrast to autoencoder, which reconstructs the input data itself, neighbor-encoder reconstructs the input data's neighbors. As the proposed representation learning problem is essentially a neighbor reconstruction problem, domain knowledge can be easily incorporated in the form of an appropriate definition of similarity between objects. Based on that observation, our framework can leverage any off-the-shelf similarity search algorithms or side information to find the neighbor of an input object. Applications of other algorithms (e.g., association rule mining) in our framework are also possible, given that the appropriate definition of neighbor can vary in different contexts. We have demonstrated the effectiveness of our framework in many diverse domains, including images, text, and time series, and for various data mining tasks including classification, clustering, and visualization. Experimental results show that neighbor-encoder not only outperforms autoencoder in most of the scenarios we consider, but also achieves the state-of-the-art performance on text document clustering.
MMAug 23, 2018
Webly Supervised Joint Embedding for Cross-Modal Image-Text RetrievalNiluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis et al.
Cross-modal retrieval between visual data and natural language description remains a long-standing challenge in multimedia. While recent image-text retrieval methods offer great promise by learning deep representations aligned across modalities, most of these methods are plagued by the issue of training with small-scale datasets covering a limited number of images with ground-truth sentences. Moreover, it is extremely expensive to create a larger dataset by annotating millions of images with sentences and may lead to a biased model. Inspired by the recent success of webly supervised learning in deep neural networks, we capitalize on readily-available web images with noisy annotations to learn robust image-text joint representation. Specifically, our main idea is to leverage web images and corresponding tags, along with fully annotated datasets, in training for learning the visual-semantic joint embedding. We propose a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding. Experiments on two standard benchmark datasets demonstrate that our method achieves a significant performance gain in image-text retrieval compared to state-of-the-art approaches.
LGJul 3, 2018
OCTen: Online Compression-based Tensor DecompositionEkta Gujral, Ravdeep Pasricha, Tianxiong Yang et al.
Tensor decompositions are powerful tools for large data analytics as they jointly model multiple aspects of data into one framework and enable the discovery of the latent structures and higher-order correlations within the data. One of the most widely studied and used decompositions, especially in data mining and machine learning, is the Canonical Polyadic or CP decomposition. However, today's datasets are not static and these datasets often dynamically growing and changing with time. To operate on such large data, we present OCTen the first ever compression-based online parallel implementation for the CP decomposition. We conduct an extensive empirical analysis of the algorithms in terms of fitness, memory used and CPU time, and in order to demonstrate the compression and scalability of the method, we apply OCTen to big tensor data. Indicatively, OCTen performs on-par or better than state-of-the-art online and online methods in terms of decomposition accuracy and efficiency, while saving up to 40-200 % memory space.
IRJun 30, 2018
A Constrained Coupled Matrix-Tensor Factorization for Learning Time-evolving and Emerging TopicsSanaz Bahargam, Evangelos E. Papalexakis
Topic discovery has witnessed a significant growth as a field of data mining at large. In particular, time-evolving topic discovery, where the evolution of a topic is taken into account has been instrumental in understanding the historical context of an emerging topic in a dynamic corpus. Traditionally, time-evolving topic discovery has focused on this notion of time. However, especially in settings where content is contributed by a community or a crowd, an orthogonal notion of time is the one that pertains to the level of expertise of the content creator: the more experienced the creator, the more advanced the topic. In this paper, we propose a novel time-evolving topic discovery method which, in addition to the extracted topics, is able to identify the evolution of that topic over time, as well as the level of difficulty of that topic, as it is inferred by the level of expertise of its main contributors. Our method is based on a novel formulation of Constrained Coupled Matrix-Tensor Factorization, which adopts constraints well-motivated for, and, as we demonstrate, are essential for high-quality topic discovery. We qualitatively evaluate our approach using real data from the Physics and also Programming Stack Exchange forum, and we were able to identify topics of varying levels of difficulty which can be linked to external events, such as the announcement of gravitational waves by the LIGO lab in Physics forum. We provide a quantitative evaluation of our method by conducting a user study where experts were asked to judge the coherence and quality of the extracted topics. Finally, our proposed method has implications for automatic curriculum design using the extracted topics, where the notion of the level of difficulty is necessary for the proper modeling of prerequisites and advanced concepts.
LGJun 6, 2018
A Peek Into the Hidden Layers of a Convolutional Neural Network Through a Factorization LensUday Singh Saini, Evangelos E. Papalexakis
Despite their increasing popularity and success in a variety of supervised learning problems, deep neural networks are extremely hard to interpret and debug: Given and already trained Deep Neural Net, and a set of test inputs, how can we gain insight into how those inputs interact with different layers of the neural network? Furthermore, can we characterize a given deep neural network based on it's observed behavior on different inputs? In this paper we propose a novel factorization based approach on understanding how different deep neural networks operate. In our preliminary results, we identify fascinating patterns that link the factorization rank (typically used as a measure of interestingness in unsupervised data analysis) with how well or poorly the deep network has been trained. Finally, our proposed approach can help provide visual insights on how high-level. interpretable patterns of the network's input behave inside the hidden layers of the deep network.
LGMay 3, 2018
t-PINE: Tensor-based Predictable and Interpretable Node EmbeddingsSaba A. Al-Sayouri, Ekta Gujral, Danai Koutra et al.
Graph representations have increasingly grown in popularity during the last years. Existing representation learning approaches explicitly encode network structure. Despite their good performance in downstream processes (e.g., node classification, link prediction), there is still room for improvement in different aspects, like efficacy, visualization, and interpretability. In this paper, we propose, t-PINE, a method that addresses these limitations. Contrary to baseline methods, which generally learn explicit graph representations by solely using an adjacency matrix, t-PINE avails a multi-view information graph, the adjacency matrix represents the first view, and a nearest neighbor adjacency, computed over the node features, is the second view, in order to learn explicit and implicit node representations, using the Canonical Polyadic (a.k.a. CP) decomposition. We argue that the implicit and the explicit mapping from a higher-dimensional to a lower-dimensional vector space is the key to learn more useful, highly predictable, and gracefully interpretable representations. Having good interpretable representations provides a good guidance to understand how each view contributes to the representation learning process. In addition, it helps us to exclude unrelated dimensions. Extensive experiments show that t-PINE drastically outperforms baseline methods by up to 158.6% with respect to Micro-F1, in several multi-label classification problems, while it has high visualization and interpretability utility.
LGMay 3, 2018
RECS: Robust Graph Embedding Using Connection SubgraphsSaba A. Al-Sayouri, Danai Koutra, Evangelos E. Papalexakis et al.
The success of graph embeddings or node representation learning in a variety of downstream tasks, such as node classification, link prediction, and recommendation systems, has led to their popularity in recent years. Representation learning algorithms aim to preserve local and global network structure by identifying node neighborhood notions. However, many existing algorithms generate embeddings that fail to properly preserve the network structure, or lead to unstable representations due to random processes (e.g., random walks to generate context) and, thus, cannot generate to multi-graph problems. In this paper, we propose RECS, a novel, stable graph embedding algorithmic framework. RECS learns graph representations using connection subgraphs by employing the analogy of graphs with electrical circuits. It preserves both local and global connectivity patterns, and addresses the issue of high-degree nodes. Further, it exploits the strength of weak ties and meta-data that have been neglected by baselines. The experiments show that RECS outperforms state-of-the-art algorithms by up to 36.85% on multi-label classification problem. Further, in contrast to baselines, RECS, being deterministic, is completely stable.
LGApr 25, 2018
Identifying and Alleviating Concept Drift in Streaming Tensor DecompositionRavdeep Pasricha, Ekta Gujral, Evangelos E. Papalexakis
Tensor decompositions are used in various data mining applications from social network to medical applications and are extremely useful in discovering latent structures or concepts in the data. Many real-world applications are dynamic in nature and so are their data. To deal with this dynamic nature of data, there exist a variety of online tensor decomposition algorithms. A central assumption in all those algorithms is that the number of latent concepts remains fixed throughout the entire stream. However, this need not be the case. Every incoming batch in the stream may have a different number of latent concepts, and the difference in latent concepts from one tensor batch to another can provide insights into how our findings in a particular application behave and deviate over time. In this paper, we define "concept" and "concept drift" in the context of streaming tensor decomposition, as the manifestation of the variability of latent concepts throughout the stream. Furthermore, we introduce SeekAndDestroy, an algorithm that detects concept drift in streaming tensor decomposition and is able to produce results robust to that drift. To the best of our knowledge, this is the first work that investigates concept drift in streaming tensor decomposition. We extensively evaluate SeekAndDestroy on synthetic datasets, which exhibit a wide variety of realistic drift. Our experiments demonstrate the effectiveness of SeekAndDestroy, both in the detection of concept drift and in the alleviation of its effects, producing results with similar quality to decomposing the entire tensor in one shot. Additionally, in real datasets, SeekAndDestroy outperforms other streaming baselines, while discovering novel useful components.
LGApr 24, 2018
Semi-supervised Content-based Detection of Misinformation via Tensor EmbeddingsGisel Bastidas Guacho, Sara Abdali, Neil Shah et al.
Fake news may be intentionally created to promote economic, political and social interests, and can lead to negative impacts on humans beliefs and decisions. Hence, detection of fake news is an emerging problem that has become extremely prevalent during the last few years. Most existing works on this topic focus on manual feature extraction and supervised classification models leveraging a large number of labeled (fake or real) articles. In contrast, we focus on content-based detection of fake news articles, while assuming that we have a small amount of labels, made available by manual fact-checkers or automated sources. We argue this is a more realistic setting in the presence of massive amounts of content, most of which cannot be easily factchecked. To that end, we represent collections of news articles as multi-dimensional tensors, leverage tensor decomposition to derive concise article embeddings that capture spatial/contextual information about each news article, and use those embeddings to create an article-by-article graph on which we propagate limited labels. Results on three real-world datasets show that our method performs on par or better than existing models that are fully supervised, in that we achieve better detection accuracy using fewer labels. In particular, our proposed method achieves 75.43% of accuracy using only 30% of labels of a public dataset while an SVM-based classifier achieved 67.43%. Furthermore, our method achieves 70.92% of accuracy in a large dataset using only 2% of labels.