GEO-PHJun 21, 2023
$\mathbf{\mathbb{E}^{FWI}}$: Multi-parameter Benchmark Datasets for Elastic Full Waveform Inversion of Geophysical PropertiesShihang Feng, Hanchen Wang, Chengyuan Deng et al.
Elastic geophysical properties (such as P- and S-wave velocities) are of great importance to various subsurface applications like CO$_2$ sequestration and energy exploration (e.g., hydrogen and geothermal). Elastic full waveform inversion (FWI) is widely applied for characterizing reservoir properties. In this paper, we introduce $\mathbf{\mathbb{E}^{FWI}}$, a comprehensive benchmark dataset that is specifically designed for elastic FWI. $\mathbf{\mathbb{E}^{FWI}}$ encompasses 8 distinct datasets that cover diverse subsurface geologic structures (flat, curve, faults, etc). The benchmark results produced by three different deep learning methods are provided. In contrast to our previously presented dataset (pressure recordings) for acoustic FWI (referred to as OpenFWI), the seismic dataset in $\mathbf{\mathbb{E}^{FWI}}$ has both vertical and horizontal components. Moreover, the velocity maps in $\mathbf{\mathbb{E}^{FWI}}$ incorporate both P- and S-wave velocities. While the multicomponent data and the added S-wave velocity make the data more realistic, more challenges are introduced regarding the convergence and computational cost of the inversion. We conduct comprehensive numerical experiments to explore the relationship between P-wave and S-wave velocities in seismic data. The relation between P- and S-wave velocities provides crucial insights into the subsurface properties such as lithology, porosity, fluid content, etc. We anticipate that $\mathbf{\mathbb{E}^{FWI}}$ will facilitate future research on multiparameter inversions and stimulate endeavors in several critical research topics of carbon-zero and new energy exploration. All datasets, codes and relevant information can be accessed through our website at https://efwi-lanl.github.io/
LGJul 12, 2024
RIO-CPD: A Riemannian Geometric Method for Correlation-aware Online Change Point DetectionChengyuan Deng, Zhengzhang Chen, Xujiang Zhao et al.
Change point detection aims to identify abrupt shifts occurring at multiple points within a data sequence. This task becomes particularly challenging in the online setting, where different types of changes can occur, including shifts in both the marginal and joint distributions of the data. In this paper, we address these challenges by tracking the Riemannian geometry of correlation matrices, allowing Riemannian metrics to compute the geodesic distance as an accurate measure of correlation dynamics. We introduce Rio-CPD, a non-parametric, correlation-aware online change point detection framework that integrates the Riemannian geometry of the manifold of symmetric positive definite matrices with the cumulative sum (CUSUM) statistic for detecting change points. Rio-CPD employs a novel CUSUM design by computing the geodesic distance between current observations and the Fréchet mean of prior observations. With appropriate choices of Riemannian metrics, Rio-CPD offers a simple yet effective and computationally efficient algorithm. Experimental results on both synthetic and real-world datasets demonstrate that Rio-CPD outperforms existing methods on detection accuracy, average detection delay and efficiency.
AIJun 8, 2024Code
LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause AnalysisLecheng Zheng, Zhengzhang Chen, Dongjie Wang et al.
Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at https://lemma-rca.github.io/.
LGNov 4, 2021Code
OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic Full Waveform InversionChengyuan Deng, Shihang Feng, Hanchen Wang et al.
Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data. The recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community. We present OpenFWI, a collection of large-scale multi-structural benchmark datasets, to facilitate diversified, rigorous, and reproducible research on FWI. In particular, OpenFWI consists of 12 datasets (2.1TB in total) synthesized from multiple sources. It encompasses diverse domains in geophysics (interface, fault, CO2 reservoir, etc.), covers different geological subsurface structures (flat, curve, etc.), and contains various amounts of data samples (2K - 67K). It also includes a dataset for 3D FWI. Moreover, we use OpenFWI to perform benchmarking over four deep learning methods, covering both supervised and unsupervised learning regimes. Along with the benchmarks, we implement additional experiments, including physics-driven methods, complexity analysis, generalization study, uncertainty quantification, and so on, to sharpen our understanding of datasets and methods. The studies either provide valuable insights into the datasets and the performance, or uncover their current limitations. We hope OpenFWI supports prospective research on FWI and inspires future open-source efforts on AI for science. All datasets and related information can be accessed through our website at https://openfwi-lanl.github.io/
CGMar 20
Locality Sensitive Hashing in Hyperbolic SpaceChengyuan Deng, Jie Gao, Kevin Lu et al.
For a metric space $(X, d)$, a family $\mathcal{H}$ of locality sensitive hash functions is called $(r, cr, p_1, p_2)$ sensitive if a randomly chosen function $h\in \mathcal{H}$ has probability at least $p_1$ (at most $p_2$) to map any $a, b\in X$ in the same hash bucket if $d(a, b)\leq r$ (or $d(a, b)\geq cr$). Locality Sensitive Hashing (LSH) is one of the most popular techniques for approximate nearest-neighbor search in high-dimensional spaces, and has been studied extensively for Hamming, Euclidean, and spherical geometries. An $(r, cr, p_1, p_2)$-sensitive hash function enables approximate nearest neighbor search (i.e., returning a point within distance $cr$ from a query $q$ if there exists a point within distance $r$ from $q$) with space $O(n^{1+Ï})$ and query time $O(n^Ï)$ where $Ï=\frac{\log 1/p_1}{\log 1/p_2}$. But LSH for hyperbolic spaces $\mathbb{H}^d$ remains largely unexplored. In this work, we present the first LSH construction native to hyperbolic space. For the hyperbolic plane $(d=2)$, we show a construction achieving $Ï\leq 1/c$, based on the hyperplane rounding scheme. For general hyperbolic spaces $(d \geq 3)$, we use dimension reduction from $\mathbb{H}^d$ to $\mathbb{H}^2$ and the 2D hyperbolic LSH to get $Ï\leq 1.59/c$. On the lower bound side, we show that the lower bound on $Ï$ of Euclidean LSH extends to the hyperbolic setting via local isometry, therefore giving $Ï\geq 1/c^2$.
LGNov 16, 2024
Neuc-MDS: Non-Euclidean Multidimensional Scaling Through Bilinear FormsChengyuan Deng, Jie Gao, Kevin Lu et al.
We introduce Non-Euclidean-MDS (Neuc-MDS), an extension of classical Multidimensional Scaling (MDS) that accommodates non-Euclidean and non-metric inputs. The main idea is to generalize the standard inner product to symmetric bilinear forms to utilize the negative eigenvalues of dissimilarity Gram matrices. Neuc-MDS efficiently optimizes the choice of (both positive and negative) eigenvalues of the dissimilarity Gram matrix to reduce STRESS, the sum of squared pairwise error. We provide an in-depth error analysis and proofs of the optimality in minimizing lower bounds of STRESS. We demonstrate Neuc-MDS's ability to address limitations of classical MDS raised by prior research, and test it on various synthetic and real-world datasets in comparison with both linear and non-linear dimension reduction methods.
DSApr 22, 2025
On the Price of Differential Privacy for Hierarchical ClusteringChengyuan Deng, Jie Gao, Jalaj Upadhyay et al.
Hierarchical clustering is a fundamental unsupervised machine learning task with the aim of organizing data into a hierarchy of clusters. Many applications of hierarchical clustering involve sensitive user information, therefore motivating recent studies on differentially private hierarchical clustering under the rigorous framework of Dasgupta's objective. However, it has been shown that any privacy-preserving algorithm under edge-level differential privacy necessarily suffers a large error. To capture practical applications of this problem, we focus on the weight privacy model, where each edge of the input graph is at least unit weight. We present a novel algorithm in the weight privacy model that shows significantly better approximation than known impossibility results in the edge-level DP setting. In particular, our algorithm achieves $O(\log^{1.5}n/\varepsilon)$ multiplicative error for $\varepsilon$-DP and runs in polynomial time, where $n$ is the size of the input graph, and the cost is never worse than the optimal additive error in existing work. We complement our algorithm by showing if the unit-weight constraint does not apply, the lower bound for weight-level DP hierarchical clustering is essentially the same as the edge-level DP, i.e. $Ω(n^2/\varepsilon)$ additive error. As a result, we also obtain a new lower bound of $\tildeΩ(1/\varepsilon)$ additive error for balanced sparsest cuts in the weight-level DP model, which may be of independent interest. Finally, we evaluate our algorithm on synthetic and real-world datasets. Our experimental results show that our algorithm performs well in terms of extra cost and has good scalability to large graphs.
CLJun 8, 2024
Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas: A SurveyChengyuan Deng, Yiqun Duan, Xin Jin et al.
Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, and data privacy, to emerging problems like truthfulness and social norms. We critically analyze existing research aimed at understanding, examining, and mitigating these ethical risks. Our survey underscores integrating ethical standards and societal values into the development of LLMs, thereby guiding the development of responsible and ethically aligned language models.
CLMay 30, 2023
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive SurveyChen Ling, Xujiang Zhao, Jiaying Lu et al.
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles, caused by the heterogeneity of domain data, the sophistication of domain knowledge, the uniqueness of domain objectives, and the diversity of the constraints (e.g., various social norms, cultural conformity, religious beliefs, and ethical standards in the domain applications). Domain specification techniques are key to make large language models disruptive in many applications. Specifically, to solve these hurdles, there has been a notable increase in research and practices conducted in recent years on the domain specialization of LLMs. This emerging field of study, with its substantial potential for impact, necessitates a comprehensive and systematic review to better summarize and guide ongoing work in this area. In this article, we present a comprehensive survey on domain specification techniques for large language models, an emerging direction critical for large language model applications. First, we propose a systematic taxonomy that categorizes the LLM domain-specialization techniques based on the accessibility to LLMs and summarizes the framework for all the subcategories as well as their relations and differences to each other. Second, we present an extensive taxonomy of critical application domains that can benefit dramatically from specialized LLMs, discussing their practical significance and open challenges. Last, we offer our insights into the current research status and future trends in this area.
LGMay 4, 2023
Impossibility of Depth Reduction in Explainable ClusteringChengyuan Deng, Surya Teja Gavva, Karthik C. S. et al.
Over the last few years Explainable Clustering has gathered a lot of attention. Dasgupta et al. [ICML'20] initiated the study of explainable $k$-means and $k$-median clustering problems where the explanation is captured by a threshold decision tree which partitions the space at each node using axis parallel hyperplanes. Recently, Laber et al. [Pattern Recognition'23] made a case to consider the depth of the decision tree as an additional complexity measure of interest. In this work, we prove that even when the input points are in the Euclidean plane, then any depth reduction in the explanation incurs unbounded loss in the $k$-means and $k$-median cost. Formally, we show that there exists a data set $X\subseteq \mathbb{R}^2$, for which there is a decision tree of depth $k-1$ whose $k$-means/$k$-median cost matches the optimal clustering cost of $X$, but every decision tree of depth less than $k-1$ has unbounded cost w.r.t. the optimal cost of clustering. We extend our results to the $k$-center objective as well, albeit with weaker guarantees.
LGNov 28, 2021
On the Robustness and Generalization of Deep Learning Driven Full Waveform InversionChengyuan Deng, Youzuo Lin
The data-driven approach has been demonstrated as a promising technique to solve complicated scientific problems. Full Waveform Inversion (FWI) is commonly epitomized as an image-to-image translation task, which motivates the use of deep neural networks as an end-to-end solution. Despite being trained with synthetic data, the deep learning-driven FWI is expected to perform well when evaluated with sufficient real-world data. In this paper, we study such properties by asking: how robust are these deep neural networks and how do they generalize? For robustness, we prove the upper bounds of the deviation between the predictions from clean and noisy data. Moreover, we demonstrate an interplay between the noise level and the additional gain of loss. For generalization, we prove a norm-based generalization error upper bound via a stability-generalization framework. Experimental results on seismic FWI datasets corroborate with the theoretical results, shedding light on a better understanding of utilizing Deep Learning for complicated scientific applications.
LGOct 21, 2020
On the Global Self-attention Mechanism for Graph Convolutional NetworksChen Wang, Chengyuan Deng
Applying Global Self-attention (GSA) mechanism over features has achieved remarkable success on Convolutional Neural Networks (CNNs). However, it is not clear if Graph Convolutional Networks (GCNs) can similarly benefit from such a technique. In this paper, inspired by the similarity between CNNs and GCNs, we study the impact of the Global Self-attention mechanism on GCNs. We find that consistent with the intuition, the GSA mechanism allows GCNs to capture feature-based vertex relations regardless of edge connections; As a result, the GSA mechanism can introduce extra expressive power to the GCNs. Furthermore, we analyze the impacts of the GSA mechanism on the issues of overfitting and over-smoothing. We prove that the GSA mechanism can alleviate both the overfitting and the over-smoothing issues based on some recent technical developments. Experiments on multiple benchmark datasets illustrate both superior expressive power and less significant overfitting and over-smoothing problems for the GSA-augmented GCNs, which corroborate the intuitions and the theoretical results.
LGNov 27, 2019
SAG-VAE: End-to-end Joint Inference of Data Representations and Feature RelationsChen Wang, Chengyuan Deng, Vladimir Ivanov
Variational Autoencoders (VAEs) are powerful in data representation inference, but it cannot learn relations between features with its vanilla form and common variations. The ability to capture relations within data can provide the much needed inductive bias necessary for building more robust Machine Learning algorithms with more interpretable results. In this paper, inspired by recent advances in relational learning using Graph Neural Networks, we propose the Self-Attention Graph Variational AutoEncoder (SAG-VAE) network which can simultaneously learn feature relations and data representations in an end-to-end manner. SAG-VAE is trained by jointly inferring the posterior distribution of two types of latent variables, which denote the data representation and a shared graph structure, respectively. Furthermore, we introduce a novel self-attention graph network that improves the generative capabilities of SAG-VAE by parameterizing the generative distribution allowing SAG-VAE to generate new data via graph convolution, while still trainable via backpropagation. A learnable relational graph representation enhances SAG-VAE's robustness to perturbation and noise, while also providing deeper intuition into model performance. Experiments based on graphs show that SAG-VAE is capable of approximately retrieving edges and links between nodes based entirely on feature observations. Finally, results on image data illustrate that SAG-VAE is fairly robust against perturbations in image reconstruction and sampling.
LGAug 9, 2019
Adaptive Ensemble of Classifiers with Regularization for Imbalanced Data ClassificationChen Wang, Chengyuan Deng, Zhoulu Yu et al.
The dynamic ensemble selection of classifiers is an effective approach for processing label-imbalanced data classifications. However, such a technique is prone to overfitting, owing to the lack of regularization methods and the dependence of the aforementioned technique on local geometry. In this study, focusing on binary imbalanced data classification, a novel dynamic ensemble method, namely adaptive ensemble of classifiers with regularization (AER), is proposed, to overcome the stated limitations. The method solves the overfitting problem through implicit regularization. Specifically, it leverages the properties of stochastic gradient descent to obtain the solution with the minimum norm, thereby achieving regularization; furthermore, it interpolates the ensemble weights by exploiting the global geometry of data to further prevent overfitting. According to our theoretical proofs, the seemingly complicated AER paradigm, in addition to its regularization capabilities, can actually reduce the asymptotic time and memory complexities of several other algorithms. We evaluate the proposed AER method on seven benchmark imbalanced datasets from the UCI machine learning repository and one artificially generated GMM-based dataset with five variations. The results show that the proposed algorithm outperforms the major existing algorithms based on multiple metrics in most cases, and two hypothesis tests (McNemar's and Wilcoxon tests) verify the statistical significance further. In addition, the proposed method has other preferred properties such as special advantages in dealing with highly imbalanced data, and it pioneers the research on the regularization for dynamic ensemble methods.
LGAug 5, 2019
Imbalance-XGBoost: Leveraging Weighted and Focal Losses for Binary Label-Imbalanced Classification with XGBoostChen Wang, Chengyuan Deng, Suzhen Wang
The paper presents Imbalance-XGBoost, a Python package that combines the powerful XGBoost software with weighted and focal losses to tackle binary label-imbalanced classification tasks. Though a small-scale program in terms of size, the package is, to the best of the authors' knowledge, the first of its kind which provides an integrated implementation for the two losses on XGBoost and brings a general-purpose extension on XGBoost for label-imbalanced scenarios. In this paper, the design and usage of the package are described with exemplar code listings, and its convenience to be integrated into Python-driven Machine Learning projects is illustrated. Furthermore, as the first- and second-order derivatives of the loss functions are essential for the implementations, the algebraic derivation is discussed and it can be deemed as a separate algorithmic contribution. The performances of the algorithms implemented in the package are empirically evaluated on Parkinson's disease classification data set, and multiple state-of-the-art performances have been observed. Given the scalable nature of XGBoost, the package has great potentials to be applied to real-life binary classification tasks, which are usually of large-scale and label-imbalanced.