CLApr 16, 2023Code
Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and EvaluationYunjie Ji, Yan Gong, Yong Deng et al.
Recently, significant public efforts have been directed towards developing low-cost models with capabilities akin to ChatGPT, thereby fostering the growth of open-source conversational models. However, there remains a scarcity of comprehensive and in-depth evaluations of these models' performance. In this study, we examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance. Our analysis is grounded in several publicly accessible, high-quality instruction datasets, as well as our own Chinese multi-turn conversations. We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios. Our goal is to supplement manual evaluations with quantitative analyses, offering valuable insights for the continued advancement of open-source chat models. Furthermore, to enhance the performance and training and inference efficiency of models in the Chinese domain, we extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3 - and conduct secondary pre-training on 3.4B Chinese words. We make our model, data, as well as code publicly available.
CLMar 26, 2023
Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use CasesYunjie Ji, Yong Deng, Yan Gong et al.
The success of ChatGPT has recently attracted numerous efforts to replicate it, with instruction-tuning strategies being a key factor in achieving remarkable results. Instruction-tuning not only significantly enhances the model's performance and generalization but also makes the model's generated results more consistent with human speech patterns. However current research rarely studies the impact of different amounts of instruction data on model performance, especially in the real-world use cases. In this paper we explore the performance of large language models based on instruction tuning across different scales of instruction data. An evaluation dataset consisting of 12 major online use cases is constructed in the experiment. With Bloomz-7B1-mt as the base model, the results show that 1) merely increasing the amount of instruction data leads to continuous improvement in tasks such as open-ended generation, 2) in tasks such as math and code, the model performance curve remains quite flat while increasing data size. We further analyze the possible causes of these phenomena and propose potential future research directions such as effectively selecting high-quality training data, scaling base models and training methods specialized for hard tasks. We will release our training and evaluation datasets, as well as model checkpoints.
GEO-PHOct 2, 2023
SeisT: A foundational deep learning model for earthquake monitoring tasksSen Li, Xu Yang, Anye Cao et al.
Seismograms, the fundamental seismic records, have revolutionized earthquake research and monitoring. Recent advancements in deep learning have further enhanced seismic signal processing, leading to even more precise and effective earthquake monitoring capabilities. This paper introduces a foundational deep learning model, the Seismogram Transformer (SeisT), designed for a variety of earthquake monitoring tasks. SeisT combines multiple modules tailored to different tasks and exhibits impressive out-of-distribution generalization performance, outperforming or matching state-of-the-art models in tasks like earthquake detection, seismic phase picking, first-motion polarity classification, magnitude estimation, back-azimuth estimation, and epicentral distance estimation. The performance scores on the tasks are 0.96, 0.96, 0.68, 0.95, 0.86, 0.55, and 0.81, respectively. The most significant improvements, in comparison to existing models, are observed in phase-P picking, phase-S picking, and magnitude estimation, with gains of 1.7%, 9.5%, and 8.0%, respectively. Our study, through rigorous experiments and evaluations, suggests that SeisT has the potential to contribute to the advancement of seismic signal processing and earthquake research.
NAMay 24
Arnoldi-Enhanced Multivariate Hermite Interpolation of Manifold-Valued DataYuxuan Li, Qiang Niu, Wubin Zhou
This paper presents a robust enhancement of the Tangent space Hermite Interpolation (THI) method for manifold-valued data by integrating the multivariate Arnoldi process. To circumvent the inherent numerical instability of multivariate confluent Vandermonde matrices, we use a $G$-Arnoldi-based recurrence to construct a discrete orthogonal polynomial basis directly on the tangent space. The method generates better numerical conditioning for high-order approximations. We analyze the convergence rates for both $C^0$ and $C^1$ errors in the multivariate setting. When only function values are used, the $C^0$ approximation error decays as $\mathcal{O}\left(\sqrt{M} n^{-m}\right)$. For the $C^1$ error without derivative data, the rate becomes $\mathcal{O}\left(\sqrt{M} h^{-1} n^{-m}\right)$, where $h$ is the fill distance of the sampling set. When derivative data are additionally available, the $C^1$ error is $\mathcal{O}\left(\sqrt{M} n^{-(m-1)}\right)$. In all cases, $n$ is the polynomial degree, $m$ denotes the regularity of the target function, and $M$ is the number of sampling points. Importantly, as $n$ increases, the required number of points $M$ must also increase. This reveals the interplay among approximation order, sampling density ($M$), fill distance ($h$), dimension ($d$), and the regularity ($m$) of the target function. Extensive numerical experiments conducted on the special orthogonal group $SO(3)$ and the unit sphere $S^2$ show that the Arnoldi-enhanced THI method outperforms the Kriging-based approaches in terms of both computational efficiency and accuracy.
COAug 17, 2022
CSGO: Constrained-Softassign Gradient Optimization For Large Graph MatchingBinrui Shen, Qiang Niu, Shengxin Zhu
Graph matching aims to find correspondences between two graphs. This paper integrates several well-known graph matching algorithms into a framework: the constrained gradient method. The primary difference among these algorithms lies in tuning a step size parameter and constraining operators. By leveraging these insights, we propose an adaptive step size parameter to guarantee the underlying algorithms' convergence, simultaneously enhancing their efficiency and robustness. For the constraining operator, we introduce a scalable softassign for large graph matching problems. Compared to the original softassign, our approach offers increased speed, improved robustness, and reduced risk of overflow. The advanced constraining operator enables a CSGO for large graph matching, which outperforms state-of-the-art methods in experiments. Notably, in attributed graph matching tasks, CSGO achieves an over 10X increase in speed compared to current constrained gradient algorithms.
CLDec 2, 2024Code
Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue DataShuaijiang Zhao, Tingwei Guo, Bajian Xiang et al.
The GPT-4o represents a significant milestone in enabling real-time interaction with large language models (LLMs) through speech, its remarkable low latency and high fluency not only capture attention but also stimulate research interest in the field. This real-time speech interaction is particularly valuable in scenarios requiring rapid feedback and immediate responses, dramatically enhancing user experience. However, there is a notable lack of research focused on real-time large speech language models, particularly for Chinese. In this work, we present KE-Omni, a seamless large speech language model built upon Ke-SpeechChat, a large-scale high-quality synthetic speech interaction dataset consisting of 7 million Chinese and English conversations, featuring 42,002 speakers, and totaling over 60,000 hours, This contributes significantly to the advancement of research and development in this field. The demos can be accessed at \url{https://huggingface.co/spaces/KE-Team/KE-Omni}.
CPJul 4, 2022
Modeling Randomly Walking Volatility with Chained Gamma DistributionsDi Zhang, Qiang Niu, Youzhou Zhou
Volatility clustering is a common phenomenon in financial time series. Typically, linear models can be used to describe the temporal autocorrelation of the (logarithmic) variance of returns. Considering the difficulty in estimating this model, we construct a Dynamic Bayesian Network, which utilizes the conjugate prior relation of normal-gamma and gamma-gamma, so that its posterior form locally remains unchanged at each node. This makes it possible to find approximate solutions using variational methods quickly. Furthermore, we ensure that the volatility expressed by the model is an independent incremental process after inserting dummy gamma nodes between adjacent time steps. We have found that this model has two advantages: 1) It can be proved that it can express heavier tails than Gaussians, i.e., have positive excess kurtosis, compared to popular linear models. 2) If the variational inference(VI) is used for state estimation, it runs much faster than Monte Carlo(MC) methods since the calculation of the posterior uses only basic arithmetic operations. And its convergence process is deterministic. We tested the model, named Gam-Chain, using recent Crypto, Nasdaq, and Forex records of varying resolutions. The results show that: 1) In the same case of using MC, this model can achieve comparable state estimation results with the regular lognormal chain. 2) In the case of only using VI, this model can obtain accuracy that are slightly worse than MC, but still acceptable in practice; 3) Only using VI, the running time of Gam-Chain, in general case, can be reduced to below 5% of that based on the lognormal chain via MC.
NAMar 25
Stable High-Order Interpolation on the Grassmann Manifold by Maximum-Volume Coordinates and Arnoldi OrthogonalizationQiang Niu, Wen Jiang, Jie Fei et al.
High-order interpolation on the Grassmann manifold $\Gr(n, p)$ is often hindered by the computational overhead and derivative instability of SVD-based geometric mappings. To solve the challenges, we propose a stabilized framework that combines Maximum-Volume (MV) local coordinates with Arnoldi-orthogonalized polynomial bases. First, manifold data are mapped to a well-conditioned Euclidean domain via MV coordinates. The approach bypasses the costly matrix factorizations inherent to traditional Riemannian normal coordinates. Within the coordinate space, we use the Vandermonde-with-Arnoldi (V+A) method for Lagrange interpolation and its confluent extension (CV+A) for derivative-enriched Hermite interpolation. By constructing discrete orthogonal bases directly from the parameter nodes, the solution of ill-conditioned linear system is avoided. Theoretical bounds are established to verify the stability of the geometric mapping and the polynomial approximation. Extensive numerical experiments demonstrate that the proposed MV-(C)V+A framework can produce highly accurate approximation in high-degree polynomial interpolation.
LGApr 1, 2021
Sub-GMN: The Neural Subgraph Matching Network ModelZixun Lan, Limin Yu, Linglong Yuan et al.
As one of the most fundamental tasks in graph theory, subgraph matching is a crucial task in many fields, ranging from information retrieval, computer vision, biology, chemistry and natural language processing. Yet subgraph matching problem remains to be an NP-complete problem. This study proposes an end-to-end learning-based approximate method for subgraph matching task, called subgraph matching network (Sub-GMN). The proposed Sub-GMN firstly uses graph representation learning to map nodes to node-level embedding. It then combines metric learning and attention mechanisms to model the relationship between matched nodes in the data graph and query graph. To test the performance of the proposed method, we applied our method on two databases. We used two existing methods, GNN and FGNN as baseline for comparison. Our experiment shows that, on dataset 1, on average the accuracy of Sub-GMN are 12.21\% and 3.2\% higher than that of GNN and FGNN respectively. On average running time Sub-GMN runs 20-40 times faster than FGNN. In addition, the average F1-score of Sub-GMN on all experiments with dataset 2 reached 0.95, which demonstrates that Sub-GMN outputs more correct node-to-node matches. Comparing with the previous GNNs-based methods for subgraph matching task, our proposed Sub-GMN allows varying query and data graphes in the test/application stage, while most previous GNNs-based methods can only find a matched subgraph in the data graph during the test/application for the same query graph used in the training stage. Another advantage of our proposed Sub-GMN is that it can output a list of node-to-node matches, while most existing end-to-end GNNs based methods cannot provide the matched node pairs.
CVJan 16, 2020
Fabricated Pictures Detection with Graph MatchingBinrui Shen, Qiang Niu, Shengxin Zhu
Fabricating experimental pictures in research work is a serious academic misconduct, which should better be detected in the reviewing process. However, due to large number of submissions, the detection whether a picture is fabricated or reused is laborious for reviewers, and sometimes is indistinct with human eyes. A tool for detecting similarity between images may help to alleviate this problem. Some methods based on local feature points matching work for most of the time, while these methods may result in mess of matchings due to ignorance of global relationship between features. We present a framework to detect similar, or perhaps fabricated, pictures with the graph matching techniques. A new iterative method is proposed, and experiments show that such a graph matching technique is better than the methods based only on local features for some cases.