Experiments with truth using Machine Learning: Spectral analysis and explainable classification of synthetic, false, and genuine information
It addresses the societal problem of misinformation, which is exacerbated by LLMs, but is incremental as it applies existing methods to analyze and explain classification without introducing new techniques.
This paper tackled the problem of distinguishing synthetic, false, and genuine information using machine learning, finding that misinformation is closely intertwined with genuine information and that existing algorithms are not effective at separating them, despite claims in the literature.
Misinformation is still a major societal problem and the arrival of Large Language Models (LLMs) only added to it. This paper analyzes synthetic, false, and genuine information in the form of text from spectral analysis, visualization, and explainability perspectives to find the answer to why the problem is still unsolved despite multiple years of research and a plethora of solutions in the literature. Various embedding techniques on multiple datasets are used to represent information for the purpose. The diverse spectral and non-spectral methods used on these embeddings include t-distributed Stochastic Neighbor Embedding (t-SNE), Principal Component Analysis (PCA), and Variational Autoencoders (VAEs). Classification is done using multiple machine learning algorithms. Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and Integrated Gradients are used for the explanation of the classification. The analysis and the explanations generated show that misinformation is quite closely intertwined with genuine information and the machine learning algorithms are not as effective in separating the two despite the claims in the literature.