Andrei Chernov

LG
h-index3
5papers
6citations
Novelty47%
AI Score27

5 Papers

LGSep 18, 2024
Fine-Tuning a Time Series Foundation Model with Wasserstein Loss

Andrei Chernov

Inspired by recent advancements in large language models (LLMs) for Natural Language Processing (NLP), there has been a surge in research focused on developing foundational models for time series forecasting. One approach involves training LLM architectures on tokenized time series data using cross-entropy loss. Although this method has demonstrated promising results, cross-entropy loss is primarily designed for classification tasks and does not account for the distance between classes. To address this limitation, we propose using the Wasserstein loss for such architectures. To validate our approach, we fine-tuned a foundational time series model on $22$ zero-shot datasets, comparing the performance of cross-entropy loss with that of Wasserstein loss. Our results demonstrate that replacing cross-entropy loss with Wasserstein loss significantly improves point estimation.

LGFeb 5, 2025
(GG) MoE vs. MLP on Tabular Data

Andrei Chernov

In recent years, significant efforts have been directed toward adapting modern neural network architectures for tabular data. However, despite their larger number of parameters and longer training and inference times, these models often fail to consistently outperform vanilla multilayer perceptron (MLP) neural networks. Moreover, MLP-based ensembles have recently demonstrated superior performance and efficiency compared to advanced deep learning methods. Therefore, rather than focusing on building deeper and more complex deep learning models, we propose investigating whether MLP neural networks can be replaced with more efficient architectures without sacrificing performance. In this paper, we first introduce GG MoE, a mixture-of-experts (MoE) model with a Gumbel-Softmax gating function. We then demonstrate that GG MoE with an embedding layer achieves the highest performance across $38$ datasets compared to standard MoE and MLP models. Finally, we show that both MoE and GG MoE utilize significantly fewer parameters than MLPs, making them a promising alternative for scaling and ensemble methods.

LGFeb 24, 2025
The Empirical Impact of Reducing Symmetries on the Performance of Deep Ensembles and MoE

Andrei Chernov, Oleg Novitskij

Recent studies have shown that reducing symmetries in neural networks enhances linear mode connectivity between networks without requiring parameter space alignment, leading to improved performance in linearly interpolated neural networks. However, in practical applications, neural network interpolation is rarely used; instead, ensembles of networks are more common. In this paper, we empirically investigate the impact of reducing symmetries on the performance of deep ensembles and Mixture of Experts (MoE) across five datasets. Additionally, to explore deeper linear mode connectivity, we introduce the Mixture of Interpolated Experts (MoIE). Our results show that deep ensembles built on asymmetric neural networks achieve significantly better performance as ensemble size increases compared to their symmetric counterparts. In contrast, our experiments do not provide conclusive evidence on whether reducing symmetries affects both MoE and MoIE architectures.

CLFeb 24, 2025
Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks

Andrei Chernov

Recently, Large Language Models (LLMs) with Mixture of Experts (MoE) layers have gained significant attention. Currently, state-of-the-art LLMs utilize this architecture. There is a substantial amount of research on how to train such models and how to select hyperparameters for this architecture. However, there is a lack of studies focusing on post-evaluation analysis of MoE layer properties. In this paper, we take a first step toward closing this gap by evaluating expert contributions on the quiz-based MMLU benchmark. We show that most experts were never activated during inference on this benchmark. Additionally, the output distribution of gating networks is much closer to uniform than sparse. Finally, we demonstrate that the average performance of some experts within the same layer varies significantly.

LGMay 30, 2025
BinConv: A Neural Architecture for Ordinal Encoding in Time-Series Forecasting

Andrei Chernov, Vitaliy Pozdnyakov, Ilya Makarov

Recent work in time series forecasting has explored reformulating regression as a classification task. By discretizing the continuous target space into bins and predicting over a fixed set of classes, these approaches benefit from more stable training, improved uncertainty modeling, and compatibility with modern deep learning architectures. However, most existing methods rely on one-hot encoding, which ignores the inherent ordinal structure of the target values. As a result, they fail to convey information about the relative distance between predicted and true values during training. In this paper, we address this limitation by applying \textbf{Cumulative Binary Encoding} (CBE), a monotonic binary representation that transforms both model inputs and outputs. CBE implicitly preserves ordinal and magnitude information, allowing models to learn distance aware representations while operating within a classification framework. To leverage CBE effectively, we propose \textbf{BinConv}, a fully convolutional neural network architecture designed for probabilistic forecasting. We demonstrate that standard fully connected layers are not only less computationally efficient than convolutional layers when used with CBE, but also degrade forecasting performance. Our experiments on standard benchmark datasets show that BinConv achieves superior performance compared to widely used baselines in both point and probabilistic forecasting, while requiring fewer parameters and enabling faster training.