Akira Nakagawa

ML
4papers
14citations
Novelty51%
AI Score40

4 Papers

CLMay 28
Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model

Thang Dang, Akira Nakagawa, Kenichi Kobayashi et al.

Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.

MLNov 25, 2022
Toward Unlimited Self-Learning MCMC with Parallel Adaptive Annealing

Yuma Ichikawa, Akira Nakagawa, Hiromoto Masayuki et al.

Self-learning Monte Carlo (SLMC) methods are recently proposed to accelerate Markov chain Monte Carlo (MCMC) methods using a machine learning model. With latent generative models, SLMC methods realize efficient Monte Carlo updates with less autocorrelation. However, SLMC methods are difficult to directly apply to multimodal distributions for which training data are difficult to obtain. To solve the limitation, we propose parallel adaptive annealing, which makes SLMC methods directly apply to multimodal distributions with a gradually trained proposal while annealing target distribution. Parallel adaptive annealing is based on (i) sequential learning with annealing to inherit and update the model parameters, (ii) adaptive annealing to automatically detect under-learning, and (iii) parallel annealing to mitigate mode collapse of proposal models. We also propose VAE-SLMC method which utilizes a variational autoencoder (VAE) as a proposal of SLMC to make efficient parallel proposals independent of any previous state using recently clarified quantitative properties of VAE. Experiments validate that our method can proficiently obtain accurate samples from multiple multimodal toy distributions and practical multimodal posterior distributions, which is difficult to achieve with the existing SLMC methods.

MLJul 30, 2020
Quantitative Understanding of VAE as a Non-linearly Scaled Isometric Embedding

Akira Nakagawa, Keizo Kato, Taiji Suzuki

Variational autoencoder (VAE) estimates the posterior parameters (mean and variance) of latent variables corresponding to each input data. While it is used for many tasks, the transparency of the model is still an underlying issue. This paper provides a quantitative understanding of VAE property through the differential geometric and information-theoretic interpretations of VAE. According to the Rate-distortion theory, the optimal transform coding is achieved by using an orthonormal transform with PCA basis where the transform space is isometric to the input. Considering the analogy of transform coding to VAE, we clarify theoretically and experimentally that VAE can be mapped to an implicit isometric embedding with a scale factor derived from the posterior parameter. As a result, we can estimate the data probabilities in the input space from the prior, loss metrics, and corresponding posterior parameters, and further, the quantitative importance of each latent variable can be evaluated like the eigenvalue of PCA.

LGOct 10, 2019
Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

Keizo Kato, Jing Zhou, Tomotake Sasaki et al.

To analyze high-dimensional and complex data in the real world, deep generative models, such as variational autoencoder (VAE) embed data in a low-dimensional space (latent space) and learn a probabilistic model in the latent space. However, they struggle to accurately reproduce the probability distribution function (PDF) in the input space from that in the latent space. If the embedding were isometric, this issue can be solved, because the relation of PDFs can become tractable. To achieve isometric property, we propose Rate- Distortion Optimization guided autoencoder inspired by orthonormal transform coding. We show our method has the following properties: (i) the Jacobian matrix between the input space and a Euclidean latent space forms a constantlyscaled orthonormal system and enables isometric data embedding; (ii) the relation of PDFs in both spaces can become tractable one such as proportional relation. Furthermore, our method outperforms state-of-the-art methods in unsupervised anomaly detection with four public datasets.