CLMay 28
Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical ModelThang Dang, Akira Nakagawa, Kenichi Kobayashi et al.
Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.
SEJun 10, 2013Code
Feature-Gathering Dependency-Based Software Clustering Using Dedication and ModularityKenichi Kobayashi, Manabu Kamimura, Koki Kato et al.
Software clustering is one of the important techniques to comprehend software systems. However, presented techniques to date require human interactions to refine clustering results. In this paper, we proposed a novel dependency-based software clustering algorithm, SArF. SArF has two characteristics. First, SArF eliminates the need of the omnipresent-module-removing step which requires human interactions. Second, the objective of SArF is to gather relevant software features or functionalities into a cluster. To achieve them, we defined the Dedication score to infer the importance of dependencies and utilized Modularity Maximization to cluster weighted directed graphs. Two case studies and extensive comparative evaluations using open source and industrial systems show that SArF could successfully decompose the systems fitting to the authoritative decompositions from a feature viewpoint without any tailored setups and that SArF was superior to existing dependency-based software clustering studies. Besides, the case studies show that there exist measurable authoritativeness limits and that SArF nearly reached the limits.
SEJun 5, 2013Code
SArF Map: Visualizing Software Architecture from Feature and Layer ViewpointsKenichi Kobayashi, Manabu Kamimura, Keisuke Yano et al.
To facilitate understanding the architecture of a software system, we developed SArF Map technique that visualizes software architecture from feature and layer viewpoints using a city metaphor. SArF Map visualizes implicit software features using our previous study, SArF dependency-based software clustering algorithm. Since features are high-level abstraction units of software, a generated map can be directly used for high-level decision making such as reuse and also for communications between developers and non-developer stakeholders. In SArF Map, each feature is visualized as a city block, and classes in the feature are laid out as buildings reflecting their software layer. Relevance between features is represented as streets. Dependency links are visualized lucidly. Through open source and industrial case studies, we show that the architecture of the target systems can be easily overviewed and that the quality of their packaging designs can be quickly assessed.
LGDec 6, 2024
Direct Quantized Training of Language Models with Stochastic RoundingKaiyan Zhao, Tsuguchika Tabaru, Kenichi Kobayashi et al.
Although recent quantized Large Language Models (LLMs), such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estimation during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.
LGDec 25, 2021
Neural Network Module Decomposition and RecompositionHiroaki Kingetsu, Kenichi Kobayashi, Taiji Suzuki
We propose a modularization method that decomposes a deep neural network (DNN) into small modules from a functionality perspective and recomposes them into a new model for some other task. Decomposed modules are expected to have the advantages of interpretability and verifiability due to their small size. In contrast to existing studies based on reusing models that involve retraining, such as a transfer learning model, the proposed method does not require retraining and has wide applicability as it can be easily combined with existing functional modules. The proposed method extracts modules using weight masks and can be applied to arbitrary DNNs. Unlike existing studies, it requires no assumption about the network architecture. To extract modules, we designed a learning method and a loss function to maximize shared weights among modules. As a result, the extracted modules can be recomposed without a large increase in the size. We demonstrate that the proposed method can decompose and recompose DNNs with high compression ratio and high accuracy and is superior to the existing method through sharing weights between modules.