CLMar 4
$V_1$: Unifying Generation and Self-Verification for Parallel ReasonersHarman Singh, Xiuyu Li, Kusha Sareen et al. · berkeley
Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger at pairwise self-verification. Leveraging this insight, we introduce $V_1$, a framework that unifies generation and verification through efficient pairwise ranking. $V_1$ comprises two components: $V_1$-Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; and $V_1$-PairRL, an RL framework that jointly trains a single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator's evolving distribution. On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, $V_1$-Infer improves Pass@1 by up to $10%$ over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, $V_1$-PairRL achieves $7$--$9%$ test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL in a code-generation setting.
LGFeb 2, 2023
STEP: Learning N:M Structured Sparsity Masks from Scratch with PreconditionYucheng Lu, Shivani Agrawal, Suvinay Subramanian et al.
Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks (mask-learning phase). STEP automatically identifies the switching point of two phases by dynamically sampling variance changes over the training trajectory and testing the sample concentration. Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.
LGMay 22, 2022
GraB: Finding Provably Better Data Permutations than Random ReshufflingYucheng Lu, Wentao Guo, Christopher De Sa
Random reshuffling, which randomly permutes the dataset each epoch, is widely adopted in model training because it yields faster convergence than with-replacement sampling. Recent studies indicate greedily chosen data orderings can further speed up convergence empirically, at the cost of using more computation and memory. However, greedy ordering lacks theoretical justification and has limited utility due to its non-trivial memory and computation overhead. In this paper, we first formulate an example-ordering framework named herding and answer affirmatively that SGD with herding converges at the rate $O(T^{-2/3})$ on smooth, non-convex objectives, faster than the $O(n^{1/3}T^{-2/3})$ obtained by random reshuffling, where $n$ denotes the number of data points and $T$ denotes the total number of iterations. To reduce the memory overhead, we leverage discrepancy minimization theory to propose an online Gradient Balancing algorithm (GraB) that enjoys the same rate as herding, while reducing the memory usage from $O(nd)$ to just $O(d)$ and computation from $O(n^2)$ to $O(n)$, where $d$ denotes the model dimension. We show empirically on applications including MNIST, CIFAR10, WikiText and GLUE that GraB can outperform random reshuffling in terms of both training and validation performance, and even outperform state-of-the-art greedy ordering while reducing memory usage over $100\times$.
CVAug 24, 2022
RZSR: Reference-based Zero-Shot Super-Resolution with Depth Guided Self-ExemplarsJun-Sang Yoo, Dong-Wook Kim, Yucheng Lu et al.
Recent methods for single image super-resolution (SISR) have demonstrated outstanding performance in generating high-resolution (HR) images from low-resolution (LR) images. However, most of these methods show their superiority using synthetically generated LR images, and their generalizability to real-world images is often not satisfactory. In this paper, we pay attention to two well-known strategies developed for robust super-resolution (SR), i.e., reference-based SR (RefSR) and zero-shot SR (ZSSR), and propose an integrated solution, called reference-based zero-shot SR (RZSR). Following the principle of ZSSR, we train an image-specific SR network at test time using training samples extracted only from the input image itself. To advance ZSSR, we obtain reference image patches with rich textures and high-frequency details which are also extracted only from the input image using cross-scale matching. To this end, we construct an internal reference dataset and retrieve reference image patches from the dataset using depth information. Using LR patches and their corresponding HR reference patches, we train a RefSR network that is embodied with a non-local attention module. Experimental results demonstrate the superiority of the proposed RZSR compared to the previous ZSSR methods and robustness to unseen images compared to other fully supervised SISR methods.
IVApr 21, 2023
Cross-domain Denoising for Low-dose Multi-frame Spiral Computed TomographyYucheng Lu, Zhixin Xu, Moon Hyung Choi et al.
Computed tomography (CT) has been used worldwide as a non-invasive test to assist in diagnosis. However, the ionizing nature of X-ray exposure raises concerns about potential health risks such as cancer. The desire for lower radiation doses has driven researchers to improve reconstruction quality. Although previous studies on low-dose computed tomography (LDCT) denoising have demonstrated the effectiveness of learning-based methods, most were developed on the simulated data. However, the real-world scenario differs significantly from the simulation domain, especially when using the multi-slice spiral scanner geometry. This paper proposes a two-stage method for the commercially available multi-slice spiral CT scanners that better exploits the complete reconstruction pipeline for LDCT denoising across different domains. Our approach makes good use of the high redundancy of multi-slice projections and the volumetric reconstructions while leveraging the over-smoothing problem in conventional cascaded frameworks caused by aggressive denoising. The dedicated design also provides a more explicit interpretation of the data flow. Extensive experiments on various datasets showed that the proposed method could remove up to 70\% of noise without compromised spatial resolution, and subjective evaluations by two experienced radiologists further supported its superior performance against state-of-the-art methods in clinical practice.
LGFeb 2, 2023
Coordinating Distributed Example Orders for Provably Accelerated TrainingA. Feder Cooper, Wentao Guo, Khiem Pham et al.
Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings for SGD that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: while it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.
CVJul 16, 2024
Exploring connections of spectral analysis and transfer learning in medical imagingYucheng Lu, Dovile Juodelyte, Jonathan D. Victor et al.
In this paper, we use spectral analysis to investigate transfer learning and study model sensitivity to frequency shortcuts in medical imaging. By analyzing the power spectrum density of both pre-trained and fine-tuned model gradients, as well as artificially generated frequency shortcuts, we observe notable differences in learning priorities between models pre-trained on natural vs medical images, which generally persist during fine-tuning. We find that when a model's learning priority aligns with the power spectrum density of an artifact, it results in overfitting to that artifact. Based on these observations, we show that source data editing can alter the model's resistance to shortcut learning.
CVMar 7, 2024Code
Source Matters: Source Dataset Impact on Model Robustness in Medical ImagingDovile Juodelyte, Yucheng Lu, Amelia Jiménez-Sánchez et al.
Transfer learning has become an essential part of medical imaging classification algorithms, often leveraging ImageNet weights. The domain shift from natural to medical images has prompted alternatives such as RadImageNet, often showing comparable classification performance. However, it remains unclear whether the performance gains from transfer learning stem from improved generalization or shortcut learning. To address this, we conceptualize confounders by introducing the Medical Imaging Contextualized Confounder Taxonomy (MICCAT) and investigate a range of confounders across it -- whether synthetic or sampled from the data -- using two public chest X-ray and CT datasets. We show that ImageNet and RadImageNet achieve comparable classification performance, yet ImageNet is much more prone to overfitting to confounders. We recommend that researchers using ImageNet-pretrained models reexamine their model robustness by conducting similar experiments. Our code and experiments are available at https://github.com/DovileDo/source-matters.
MAApr 2
SimCity: Multi-Agent Urban Development Simulation with Rich InteractionsYeqi Feng, Yucheng Lu, Hongyu Su et al.
Large Language Models (LLMs) open new possibilities for constructing realistic and interpretable macroeconomic simulations. We present SimCity, a multi-agent framework that leverages LLMs to model an interpretable macroeconomic system with heterogeneous agents and rich interactions. Unlike classical equilibrium models that limit heterogeneity for tractability, or traditional agent-based models (ABMs) that rely on hand-crafted decision rules, SimCity enables flexible, adaptive behavior with transparent natural-language reasoning. Within SimCity, four core agent types (households, firms, a central bank, and a government) deliberate and participate in a frictional labor market, a heterogeneous goods market, and a financial market. Furthermore, a Vision-Language Model (VLM) determines the geographic placement of new firms and renders a mapped virtual city, allowing us to study both macroeconomic regularities and urban expansion dynamics within a unified environment. To evaluate the framework, we compile a checklist of canonical macroeconomic phenomena, including price elasticity of demand, Engel's Law, Okun's Law, the Phillips Curve, and the Beveridge Curve, and show that SimCity naturally reproduces these empirical patterns while remaining robust across simulation runs.
CVDec 28, 2024Code
On dataset transferability in medical image classificationDovile Juodelyte, Enzo Ferrante, Yucheng Lu et al.
Current transferability estimation methods designed for natural image datasets are often suboptimal in medical image classification. These methods primarily focus on estimating the suitability of pre-trained source model features for a target dataset, which can lead to unrealistic predictions, such as suggesting that the target dataset is the best source for itself. To address this, we propose a novel transferability metric that combines feature quality with gradients to evaluate both the suitability and adaptability of source model features for target tasks. We evaluate our approach in two new scenarios: source dataset transferability for medical image classification and cross-domain transferability. Our results show that our method outperforms existing transferability metrics in both settings. We also provide insight into the factors influencing transfer performance in medical image classification, as well as the dynamics of cross-domain transfer from natural to medical images. Additionally, we provide ground-truth transfer performance benchmarking results to encourage further research into transferability estimation for medical image classification. Our code and experiments are available at https://github.com/DovileDo/transferability-in-medical-imaging.
GRJun 27, 2025
A Design Space for Visualization Transitions of 3D Spatial Data in Hybrid AR-Desktop EnvironmentsYucheng Lu, Tobias Rau, Benjamin Lee et al.
We present a design space for animated transitions of the appearance of 3D spatial datasets in a hybrid Augmented Reality (AR)-desktop context. Such hybrid interfaces combine both traditional and immersive displays to facilitate the exploration of 2D and 3D data representations in the environment in which they are best displayed. One key aspect is to introduce transitional animations that change between the different dimensionalities to illustrate the connection between the different representations and to reduce the potential cognitive load on the user. The specific transitions to be used depend on the type of data, the needs of the application domain, and other factors. We summarize these as a transition design space to simplify the decision-making process and provide inspiration for future designs. First, we discuss 3D visualizations from a spatial perspective: a spatial encoding pipeline, where 3D data sampled from the physical world goes through various transformations, being mapped to visual representations, and then being integrated into a hybrid AR-desktop environment. The transition design then focuses on interpolating between two spatial encoding pipelines to provide a smooth experience. To illustrate the use of our design space, we apply it to three case studies that focus on applications in astronomy, radiology, and chemistry; we then discuss lessons learned from these applications.
LGMay 12
Multi-Token Residual PredictionYufeng Xu, Zishuo Bao, Qian Wang et al.
Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We deploy MRP in two inference modes: direct decoding, which uses the corrected logits without verification for a tunable quality--speed tradeoff; and speculative decoding, which verifies MRP's proposals against the backbone for lossless acceleration. Experiments on SDAR models at the 1.7B, 4B, and 8B scales across reasoning and code generation benchmarks demonstrate up to $1.42\times$ lossless speedup in SGLang.
CLFeb 1
Distilling Token-Trained Models into Byte-Level ModelsZishuo Bao, Jiaqi Leng, Junxiong Wang et al.
Byte Language Models (BLMs) have emerged as a promising direction for scaling language models beyond tokenization. However, existing BLMs typically require training from scratch on trillions of bytes, making them prohibitively expensive. In this paper, we propose an efficient distillation recipe that converts existing token-trained LLMs into BLMs while retaining comparable capabilities. Our recipe follows a two-stage curriculum: (1) Progressive Knowledge Distillation, which aligns byte-level representations with the embeddings of the token-trained teacher model; and (2) Byte-Level Supervised Fine-Tuning, which enables end-to-end generation entirely in the byte space. We validate our approach across multiple model families, including Llama, Qwen, and OLMo, and demonstrate that the distilled BLMs retain most of the teacher models' performance using only approximately 125B bytes.
CLOct 20, 2025
Understanding and Improving Length Generalization in Hierarchical Sparse Attention ModelsJiaqi Leng, Xiang Hu, Junxiong Wang et al.
Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
CVOct 5, 2025
MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene GenerationZhenyu Pan, Yucheng Lu, Han Liu
We present MetaFind, a scene-aware tri-modal compositional retrieval framework designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories. MetaFind addresses two core challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic, and stylistic constraints, and (ii) the absence of a standardized retrieval paradigm specifically tailored for 3D asset retrieval, as existing approaches mainly rely on general-purpose 3D shape representation models. Our key innovation is a flexible retrieval mechanism that supports arbitrary combinations of text, image, and 3D modalities as queries, enhancing spatial reasoning and style consistency by jointly modeling object-level features (including appearance) and scene-level layout structures. Methodologically, MetaFind introduces a plug-and-play equivariant layout encoder ESSGNN that captures spatial relationships and object appearance features, ensuring retrieved 3D assets are contextually and stylistically coherent with the existing scene, regardless of coordinate frame transformations. The framework supports iterative scene construction by continuously adapting retrieval results to current scene updates. Empirical evaluations demonstrate the improved spatial and stylistic consistency of MetaFind in various retrieval tasks compared to baseline methods.
CVOct 1, 2025
Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image ClassificationYucheng Lu, Hubert Dariusz Zając, Veronika Cheplygina et al.
Transfer learning is crucial for medical imaging, yet the selection of source datasets - which can impact the generalizability of algorithms, and thus patient outcomes - often relies on researchers' intuition rather than systematic principles. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-centered HCI perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Participants often used ambiguous terminology, which suggests a need for clearer definitions and HCI tools to make them explicit and usable. By clarifying these heuristics, this work provides practical insights for more systematic source selection in transfer learning.
CLApr 21, 2025
Feeding LLM Annotations to BERT Classifiers at Your Own RiskYucheng Lu, Kazimier Smith
Using LLM-generated labels to fine-tune smaller encoder-only models for text classification has gained popularity in various settings. While this approach may be justified in simple and low-stakes applications, we conduct empirical analysis to demonstrate how the perennial curse of training on synthetic data manifests itself in this specific setup. Compared to models trained on gold labels, we observe not only the expected performance degradation in accuracy and F1 score, but also increased instability across training runs and premature performance plateaus. These findings cast doubts on the reliability of such approaches in real-world applications. We contextualize the observed phenomena through the lens of error propagation and offer several practical mitigation strategies, including entropy-based filtering and ensemble techniques. Although these heuristics offer partial relief, they do not fully resolve the inherent risks of propagating non-random errors from LLM annotations to smaller classifiers, underscoring the need for caution when applying this workflow in high-stakes text classification tasks.
CVApr 14, 2025
Learning to Harmonize Cross-vendor X-ray Images by Non-linear Image Dynamics CorrectionYucheng Lu, Shunxin Wang, Dovile Juodelyte et al.
In this paper, we explore how conventional image enhancement can improve model robustness in medical image analysis. By applying commonly used normalization methods to images from various vendors and studying their influence on model generalization in transfer learning, we show that the nonlinear characteristics of domain-specific image dynamics cannot be addressed by simple linear transforms. To tackle this issue, we reformulate the image harmonization task as an exposure correction problem and propose a method termed Global Deep Curve Estimation (GDCE) to reduce domain-specific exposure mismatch. GDCE performs enhancement via a pre-defined polynomial function and is trained with a "domain discriminator", aiming to improve model transparency in downstream tasks compared to existing black-box methods.
LGFeb 12, 2022
Maximizing Communication Efficiency for Large-scale Training via 0/1 AdamYucheng Lu, Conglong Li, Minjia Zhang et al.
1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose 0/1 Adam that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. 0/1 Adam performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for 0/1 Adam on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2$\times$ higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.
IVJun 28, 2021
Progressive Joint Low-light Enhancement and Noise Removal for Raw ImagesYucheng Lu, Seung-Won Jung
Low-light imaging on mobile devices is typically challenging due to insufficient incident light coming through the relatively small aperture, resulting in a low signal-to-noise ratio. Most of the previous works on low-light image processing focus either only on a single task such as illumination adjustment, color enhancement, or noise removal; or on a joint illumination adjustment and denoising task that heavily relies on short-long exposure image pairs collected from specific camera models, and thus these approaches are less practical and generalizable in real-world settings where camera-specific joint enhancement and restoration is required. To tackle this problem, in this paper, we propose a low-light image processing framework that performs joint illumination adjustment, color enhancement, and denoising. Considering the difficulty in model-specific data collection and the ultra-high definition of the captured images, we design two branches: a coefficient estimation branch as well as a joint enhancement and denoising branch. The coefficient estimation branch works in a low-resolution space and predicts the coefficients for enhancement via bilateral learning, whereas the joint enhancement and denoising branch works in a full-resolution space and progressively performs joint enhancement and denoising. In contrast to existing methods, our framework does not need to recollect massive data when being adapted to another camera model, which significantly reduces the efforts required to fine-tune our approach for practical usage. Through extensive experiments, we demonstrate its great potential in real-world low-light imaging applications when compared with current state-of-the-art methods.
LGMar 2, 2021
Variance Reduced Training with Stratified Sampling for Forecasting ModelsYucheng Lu, Youngsuk Park, Lifan Chen et al.
In large-scale time series forecasting, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large variance on gradient estimation, and thus incurs long-time training. We show that this issue can be efficiently alleviated via stratification, which allows the optimizer to sample from pre-grouped time series strata. For better trading-off gradient variance and computation complexity, we further propose SCott (Stochastic Stratified Control Variate Gradient Descent), a variance reduced SGD-style optimizer that utilizes stratified sampling via control variate. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and demonstrate SCott converges faster with respect to both iterations and wall clock time.
LGFeb 5, 2021
Hyperparameter Optimization Is Deceiving Us, and How to Stop ItA. Feder Cooper, Yucheng Lu, Jessica Zosa Forde et al.
Recent empirical work shows that inconsistent results based on choice of hyperparameter optimization (HPO) configuration are a widespread problem in ML research. When comparing two algorithms J and K searching one subspace can yield the conclusion that J outperforms K, whereas searching another can entail the opposite. In short, the way we choose hyperparameters can deceive us. We provide a theoretical complement to this prior work, arguing that, to avoid such deception, the process of drawing conclusions from HPO should be made more rigorous. We call this process epistemic hyperparameter optimization (EHPO), and put forth a logical framework to capture its semantics and how it can lead to inconsistent conclusions about performance. Our framework enables us to prove EHPO methods that are guaranteed to be defended against deception, given bounded compute time budget t. We demonstrate our framework's utility by proving and empirically validating a defended variant of random search.
LGJun 15, 2020
Optimal Complexity in Decentralized TrainingYucheng Lu, Christopher De Sa
Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. Empirically, we compare DeTAG with other decentralized algorithms on image classification tasks, and we show DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks.
LGMay 14, 2020
MixML: A Unified Analysis of Weakly Consistent Parallel LearningYucheng Lu, Jack Nash, Christopher De Sa
Parallelism is a ubiquitous method for accelerating machine learning algorithms. However, theoretical analysis of parallel learning is usually done in an algorithm- and protocol-specific setting, giving little insight about how changes in the structure of communication could affect convergence. In this paper we propose MixML, a general framework for analyzing convergence of weakly consistent parallel machine learning. Our framework includes: (1) a unified way of modeling the communication process among parallel workers; (2) a new parameter, the mixing time tmix, that quantifies how the communication process affects convergence; and (3) a principled way of converting a convergence proof for a sequential algorithm into one for a parallel version that depends only on tmix. We show MixML recovers and improves on known convergence bounds for asynchronous and/or decentralized versions of many algorithms, includingSGD and AMSGrad. Our experiments substantiate the theory and show the dependency of convergence on the underlying mixing time.
LGFeb 26, 2020
Moniqua: Modulo Quantized Communication in Decentralized SGDYucheng Lu, Christopher De Sa
Running Stochastic Gradient Descent (SGD) in a decentralized fashion has shown promising results. In this paper we propose Moniqua, a technique that allows decentralized SGD to use quantized communication. We prove in theory that Moniqua communicates a provably bounded number of bits per iteration, while converging at the same asymptotic rate as the original algorithm does with full-precision communication. Moniqua improves upon prior works in that it (1) requires zero additional memory, (2) works with 1-bit quantization, and (3) is applicable to a variety of decentralized algorithms. We demonstrate empirically that Moniqua converges faster with respect to wall clock time than other quantized decentralized algorithms. We also show that Moniqua is robust to very low bit-budgets, allowing 1-bit-per-parameter communication without compromising validation accuracy when training ResNet20 and ResNet110 on CIFAR10.