CLApr 10, 2025
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement LearningByteDance Seed, Jiaze Chen, Tiantian Fan et al. · bytedance
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
AIJun 7, 2023Code
MobileNMT: Enabling Translation in 15MB and 30msYe Lin, Xiaohui Wang, Zhexi Zhang et al.
Deploying NMT models on mobile devices is essential for privacy, low latency, and offline scenarios. For high model capacity, NMT models are rather large. Running these models on devices is challenging with limited storage, memory, computation, and power consumption. Existing work either only focuses on a single metric such as FLOPs or general engine which is not good at auto-regressive decoding. In this paper, we present MobileNMT, a system that can translate in 15MB and 30ms on devices. We propose a series of principles for model compression when combined with quantization. Further, we implement an engine that is friendly to INT8 and decoding. With the co-design of model and engine, compared with the existing system, we speed up 47.0x and save 99.5% of memory with only 11.6% loss of BLEU. The code is publicly available at https://github.com/zjersey/Lightseq-ARM.
LGJun 15, 2023
Understanding Parameter Sharing in TransformersYe Lin, Mingxuan Wang, Zhexi Zhang et al.
Parameter sharing has proven to be a parameter-efficient approach. Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth. In this paper, we study why this approach works from two perspectives. First, increasing model depth makes the model more complex, and we hypothesize that the reason is related to model complexity (referring to FLOPs). Secondly, since each shared parameter will participate in the network computation several times in forward propagation, its corresponding gradient will have a different range of values from the original model, which will affect the model convergence. Based on this, we hypothesize that training convergence may also be one of the reasons. Through further analysis, we show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity. Inspired by this, we tune the training hyperparameters related to model convergence in a targeted manner. Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
CVFeb 12
JEPA-VLA: Video Predictive Embedding is Needed for VLA ModelsShangchen Miao, Ningya Feng, Jialong Wu et al.
Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.
DCJan 7
A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP SystemsQi Wu, Chao Fang, Jiayuan Chen et al.
Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.
CLSep 16, 2021Code
The NiuTrans System for WNGT 2020 Efficiency TaskChi Hu, Bei Li, Ye Lin et al.
This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task. We focus on the efficient implementation of deep Transformer models \cite{wang-etal-2019-learning, li-etal-2019-niutrans} using NiuTensor (https://github.com/NiuTrans/NiuTensor), a flexible toolkit for NLP tasks. We explored the combination of deep encoder and shallow decoder in Transformer models via model compression and knowledge distillation. The neural machine translation decoding also benefits from FP16 inference, attention caching, dynamic batching, and batch pruning. Our systems achieve promising results in both translation quality and efficiency, e.g., our fastest system can translate more than 40,000 tokens per second with an RTX 2080 Ti while maintaining 42.9 BLEU on \textit{newstest2018}. The code, models, and docker images are available at NiuTrans.NMT (https://github.com/NiuTrans/NiuTrans.NMT).
CLSep 16, 2021Code
The NiuTrans System for the WMT21 Efficiency TaskChenglong Wang, Chi Hu, Yongyu Mu et al.
This paper describes the NiuTrans system for the WMT21 translation efficiency task (http://statmt.org/wmt21/efficiency-task.html). Following last year's work, we explore various techniques to improve efficiency while maintaining translation quality. We investigate the combinations of lightweight Transformer architectures and knowledge distillation strategies. Also, we improve the translation efficiency with graph optimization, low precision, dynamic batching, and parallel pre/post-processing. Our system can translate 247,000 words per second on an NVIDIA A100, being 3$\times$ faster than last year's system. Our system is the fastest and has the lowest memory consumption on the GPU-throughput track. The code, model, and pipeline will be available at NiuTrans.NMT (https://github.com/NiuTrans/NiuTrans.NMT).
LGSep 9, 2021Code
Bag of Tricks for Optimizing Transformer EfficiencyYe Lin, Yanyang Li, Tong Xiao et al.
Improving Transformer efficiency has become increasingly attractive recently. A wide range of methods has been proposed, e.g., pruning, quantization, new architectures and etc. But these methods are either sophisticated in implementation or dependent on hardware. In this paper, we show that the efficiency of Transformer can be improved by combining some simple and hardware-agnostic methods, including tuning hyper-parameters, better design choices and training strategies. On the WMT news translation tasks, we improve the inference efficiency of a strong Transformer system by 3.80X on CPU and 2.52X on GPU. The code is publicly available at https://github.com/Lollipop321/mini-decoder-network.
NAJul 4, 2024
Green Multigrid NetworkYe Lin, Young Ju Lee, Jiwei Jia
GreenLearning networks (GL) directly learn Green's function in physical space, making them an interpretable model for capturing unknown solution operators of partial differential equations (PDEs). For many PDEs, the corresponding Green's function exhibits asymptotic smoothness. In this paper, we propose a framework named Green Multigrid networks (GreenMGNet), an operator learning algorithm designed for a class of asymptotically smooth Green's functions. Compared with the pioneering GL, the new framework presents itself with better accuracy and efficiency, thereby achieving a significant improvement. GreenMGNet is composed of two technical novelties. First, Green's function is modeled as a piecewise function to take into account its singular behavior in some parts of the hyperplane. Such piecewise function is then approximated by a neural network with augmented output(AugNN) so that it can capture singularity accurately. Second, the asymptotic smoothness property of Green's function is used to leverage the Multi-Level Multi-Integration (MLMI) algorithm for both the training and inference stages. Several test cases of operator learning are presented to demonstrate the accuracy and effectiveness of the proposed method. On average, GreenMGNet achieves $3.8\%$ to $39.15\%$ accuracy improvement. To match the accuracy level of GL, GreenMGNet requires only about $10\%$ of the full grid data, resulting in a $55.9\%$ and $92.5\%$ reduction in training time and GPU memory cost for one-dimensional test problems, and a $37.7\%$ and $62.5\%$ reduction for two-dimensional test problems.
62.1NAMar 27
Boundary neuron method for solving partial differential equationsYe Lin, Wentao Liu, Young Ju Lee et al.
We propose a boundary neuron method with random features (BNM-RF) for solving partial differential equations. The method approximates the unknown boundary function by a shallow network within the boundary integral formulation. With randomly sampled and fixed hidden parameters, the computation reduces to a linear least squares problem for the output coefficients, which avoids gradient based nonconvex optimization. This construction retains the dimensionality reduction of boundary integral equations and the linear solution structure of the random feature method. For elliptic problems, we establish convergence analysis by combining kernel-based method with random feature approximation, and obtain error bounds on both the boundary and the interior solution. Numerical experiments on Laplace and Helmholtz problems, including interior and exterior cases, show that the proposed method achieves competitive accuracy relative to the boundary element method and favorable performance relative to boundary integral neural networks in the tested settings with only few neurons. Overall, the proposed method provides a practical framework for combining boundary integral equations with neural network for problems on complex geometries and unbounded domains.
CVDec 12, 2025
RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly DetectionRongcheng Wu, Hao Zhu, Shiying Zhang et al.
Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.
NAJan 6, 2025
Orthogonal greedy algorithm for linear operator learning with shallow neural networkYe Lin, Jiwei Jia, Young Ju Lee et al.
Greedy algorithms, particularly the orthogonal greedy algorithm (OGA), have proven effective in training shallow neural networks for fitting functions and solving partial differential equations (PDEs). In this paper, we extend the application of OGA to the tasks of linear operator learning, which is equivalent to learning the kernel function through integral transforms. Firstly, a novel greedy algorithm is developed for kernel estimation rate in a new semi-inner product, which can be utilized to approximate the Green's function of linear PDEs from data. Secondly, we introduce the OGA for point-wise kernel estimation to further improve the approximation rate, achieving orders of accuracy improvement across various tasks and baseline models. In addition, we provide a theoretical analysis on the kernel estimation problem and the optimal approximation rates for both algorithms, establishing their efficacy and potential for future applications in PDEs and operator learning tasks.
CLMay 10, 2023
Multi-Path Transformer is Better: A Case Study on Neural Machine TranslationYe Lin, Shuhan Zhou, Yanyang Li et al.
For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.
CLJan 3, 2021
An Efficient Transformer Decoder with Compressed Sub-layersYanyang Li, Ye Lin, Tong Xiao et al.
The large attention-based encoder-decoder network (Transformer) has become prevailing recently due to its effectiveness. But the high computation complexity of its decoder raises the inefficiency issue. By examining the mathematic formulation of the decoder, we show that under some mild conditions, the architecture could be simplified by compressing its sub-layers, the basic building block of Transformer, and achieves a higher parallelism. We thereby propose Compressed Attention Network, whose decoder layer consists of only one sub-layer instead of three. Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42x faster with performance on par with a strong baseline. This strong baseline is already 2x faster than the widely used standard baseline without loss in performance.
CLNov 30, 2020
A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary InductionYanyang Li, Yingfeng Luo, Ye Lin et al.
Unsupervised Bilingual Dictionary Induction methods based on the initialization and the self-learning have achieved great success in similar language pairs, e.g., English-Spanish. But they still fail and have an accuracy of 0% in many distant language pairs, e.g., English-Japanese. In this work, we show that this failure results from the gap between the actual initialization performance and the minimum initialization performance for the self-learning to succeed. We propose Iterative Dimension Reduction to bridge this gap. Our experiments show that this simple method does not hamper the performance of similar language pairs and achieves an accuracy of 13.64~55.53% between English and four distant languages, i.e., Chinese, Japanese, Vietnamese and Thai.
CLSep 19, 2020
Weight Distillation: Transferring the Knowledge in Neural Network ParametersYe Lin, Yanyang Li, Ziyang Wang et al.
Knowledge distillation has been proven to be effective in model acceleration and compression. It allows a small network to learn to generalize in the same way as a large network. Recent successes in pre-training suggest the effectiveness of transferring model parameters. Inspired by this, we investigate methods of model acceleration and compression in another line of research. We propose Weight Distillation to transfer the knowledge in the large network parameters through a parameter generator. Our experiments on WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks show that weight distillation can train a small network that is 1.88~2.94x faster than the large network but with competitive performance. With the same sized small network, weight distillation can outperform knowledge distillation by 0.51~1.82 BLEU points.
CLSep 17, 2020
Towards Fully 8-bit Integer Inference for the Transformer ModelYe Lin, Yanyang Li, Tengbo Liu et al.
8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain functions in complex models (e.g., Softmax in Transformer), and make heavy use of quantization and de-quantization. In this work, we show that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived. De-quantization is adopted when necessary, which makes the network more efficient. Our experiments on WMT16 En<->Ro, WMT14 En<->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
LGMay 27, 2020
General-Purpose User Embeddings based on Mobile App UsageJunqi Zhang, Bing Bai, Ye Lin et al.
In this paper, we report our recent practice at Tencent for user modeling based on mobile app usage. User behaviors on mobile app usage, including retention, installation, and uninstallation, can be a good indicator for both long-term and short-term interests of users. For example, if a user installs Snapseed recently, she might have a growing interest in photographing. Such information is valuable for numerous downstream applications, including advertising, recommendations, etc. Traditionally, user modeling from mobile app usage heavily relies on handcrafted feature engineering, which requires onerous human work for different downstream applications, and could be sub-optimal without domain experts. However, automatic user modeling based on mobile app usage faces unique challenges, including (1) retention, installation, and uninstallation are heterogeneous but need to be modeled collectively, (2) user behaviors are distributed unevenly over time, and (3) many long-tailed apps suffer from serious sparsity. In this paper, we present a tailored AutoEncoder-coupled Transformer Network (AETN), by which we overcome these challenges and achieve the goals of reducing manual efforts and boosting performance. We have deployed the model at Tencent, and both online/offline experiments from multiple domains of downstream applications have demonstrated the effectiveness of the output user embeddings.
IRApr 7, 2020
CSRN: Collaborative Sequential Recommendation Networks for News RetrievalBing Bai, Guanhua Zhang, Ye Lin et al.
Nowadays, news apps have taken over the popularity of paper-based media, providing a great opportunity for personalization. Recurrent Neural Network (RNN)-based sequential recommendation is a popular approach that utilizes users' recent browsing history to predict future items. This approach is limited that it does not consider the societal influences of news consumption, i.e., users may follow popular topics that are constantly changing, while certain hot topics might be spreading only among specific groups of people. Such societal impact is difficult to predict given only users' own reading histories. On the other hand, the traditional User-based Collaborative Filtering (UserCF) makes recommendations based on the interests of the "neighbors", which provides the possibility to supplement the weaknesses of RNN-based methods. However, conventional UserCF only uses a single similarity metric to model the relationships between users, which is too coarse-grained and thus limits the performance. In this paper, we propose a framework of deep neural networks to integrate the RNN-based sequential recommendations and the key ideas from UserCF, to develop Collaborative Sequential Recommendation Networks (CSRNs). Firstly, we build a directed co-reading network of users, to capture the fine-grained topic-specific similarities between users in a vector space. Then, the CSRN model encodes users with RNNs, and learns to attend to neighbors and summarize what news they are reading at the moment. Finally, news articles are recommended according to both the user's own state and the summarized state of the neighbors. Experiments on two public datasets show that the proposed model outperforms the state-of-the-art approaches significantly.
CVNov 28, 2019
Unsupervised Many-to-Many Image-to-Image Translation Across Multiple DomainsYe Lin, Keren Fu, Shenggui Ling et al.
Unsupervised multi-domain image-to-image translation aims to synthesis images among multiple domains without labeled data, which is more general and complicated than one-to-one image mapping. However, existing methods mainly focus on reducing the large costs of modeling and do not pay enough attention to the quality of generated images. In some target domains, their translation results may not be expected or even it has model collapse. To improve the image quality, we propose an effective many-to-many mapping framework for unsupervised multi-domain image-to-image translation. There are two key aspects in our method. The first is a proposed many-to-many architecture with only one domain-shared encoder and several domain-specialized decoders to effectively and simultaneously translate images across multiple domains. The second is two proposed constraints extended from one-to-one mappings to further help improve the generation. All the evaluations demonstrate our framework is superior to existing methods and provides an effective solution for multi-domain image-to-image translation.