AINov 30, 2024
FullStack Bench: Evaluating LLMs as Full Stack CodersBytedance-Seed-Foundation-Code-Team, Yao Cheng, Jianfeng Chen et al. · bytedance
As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.
CVOct 5, 2023Code
Expedited Training of Visual Conditioned Language Generation via Redundancy ReductionYiren Jian, Tingkai Liu, Yunzhe Tao et al.
In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained large language models (LLMs). The conventional approach in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive phase dedicated to general-purpose vision-language representation learning, focused on extracting and consolidating relevant visual features. This is followed by a subsequent phase that emphasizes end-to-end alignment between visual and linguistic modalities. Our novel one-stage, single-loss framework bypasses the computationally demanding first training stage by gradually merging similar visual tokens during training, while avoiding model collapse caused by single-stage training of BLIP-2 type models. The gradual merging process effectively condenses visual information while preserving semantic richness, resulting in rapid convergence without compromising performance. Our experimental findings demonstrate that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance. Furthermore, we illustrate that our models significantly narrow the performance gap to current vision-language models using only 1/10 of the data. Finally, we showcase how our image-text models can seamlessly adapt to video-conditioned language generation tasks through novel soft attentive temporal token contextualizing modules. Code is available at \url{https://github.com/yiren-jian/EVLGen}.
CVOct 8, 2023Code
DeVAn: Dense Video Annotation for Video-Language ModelsTingkai Liu, Yunzhe Tao, Haogeng Liu et al.
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visuallanguage models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given summary. Given the novel nature of the paragraph-length video summarization task, we compared different existing evaluation metrics and their alignment with human preferences and found that model-based evaluation metrics provide more semantically-oriented and human-aligned evaluation. Finally, we benchmarked a wide range of current video-language models on DeVAn, and we aim for DeVAn to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks. Code is available at https: //github.com/TK-21st/DeVAn.
CVOct 8, 2023
Video-Teller: Enhancing Cross-Modal Generation with Fusion and DecouplingHaogeng Liu, Qihang Fan, Tingkai Liu et al.
This paper proposes Video-Teller, a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment to significantly enhance the video-to-text generation task. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions. To effectively integrate visual and auditory information, Video-Teller builds upon the image-based BLIP-2 model and introduces a cascaded Q-Former which fuses information across frames and ASR texts. To better guide video summarization, we introduce a fine-grained modality alignment objective, where the cascaded Q-Former's output embedding is trained to align with the caption/summary embedding created by a pretrained text auto-encoder. Experimental results demonstrate the efficacy of our proposed video-language foundation model in accurately comprehending videos and generating coherent and precise language descriptions. It is worth noting that the fine-grained alignment enhances the model's capabilities (4% improvement of CIDEr score on MSR-VTT) with only 13% extra parameters in training and zero additional cost in inference.
CLOct 4, 2023
$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program SynthesisZishun Yu, Yunzhe Tao, Liyu Chen et al.
Program synthesis aims to create accurate, executable programs from problem specifications, specifically from natural language descriptions in our context. Recent studies have leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. The application of RL focuses on directly optimizing for functional correctness, offering an advantage over conventional supervised methods. Despite policy-based RL methods dominating the literature on RL for program synthesis, the nature of program synthesis tasks hints at a natural alignment with value-based methods. This stems from the rich collection of off-policy programs, including those developed by human programmers and also historical samples, coupled with the straightforward verification of generated programs through automated unit testing, meaning rewards are easy to obtain. Diverging from the dominant use of policy-based algorithms, our work explores the feasibility of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we introduce an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance when compared to policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.
CVMar 3, 2024Code
InfiMM-HD: A Leap Forward in High-Resolution Multimodal UnderstandingHaogeng Liu, Quanzeng You, Xiaotian Han et al.
Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd
LGNov 5, 2019Code
DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement LearningBharathan Balaji, Sunil Mallya, Sahika Genc et al.
DeepRacer is a platform for end-to-end experimentation with RL and can be used to systematically investigate the key challenges in developing intelligent control systems. Using the platform, we demonstrate how a 1/18th scale car can learn to drive autonomously using RL with a monocular camera. It is trained in simulation with no additional tuning in physical world and demonstrates: 1) formulation and solution of a robust reinforcement learning algorithm, 2) narrowing the reality gap through joint perception and dynamics, 3) distributed on-demand compute architecture for training optimal policies, and 4) a robust evaluation method to identify when to stop training. It is the first successful large-scale deployment of deep reinforcement learning on a robotic control agent that uses only raw camera images as observations and a model-free learning method to perform robust path planning. We open source our code and video demo on GitHub: https://git.io/fjxoJ.
CVMar 27, 2024
ViTAR: Vision Transformer with Any ResolutionQihang Fan, Quanzeng You, Xiaotian Han et al.
This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration. Secondly, we introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions, thereby preventing overfitting to any single training resolution. Our resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and 80.4\% accuracy at a 4032x4032 resolution, all while reducing computational costs. ViTAR also shows strong performance in downstream tasks such as instance and semantic segmentation and can easily combined with self-supervised learning techniques like Masked AutoEncoder. Our work provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.
LGMar 11, 2024
$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language ModelYufeng Zhang, Liyu Chen, Boyi Liu et al.
Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale. Yet, there is a noticeable absence of a cost-effective and standardized testbed tailored to evaluating and comparing these algorithms. To bridge this gap, we present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO).
CLJun 10, 2021
A Template-guided Hybrid Pointer Network for Knowledge-basedTask-oriented Dialogue SystemsDingmin Wang, Ziyao Chen, Wanwei He et al.
Most existing neural network based task-oriented dialogue systems follow encoder-decoder paradigm, where the decoder purely depends on the source texts to generate a sequence of words, usually suffering from instability and poor readability. Inspired by the traditional template-based generation approaches, we propose a template-guided hybrid pointer network for the knowledge-based task-oriented dialogue system, which retrieves several potentially relevant answers from a pre-constructed domain-specific conversational repository as guidance answers, and incorporates the guidance answers into both the encoding and decoding processes. Specifically, we design a memory pointer network model with a gating mechanism to fully exploit the semantic correlation between the retrieved answers and the ground-truth response. We evaluate our model on four widely used task-oriented datasets, including one simulated and three manually created datasets. The experimental results demonstrate that the proposed model achieves significantly better performance than the state-of-the-art methods over different automatic evaluation metrics.
LGNov 24, 2020
REPAINT: Knowledge Transfer in Deep Reinforcement LearningYunzhe Tao, Sahika Genc, Jonathan Chung et al.
Accelerating learning processes for complex tasks by leveraging previously learned tasks has been one of the most challenging problems in reinforcement learning, especially when the similarity between source and target tasks is low. This work proposes REPresentation And INstance Transfer (REPAINT) algorithm for knowledge transfer in deep reinforcement learning. REPAINT not only transfers the representation of a pre-trained teacher policy in the on-policy learning, but also uses an advantage-based experience selection approach to transfer useful samples collected following the teacher policy in the off-policy learning. Our experimental results on several benchmark tasks show that REPAINT significantly reduces the total training time in generic cases of task similarity. In particular, when the source tasks are dissimilar to, or sub-tasks of, the target tasks, REPAINT outperforms other baselines in both training-time reduction and asymptotic performance of return scores.
LGJan 2, 2020
Zero-Shot Reinforcement Learning with Deep Attention Convolutional Neural NetworksSahika Genc, Sunil Mallya, Sravan Bodapati et al.
Simulation-to-simulation and simulation-to-real world transfer of neural network models have been a difficult problem. To close the reality gap, prior methods to simulation-to-real world transfer focused on domain adaptation, decoupling perception and dynamics and solving each problem separately, and randomization of agent parameters and environment conditions to expose the learning agent to a variety of conditions. While these methods provide acceptable performance, the computational complexity required to capture a large variation of parameters for comprehensive scenarios on a given task such as autonomous driving or robotic manipulation is high. Our key contribution is to theoretically prove and empirically demonstrate that a deep attention convolutional neural network (DACNN) with specific visual sensor configuration performs as well as training on a dataset with high domain and parameter variation at lower computational complexity. Specifically, the attention network weights are learned through policy optimization to focus on local dependencies that lead to optimal actions, and does not require tuning in real-world for generalization. Our new architecture adapts perception with respect to the control objective, resulting in zero-shot learning without pre-training a perception network. To measure the impact of our new deep network architecture on domain adaptation, we consider autonomous driving as a use case. We perform an extensive set of experiments in simulation-to-simulation and simulation-to-real scenarios to compare our approach to several baselines including the current state-of-art models.
CLOct 25, 2019
FineText: Text Classification via Attention-based Language Model Fine-tuningYunzhe Tao, Saurabh Gupta, Satyapriya Krishna et al.
Training deep neural networks from scratch on natural language processing (NLP) tasks requires significant amount of manually labeled text corpus and substantial time to converge, which usually cannot be satisfied by the customers. In this paper, we aim to develop an effective transfer learning algorithm by fine-tuning a pre-trained language model. The goal is to provide expressive and convenient-to-use feature extractors for downstream NLP tasks, and achieve improvement in terms of accuracy, data efficiency, and generalization to new domains. Therefore, we propose an attention-based fine-tuning algorithm that automatically selects relevant contextualized features from the pre-trained language model and uses those features on downstream text classification tasks. We test our methods on six widely-used benchmarking datasets, and achieve new state-of-the-art performance on all of them. Moreover, we then introduce an alternative multi-task learning approach, which is an end-to-end algorithm given the pre-trained model. By doing multi-task learning, one can largely reduce the total training time by trading off some classification accuracy.
LGDec 1, 2018
Stochastic Training of Residual Networks: a Differential Equation ViewpointQi Sun, Yunzhe Tao, Qiang Du
During the last few years, significant attention has been paid to the stochastic training of artificial neural networks, which is known as an effective regularization approach that helps improve the generalization capability of trained models. In this work, the method of modified equations is applied to show that the residual network and its variants with noise injection can be regarded as weak approximations of stochastic differential equations. Such observations enable us to bridge the stochastic training processes with the optimal control of backward Kolmogorov's equations. This not only offers a novel perspective on the effects of regularization from the loss landscape viewpoint but also sheds light on the design of more reliable and efficient stochastic training strategies. As an example, we propose a new way to utilize Bernoulli dropout within the plain residual network architecture and conduct experiments on a real-world image classification task to substantiate our theoretical findings.
LGNov 7, 2018
Explaining Deep Learning Models - A Bayesian Non-parametric ApproachWenbo Guo, Sui Huang, Yunzhe Tao et al.
Understanding and interpreting how machine learning (ML) models make decisions have been a big challenge. While recent research has proposed various technical approaches to provide some clues as to how an ML model makes individual predictions, they cannot provide users with an ability to inspect a model as a complete entity. In this work, we propose a novel technical approach that augments a Bayesian non-parametric regression mixture model with multiple elastic nets. Using the enhanced mixture model, we can extract generalizable insights for a target model through a global approximation. To demonstrate the utility of our approach, we evaluate it on different ML models in the context of image recognition. The empirical results indicate that our proposed approach not only outperforms the state-of-the-art techniques in explaining individual decisions but also provides users with an ability to discover the vulnerabilities of the target ML models.
LGJun 2, 2018
Hierarchical Attention-Based Recurrent Highway Networks for Time Series PredictionYunzhe Tao, Lin Ma, Weizhong Zhang et al.
Time series prediction has been studied in a variety of domains. However, it is still challenging to predict future series given historical observations and past exogenous data. Existing methods either fail to consider the interactions among different components of exogenous variables which may affect the prediction accuracy, or cannot model the correlations between exogenous data and target data. Besides, the inherent temporal dynamics of exogenous data are also related to the target series prediction, and thus should be considered as well. To address these issues, we propose an end-to-end deep learning model, i.e., Hierarchical attention-based Recurrent Highway Network (HRHN), which incorporates spatio-temporal feature extraction of exogenous variables and temporal dynamics modeling of target variables into a single framework. Moreover, by introducing the hierarchical attention mechanism, HRHN can adaptively select the relevant exogenous features in different semantic levels. We carry out comprehensive empirical evaluations with various methods over several datasets, and show that HRHN outperforms the state of the arts in time series prediction, especially in capturing sudden changes and sudden oscillations of time series.
LGJun 2, 2018
Nonlocal Neural Networks, Nonlocal Diffusion and Nonlocal ModelingYunzhe Tao, Qi Sun, Qiang Du et al.
Nonlocal neural networks have been proposed and shown to be effective in several computer vision tasks, where the nonlocal operations can directly capture long-range dependencies in the feature space. In this paper, we study the nature of diffusion and damping effect of nonlocal networks by doing spectrum analysis on the weight matrices of the well-trained networks, and then propose a new formulation of the nonlocal block. The new block not only learns the nonlocal interactions but also has stable dynamics, thus allowing deeper nonlocal structures. Moreover, we interpret our formulation from the general nonlocal modeling perspective, where we make connections between the proposed nonlocal network and other nonlocal models, such as nonlocal diffusion process and Markov jump process.
CLMay 9, 2018
A Reinforced Topic-Aware Convolutional Sequence-to-Sequence Model for Abstractive Text SummarizationLi Wang, Junlin Yao, Yunzhe Tao et al.
In this paper, we propose a deep learning approach to tackle the automatic summarization tasks by incorporating topic information into the convolutional sequence-to-sequence (ConvS2S) model and using self-critical sequence training (SCST) for optimization. Through jointly attending to topics and word-level alignment, our approach can improve coherence, diversity, and informativeness of generated summaries via a biased probability generation mechanism. On the other hand, reinforcement training, like SCST, directly optimizes the proposed model with respect to the non-differentiable metric ROUGE, which also avoids the exposure bias during inference. We carry out the experimental evaluation with state-of-the-art methods over the Gigaword, DUC-2004, and LCSTS datasets. The empirical results demonstrate the superiority of our proposed method in the abstractive summarization.