100.0LGMar 26Code
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion ScaleYicheng Zou, Dongsheng Zhu, Lin Zhu et al.
We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
CVAug 29, 2023Code
On-the-Fly Guidance Training for Medical Image RegistrationYuelin Xin, Yicheng Chen, Shengxiang Ji et al.
This study introduces a novel On-the-Fly Guidance (OFG) training framework for enhancing existing learning-based image registration models, addressing the limitations of weakly-supervised and unsupervised methods. Weakly-supervised methods struggle due to the scarcity of labeled data, and unsupervised methods directly depend on image similarity metrics for accuracy. Our method proposes a supervised fashion for training registration models, without the need for any labeled data. OFG generates pseudo-ground truth during training by refining deformation predictions with a differentiable optimizer, enabling direct supervised learning. OFG optimizes deformation predictions efficiently, improving the performance of registration models without sacrificing inference speed. Our method is tested across several benchmark datasets and leading models, it significantly enhanced performance, providing a plug-and-play solution for training learning-based registration models. Code available at: https://github.com/cilix-ai/on-the-fly-guidance
89.6AIMay 26
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World IntelligenceMiniMax, Aili Chen, Aonian Li et al.
We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.
ROSep 19, 2023
Learning to Initialize Trajectory Optimization for Vision-Based Autonomous Flight in Unknown EnvironmentsYicheng Chen, Jinjie Li, Wenyuan Qin et al.
Autonomous flight in unknown environments requires precise spatial and temporal trajectory planning, often involving computationally expensive nonconvex optimization prone to local optima. To overcome these challenges, we present the Neural-Enhanced Trajectory Planner (NEO-Planner), a novel approach that leverages a Neural Network (NN) Planner to provide informed initial values for trajectory optimization. The NN-Planner is trained on a dataset generated by an expert planner using batch sampling, capturing multimodal trajectory solutions. It learns to predict spatial and temporal parameters for trajectories directly from raw sensor observations. NEO-Planner starts optimization from these predictions, accelerating computation speed while maintaining explainability. Furthermore, we introduce a robust online replanning framework that accommodates planning latency for smooth trajectory tracking. Extensive simulations demonstrate that NEO-Planner reduces optimization iterations by 20%, leading to a 26% decrease in computation time compared with pure optimization-based methods. It maintains trajectory quality comparable to baseline approaches and generalizes well to unseen environments. Real-world experiments validate its effectiveness for autonomous drone navigation in cluttered, unknown environments.
CVJan 4, 2024Code
An Open and Comprehensive Pipeline for Unified Object Grounding and DetectionXiangyu Zhao, Yicheng Chen, Shilin Xu et al.
Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.
CLApr 12, 2025Code
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and FutureJialun Zhong, Wei Shen, Yanzeng Li et al.
Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs' behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github\footnote{https://github.com/JLZhong23/awesome-reward-models}.
LGSep 24, 2022
Communication-Efficient {Federated} Learning Using Censored Heavy Ball DescentYicheng Chen, Rick S. Blum, Brian M. Sadler
Distributed machine learning enables scalability and computational offloading, but requires significant levels of communication. Consequently, communication efficiency in distributed learning settings is an important consideration, especially when the communications are wireless and battery-driven devices are employed. In this paper we develop a censoring-based heavy ball (CHB) method for distributed learning in a server-worker architecture. Each worker self-censors unless its local gradient is sufficiently different from the previously transmitted one. The significant practical advantages of the HB method for learning problems are well known, but the question of reducing communications has not been addressed. CHB takes advantage of the HB smoothing to eliminate reporting small changes, and provably achieves a linear convergence rate equivalent to that of the classical HB method for smooth and strongly convex objective functions. The convergence guarantee of CHB is theoretically justified for both convex and nonconvex cases. In addition we prove that, under some conditions, at least half of all communications can be eliminated without any impact on convergence rate. Extensive numerical results validate the communication efficiency of CHB on both synthetic and real datasets, for convex, nonconvex, and nondifferentiable cases. Given a target accuracy, CHB can significantly reduce the number of communications compared to existing algorithms, achieving the same accuracy without slowing down the optimization process.
SEOct 6, 2023
Reverse Chain: A Generic-Rule for LLMs to Master Multi-API PlanningYinger Zhang, Hui Cai, Xeirui Song et al.
While enabling large language models to implement function calling (known as APIs) can greatly enhance the performance of Large Language Models (LLMs), function calling is still a challenging task due to the complicated relations between different APIs, especially in a context-learning setting without fine-tuning. This paper introduces ``Reverse Chain'', a controllable, target-driven approach designed to empower LLMs with the capability to operate external APIs only via prompts. Recognizing that most LLMs have limited tool-use capabilities, Reverse Chain limits LLMs to executing simple tasks, e.g., API Selection and Argument Completion. Furthermore, to manage a controllable multi-function calling, Reverse Chain adopts a generic rule based on a backward reasoning process. This rule determines when to do API selection or Argument completion. To evaluate the multi-tool-use capability of LLMs, we have released a compositional multi-tool task dataset, available at \url{https://anonymous.4open.science/r/reverse-chain-8681}. Extensive numerical experiments validate the remarkable proficiency of Reverse Chain in managing multiple API calls.
97.1AIApr 15
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based ExplorationZerun Ma, Guoqiang Wang, Xinchen Xie et al.
While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.
98.6CLMar 30
Kernel-Smith: A Unified Recipe for Evolutionary Kernel OptimizationHe Du, Qiming Ge, Jiakai Hu et al.
We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
CLApr 18, 2025Code
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic SpaceYicheng Chen, Yining Li, Kai Hu et al.
Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.
CLJan 4Code
From Failure to Mastery: Generating Hard Samples for Tool-use AgentsBingguang Hao, Zengzhuang Xu, Yuntao Wen et al.
The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.
CLFeb 11
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement LearningYicheng Chen, Zerun Ma, Xinchen Xie et al.
In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.
ROMar 3
CoFL: Continuous Flow Fields for Language-Conditioned NavigationHaokun Liu, Zhaoqi Ma, Yicheng Chen et al.
Language-conditioned navigation pipelines often rely on brittle modular components or costly action-sequence generation. To address these limitations, we present CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. Instead of predicting discrete action tokens or sampling action chunks via iterative denoising, CoFL outputs instantaneous velocities that can be queried at arbitrary 2D projected locations. Trajectories are obtained by numerical integration of the predicted field, producing smooth motion that remains reactive under closed-loop execution. To enable large-scale training, we build a dataset of over 500k BEV image-instruction pairs, each procedurally annotated with a flow field and a trajectory derived from BEV semantic maps built on Matterport3D and ScanNet. By training on a mixed distribution, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes. Finally, we deploy CoFL zero-shot in real-world experiments with overhead BEV observations across multiple layouts, maintaining reliable closed-loop control and a high success rate.
CLFeb 19, 2024
Are LLM-based Evaluators Confusing NLG Quality Criteria?Xinyu Hu, Mingqi Gao, Sen Hu et al.
Some prior work has shown that LLMs perform well in NLG evaluation for different tasks. However, we discover that LLMs seem to confuse different evaluation criteria, which reduces their reliability. For further verification, we first consider avoiding issues of inconsistent conceptualization and vague expression in existing NLG quality criteria themselves. So we summarize a clear hierarchical classification system for 11 common aspects with corresponding different criteria from previous studies involved. Inspired by behavioral testing, we elaborately design 18 types of aspect-targeted perturbation attacks for fine-grained analysis of the evaluation behaviors of different LLMs. We also conduct human annotations beyond the guidance of the classification system to validate the impact of the perturbations. Our experimental results reveal confusion issues inherent in LLMs, as well as other noteworthy phenomena, and necessitate further research and improvements for LLM-based evaluation.
ROFeb 25
Hierarchical Trajectory Planning of Floating-Base Multi-Link Robot for Maneuvering in Confined EnvironmentsYicheng Chen, Jinjie Li, Haokun Liu et al.
Floating-base multi-link robots can change their shape during flight, making them well-suited for applications in confined environments such as autonomous inspection and search and rescue. However, trajectory planning for such systems remains an open challenge because the problem lies in a high-dimensional, constraint-rich space where collision avoidance must be addressed together with kinematic limits and dynamic feasibility. This work introduces a hierarchical trajectory planning framework that integrates global guidance with configuration-aware local optimization. First, we exploit the dual nature of these robots - the root link as a rigid body for guidance and the articulated joints for flexibility - to generate global anchor states that decompose the planning problem into tractable segments. Second, we design a local trajectory planner that optimizes each segment in parallel with differentiable objectives and constraints, systematically enforcing kinematic feasibility and maintaining dynamic feasibility by avoiding control singularities. Third, we implement a complete system that directly processes point-cloud data, eliminating the need for handcrafted obstacle models. Extensive simulations and real-world experiments confirm that this framework enables an articulated aerial robot to exploit its morphology for maneuvering that rigid robots cannot achieve. To the best of our knowledge, this is the first planning framework for floating-base multi-link robots that has been demonstrated on a real robot to generate continuous, collision-free, and dynamically feasible trajectories directly from raw point-cloud inputs, without relying on handcrafted obstacle models.
LGDec 26, 2023
AdapterDistillation: Non-Destructive Task Composition with Knowledge DistillationJunjie Wang, Yicheng Chen, Wangshu Zhang et al.
Leveraging knowledge from multiple tasks through introducing a small number of task specific parameters into each transformer layer, also known as adapters, receives much attention recently. However, adding an extra fusion layer to implement knowledge composition not only increases the inference time but also is non-scalable for some applications. To avoid these issues, we propose a two-stage knowledge distillation algorithm called AdapterDistillation. In the first stage, we extract task specific knowledge by using local data to train a student adapter. In the second stage, we distill the knowledge from the existing teacher adapters into the student adapter to help its inference. Extensive experiments on frequently asked question retrieval in task-oriented dialog systems validate the efficiency of AdapterDistillation. We show that AdapterDistillation outperforms existing algorithms in terms of accuracy, resource consumption and inference time.
CLDec 27, 2023
S2M: Converting Single-Turn to Multi-Turn Datasets for Conversational Question AnsweringBaokui Li, Sen Zhang, Wangshu Zhang et al.
Supplying data augmentation to conversational question answering (CQA) can effectively improve model performance. However, there is less improvement from single-turn datasets in CQA due to the distribution gap between single-turn and multi-turn datasets. On the other hand, while numerous single-turn datasets are available, we have not utilized them effectively. To solve this problem, we propose a novel method to convert single-turn datasets to multi-turn datasets. The proposed method consists of three parts, namely, a QA pair Generator, a QA pair Reassembler, and a question Rewriter. Given a sample consisting of context and single-turn QA pairs, the Generator obtains candidate QA pairs and a knowledge graph based on the context. The Reassembler utilizes the knowledge graph to get sequential QA pairs, and the Rewriter rewrites questions from a conversational perspective to obtain a multi-turn dataset S2M. Our experiments show that our method can synthesize effective training resources for CQA. Notably, S2M ranks 1st place on the QuAC leaderboard at the time of submission (Aug 24th, 2022).
AIOct 28, 2025
FunReason-MT Technical Report: Advanced Data Synthesis Solution for Real-world Multi-Turn Tool-useZengzhuang Xu, Bingguang Hao, Zechuan Wang et al.
Function calling (FC) empowers large language models (LLMs) and autonomous agents to interface with external tools, a critical capability for solving complex, real-world problems. As this ability becomes increasingly central to advanced AI systems, the need for high-quality, multi-turn training data to develop and refine it cannot be overstated. Existing data synthesis methods, such as random environment sampling or multi-agent role-playing, are not powerful enough to generate high-quality data in real-world environments. Practical challenges come in three folds: targeted data synthesis, hard query construction, and multi-turn logical dependency. To address these structural deficiencies, we present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use. FunReason-MT resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories with targeted tool, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation. Evaluations on Berkeley Function-Calling Leaderboard (BFCLv3) demonstrate the power of our framework: a 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models. Further performance improvements on BFCLv4 confirm that FunReason-MT provides a reliable and robust source for agentic learning.
LGAug 7, 2025
Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function CallingBingguang Hao, Zengzhuang Xu, Maolin Wang et al.
The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT) fail to instill robust reasoning, and traditional Reinforcement Learning (RL) struggles with inefficient exploration. We propose \textbf{EGPO}, a new RL framework built upon Group Relative Policy Optimization (GRPO), designed to address this challenge directly. The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient computation. This encourages the generation of diverse reasoning strategies. To maintain optimization direction, the entropy bonus is carefully constrained by a clipping mechanism. Complemented by a strict, binary reward signal, EGPO effectively guides the model towards discovering structured and accurate tool invocation patterns. On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B-parameter model trained with EGPO sets a new state-of-the-art among models of comparable size, surpassing a range of strong competitors, including GPT-4o and Gemini-2.5.
ROJun 5, 2025
Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic SystemHaokun Liu, Zhaoqi Ma, Yunong Li et al.
Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM). At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine-grained manipulation. The aerial robot leverages this global map to generate semantic paths and guide the ground robot's local navigation and manipulation, ensuring robust coordination even in target-absent or ambiguous scenarios. We validate the framework through extensive simulation and real-world experiments on long-horizon object arrangement tasks, demonstrating zero-shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments. To the best of our knowledge, this work presents the first heterogeneous aerial-ground robotic system that integrates VLM-based perception with LLM-driven reasoning for global high-level task planning and execution.
CVJun 28, 2024
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by LanguageYicheng Chen, Xiangtai Li, Yining Li et al.
Diffusion models can generate realistic and diverse images, potentially facilitating data availability for data-intensive perception tasks. However, leveraging these models to boost performance on downstream tasks with synthetic data poses several challenges, including aligning with real data distribution, scaling synthetic sample volumes, and ensuring their quality. To bridge these gaps, we present \textbf{A}uto \textbf{C}herry-\textbf{P}icker (ACP), a novel framework that generates high-quality cross-modality training samples at scale to augment perception and multi-modal training. ACP first uses LLMs to sample descriptions and layouts based on object combinations from real data priors, eliminating the need for ground truth image captions or annotations. Next, we use an off-the-shelf controllable diffusion model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric, Composite Layout and Image Score (CLIS), to ensure quality. Our customized synthetic high-quality samples boost performance in various scenarios, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that ACP can significantly improve the performance of existing models. In addition, we find a positive correlation between CLIS and performance gains in downstream tasks. This finding shows the potential for evaluation metrics as the role for various visual perception and MLLM tasks.
LGFeb 5, 2022
Communication Efficient Federated Learning via Ordered ADMM in a Fully Decentralized SettingYicheng Chen, Rick S. Blum, Brian M. Sadler
The challenge of communication-efficient distributed optimization has attracted attention in recent years. In this paper, a communication efficient algorithm, called ordering-based alternating direction method of multipliers (OADMM) is devised in a general fully decentralized network setting where a worker can only exchange messages with neighbors. Compared to the classical ADMM, a key feature of OADMM is that transmissions are ordered among workers at each iteration such that a worker with the most informative data broadcasts its local variable to neighbors first, and neighbors who have not transmitted yet can update their local variables based on that received transmission. In OADMM, we prohibit workers from transmitting if their current local variables are not sufficiently different from their previously transmitted value. A variant of OADMM, called SOADMM, is proposed where transmissions are ordered but transmissions are never stopped for each node at each iteration. Numerical results demonstrate that given a targeted accuracy, OADMM can significantly reduce the number of communications compared to existing algorithms including ADMM. We also show numerically that SOADMM can accelerate convergence, resulting in communication savings compared to the classical ADMM.
LGFeb 5, 2022
Distributed Learning With Sparsified Gradient DifferencesYicheng Chen, Rick S. Blum, Martin Takac et al.
A very large number of communications are typically required to solve distributed learning tasks, and this critically limits scalability and convergence speed in wireless communications applications. In this paper, we devise a Gradient Descent method with Sparsification and Error Correction (GD-SEC) to improve the communications efficiency in a general worker-server architecture. Motivated by a variety of wireless communications learning scenarios, GD-SEC reduces the number of bits per communication from worker to server with no degradation in the order of the convergence rate. This enables larger-scale model learning without sacrificing convergence or accuracy. At each iteration of GD-SEC, instead of directly transmitting the entire gradient vector, each worker computes the difference between its current gradient and a linear combination of its previously transmitted gradients, and then transmits the sparsified gradient difference to the server. A key feature of GD-SEC is that any given component of the gradient difference vector will not be transmitted if its magnitude is not sufficiently large. An error correction technique is used at each worker to compensate for the error resulting from sparsification. We prove that GD-SEC is guaranteed to converge for strongly convex, convex, and nonconvex optimization problems with the same order of convergence rate as GD. Furthermore, if the objective function is strongly convex, GD-SEC has a fast linear convergence rate. Numerical results not only validate the convergence rate of GD-SEC but also explore the communication bit savings it provides. Given a target accuracy, GD-SEC can significantly reduce the communications load compared to the best existing algorithms without slowing down the optimization process.
MAFeb 3, 2021
Multi-UAV Mobile Edge Computing and Path Planning Platform based on Reinforcement LearningHuan Chang, Yicheng Chen, Baochang Zhang et al.
Unmanned Aerial vehicles (UAVs) are widely used as network processors in mobile networks, but more recently, UAVs have been used in Mobile Edge Computing as mobile servers. However, there are significant challenges to use UAVs in complex environments with obstacles and cooperation between UAVs. We introduce a new multi-UAV Mobile Edge Computing platform, which aims to provide better Quality-of-Service and path planning based on reinforcement learning to address these issues. The contributions of our work include: 1) optimizing the quality of service for mobile edge computing and path planning in the same reinforcement learning framework; 2) using a sigmoid-like function to depict the terminal users' demand to ensure a higher quality of service; 3) applying synthetic considerations of the terminal users' demand, risk and geometric distance in reinforcement learning reward matrix to ensure the quality of service, risk avoidance, and the cost-savings. Simulations have shown the effectiveness and feasibility of our platform, which can help advance related researches.
IVMay 8, 2019
QSMGAN: Improved Quantitative Susceptibility Mapping using 3D Generative Adversarial Networks with Increased Receptive FieldYicheng Chen, Angela Jakary, Sivakami Avadiappan et al.
Quantitative susceptibility mapping (QSM) is a powerful MRI technique that has shown great potential in quantifying tissue susceptibility in numerous neurological disorders. However, the intrinsic ill-posed dipole inversion problem greatly affects the accuracy of the susceptibility map. We propose QSMGAN: a 3D deep convolutional neural network approach based on a 3D U-Net architecture with increased receptive field of the input phase compared to the output and further refined the network using the WGAN with gradient penalty training strategy. Our method generates accurate QSM maps from single orientation phase maps efficiently and performs significantly better than traditional non-learning-based dipole inversion algorithms. The generalization capability was verified by applying the algorithm to an unseen pathology--brain tumor patients with radiation-induced cerebral microbleeds.