DCJun 3
Clownfish: Scaling DAG-based BFT Consensus via Sparse EdgesFeifan Wang, Jingfan Yu, Zixi Cai et al.
Directed Acyclic Graph (DAG) based BFT protocols have demonstrated the capability to achieve significantly high throughput in practice. Recent advancements focused on minimizing the good-case latency of these protocols, approaching the theoretical lower bound. However, the high communication complexity inherent in existing DAG-based protocols limits their scalability. This primarily arises because each vertex in the DAG must include a linear number of edges (references) to vertices from previous rounds. We present Clownfish, a partially synchronous DAG-based BFT protocol designed to address the scalability bottleneck. Clownfish achieves lower communication complexity by selectively reducing the number of edges in DAG vertices. When using a communication-optimal consistent broadcast, Clownfish attains quadratic total communication complexity per round, outperforming prior DAG-based protocols. Clownfish also reduces the additional latency in failure cases by optimizing the round advancement rule. Additionally, Clownfish supports multiple leaders per round to reduce average latency while maintaining its lower communication complexity. Our experimental evaluation demonstrates that Clownfish provides significantly better scalability than existing DAG-based protocols.
CVDec 19, 2022
From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera CalibrationZekun Qian, Ruize Han, Wei Feng et al.
We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene, without the BEV image and the calibration of the FPVs, while the output is a unified plane with the localization and orientation of both the subjects and cameras in a BEV. We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts: i) creating a view-transform subject detection module to transform the FPV to a virtual BEV including localization and orientation of each pedestrian, ii) deriving a geometric transformation based method to estimate camera localization and view direction, i.e., the camera registration in a unified BEV, iii) making use of spatial and appearance information to aggregate the subjects into the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for evaluation. The experimental results show the remarkable effectiveness of our proposed method.
CVOct 23, 2025Code
EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied IntelligenceDing Zou, Feifan Wang, Mengyu Ge et al.
The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.
DCFeb 14, 2025
Hybrid Offline-online Scheduling Method for Large Language Model Inference OptimizationBowen Pang, Kai Li, Ruifeng She et al.
With the development of large language models (LLMs), it has become increasingly important to optimize hardware usage and improve throughput. In this paper, we study the inference optimization of the serving system that deploys LLMs. To optimize system throughput and maximize hardware utilization, we formulate the inference optimization problem as a mixed-integer programming (MIP) model and propose a hybrid offline-online method as solution. The offline method improves large-scale inference systems by introducing a Minimizing Makespan Bin Packing Problem. We further provide a theoretical lower bound computation method. Then, we propose an online sorting and preemptive scheduling method to better utilize hardware. In the online iteration scheduling process, a Lagrangian method is applied to evaluate the cost efficiency of inserting prefill stages versus decode stages at each iteration and dynamically determine when to preempt decoding tasks and insert prefill tasks. Experiments using real-world data from the LLaMA-65B model and the GSM8K dataset demonstrate that system utilization improves from 80.2% to 89.1%, and the total inference time decreases from 201.00 to 190.58 seconds. A 100-cases study shows that our method consistently outperforms the baseline method and improves the utilization rate by 8.0% on average. Finally, we discuss potential future extensions, including stochastic modeling, reinforcement learning-based schedulers, and dynamic decision-making strategies for system throughput and hardware utilization.
CVJan 31, 2024
Unveiling the Power of Self-supervision for Multi-view Multi-human Association and TrackingWei Feng, Feifan Wang, Ruize Han et al.
Multi-view multi-human association and tracking (MvMHAT), is a new but important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self learning. In this work, we tackle this problem with a self-supervised learning aware end-to-end network. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate the multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public.
LGMay 14, 2025
Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data GenerationFeifan Wang, Tengfei Song, Minggui He et al.
Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM's corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.
CVOct 19, 2021
Towards Toxic and Narcotic Medication Detection with Rotated Object DetectorJiao Peng, Feifan Wang, Zhongqiang Fu et al.
Recent years have witnessed the advancement of deep learning vision technologies and applications in the medical industry. Intelligent devices for special medication management are in great need of, which requires more precise detection algorithms to identify the specifications and locations. In this work, YOLO (You only look once) based object detectors are tailored for toxic and narcotic medications detection tasks. Specifically, a more flexible annotation with rotated degree ranging from $0^\circ$ to $90^\circ$ and a mask-mapping-based non-maximum suppression method are proposed to achieve a feasible and efficient medication detector aiming at arbitrarily oriented bounding boxes. Extensive experiments demonstrate that the rotated YOLO detectors are more suitable for identifying densely arranged drugs. The best shot mean average precision of the proposed network reaches 0.811 while the inference time is less than 300ms.
IVMay 13, 2020
Neural Architecture Search for Gliomas Segmentation on Multimodal Magnetic Resonance ImagingFeifan Wang
Past few years have witnessed the artificial intelligence inspired evolution in various medical fields. The diagnosis and treatment of gliomas -- one of the most commonly seen brain tumors with low survival rate -- rely heavily on the computer assisted segmentation process undertaken on the magnetic resonance imaging (MRI) scans. Although the encoder-decoder shaped deep learning networks have been the de facto standard style for semantic segmentation tasks in medical imaging analysis, enormous effort is still required to be spent on designing the detailed architecture of the down-sampling and up-sampling blocks. In this work, we propose a neural architecture search (NAS) based solution to brain tumor segmentation tasks on multimodal volumetric MRI scans. Three sets of candidate operations are composed respectively for three kinds of basic building blocks in which each operation is assigned with a specific probabilistic parameter to be learned. Through alternately updating the weights of operations and the other parameters in the network, the searching mechanism ends up with two optimal structures for the upward and downward blocks. Moreover, the developed solution also integrates normalization and patching strategies tailored for brain MRI processing. Extensive comparative experiments on the BraTS 2019 dataset demonstrate that the proposed algorithm not only could relieve the pressure of fabricating block architectures but also possesses competitive feasibility and scalability.
IVSep 15, 2019
3D U-Net Based Brain Tumor Segmentation and Survival Days PredictionFeifan Wang, Runzhou Jiang, Liqin Zheng et al.
Past few years have witnessed the prevalence of deep learning in many application scenarios, among which is medical image processing. Diagnosis and treatment of brain tumors requires an accurate and reliable segmentation of brain tumors as a prerequisite. However, such work conventionally requires brain surgeons significant amount of time. Computer vision techniques could provide surgeons a relief from the tedious marking procedure. In this paper, a 3D U-net based deep learning model has been trained with the help of brain-wise normalization and patching strategies for the brain tumor segmentation task in the BraTS 2019 competition. Dice coefficients for enhancing tumor, tumor core, and the whole tumor are 0.737, 0.807 and 0.894 respectively on the validation dataset. These three values on the test dataset are 0.778, 0.798 and 0.852. Furthermore, numerical features including ratio of tumor size to brain size and the area of tumor surface as well as age of subjects are extracted from predicted tumor labels and have been used for the overall survival days prediction task. The accuracy could be 0.448 on the validation dataset, and 0.551 on the final test dataset.