h-index66
275papers
26,632citations
Novelty51%
AI Score65

275 Papers

CVJul 12, 2023Code
MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang et al. · pku

Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities; 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. To summarize, MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. We hope MMBench will assist the research community in better evaluating their models and facilitate future progress in this area. The evalutation code of MMBench has been integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

CVMay 19, 2022Code
PYSKL: Towards Good Practices for Skeleton Action Recognition

Haodong Duan, Jiaqi Wang, Kai Chen et al. · pku

We present PYSKL: an open-source toolbox for skeleton-based action recognition based on PyTorch. The toolbox supports a wide variety of skeleton action recognition algorithms, including approaches based on GCN and CNN. In contrast to existing open-source skeleton action recognition projects that include only one or two algorithms, PYSKL implements six different algorithms under a unified framework with both the latest and original good practices to ease the comparison of efficacy and efficiency. We also provide an original GCN-based skeleton action recognition model named ST-GCN++, which achieves competitive recognition performance without any complicated attention schemes, serving as a strong baseline. Meanwhile, PYSKL supports the training and testing of nine skeleton-based action recognition benchmarks and achieves state-of-the-art recognition performance on eight of them. To facilitate future research on skeleton action recognition, we also provide a large number of trained models and detailed benchmark results to give some insights. PYSKL is released at https://github.com/kennymckormick/pyskl and is actively maintained. We will update this report when we add new features or benchmarks. The current version corresponds to PYSKL v0.2.

CVSep 26, 2023Code
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Pan Zhang, Xiaoyi Dong, Bin Wang et al. · pku

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a writing instruction, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on an extensive multi-modal multilingual database with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench (Chinese Cultural Benchmark), QBench and Tiny LVLM. Owing to the absence of established metrics for quantitatively assessing text-image composition, we have devised a robust evaluation procedure that comprises both human and GPT4-Vision (GPT4-V) to ensure reliability. Notably, our InternLM-XComposer achieves competitive text-image composition scores compared to public solutions, including GPT4-V and GPT3.5. Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series are publicly available at https://github.com/InternLM/InternLM-XComposer.

CVJul 16, 2024Code
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Haodong Duan, Xinyu Fang, Junming Yang et al. · pku

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on https://github.com/open-compass/VLMEvalKit and is actively maintained.

CVJul 3, 2024Code
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Pan Zhang, Xiaoyi Dong, Yuhang Zang et al. · pku

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

CVAug 26, 2022Code
Voxurf: Voxel-based Efficient and Accurate Neural Surface Reconstruction

Tong Wu, Jiaqi Wang, Xingang Pan et al.

Neural surface reconstruction aims to reconstruct accurate 3D surfaces based on multi-view images. Previous methods based on neural volume rendering mostly train a fully implicit model with MLPs, which typically require hours of training for a single scene. Recent efforts explore the explicit volumetric representation to accelerate the optimization via memorizing significant information with learnable voxel grids. However, existing voxel-based methods often struggle in reconstructing fine-grained geometry, even when combined with an SDF-based volume rendering scheme. We reveal that this is because 1) the voxel grids tend to break the color-geometry dependency that facilitates fine-geometry learning, and 2) the under-constrained voxel grids lack spatial coherence and are vulnerable to local minima. In this work, we present Voxurf, a voxel-based surface reconstruction approach that is both efficient and accurate. Voxurf addresses the aforementioned issues via several key designs, including 1) a two-stage training procedure that attains a coherent coarse shape and recovers fine details successively, 2) a dual color network that maintains color-geometry dependency, and 3) a hierarchical geometry feature to encourage information propagation across voxels. Extensive experiments show that Voxurf achieves high efficiency and high quality at the same time. On the DTU benchmark, Voxurf achieves higher reconstruction quality with a 20x training speedup compared to previous fully implicit methods. Our code is available at https://github.com/wutong16/Voxurf.

CVFeb 17, 2023Code
Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

Yujie Zhou, Haodong Duan, Anyi Rao et al. · pku

Self-supervised learning has demonstrated remarkable capability in representation learning for skeleton-based action recognition. Existing methods mainly focus on applying global data augmentation to generate different views of the skeleton sequence for contrastive learning. However, due to the rich action clues in the skeleton sequences, existing methods may only take a global perspective to learn to discriminate different skeletons without thoroughly leveraging the local relationship between different skeleton joints and video frames, which is essential for real-world applications. In this work, we propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship from a partial skeleton sequences built by a unique spatio-temporal masking strategy. Specifically, we construct a negative-sample-free triplet steam structure that is composed of an anchor stream without any masking, a spatial masking stream with Central Spatial Masking (CSM), and a temporal masking stream with Motion Attention Temporal Masking (MATM). The feature cross-correlation matrix is measured between the anchor stream and the other two masking streams, respectively. (1) Central Spatial Masking discards selected joints from the feature calculation process, where the joints with a higher degree of centrality have a higher possibility of being selected. (2) Motion Attention Temporal Masking leverages the motion of action and remove frames that move faster with a higher possibility. Our method achieves SOTA performance on NTU-60, NTU-120 and PKU-MMD under various downstream tasks. A practical evaluation is performed where some skeleton joints are lost in downstream tasks. In contrast to previous methods that suffer from large performance drops, our PSTL can still achieve remarkable results, validating the robustness of our method. Code: https://github.com/YujieOuO/PSTL.git.

CVJan 18, 2023Code
Semi-Supervised Semantic Segmentation via Gentle Teaching Assistant

Ying Jin, Jiaqi Wang, Dahua Lin

Semi-Supervised Semantic Segmentation aims at training the segmentation model with limited labeled data and a large amount of unlabeled data. To effectively leverage the unlabeled data, pseudo labeling, along with the teacher-student framework, is widely adopted in semi-supervised semantic segmentation. Though proved to be effective, this paradigm suffers from incorrect pseudo labels which inevitably exist and are taken as auxiliary training data. To alleviate the negative impact of incorrect pseudo labels, we delve into the current Semi-Supervised Semantic Segmentation frameworks. We argue that the unlabeled data with pseudo labels can facilitate the learning of representative features in the feature extractor, but it is unreliable to supervise the mask predictor. Motivated by this consideration, we propose a novel framework, Gentle Teaching Assistant (GTA-Seg) to disentangle the effects of pseudo labels on feature extractor and mask predictor of the student model. Specifically, in addition to the original teacher-student framework, our method introduces a teaching assistant network which directly learns from pseudo labels generated by the teacher network. The gentle teaching assistant (GTA) is coined gentle since it only transfers the beneficial feature representation knowledge in the feature extractor to the student model in an Exponential Moving Average (EMA) manner, protecting the student model from the negative influences caused by unreliable pseudo labels in the mask predictor. The student model is also supervised by reliable labeled data to train an accurate mask predictor, further facilitating feature representation. Extensive experiment results on benchmark datasets validate that our method shows competitive performance against previous methods. Code is available at https://github.com/Jin-Ying/GTA-Seg.

CVJul 1, 2024
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen et al. · pku, stanford

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: https://mayubo2333.github.io/MMLongBench-Doc

CVNov 29, 2023Code
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang et al.

Hallucination, posed as a pervasive challenge of multi-modal large language models (MLLMs), has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources, incurring inevitable additional costs. In this paper, we present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate the hallucination issue without additional data, knowledge, or training. Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial over-trust inclination results in the neglecting of image tokens and describes the image content with hallucination. Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary. With extensive experiments, OPERA shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality. Our code is available at: https://github.com/shikiw/OPERA.

CVMar 22, 2023Code
Dense Distinct Query for End-to-End Object Detection

Shilong Zhang, Xinjiang Wang, Jiaqi Wang et al.

One-to-one label assignment in object detection has successfully obviated the need for non-maximum suppression (NMS) as postprocessing and makes the pipeline end-to-end. However, it triggers a new dilemma as the widely used sparse queries cannot guarantee a high recall, while dense queries inevitably bring more similar queries and encounter optimization difficulties. As both sparse and dense queries are problematic, then what are the expected queries in end-to-end object detection? This paper shows that the solution should be Dense Distinct Queries (DDQ). Concretely, we first lay dense queries like traditional detectors and then select distinct ones for one-to-one assignments. DDQ blends the advantages of traditional and recent end-to-end detectors and significantly improves the performance of various detectors including FCN, R-CNN, and DETRs. Most impressively, DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting. DDQ also shares the benefit of end-to-end detectors in crowded scenes and achieves 93.8 AP on CrowdHuman. We hope DDQ can inspire researchers to consider the complementarity between traditional methods and end-to-end detectors. The source code can be found at \url{https://github.com/jshilong/DDQ}.

50.9ROMay 30
Highly Deformable Proprioceptive Membrane for Real-Time 3D Shape Reconstruction

Guanyu Xu, Jiaqi Wang, Dezhong Tong et al.

Reconstructing the three-dimensional (3D) geometry of object surfaces is essential for robot perception, yet vision-based approaches degrade under low illumination or occlusion. This limitation motivates the design of a proprioceptive membrane that conforms to the surface of interest and infers 3D geometry by reconstructing its own deformation. Conventional deformation-aware membranes typically rely on resistive, capacitive, or magneto-sensitive mechanisms, but can suffer from structural complexity, limited compliance during large-scale deformation, and susceptibility to electromagnetic interference. This work presents a soft, flexible, and stretchable proprioceptive silicone membrane based on optical waveguide sensing. The membrane integrates edge-mounted LEDs and centrally-distributed photodiodes (PDs) within a multilayer elastomeric composite. Rich deformation-dependent light-intensity signals are decoded by a data-driven model to recover the membrane geometry. Real-time reconstruction is demonstrated on a customized 140 mm square membrane at an end-to-end update rate of 90 Hz, achieving an average reconstruction error of 1.307 mm for out-of-plane deformation of up to 25 mm. The proposed sensor also demonstrates accurate reconstruction under large in-plane deformation, achieving reliable shape recovery up to 75% strain with an average Chamfer distance of 1.214 mm. The proposed framework provides a scalable, robust, and low-profile solution for global shape perception in deformable robotic systems.

CLNov 13, 2023Code
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu et al.

Despite making significant progress in multi-modal tasks, current Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of tasks and hallucinations). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER.

CVOct 12, 2022
DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition

Haodong Duan, Jiaqi Wang, Kai Chen et al. · pku

Graph convolution networks (GCN) have been widely used in skeleton-based action recognition. We note that existing GCN-based approaches primarily rely on prescribed graphical structures (ie., a manually defined topology of skeleton joints), which limits their flexibility to capture complicated correlations between joints. To move beyond this limitation, we propose a new framework for skeleton-based action recognition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN). It consists of two modules, DG-GCN and DG-TCN, respectively, for spatial and temporal modeling. In particular, DG-GCN uses learned affinity matrices to capture dynamic graphical structures instead of relying on a prescribed one, while DG-TCN performs group-wise temporal convolutions with varying receptive fields and incorporates a dynamic joint-skeleton fusion module for adaptive multi-level temporal modeling. On a wide range of benchmarks, including NTURGB+D, Kinetics-Skeleton, BABEL, and Toyota SmartHome, DG-STGCN consistently outperforms state-of-the-art methods, often by a notable margin.

CVJan 18, 2023
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

Tong Wu, Jiarui Zhang, Xiao Fu et al.

Recent advances in modeling 3D objects mostly rely on synthetic datasets due to the lack of large-scale realscanned 3D databases. To facilitate the development of 3D perception, reconstruction, and generation in the real world, we propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties: 1) Large Vocabulary: It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets (e.g., ImageNet and LVIS), benefiting the pursuit of generalizable 3D representations. 2) Rich Annotations: Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos. 3) Realistic Scans: The professional scanners support highquality object scans with precise shapes and realistic appearances. With the vast exploration space offered by OmniObject3D, we carefully set up four evaluation tracks: a) robust 3D perception, b) novel-view synthesis, c) neural surface reconstruction, and d) 3D object generation. Extensive studies are performed on these four benchmarks, revealing new observations, challenges, and opportunities for future research in realistic 3D vision.

CLAug 21, 2023Code
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Conghui He, Zhenjiang Jin, Chao Xu et al.

The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.

CVJan 31, 2023Code
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Dachuan Shi, Chaofan Tao, Ying Jin et al.

Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, \textit{e}.\textit{g}., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (\textbf{\emph{UPop}}) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.

LGMar 5, 2023Code
Knowledge-Enhanced Semi-Supervised Federated Learning for Aggregating Heterogeneous Lightweight Clients in IoT

Jiaqi Wang, Shenglai Zeng, Zewei Long et al.

Federated learning (FL) enables multiple clients to train models collaboratively without sharing local data, which has achieved promising results in different areas, including the Internet of Things (IoT). However, end IoT devices do not have abilities to automatically annotate their collected data, which leads to the label shortage issue at the client side. To collaboratively train an FL model, we can only use a small number of labeled data stored on the server. This is a new yet practical scenario in federated learning, i.e., labels-at-server semi-supervised federated learning (SemiFL). Although several SemiFL approaches have been proposed recently, none of them can focus on the personalization issue in their model design. IoT environments make SemiFL more challenging, as we need to take device computational constraints and communication cost into consideration simultaneously. To tackle these new challenges together, we propose a novel SemiFL framework named pFedKnow. pFedKnow generates lightweight personalized client models via neural network pruning techniques to reduce communication cost. Moreover, it incorporates pretrained large models as prior knowledge to guide the aggregation of personalized client models and further enhance the framework performance. Experiment results on both image and text datasets show that the proposed pFedKnow outperforms state-of-the-art baselines as well as reducing considerable communication cost. The source code of the proposed pFedKnow is available at https://github.com/JackqqWang/pfedknow/tree/master.

CVJun 1, 2023Code
BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction From A Single Image

Tao Chu, Pan Zhang, Qiong Liu et al.

Understanding and modeling the 3D scene from a single image is a practical problem. A recent advance proposes a panoptic 3D scene reconstruction task that performs both 3D reconstruction and 3D panoptic segmentation from a single image. Although having made substantial progress, recent works only focus on top-down approaches that fill 2D instances into 3D voxels according to estimated depth, which hinders their performance by two ambiguities. (1) instance-channel ambiguity: The variable ids of instances in each scene lead to ambiguity during filling voxel channels with 2D information, confusing the following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting with estimated single view depth only propagates 2D information onto the surface of 3D regions, leading to ambiguity during the reconstruction of regions behind the frontal view surface. In this paper, we propose BUOL, a Bottom-Up framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image. For instance-channel ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on deterministic semantic assignments rather than arbitrary instance id assignments. The 3D voxels are then refined and grouped into 3D instances according to the predicted 2D instance centers. For voxel-reconstruction ambiguity, the estimated multi-plane occupancy is leveraged together with depth to fill the whole regions of things and stuff. Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D. Code and models are available in https://github.com/chtsy/buol.

CVJul 3, 2023
Review of Large Vision Models and Visual Prompt Engineering

Jiaqi Wang, Zhengliang Liu, Lin Zhao et al.

Visual prompt engineering is a fundamental technology in the field of visual and image Artificial General Intelligence, serving as a key component for achieving zero-shot capabilities. As the development of large vision models progresses, the importance of prompt engineering becomes increasingly evident. Designing suitable prompts for specific visual tasks has emerged as a meaningful research direction. This review aims to summarize the methods employed in the computer vision domain for large vision models and visual prompt engineering, exploring the latest advancements in visual prompt engineering. We present influential large models in the visual domain and a range of prompt engineering methods employed on these models. It is our hope that this review provides a comprehensive and systematic description of prompt engineering methods based on large visual models, offering valuable insights for future researchers in their exploration of this field.

CVApr 7, 2023
V3Det: Vast Vocabulary Visual Detection Dataset

Jiaqi Wang, Pan Zhang, Tao Chu et al.

Recent advances in detecting arbitrary objects in the real world are trained and evaluated on object detection datasets with a relatively restricted vocabulary. To facilitate the development of more general visual object detection, we propose V3Det, a vast vocabulary visual detection dataset with precisely annotated bounding boxes on massive images. V3Det has several appealing properties: 1) Vast Vocabulary: It contains bounding boxes of objects from 13,204 categories on real-world images, which is 10 times larger than the existing large vocabulary object detection dataset, e.g., LVIS. 2) Hierarchical Category Organization: The vast vocabulary of V3Det is organized by a hierarchical category tree which annotates the inclusion relationship among categories, encouraging the exploration of category relationships in vast and open vocabulary object detection. 3) Rich Annotations: V3Det comprises precisely annotated objects in 243k images and professional descriptions of each category written by human experts and a powerful chatbot. By offering a vast exploration space, V3Det enables extensive benchmarks on both vast and open vocabulary object detection, leading to new observations, practices, and insights for future research. It has the potential to serve as a cornerstone dataset for developing more general visual perception systems. V3Det is available at https://v3det.openxlab.org.cn/.

CVNov 21, 2023
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jinsong Li, Xiaoyi Dong et al.

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.

CVAug 7, 2023Code
Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Yujie Zhou, Wenwen Qiang, Anyi Rao et al.

Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: https://github.com/YujieOuO/SMIE.

AIApr 28, 2023
Prompt Engineering for Healthcare: Methodologies and Applications

Jiaqi Wang, Enze Shi, Sigang Yu et al.

Prompt engineering is a critical technique in the field of natural language processing that involves designing and optimizing the prompts used to input information into models, aiming to enhance their performance on specific tasks. With the recent advancements in large language models, prompt engineering has shown significant superiority across various domains and has become increasingly important in the healthcare domain. However, there is a lack of comprehensive reviews specifically focusing on prompt engineering in the medical field. This review will introduce the latest advances in prompt engineering in the field of natural language processing for the medical field. First, we will provide the development of prompt engineering and emphasize its significant contributions to healthcare natural language processing applications such as question-answering systems, text summarization, and machine translation. With the continuous improvement of general large language models, the importance of prompt engineering in the healthcare domain is becoming increasingly prominent. The aim of this article is to provide useful resources and bridges for healthcare natural language processing researchers to better explore the application of prompt engineering in this field. We hope that this review can provide new ideas and inspire for research and application in medical natural language processing.

LGAug 25, 2023Code
MLLM-DataEngine: An Iterative Refinement Approach for MLLM

Zhiyuan Zhao, Linke Ouyang, Bin Wang et al.

Despite the great advance of Multimodal Large Language Models (MLLMs) in both instruction dataset building and benchmarking, the independence of training and evaluation makes current MLLMs hard to further improve their capability under the guidance of evaluation results with a relatively low human cost. In this paper, we propose MLLM-DataEngine, a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop iteration, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results, then generate a proper incremental dataset for the next training iteration and enhance the model capability iteratively. Compared with previous data collection methods which are separate from the benchmarking, the data generated by MLLM-DataEngine shows better targeting, quality, and correctness. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data within each incremental dataset based on the benchmarking results. For quality, we resort to GPT-4 to generate high-quality data with each given data type. For correctness, prompt design is critical for the data generation results. Rather than previous hand-crafted prompt, we propose an Interactive Prompt Optimization strategy, which optimizes the prompt with the multi-round interaction between human and GPT, and improve the correctness of generated data greatly. Through extensive experiments, we find our MLLM-DataEngine could boost the MLLM capability in a targeted and automatic manner, with only a few human participation. We hope it could be a general solution for the following MLLMs building. The MLLM-DataEngine has been open-sourced and is now available at https://github.com/opendatalab/MLLM-DataEngine.

CVJun 2, 2022Code
What Are Expected Queries in End-to-End Object Detection?

Shilong Zhang, Xinjiang Wang, Jiaqi Wang et al.

End-to-end object detection is rapidly progressed after the emergence of DETR. DETRs use a set of sparse queries that replace the dense candidate boxes in most traditional detectors. In comparison, the sparse queries cannot guarantee a high recall as dense priors. However, making queries dense is not trivial in current frameworks. It not only suffers from heavy computational cost but also difficult optimization. As both sparse and dense queries are imperfect, then \emph{what are expected queries in end-to-end object detection}? This paper shows that the expected queries should be Dense Distinct Queries (DDQ). Concretely, we introduce dense priors back to the framework to generate dense queries. A duplicate query removal pre-process is applied to these queries so that they are distinguishable from each other. The dense distinct queries are then iteratively processed to obtain final sparse outputs. We show that DDQ is stronger, more robust, and converges faster. It obtains 44.5 AP on the MS COCO detection dataset with only 12 epochs. DDQ is also robust as it outperforms previous methods on both object detection and instance segmentation tasks on various datasets. DDQ blends advantages from traditional dense priors and recent end-to-end detectors. We hope it can serve as a new baseline and inspires researchers to revisit the complementarity between traditional methods and end-to-end detectors. The source code is publicly available at \url{https://github.com/jshilong/DDQ}.

CLApr 17, 2023
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT

Chong Ma, Zihao Wu, Jiaqi Wang et al.

The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians, and it is typically written by radiologists based on the 'Findings' section. However, writing numerous impressions can be laborious and error-prone for radiologists. Although recent studies have achieved promising results in automatic impression generation using large-scale medical text data for pre-training and fine-tuning pre-trained language models, such models often require substantial amounts of medical text data and have poor generalization performance. While large language models (LLMs) like ChatGPT have shown strong generalization capabilities and performance, their performance in specific domains, such as radiology, remains under-investigated and potentially limited. To address this limitation, we propose ImpressionGPT, which leverages the in-context learning capability of LLMs by constructing dynamic contexts using domain-specific, individualized data. This dynamic prompt approach enables the model to learn contextual knowledge from semantically similar examples from existing data. Additionally, we design an iterative optimization algorithm that performs automatic evaluation on the generated impression results and composes the corresponding instruction prompts to further optimize the model. The proposed ImpressionGPT model achieves state-of-the-art performance on both MIMIC-CXR and OpenI datasets without requiring additional training data or fine-tuning the LLMs. This work presents a paradigm for localizing LLMs that can be applied in a wide range of similar application scenarios, bridging the gap between general-purpose LLMs and the specific language processing needs of various domains.

CVMar 11, 2023
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation

Zhao Yang, Jiaqi Wang, Yansong Tang et al.

Referring image segmentation segments an image from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based methods are subject to specific encoder choices, while attention-based methods offer limited gains. In this work, we introduce a simple yet effective alternative for progressively learning discriminative multi-modal features. The core idea of our approach is to leverage a continuously updated query as the representation of the target object and at each iteration, strengthen multi-modal features strongly correlated to the query while weakening less related ones. As the query is initialized by language features and successively updated by object features, our algorithm gradually shifts from being localization-centric to segmentation-centric. This strategy enables the incremental recovery of missing object parts and/or removal of extraneous parts through iteration. Compared to its counterparts, our method is more versatile$\unicode{x2014}$it can be plugged into prior arts straightforwardly and consistently bring improvements. Experimental results on the challenging datasets of RefCOCO, RefCOCO+, and G-Ref demonstrate its advantage with respect to the state-of-the-art methods.

CVNov 19, 2023Code
Adversarial Prompt Tuning for Vision-Language Models

Jiaming Zhang, Xingjun Ma, Xin Wang et al.

With the rapid advancement of multimodal learning, pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capacities in bridging the gap between visual and language modalities. However, these models remain vulnerable to adversarial attacks, particularly in the image modality, presenting considerable security risks. This paper introduces Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial robustness of image encoders in VLMs. AdvPT innovatively leverages learnable text prompts and aligns them with adversarial image embeddings, to address the vulnerabilities inherent in VLMs without the need for extensive parameter training or modification of the model architecture. We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques, further boosting defensive capabilities. Comprehensive experimental analyses provide insights into adversarial prompt tuning, a novel paradigm devoted to improving resistance to adversarial images through textual input modifications, paving the way for future robust multimodal learning research. These findings open up new possibilities for enhancing the security of VLMs. Our code is available at https://github.com/jiamingzhang94/Adversarial-Prompt-Tuning.

CLJul 25, 2023
Evaluating Large Language Models for Radiology Natural Language Processing

Zhengliang Liu, Tianyang Zhong, Yiwei Li et al.

The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.

93.9LGJun 2
Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

Ziyue Wang, Aomufei Yuan, Yongfu Zhu et al.

Reinforcement Learning from Verifiable Rewards (RLVR) has become the dominant approach for improving mathematical reasoning in large language models, yet current methods reduce each correct rollout to a single reward bit, ignoring the geometric structure shared among their hidden states. Investigating this structure, we find that at the anchor token (the position immediately before the answer marker), correct rollouts converge naturally because they must produce the same answer (cosine similarity ~0.84), yet each retains residual variance from its unique reasoning path. Encouraging full alignment at this point pushes the model to extract a unified "correct decision" representation, reducing sensitivity to which reasoning path was taken. Based on this observation, we propose Hidden-Align, an auxiliary loss function that aligns the last-layer hidden states of correct rollouts at the anchor token during RL training, with zero overhead in both training and inference. On eight mathematical reasoning benchmarks, Hidden-Align improves average pass@1 over the DAPO baseline by 3.8, 6.2, and 5.4 percentage points on Qwen3-1.7B, 4B, and 14B respectively, with consistent pass@k gains across all three scales, supported by ablations on loss type, anchor position, layer depth, and loss weight.

LGSep 22, 2022
In Differential Privacy, There is Truth: On Vote Leakage in Ensemble Private Learning

Jiaqi Wang, Roei Schuster, Ilia Shumailov et al. · deepmind

When learning from sensitive data, care must be taken to ensure that training algorithms address privacy concerns. The canonical Private Aggregation of Teacher Ensembles, or PATE, computes output labels by aggregating the predictions of a (possibly distributed) collection of teacher models via a voting mechanism. The mechanism adds noise to attain a differential privacy guarantee with respect to the teachers' training data. In this work, we observe that this use of noise, which makes PATE predictions stochastic, enables new forms of leakage of sensitive information. For a given input, our adversary exploits this stochasticity to extract high-fidelity histograms of the votes submitted by the underlying teachers. From these histograms, the adversary can learn sensitive attributes of the input such as race, gender, or age. Although this attack does not directly violate the differential privacy guarantee, it clearly violates privacy norms and expectations, and would not be possible at all without the noise inserted to obtain differential privacy. In fact, counter-intuitively, the attack becomes easier as we add more noise to provide stronger differential privacy. We hope this encourages future work to consider privacy holistically rather than treat differential privacy as a panacea.

CLSep 27, 2024
Evaluation of OpenAI o1: Opportunities and Challenges of AGI

Tianyang Zhong, Zhengliang Liu, Yi Pan et al.

This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.

98.2CVJun 1
AdaCodec: A Predictive Visual Code for Video MLLMs

Haowen Hou, Zhen Huang, Zheming Liang et al.

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emph{predictive visual code}, and instantiate it for video MLLMs as \textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.

CVNov 28, 2023
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Zhiyuan Zhao, Bin Wang, Linke Ouyang et al.

Multimodal large language models have made significant advancements in recent years, yet they still suffer from a common issue known as the "hallucination problem", in which the models generate textual descriptions that inaccurately depict or entirely fabricate content from associated images. This paper introduces a novel solution, Hallucination-Aware Direct Preference Optimization (HA-DPO), which reframes the hallucination problem as a preference selection task. The model is trained to favor the non-hallucinating response when presented with two responses of the same image (one accurate and one hallucinatory). Furthermore, this paper proposes an efficient pipeline for constructing positive~(non-hallucinatory) and negative~(hallucinatory) sample pairs, ensuring a high-quality, style-consistent dataset for robust preference learning. When applied to three mainstream multimodal models, HA-DPO significantly reduced hallucination issues and amplified the models' generalization capabilities. Notably, the MiniGPT-4 model, when enhanced with HA-DPO, demonstrated a substantial improvement: POPE accuracy rose from 51.13% to 86.13% (an absolute improvement of 35%), and the MME score surged from 932.00 to 1326.46 (a relative improvement of 42.32%). The codes, models, and datasets are made accessible at https://opendatalab.github.io/HA-DPO.

AIAug 2, 2024
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

Jiaqi Wang, Hanqi Jiang, Yiheng Liu et al.

In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data types-including text, images, videos, audio, and physiological sequences-MLLMs address the complexities of real-world applications far beyond the capabilities of single-modality systems. In this paper, we systematically sort out the applications of MLLM in multimodal tasks such as natural language, vision, and audio. We also provide a comparative analysis of the focus of different MLLMs in the tasks, and provide insights into the shortcomings of current MLLMs, and suggest potential directions for future research. Through these discussions, this paper hopes to provide valuable insights for the further development and application of MLLM.

89.9CVApr 18Code
EasyVideoR1: Easier RL for Video Understanding

Chuanyu Qin, Chenxu Yang, Qingyi Si et al.

Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.

14.9AIMay 31
Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes

Xiongri Shen, Jiaqi Wang, Zhenxi Song et al.

Mild cognitive impairment (MCI) and subjective cognitive decline (SCD) are closely associated with the early Alzheimer's disease continuum, where accurate and explainable diagnosis is important for early risk assessment and intervention. Existing connectome-based deep learning models can improve classification performance but often provide limited insight into disease-related functional and structural connectivity changes. This paper proposes an atlas-knowledge-guided Generative Counterfactual Attention-guided Network (GCAN) for explainable cognitive decline diagnosis using multimodal brain connectomes. GCAN formulates diagnosis as a source-to-target counterfactual generation problem, where target-label connectomes are generated from source-label inputs and their differences are used to construct counterfactual attention maps. To preserve connectome topology, an Atlas-aware Bidirectional Transformer (AABT) performs network-level token encoding and decoding under brain-atlas constraints. The framework is further extended from functional connectivity (FC) to joint functional and structural connectivity (SC) modeling, enabling counterfactual analysis of complementary functional reorganization and structural topology changes. Experiments on hospital-collected and ADNI datasets show that GCAN achieves competitive performance across HC vs. SCD, HC vs. MCI, and SCD vs. MCI classification tasks. Visualization, circular connectome analysis, CAM-based comparison, ablation studies, and confidence interval analysis further support the interpretability and reliability of the proposed framework. Modality-specific FC and SC pre-trained classifiers are used to provide target-state priors for counterfactual generation while being separated from the downstream diagnostic classifier to prevent data leakage.

75.3AIApr 20Code
One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction

Jihong Guan, Jiaqi Wang, Wengen Li et al.

Knowledge Graphs (KGs) are composed of triples, and the goal of Knowledge Graph Completion (KGC) is to infer the missing factual triples. Traditional KGC tasks predict missing elements in a triple given one or two of its elements. As a more realistic task, the Triple Set Prediction (TSP) task aims to infer the set of missing triples conditioned only on the observed knowledge graph, without assuming any partial information about the missing triples. Existing TSP methods predict the set of missing triples in a triple-by-triple manner, falling short in capturing the dependencies among the predicted triples to ensure consistency. To address this issue, we propose a novel discrete diffusion model termed DiffTSP that treats TSP as a generative task. DiffTSP progressively adds noise to the KG through a discrete diffusion process, achieved by masking relational edges. The reverse process then gradually recovers the complete KG conditioned on the incomplete graph. To this end, we design a structure-aware denoising network that integrates a relational context encoder with a relational graph diffusion transformer for knowledge graph generation. DiffTSP can generate the complete set of triples in a one-pass manner while ensuring the dependencies among the predicted triples. Our approach achieves state-of-the-art performance on three public datasets. Code: https://github.com/ADMIS-TONGJI/DiffTSP.

CVMay 6, 2022
MINI: Mining Implicit Novel Instances for Few-Shot Object Detection

Yuhang Cao, Jiaqi Wang, Yiqi Lin et al.

Learning from a few training samples is a desirable ability of an object detector, inspiring the explorations of Few-Shot Object Detection (FSOD). Most existing approaches employ a pretrain-transfer paradigm. The model is first pre-trained on base classes with abundant data and then transferred to novel classes with a few annotated samples. Despite the substantial progress, the FSOD performance is still far behind satisfactory. During pre-training, due to the co-occurrence between base and novel classes, the model is learned to treat the co-occurred novel classes as backgrounds. During transferring, given scarce samples of novel classes, the model suffers from learning discriminative features to distinguish novel instances from backgrounds and base classes. To overcome the obstacles, we propose a novel framework, Mining Implicit Novel Instances (MINI), to mine the implicit novel instances as auxiliary training samples, which widely exist in abundant base data but are not annotated. MINI comprises an offline mining mechanism and an online mining mechanism. The offline mining mechanism leverages a self-supervised discriminative model to collaboratively mine implicit novel instances with a trained FSOD network. Taking the mined novel instances as auxiliary training samples, the online mining mechanism takes a teacher-student framework to simultaneously update the FSOD network and the mined implicit novel instances on the fly. Extensive experiments on PASCAL VOC and MS-COCO datasets show MINI achieves new state-of-the-art performance on any shot and split. The significant performance improvements demonstrate the superiority of our method.

CVAug 24, 2023
VIGC: Visual Instruction Generation and Correction

Bin Wang, Fan Wu, Xiao Han et al.

The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at https://opendatalab.github.io/VIGC.

LGSep 19, 2024Code
FoME: A Foundation Model for EEG using Adaptive Temporal-Lateral Attention Scaling

Enze Shi, Kui Zhao, Qilong Yuan et al.

Electroencephalography (EEG) is a vital tool to measure and record brain activity in neuroscience and clinical applications, yet its potential is constrained by signal heterogeneity, low signal-to-noise ratios, and limited labeled datasets. In this paper, we propose FoME (Foundation Model for EEG), a novel approach using adaptive temporal-lateral attention scaling to address above-mentioned challenges. FoME is pre-trained on a diverse 1.7TB dataset of scalp and intracranial EEG recordings, comprising 745M parameters trained for 1,096k steps. Our model introduces two key innovations: a time-frequency fusion embedding technique and an adaptive time-lateral attention scaling (ATLAS) mechanism. These components synergistically capture complex temporal and spectral EEG dynamics, enabling FoME to adapt to varying patterns across diverse data streams and facilitate robust multi-channel modeling. Evaluations across four downstream tasks demonstrate FoME's superior performance in classification and forecasting applications, consistently achieving state-of-the-art results. To conclude, FoME establishes a new paradigm for EEG analysis, offering a versatile foundation that advances brain-computer interfaces, clinical diagnostics, and cognitive research across neuroscience and related fields. Our code will be available at https://github.com/1061413241/FoME.

CLOct 8, 2023
ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data

Tianyang Zhong, Wei Zhao, Yutong Zhang et al.

Radiology report generation, as a key step in medical image analysis, is critical to the quantitative analysis of clinically informed decision-making levels. However, complex and diverse radiology reports with cross-source heterogeneity pose a huge generalizability challenge to the current methods under massive data volume, mainly because the style and normativity of radiology reports are obviously distinctive among institutions, body regions inspected and radiologists. Recently, the advent of large language models (LLM) offers great potential for recognizing signs of health conditions. To resolve the above problem, we collaborate with the Second Xiangya Hospital in China and propose ChatRadio-Valuer based on the LLM, a tailored model for automatic radiology report generation that learns generalizable representations and provides a basis pattern for model adaptation in sophisticated analysts' cases. Specifically, ChatRadio-Valuer is trained based on the radiology reports from a single institution by means of supervised fine-tuning, and then adapted to disease diagnosis tasks for human multi-system evaluation (i.e., chest, abdomen, muscle-skeleton, head, and maxillofacial $\&$ neck) from six different institutions in clinical-level events. The clinical dataset utilized in this study encompasses a remarkable total of \textbf{332,673} observations. From the comprehensive results on engineering indicators, clinical efficacy and deployment cost metrics, it can be shown that ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al., in terms of the diseases diagnosis from radiology reports. ChatRadio-Valuer provides an effective avenue to boost model generalization performance and alleviate the annotation workload of experts to enable the promotion of clinical AI applications in radiology reports.

24.3LGMay 30
scBatchProx: Federated-Inspired Refinement for Stable Cell-Type Discriminability under Heterogeneous Batch Compositions

Quang-Huy Nguyen, Jiaqi Wang, Wei-Shinn Ku

Single-cell integration workflows often construct low-dimensional cell embeddings and then refine them with post-hoc methods to reduce batch effects. This refinement process can become unstable when cell-type compositions vary across batches, with some populations underrepresented or absent in particular batches. The problem becomes more consequential in dynamic single-cell data systems, where newly acquired batches can change both technical conditions and cell-type composition. Such instability can reduce downstream cell-type classification performance and weaken stability under imbalance perturbations. We introduce scBatchProx, a lightweight post-hoc refinement method for stabilizing single-cell latent embeddings in these heterogeneous and evolving settings. scBatchProx operates directly on precomputed embeddings and treats each batch or study as a client in a federated-inspired optimization procedure. A batch-conditioned FiLM adapter learns local latent updates, while proximal and identity-preserving regularization keep these updates conservative. Experiments on multi-batch and cross-study single-cell datasets show that scBatchProx improves downstream cell-type classification across different upstream embeddings. In controlled imbalance perturbations, scBatchProx maintains more stable affected-cell-type F1 when selected populations are downsampled or ablated from one batch. In cumulative retraining and continual integration settings, scBatchProx remains effective as new datasets arrive over time. Together, these results suggest that conservative, federated-inspired refinement can help maintain stable single-cell embeddings as batch compositions change across datasets and over time.

LGAug 16, 2023
Towards Personalized Federated Learning via Heterogeneous Model Reassembly

Jiaqi Wang, Xingyi Yang, Suhan Cui et al.

This paper focuses on addressing the practical yet challenging problem of model heterogeneity in federated learning, where clients possess models with different network structures. To track this problem, we propose a novel framework called pFedHR, which leverages heterogeneous model reassembly to achieve personalized federated learning. In particular, we approach the problem of heterogeneous model personalization as a model-matching optimization task on the server side. Moreover, pFedHR automatically and dynamically generates informative and diverse personalized candidates with minimal human intervention. Furthermore, our proposed heterogeneous model reassembly technique mitigates the adverse impact introduced by using public data with different distributions from the client data to a certain extent. Experimental results demonstrate that pFedHR outperforms baselines on three datasets under both IID and Non-IID settings. Additionally, pFedHR effectively reduces the adverse impact of using different public data and dynamically generates diverse personalized models in an automated manner.

IVNov 10, 2023
Holistic Evaluation of GPT-4V for Biomedical Imaging

Zhengliang Liu, Hanqi Jiang, Tianyang Zhong et al.

In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Tasks include modality recognition, anatomy localization, disease diagnosis, report generation, and lesion detection. The extensive experiments provide insights into GPT-4V's strengths and weaknesses. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization. GPT-4V excels at diagnostic report generation, indicating strong image captioning skills. While promising for biomedical imaging AI, GPT-4V requires further enhancement and validation before clinical deployment. We emphasize responsible development and testing for trustworthy integration of biomedical AGI. This rigorous evaluation of GPT-4V on diverse medical images advances understanding of multimodal large language models (LLMs) and guides future work toward impactful healthcare applications.

BMJul 17, 2023
Efficient Prediction of Peptide Self-assembly through Sequential and Graphical Encoding

Zihan Liu, Jiaqi Wang, Yun Luo et al.

In recent years, there has been an explosion of research on the application of deep learning to the prediction of various peptide properties, due to the significant development and market potential of peptides. Molecular dynamics has enabled the efficient collection of large peptide datasets, providing reliable training data for deep learning. However, the lack of systematic analysis of the peptide encoding, which is essential for AI-assisted peptide-related tasks, makes it an urgent problem to be solved for the improvement of prediction accuracy. To address this issue, we first collect a high-quality, colossal simulation dataset of peptide self-assembly containing over 62,000 samples generated by coarse-grained molecular dynamics (CGMD). Then, we systematically investigate the effect of peptide encoding of amino acids into sequences and molecular graphs using state-of-the-art sequential (i.e., RNN, LSTM, and Transformer) and structural deep learning models (i.e., GCN, GAT, and GraphSAGE), on the accuracy of peptide self-assembly prediction, an essential physiochemical process prior to any peptide-related applications. Extensive benchmarking studies have proven Transformer to be the most powerful sequence-encoding-based deep learning model, pushing the limit of peptide self-assembly prediction to decapeptides. In summary, this work provides a comprehensive benchmark analysis of peptide encoding with advanced deep learning models, serving as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.

SPSep 16, 2023
Intelligent machines work in unstructured environments by differential neuromorphic computing

Shengbo Wang, Shuo Gao, Chenyu Tang et al.

Efficient operation of intelligent machines in the real world requires methods that allow them to understand and predict the uncertainties presented by the unstructured environments with good accuracy, scalability and generalization, similar to humans. Current methods rely on pretrained networks instead of continuously learning from the dynamic signal properties of working environments and suffer inherent limitations, such as data-hungry procedures, and limited generalization capabilities. Herein, we present a memristor-based differential neuromorphic computing, perceptual signal processing and learning method for intelligent machines. The main features of environmental information such as amplification (>720%) and adaptation (<50%) of mechanical stimuli encoded in memristors, are extracted to obtain human-like processing in unstructured environments. The developed method takes advantage of the intrinsic multi-state property of memristors and exhibits good scalability and generalization, as confirmed by validation in two different application scenarios: object grasping and autonomous driving. In the former, a robot hand experimentally realizes safe and stable grasping through fast learning (in ~1 ms) the unknown object features (e.g., sharp corner and smooth surface) with a single memristor. In the latter, the decision-making information of 10 unstructured environments in autonomous driving (e.g., overtaking cars, pedestrians) is accurately (94%) extracted with a 40*25 memristor array. By mimicking the intrinsic nature of human low-level perception mechanisms, the electronic memristive neuromorphic circuit-based method, presented here shows the potential for adapting to diverse sensing technologies and helping intelligent machines generate smart high-level decisions in the real world.

DBJul 5, 2023
Real-time Workload Pattern Analysis for Large-scale Cloud Databases

Jiaqi Wang, Tianyi Li, Anni Wang et al.

Hosting database services on cloud systems has become a common practice. This has led to the increasing volume of database workloads, which provides the opportunity for pattern analysis. Discovering workload patterns from a business logic perspective is conducive to better understanding the trends and characteristics of the database system. However, existing workload pattern discovery systems are not suitable for large-scale cloud databases which are commonly employed by the industry. This is because the workload patterns of large-scale cloud databases are generally far more complicated than those of ordinary databases. In this paper, we propose Alibaba Workload Miner (AWM), a real-time system for discovering workload patterns in complicated large-scale workloads. AWM encodes and discovers the SQL query patterns logged from user requests and optimizes the querying processing based on the discovered patterns. First, Data Collection & Preprocessing Module collects streaming query logs and encodes them into high-dimensional feature embeddings with rich semantic contexts and execution features. Next, Online Workload Mining Module separates encoded queries by business groups and discovers the workload patterns for each group. Meanwhile, Offline Training Module collects labels and trains the classification model using the labels. Finally, Pattern-based Optimizing Module optimizes query processing in cloud databases by exploiting discovered patterns. Extensive experimental results on one synthetic dataset and two real-life datasets (extracted from Alibaba Cloud databases) show that AWM enhances the accuracy of pattern discovery by 66% and reduce the latency of online inference by 22%, compared with the state-of-the-arts.

CVJun 2, 2023
OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

Zhangyang Qi, Jiaqi Wang, Xiaoyang Wu et al.

Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm, which benefits from both BEV's strong perception power and end-to-end pipeline. Despite achieving substantial progress, existing works model objects via globally leveraging temporal and spatial information of BEV features, resulting in problems when handling the challenging complex and dynamic autonomous driving scenarios. In this paper, we proposed an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively. OCBEV comprises three designs: Object Aligned Temporal Fusion aligns the BEV feature based on ego-motion and estimated current locations of moving objects, leading to a precise instance-level feature fusion. Object Focused Multi-View Sampling samples more 3D features from an adaptive local height ranges of objects for each scene to enrich foreground information. Object Informed Query Enhancement replaces part of pre-defined decoder queries in common DETR-style decoders with positional features of objects on high-confidence locations, introducing more direct object positional priors. Extensive experimental evaluations are conducted on the challenging nuScenes dataset. Our approach achieves a state-of-the-art result, surpassing the traditional BEVFormer by 1.5 NDS points. Moreover, we have a faster convergence speed and only need half of the training iterations to get comparable performance, which further demonstrates its effectiveness.