CVJul 28, 2022Code
Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion TransformerHao Shao, Letian Wang, RuoBing Chen et al. · tsinghua
Large-scale deployment of autonomous vehicles has been continually delayed due to safety concerns. On the one hand, comprehensive scene understanding is indispensable, a lack of which would result in vulnerability to rare but complex traffic situations, such as the sudden emergence of unknown objects. However, reasoning from a global context requires access to sensors of multiple types and adequate fusion of multi-modal sensor signals, which is difficult to achieve. On the other hand, the lack of interpretability in learning models also hampers the safety with unverifiable failure causes. In this paper, we propose a safety-enhanced autonomous driving framework, named Interpretable Sensor Fusion Transformer(InterFuser), to fully process and fuse information from multi-modal multi-view sensors for achieving comprehensive scene understanding and adversarial event detection. Besides, intermediate interpretable features are generated from our framework, which provide more semantics and are exploited to better constrain actions to be within the safe sets. We conducted extensive experiments on CARLA benchmarks, where our model outperforms prior methods, ranking the first on the public CARLA Leaderboard. Our code will be made available at https://github.com/opendilab/InterFuser
CVMay 26, 2022Code
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision TransformersJihao Liu, Xin Huang, Jinliang Zheng et al. · tsinghua
In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. Code is available at https://github.com/Sense-X/MixMIM.
CVJul 18, 2022Code
TokenMix: Rethinking Image Mixing for Data Augmentation in Vision TransformersJihao Liu, Boxiao Liu, Hang Zhou et al.
CutMix is a popular augmentation technique commonly used for training modern convolutional and transformer vision networks. It was originally designed to encourage Convolution Neural Networks (CNNs) to focus more on an image's global context instead of local information, which greatly improves the performance of CNNs. However, we found it to have limited benefits for transformer-based architectures that naturally have a global receptive field. In this paper, we propose a novel data augmentation technique TokenMix to improve the performance of vision transformers. TokenMix mixes two images at token level via partitioning the mixing region into multiple separated parts. Besides, we show that the mixed learning target in CutMix, a linear combination of a pair of the ground truth labels, might be inaccurate and sometimes counter-intuitive. To obtain a more suitable target, we propose to assign the target score according to the content-based neural activation maps of the two images from a pre-trained teacher model, which does not need to have high performance. With plenty of experiments on various vision transformer architectures, we show that our proposed TokenMix helps vision transformers focus on the foreground area to infer the classes and enhances their robustness to occlusion, with consistent performance gains. Notably, we improve DeiT-T/S/B with +1% ImageNet top-1 accuracy. Besides, TokenMix enjoys longer training, which achieves 81.2% top-1 accuracy on ImageNet with DeiT-S trained for 400 epochs. Code is available at https://github.com/Sense-X/TokenMix.
CVSep 19, 2024Code
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search EnginesDongzhi Jiang, Renrui Zhang, Ziyu Guo et al. · pku
The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: https://mmsearch.github.io
CVApr 3, 2023Code
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object PredictionZhuofan Zong, Dongzhi Jiang, Guanglu Song et al.
In this paper, we propose a new paradigm, named Historical Object Prediction (HoP) for multi-view 3D detection to leverage temporal information more effectively. The HoP approach is straightforward: given the current timestamp t, we generate a pseudo Bird's-Eye View (BEV) feature of timestamp t-k from its adjacent frames and utilize this feature to predict the object set at timestamp t-k. Our approach is motivated by the observation that enforcing the detector to capture both the spatial location and temporal motion of objects occurring at historical timestamps can lead to more accurate BEV feature learning. First, we elaborately design short-term and long-term temporal decoders, which can generate the pseudo BEV feature for timestamp t-k without the involvement of its corresponding camera images. Second, an additional object decoder is flexibly attached to predict the object targets using the generated pseudo BEV feature. Note that we only perform HoP during training, thus the proposed method does not introduce extra overheads during inference. As a plug-and-play approach, HoP can be easily incorporated into state-of-the-art BEV detection frameworks, including BEVFormer and BEVDet series. Furthermore, the auxiliary HoP approach is complementary to prevalent temporal modeling methods, leading to significant performance gains. Extensive experiments are conducted to evaluate the effectiveness of the proposed HoP on the nuScenes dataset. We choose the representative methods, including BEVFormer and BEVDet4D-Depth to evaluate our method. Surprisingly, HoP achieves 68.5% NDS and 62.4% mAP with ViT-L on nuScenes test, outperforming all the 3D object detectors on the leaderboard. Codes will be available at https://github.com/Sense-X/HoP.
CVNov 22, 2022Code
DETRs with Collaborative Hybrid Assignments TrainingZhuofan Zong, Guanglu Song, Yu Liu
In this paper, we provide the observation that too few queries assigned as positive samples in DETR with one-to-one set matching leads to sparse supervision on the encoder's output which considerably hurt the discriminative feature learning of the encoder and vice visa for attention learning in the decoder. To alleviate this, we present a novel collaborative hybrid assignments training scheme, namely $\mathcal{C}$o-DETR, to learn more efficient and effective DETR-based detectors from versatile label assignment manners. This new training scheme can easily enhance the encoder's learning ability in end-to-end detectors by training the multiple parallel auxiliary heads supervised by one-to-many label assignments such as ATSS and Faster RCNN. In addition, we conduct extra customized positive queries by extracting the positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. In inference, these auxiliary heads are discarded and thus our method introduces no additional parameters and computational cost to the original detector while requiring no hand-crafted non-maximum suppression (NMS). We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. The state-of-the-art DINO-Deformable-DETR with Swin-L can be improved from 58.5% to 59.5% AP on COCO val. Surprisingly, incorporated with ViT-L backbone, we achieve 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods by clear margins with much fewer model sizes. Codes are available at \url{https://github.com/Sense-X/Co-DETR}.
CVJun 2Code
FCUS-rPPG: A Fast-Converging Unsupervised Framework for Remote Photoplethysmography via Gradient Oscillation SuppressionJiajie Li, Yu Liu, Rencheng Song et al.
Remote photoplethysmography (rPPG) enables non-contact extraction of blood volume pulse (BVP) signals using consumer-grade cameras. Recent unsupervised rPPG methods learn BVP representations without requiring ground-truth physiological annotations, yet their optimization is often hindered by noisy and unstable gradients, resulting in slow convergence and limited cross-domain generalization. In this paper, we propose FCUS-rPPG, a fast-converging unsupervised rPPG framework with strong generalization capability. Motivated by the observation that BVP representations exhibit both multi-spectral covariation and low-dimensional manifold structure, we design a spectrally shared backbone that facilitates BVP feature disentanglement while improving optimization efficiency. To jointly enhance convergence stability and generalization performance, we further develop a unified optimization framework operating at the gradient, loss-landscape, and feature-representation levels. Specifically, a post-verification masking mechanism filters out misleading gradients according to the weak-amplitude physiological prior of BVP signals; a perturbation-based loss landscape smoothing strategy steers optimization toward more generalizable flat minima; and a noise-aware null-space regularization constrains feature updates to the orthogonal complement of the noise subspace, thereby mitigating noise-induced representation drift. Extensive experiments on five datasets demonstrate that FCUS-rPPG requires only one training epoch, whereas existing methods typically require tens to hundreds of epochs. Notably, FCUS-rPPG consistently achieves state-of-the-art (SOTA) performance in cross-dataset evaluations. This study provides an efficient and robust solution to the real-world deployment of unsupervised rPPG. The source code will be publicly available at https://github.com/JiaJieLee/FCUS-rPPG.
CVJul 12, 2022Code
UniNet: Unified Architecture Search with Convolution, Transformer, and MLPJihao Liu, Xin Huang, Guanglu Song et al.
Recently, transformer and multi-layer perceptron (MLP) architectures have achieved impressive results on various vision tasks. However, how to effectively combine those operators to form high-performance hybrid visual architectures still remains a challenge. In this work, we study the learnable combination of convolution, transformer, and MLP by proposing a novel unified architecture search approach. Our approach contains two key designs to achieve the search for high-performance networks. First, we model the very different searchable operators in a unified form, and thus enable the operators to be characterized with the same set of configuration parameters. In this way, the overall search space size is significantly reduced, and the total search cost becomes affordable. Second, we propose context-aware downsampling modules (DSMs) to mitigate the gap between the different types of operators. Our proposed DSMs are able to better adapt features from different types of operators, which is important for identifying high-performance hybrid architectures. Finally, we integrate configurable operators and DSMs into a unified search space and search with a Reinforcement Learning-based search algorithm to fully explore the optimal combination of the operators. To this end, we search a baseline network and scale it up to obtain a family of models, named UniNets, which achieve much better accuracy and efficiency than previous ConvNets and Transformers. In particular, our UniNet-B5 achieves 84.9% top-1 accuracy on ImageNet, outperforming EfficientNet-B7 and BoTNet-T7 with 44% and 55% fewer FLOPs respectively. By pretraining on the ImageNet-21K, our UniNet-B6 achieves 87.4%, outperforming Swin-L with 51% fewer FLOPs and 41% fewer parameters. Code is available at https://github.com/Sense-X/UniNet.
CVJul 7, 2023Code
GPT4RoI: Instruction Tuning Large Language Model on Region-of-InterestShilong Zhang, Peize Sun, Shoufa Chen et al.
Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code and model can be found at https://github.com/jshilong/GPT4RoI.
IRMay 27Code
Fine-Tuned LLM as a Complementary Predictor Improving Ads SystemHui Yang, Daiwei He, Kevin Jiang et al.
Recommendation systems power engagement and monetization across feeds, ads, and short-video platforms, but translating the latest advances in Large Language Models into Recommendation Systems (RecSys) gains remains rare, particularly in advertising and production-scale real-world industry setups. Prior real-world LLM successes typically fall into three buckets: (a) generative retrieval that directly predicts the next items for candidate generation, (b) late-stage re-ranking that uses LLMs, and (c) auxiliary signal enrichment with LLMs. We introduce a complementary paradigm for ads: a fine-tuned open-source LLM used not as a ranker, but as an ads-specific ancillary predictor, forecasting likely advertisers from user profiles and histories. This LLM-driven advertiser prediction augments conventional candidate generation and provides informative priors to downstream ranking. Developed in a large-scale production advertising system, our approach produces substantial offline improvements and measurable online business impact, demonstrating that LLM world knowledge and predictive capacity can be efficiently harnessed. Beyond validating LLMs for ads applications, our results show that targeted ancillary predictions can unlock end-to-end gains across both retrieval and late-stage ranking, offering a practical path to LLM-enhanced recommendation at scale.
AIApr 30, 2022
HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene EncodingXiaosong Jia, Penghao Wu, Li Chen et al. · pku
Encoding a driving scene into vector representations has been an essential task for autonomous driving that can benefit downstream tasks e.g. trajectory prediction. The driving scene often involves heterogeneous elements such as the different types of objects (agents, lanes, traffic signs) and the semantic relations between objects are rich and diverse. Meanwhile, there also exist relativity across elements, which means that the spatial relation is a relative concept and need be encoded in a ego-centric manner instead of in a global coordinate system. Based on these observations, we propose Heterogeneous Driving Graph Transformer (HDGT), a backbone modelling the driving scene as a heterogeneous graph with different types of nodes and edges. For heterogeneous graph construction, we connect different types of nodes according to diverse semantic relations. For spatial relation encoding, the coordinates of the node as well as its in-edges are in the local node-centric coordinate system. For the aggregation module in the graph neural network (GNN), we adopt the transformer structure in a hierarchical way to fit the heterogeneous nature of inputs. Experimental results show that HDGT achieves state-of-the-art performance for the task of trajectory prediction, on INTERACTION Prediction Challenge and Waymo Open Motion Challenge.
AISep 19, 2023Code
Towards Generative Modeling of Urban Flow through Knowledge-enhanced Denoising DiffusionZhilun Zhou, Jingtao Ding, Yu Liu et al.
Although generative AI has been successful in many areas, its ability to model geospatial data is still underexplored. Urban flow, a typical kind of geospatial data, is critical for a wide range of urban applications. Existing studies mostly focus on predictive modeling of urban flow that predicts the future flow based on historical flow data, which may be unavailable in data-sparse areas or newly planned regions. Some other studies aim to predict OD flow among regions but they fail to model dynamic changes of urban flow over time. In this work, we study a new problem of urban flow generation that generates dynamic urban flow for regions without historical flow data. To capture the effect of multiple factors on urban flow, such as region features and urban environment, we employ diffusion model to generate urban flow for regions under different conditions. We first construct an urban knowledge graph (UKG) to model the urban environment and relationships between regions, based on which we design a knowledge-enhanced spatio-temporal diffusion model (KSTDiff) to generate urban flow for each region. Specifically, to accurately generate urban flow for regions with different flow volumes, we design a novel diffusion process guided by a volume estimator, which is learnable and customized for each region. Moreover, we propose a knowledge-enhanced denoising network to capture the spatio-temporal dependencies of urban flow as well as the impact of urban environment in the denoising process. Extensive experiments on four real-world datasets validate the superiority of our model over state-of-the-art baselines in urban flow generation. Further in-depth studies demonstrate the utility of generated urban flow data and the ability of our model for long-term flow generation and urban flow prediction. Our code is released at: https://github.com/tsinghua-fib-lab/KSTDiff-Urban-flow-generation.
SOC-PHJun 1
Global evidence for a consistent spatial footprint of intra-urban centersShuai Pang, Junlong Zhang, Yu Liu et al.
Urban space is highly heterogeneous, with economic and population activities concentrating in localized centers. However, the global organization of such intra-urban centers remains poorly understood due to the lack of consistent, comparable data. Here we develop a scalable geospatial framework using nighttime light observations to identify over 15,000 intra-urban centers worldwide. We uncover a robust regularity: despite differences in city size, geography, and development context, total urban area scales linearly with the number of centers, implying a roughly constant spatial footprint per center. This macroscopic regularity is underpinned by two independent sublinear scaling laws -- center number and urban area both scale with population at closely matched rates -- whose ratio cancels to produce the observed linear relationship. At the within-city level, this constancy manifests as a characteristic Voronoi coverage area per center that is consistent across regions, and centers are more regularly spaced than spatial null models predict. As a consequence, polycentric cities maintain stable accessibility as they expand. These findings provide a new empirical foundation for understanding the spatial organization of urban growth.
CVMar 20, 2023Code
GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D UnderstandingJihao Liu, Tai Wang, Boxiao Liu et al.
Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multi-camera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry branch reconstructing dense perspective-view depth maps. The depth branch is designed to be camera-aware by inputting the camera's parameters for better transfer capability. Extensive results demonstrate that GeoMIM outperforms existing methods on nuScenes benchmark, achieving state-of-the-art performance for camera-based 3D object detection and 3D segmentation. Code and pretrained models are available at https://github.com/Sense-X/GeoMIM.
CLMay 31Code
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and DialectsSicheng Yang, Shulan Ruan, Shiwei Wu et al.
While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.
CVNov 22, 2022Code
Teach-DETR: Better Training DETR with TeachersLinjiang Huang, Kaixin Lu, Guanglu Song et al.
In this paper, we present a novel training scheme, namely Teach-DETR, to learn better DETR-based detectors from versatile teacher detectors. We show that the predicted boxes from teacher detectors are effective medium to transfer knowledge of teacher detectors, which could be either RCNN-based or DETR-based detectors, to train a more accurate and robust DETR model. This new training scheme can easily incorporate the predicted boxes from multiple teacher detectors, each of which provides parallel supervisions to the student DETR. Our strategy introduces no additional parameters and adds negligible computational cost to the original detector during training. During inference, Teach-DETR brings zero additional overhead and maintains the merit of requiring no non-maximum suppression. Extensive experiments show that our method leads to consistent improvement for various DETR-based detectors. Specifically, we improve the state-of-the-art detector DINO with Swin-Large backbone, 4 scales of feature maps and 36-epoch training schedule, from 57.8% to 58.9% in terms of mean average precision on MSCOCO 2017 validation set. Code will be available at https://github.com/LeonHLJ/Teach-DETR.
IVMay 17, 2022
HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound ImagesYuhao Mo, Chu Han, Yu Liu et al. · pku
Ultrasonography is an important routine examination for breast cancer diagnosis, due to its non-invasive, radiation-free and low-cost properties. However, the diagnostic accuracy of breast cancer is still limited due to its inherent limitations. It would be a tremendous success if we can precisely diagnose breast cancer by breast ultrasound images (BUS). Many learning-based computer-aided diagnostic methods have been proposed to achieve breast cancer diagnosis/lesion classification. However, most of them require a pre-define ROI and then classify the lesion inside the ROI. Conventional classification backbones, such as VGG16 and ResNet50, can achieve promising classification results with no ROI requirement. But these models lack interpretability, thus restricting their use in clinical practice. In this study, we propose a novel ROI-free model for breast cancer diagnosis in ultrasound images with interpretable feature representations. We leverage the anatomical prior knowledge that malignant and benign tumors have different spatial relationships between different tissue layers, and propose a HoVer-Transformer to formulate this prior knowledge. The proposed HoVer-Trans block extracts the inter- and intra-layer spatial information horizontally and vertically. We conduct and release an open dataset GDPH&SYSUCC for breast cancer diagnosis in BUS. The proposed model is evaluated in three datasets by comparing with four CNN-based models and two vision transformer models via five-fold cross validation. It achieves state-of-the-art classification performance with the best model interpretability. In the meanwhile, our proposed model outperforms two senior sonographers on the breast cancer diagnosis when only one BUS image is given.
AIJun 4
Towards World Models in Biomedical ResearchGuangyu Wang, Jingkun Yue, Siqi Zhang et al.
A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.
CVJun 20, 2023
Lipschitz Singularities in Diffusion ModelsZhantao Yang, Ruili Feng, Han Zhang et al. · tsinghua
Diffusion models, which employ stochastic differential equations to sample images through integrals, have emerged as a dominant class of generative models. However, the rationality of the diffusion process itself receives limited attention, leaving the question of whether the problem is well-posed and well-conditioned. In this paper, we explore a perplexing tendency of diffusion models: they often display the infinite Lipschitz property of the network with respect to time variable near the zero point. We provide theoretical proofs to illustrate the presence of infinite Lipschitz constants and empirical results to confirm it. The Lipschitz singularities pose a threat to the stability and accuracy during both the training and inference processes of diffusion models. Therefore, the mitigation of Lipschitz singularities holds great potential for enhancing the performance of diffusion models. To address this challenge, we propose a novel approach, dubbed E-TSDM, which alleviates the Lipschitz singularities of the diffusion model near the zero point of timesteps. Remarkably, our technique yields a substantial improvement in performance. Moreover, as a byproduct of our method, we achieve a dramatic reduction in the Fréchet Inception Distance of acceleration methods relying on network Lipschitz, including DDIM and DPM-Solver, by over 33%. Extensive experiments on diverse datasets validate our theory and method. Our work may advance the understanding of the general diffusion process, and also provide insights for the design of diffusion models.
CVFeb 20, 2023
Composer: Creative and Controllable Image Synthesis with Composable ConditionsLianghua Huang, Di Chen, Yu Liu et al.
Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability. This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity. With compositionality as the core idea, we first decompose an image into representative factors, and then train a diffusion model with all these factors as the conditions to recompose the input. At the inference stage, the rich intermediate representations work as composable elements, leading to a huge design space (i.e., exponentially proportional to the number of decomposed factors) for customizable content creation. It is noteworthy that our approach, which we call Composer, supports various levels of conditions, such as text description as the global information, depth map and sketch as the local guidance, color histogram for low-level details, etc. Besides improving controllability, we confirm that Composer serves as a general framework and facilitates a wide range of classical generative tasks without retraining. Code and models will be made available.
CVOct 20, 2022Code
Large-batch Optimization for Dense Visual PredictionsZeyue Xue, Jianming Liang, Guanglu Song et al.
Training a large-scale deep neural network in a large-scale dataset is challenging and time-consuming. The recent breakthrough of large-batch optimization is a promising way to tackle this challenge. However, although the current advanced algorithms such as LARS and LAMB succeed in classification models, the complicated pipelines of dense visual predictions such as object detection and segmentation still suffer from the heavy performance drop in the large-batch training regime. To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts. Firstly, AGVM can align the gradient variances between different modules in the dense visual predictors, such as backbone, feature pyramid network (FPN), detection, and segmentation heads. We show that training with a large batch size can fail with the gradient variances misaligned among them, which is a phenomenon primarily overlooked in previous work. Secondly, AGVM is a plug-and-play module that generalizes well to many different architectures (e.g., CNNs and Transformers) and different tasks (e.g., object detection, instance segmentation, semantic segmentation, and panoptic segmentation). It is also compatible with different optimizers (e.g., SGD and AdamW). Thirdly, a theoretical analysis of AGVM is provided. Extensive experiments on the COCO and ADE20K datasets demonstrate the superiority of AGVM. For example, it can train Faster R-CNN+ResNet50 in 4 minutes without losing performance. AGVM enables training an object detector with one billion parameters in just 3.5 hours, reducing the training time by 20.9x, whilst achieving 62.2 mAP on COCO. The deliverables are released at https://github.com/Sense-X/AGVM.
CVMay 30
MUSCLE-NET: Predicted-Multiscale-Aware Network for Pedestrian Trajectory ForecastingYu Liu, Ming Huang, Xiao Ren et al.
Accurate pedestrian trajectory prediction is essential for safe navigation in autonomous driving and intelligent transportation systems. Despite substantial progress made by recent methods, most existing approaches are limited in fully exploiting diverse observations and often overlook the scale dependency of future motion, treating multiscale features uniformly regardless of underlying motion dynamics. This limits their robustness across diverse pedestrian behaviors. To address these challenges, we propose a Predicted-MUltiSCale-Aware Network (MUSCLE-NET) for Pedestrian Trajectory Forecasting that integrates complementary multimodal cues with scale-adaptive prediction mechanisms. The proposed framework is built upon a Multiscale Multimodal Feature Extraction (MMFE) module, which combines multiscale representation, modality-aware recalibration, and directional cross-modal fusion to construct semantically aligned representations from bounding boxes, velocities, and pose information. Building on these features, a Multiscale Enhanced Hierarchical Prediction (MEHP) module performs prediction-aware future-motion refinement via a probabilistic coarse predictor, scale-aligned fusion, and progressive refinement, adaptively selecting scale-relevant cues to mitigate spatial drift. Extensive experiments on the JAAD and PIE benchmarks demonstrate that the proposed MUSCLE-Net achieves competitive performance and consistent gains compared with state-of-the-art trajectory prediction methods.
CVMar 9, 2023
Cones: Concept Neurons in Diffusion Models for Customized GenerationZhiheng Liu, Ruili Feng, Kai Zhu et al.
Human brains respond to semantic features of presented stimuli with different neurons. It is then curious whether modern deep neural networks admit a similar behavior pattern. Specifically, this paper finds a small cluster of neurons in a diffusion model corresponding to a particular subject. We call those neurons the concept neurons. They can be identified by statistics of network gradients to a stimulation connected with the given subject. The concept neurons demonstrate magnetic properties in interpreting and manipulating generation results. Shutting them can directly yield the related subject contextualized in different scenes. Concatenating multiple clusters of concept neurons can vividly generate all related concepts in a single image. A few steps of further fine-tuning can enhance the multi-concept capability, which may be the first to manage to generate up to four different subjects in a single image. For large-scale applications, the concept neurons are environmentally friendly as we only need to store a sparse cluster of int index instead of dense float32 values of the parameters, which reduces storage consumption by 90\% compared with previous subject-driven generation methods. Extensive qualitative and quantitative studies on diverse scenarios show the superiority of our method in interpreting and manipulating diffusion models.
CVJul 18, 2023
AnyDoor: Zero-shot Object-level Image CustomizationXi Chen, Lianghua Huang, Yu Liu et al.
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving. Project page is https://damo-vilab.github.io/AnyDoor-Page/.
LGNov 29, 2022
ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-DependencyChuming Li, Jie Liu, Yinmin Zhang et al.
Multi-agent reinforcement learning (MARL) suffers from the non-stationarity problem, which is the ever-changing targets at every iteration when multiple agents update their policies at the same time. Starting from first principle, in this paper, we manage to solve the non-stationarity problem by proposing bidirectional action-dependent Q-learning (ACE). Central to the development of ACE is the sequential decision-making process wherein only one agent is allowed to take action at one time. Within this process, each agent maximizes its value function given the actions taken by the preceding agents at the inference stage. In the learning phase, each agent minimizes the TD error that is dependent on how the subsequent agents have reacted to their chosen action. Given the design of bidirectional dependency, ACE effectively turns a multiagent MDP into a single-agent MDP. We implement the ACE framework by identifying the proper network representation to formulate the action dependency, so that the sequential decision process is computed implicitly in one forward pass. To validate ACE, we compare it with strong baselines on two MARL benchmarks. Empirical experiments demonstrate that ACE outperforms the state-of-the-art algorithms on Google Research Football and StarCraft Multi-Agent Challenge by a large margin. In particular, on SMAC tasks, ACE achieves 100% success rate on almost all the hard and super-hard maps. We further study extensive research problems regarding ACE, including extension, generalization, and practicability. Code is made available to facilitate further research.
CLJun 2
G^2C-MT: Graph-Guided Context Selection for Document-Level Machine TranslationBaijun Ji, Zixuan Zhou, Xiangyu Duan et al.
Effective document-level machine translation (DocMT) requires capturing long-range discourse dependencies. Recent work has explored retrieval-based and discourse-aware context selection. However, these approaches often lack an explicit mechanism for modeling structured discourse dependencies between distant paragraphs in a document. In this paper, we propose G^2C-MT (Graph-Guided Context for Machine Translation), which views DocMT context selection as a structured path discovery problem on a lightweight discourse graph, rather than retrieving unstructured context sets or relying on expensive LLM-based discourse modeling. In detail, we represent each paragraph as a node and model the relationship between each pair of nodes, considering their semantic similarity, adjacency, and keyword overlap. Furthermore, we propose a depth-biased random walk over the graph to sample a backward context path for each target paragraph. The context path will be used to prompt a large language model (LLM) for translation. This framework naturally supports multi-path context sampling, which can improve robustness by aggregating diverse translation candidates for discourse-ambiguous inputs. Experiments conducted across various domains show that G^2C-MT outperforms strong baselines on multiple LLMs, including DeepSeek-V3, Gemini-2.5-Flash-lite, and the Qwen-2.5/3 series.
CVAug 18, 2022Code
Unifying Visual Perception by Dispersible Points LearningJianming Liang, Guanglu Song, Biao Leng et al.
We present a conceptually simple, flexible, and universal visual perception head for variant visual tasks, e.g., classification, object detection, instance segmentation and pose estimation, and different frameworks, such as one-stage or two-stage pipelines. Our approach effectively identifies an object in an image while simultaneously generating a high-quality bounding box or contour-based segmentation mask or set of keypoints. The method, called UniHead, views different visual perception tasks as the dispersible points learning via the transformer encoder architecture. Given a fixed spatial coordinate, UniHead adaptively scatters it to different spatial points and reasons about their relations by transformer encoder. It directly outputs the final set of predictions in the form of multiple points, allowing us to perform different visual tasks in different frameworks with the same head design. We show extensive evaluations on ImageNet classification and all three tracks of the COCO suite of challenges, including object detection, instance segmentation and pose estimation. Without bells and whistles, UniHead can unify these visual tasks via a single visual head design and achieve comparable performance compared to expert models developed for each task.We hope our simple and universal UniHead will serve as a solid baseline and help promote universal visual perception research. Code and models are available at https://github.com/Sense-X/UniHead.
LGOct 12, 2023Code
LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision ScenariosYazhe Niu, Yuan Pu, Zhenjie Yang et al.
Building agents based on tree-search planning capabilities with learned models has achieved remarkable success in classic decision-making problems, such as Go and Atari. However, it has been deemed challenging or even infeasible to extend Monte Carlo Tree Search (MCTS) based algorithms to diverse real-world applications, especially when these environments involve complex action spaces and significant simulation costs, or inherent stochasticity. In this work, we introduce LightZero, the first unified benchmark for deploying MCTS/MuZero in general sequential decision scenarios. Specificially, we summarize the most critical challenges in designing a general MCTS-style decision-making solver, then decompose the tightly-coupled algorithm and system design of tree-search RL methods into distinct sub-modules. By incorporating more appropriate exploration and optimization strategies, we can significantly enhance these sub-modules and construct powerful LightZero agents to tackle tasks across a wide range of domains, such as board games, Atari, MuJoCo, MiniGrid and GoBigger. Detailed benchmark results reveal the significant potential of such methods in building scalable and efficient decision intelligence. The code is available as part of OpenDILab at https://github.com/opendilab/LightZero.
CVJul 26, 2023Code
3D Semantic Subspace Traverser: Empowering 3D Generative Model with Shape Editing CapabilityRuowei Wang, Yu Liu, Pei Su et al.
Shape generation is the practice of producing 3D shapes as various representations for 3D content creation. Previous studies on 3D shape generation have focused on shape quality and structure, without or less considering the importance of semantic information. Consequently, such generative models often fail to preserve the semantic consistency of shape structure or enable manipulation of the semantic attributes of shapes during generation. In this paper, we proposed a novel semantic generative model named 3D Semantic Subspace Traverser that utilizes semantic attributes for category-specific 3D shape generation and editing. Our method utilizes implicit functions as the 3D shape representation and combines a novel latent-space GAN with a linear subspace model to discover semantic dimensions in the local latent space of 3D shapes. Each dimension of the subspace corresponds to a particular semantic attribute, and we can edit the attributes of generated shapes by traversing the coefficients of those dimensions. Experimental results demonstrate that our method can produce plausible shapes with complex structures and enable the editing of semantic attributes. The code and trained models are available at https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser
CVJun 5, 2023Code
Video Diffusion Models with Local-Global Context GuidanceSiyuan Yang, Lu Zhang, Yu Liu et al.
Diffusion models have emerged as a powerful paradigm in video synthesis tasks including prediction, generation, and interpolation. Due to the limitation of the computational budget, existing methods usually implement conditional diffusion models with an autoregressive inference pipeline, in which the future fragment is predicted based on the distribution of adjacent past frames. However, only the conditions from a few previous frames can't capture the global temporal coherence, leading to inconsistent or even outrageous results in long-term video prediction. In this paper, we propose a Local-Global Context guided Video Diffusion model (LGC-VD) to capture multi-perception conditions for producing high-quality videos in both conditional/unconditional settings. In LGC-VD, the UNet is implemented with stacked residual blocks with self-attention units, avoiding the undesirable computational cost in 3D Conv. We construct a local-global context guidance strategy to capture the multi-perceptual embedding of the past fragment to boost the consistency of future prediction. Furthermore, we propose a two-stage training strategy to alleviate the effect of noisy frames for more stable predictions. Our experiments demonstrate that the proposed method achieves favorable performance on video prediction, interpolation, and unconditional video generation. We release code at https://github.com/exisas/LGC-VD.
LGMay 24Code
Mitigating Gradient Pathology in PINNs through Aligned ConstraintYichen Luo, Peiyu Zhu, Dongxiao Hu et al.
While Physics-Informed Neural Networks (PINNs) are powerful for solving Partial Differential Equations (PDEs), their training is often paralyzed by gradient pathology. The gradients from the PDE residuals and boundary constraints oppose each other, trapping the model in local minima. Current solutions, such as adaptive weighting or hard constraints, either fail to fundamentally resolve this ill-conditioning or are limited to simple geometries. In this study, we systematically analyze the possible causes of this gradient pathology from the perspectives of loss landscapes and optimization dynamics. Based on the obtained conclusion, we propose Constraint-Aligned loss with Manifold Lifting (CAML). By reformulating all zeroth-order terms into aligned constraints, our method effectively mitigates gradient conflicts. In addition, we introduce a delay factor to help the optimizer skip the high-curvature area. Experiments demonstrate that our CAML significantly enhances numerical stability and efficiency in highly complex PINN problems. Our code is open-sourced on https://github.com/YichenLuo-0/CAML.
IVMar 11, 2022
Dual-Domain Reconstruction Networks with V-Net and K-Net for fast MRIXiaohan Liu, Yanwei Pang, Ruiqi Jin et al.
Purpose: To introduce a dual-domain reconstruction network with V-Net and K-Net for accurate MR image reconstruction from undersampled k-space data. Methods: Most state-of-the-art reconstruction methods apply U-Net or cascaded U-Nets in image domain and/or k-space domain. Nevertheless, these methods have following problems: (1) Directly applying U-Net in k-space domain is not optimal for extracting features in k-space domain; (2) Classical image-domain oriented U-Net is heavy-weight and hence is inefficient to be cascaded many times for yielding good reconstruction accuracy; (3) Classical image-domain oriented U-Net does not fully make use information of encoder network for extracting features in decoder network; and (4) Existing methods are ineffective in simultaneously extracting and fusing features in image domain and its dual k-space domain. To tackle these problems, we propose in this paper (1) an image-domain encoder-decoder sub-network called V-Net which is more light-weight for cascading and effective in fully utilizing features in the encoder for decoding, (2) a k-space domain sub-network called K-Net which is more suitable for extracting hierarchical features in k-space domain, and (3) a dual-domain reconstruction network where V-Nets and K-Nets are parallelly and effectively combined and cascaded. Results: Extensive experimental results on the challenging fastMRI dataset demonstrate that the proposed KV-Net can reconstruct high-quality images and outperform current state-of-the-art approaches with fewer parameters. Conclusions: To reconstruct images effectively and efficiently from incomplete k-space data, we have presented a parallel dual-domain KV-Net to combine K-Nets and V-Nets. The KV-Net is more lightweight than state-of-the-art methods but achieves better reconstruction performance.
CVDec 17, 2022
Hyperbolic Hierarchical Contrastive HashingRukai Wei, Yu Liu, Jingkuan Song et al.
Hierarchical semantic structures, naturally existing in real-world datasets, can assist in capturing the latent distribution of data to learn robust hash codes for retrieval systems. Although hierarchical semantic structures can be simply expressed by integrating semantically relevant data into a high-level taxon with coarser-grained semantics, the construction, embedding, and exploitation of the structures remain tricky for unsupervised hash learning. To tackle these problems, we propose a novel unsupervised hashing method named Hyperbolic Hierarchical Contrastive Hashing (HHCH). We propose to embed continuous hash codes into hyperbolic space for accurate semantic expression since embedding hierarchies in hyperbolic space generates less distortion than in hyper-sphere space and Euclidean space. In addition, we extend the K-Means algorithm to hyperbolic space and perform the proposed hierarchical hyperbolic K-Means algorithm to construct hierarchical semantic structures adaptively. To exploit the hierarchical semantic structures in hyperbolic space, we designed the hierarchical contrastive learning algorithm, including hierarchical instance-wise and hierarchical prototype-wise contrastive learning. Extensive experiments on four benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art unsupervised hashing methods. Codes will be released.
AIMar 29, 2023
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIsYaobo Liang, Chenfei Wu, Ting Song et al.
Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them. Inspired by this, we introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, TaskMatrix.AI focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains. As a position paper, we will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.
CVJan 2, 2023
Urban Visual Intelligence: Studying Cities with AI and Street-level ImageryFan Zhang, Arianna Salazar Miranda, Fábio Duarte et al.
The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.
CVOct 29, 2023
CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video HashingRukai Wei, Yu Liu, Jingkuan Song et al.
Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between video frames, especially in the absence of labels. Existing self-supervised video hashing methods have been effective in designing expressive temporal encoders, but have not fully utilized the temporal dynamics and spatial appearance of videos due to less challenging and unreliable learning tasks. To address these challenges, we begin by utilizing the contrastive learning task to capture global spatio-temporal information of videos for hashing. With the aid of our designed augmentation strategies, which focus on spatial and temporal variations to create positive pairs, the learning framework can generate hash codes that are invariant to motion, scale, and viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., frame order verification and scene change regularization, to capture local spatio-temporal details within video frames, thereby enhancing the perception of temporal structure and the modeling of spatio-temporal relationships. Our proposed Contrastive Hashing with Global-Local Spatio-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets. Our codes will be released.
CVApr 16, 2023
Long-term Visual Localization with Mobile SensorsShen Yan, Yu Liu, Long Wang et al.
Despite the remarkable advances in image matching and pose estimation, image-based localization of a camera in a temporally-varying outdoor environment is still a challenging problem due to huge appearance disparity between query and reference images caused by illumination, seasonal and structural changes. In this work, we propose to leverage additional sensors on a mobile phone, mainly GPS, compass, and gravity sensor, to solve this challenging problem. We show that these mobile sensors provide decent initial poses and effective constraints to reduce the searching space in image matching and final pose estimation. With the initial pose, we are also able to devise a direct 2D-3D matching network to efficiently establish 2D-3D correspondences instead of tedious 2D-2D matching in existing systems. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of mobile sensor data and significant scene appearance variations, and develop a system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate the effectiveness of the proposed approach. The code and dataset will be released publicly.
LGNov 29, 2022
Dimensionality-Varying Diffusion ProcessHan Zhang, Ruili Feng, Zhantao Yang et al.
Diffusion models, which learn to reverse a signal destruction process to generate new data, typically require the signal at each step to have the same dimension. We argue that, considering the spatial redundancy in image signals, there is no need to maintain a high dimensionality in the evolution process, especially in the early generation phase. To this end, we make a theoretical generalization of the forward diffusion process via signal decomposition. Concretely, we manage to decompose an image into multiple orthogonal components and control the attenuation of each component when perturbing the image. That way, along with the noise strength increasing, we are able to diminish those inconsequential components and thus use a lower-dimensional signal to represent the source, barely losing information. Such a reformulation allows to vary dimensions in both training and inference of diffusion models. Extensive experiments on a range of datasets suggest that our approach substantially reduces the computational cost and achieves on-par or even better synthesis performance compared to baseline methods. We also show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at $1024\times1024$ resolution from 52.40 to 10.46. Code and models will be made publicly available.
CVOct 17, 2022
Improving Object-centric Learning with Query OptimizationBaoxiong Jia, Yu Liu, Siyuan Huang
The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by investigating the potential of learnable queries as initializations for Slot-Attention learning, uniting it with efforts from existing attempts on improving Slot-Attention learning with bi-level optimization. With simple code adjustments on Slot-Attention, our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin. We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits great potential for concept binding and zero-shot learning. Our work is made publicly available at https://bo-qsa.github.io
CVAug 29, 2022
Towards Robust Face Recognition with Comprehensive SearchManyuan Zhang, Guanglu Song, Yu Liu et al.
Data cleaning, architecture, and loss function design are important factors contributing to high-performance face recognition. Previously, the research community tries to improve the performance of each single aspect but failed to present a unified solution on the joint search of the optimal designs for all three aspects. In this paper, we for the first time identify that these aspects are tightly coupled to each other. Optimizing the design of each aspect actually greatly limits the performance and biases the algorithmic design. Specifically, we find that the optimal model architecture or loss function is closely coupled with the data cleaning. To eliminate the bias of single-aspect research and provide an overall understanding of the face recognition model design, we first carefully design the search space for each aspect, then a comprehensive search method is introduced to jointly search optimal data cleaning, architecture, and loss function design. In our framework, we make the proposed comprehensive search as flexible as possible, by using an innovative reinforcement learning based approach. Extensive experiments on million-level face recognition benchmarks demonstrate the effectiveness of our newly-designed search space for each aspect and the comprehensive search. We outperform expert algorithms developed for each single research track by large margins. More importantly, we analyze the difference between our searched optimal design and the independent design of the single factors. We point out that strong models tend to optimize with more difficult training datasets and loss functions. Our empirical study can provide guidance in future research towards more robust face recognition systems.
LGJan 6, 2023
Centralized Cooperative Exploration Policy for Continuous Control TasksChao Li, Chen Gong, Qiang He et al.
The deep reinforcement learning (DRL) algorithm works brilliantly on solving various complex control tasks. This phenomenal success can be partly attributed to DRL encouraging intelligent agents to sufficiently explore the environment and collect diverse experiences during the agent training process. Therefore, exploration plays a significant role in accessing an optimal policy for DRL. Despite recent works making great progress in continuous control tasks, exploration in these tasks has remained insufficiently investigated. To explicitly encourage exploration in continuous control tasks, we propose CCEP (Centralized Cooperative Exploration Policy), which utilizes underestimation and overestimation of value functions to maintain the capacity of exploration. CCEP first keeps two value functions initialized with different parameters, and generates diverse policies with multiple exploration styles from a pair of value functions. In addition, a centralized policy framework ensures that CCEP achieves message delivery between multiple policies, furthermore contributing to exploring the environment cooperatively. Extensive experimental results demonstrate that CCEP achieves higher exploration capacity. Empirical analysis shows diverse exploration styles in the learned policies by CCEP, reaping benefits in more exploration regions. And this exploration capacity of CCEP ensures it outperforms the current state-of-the-art methods across multiple continuous control tasks shown in experiments.
CVJul 17, 2024Code
Fast Context-Based Low-Light Image Enhancement via Neural Implicit RepresentationsTomáš Chobola, Yu Liu, Hanyi Zhang et al.
Current deep learning-based low-light image enhancement methods often struggle with high-resolution images, and fail to meet the practical demands of visual perception across diverse and unseen scenarios. In this paper, we introduce a novel approach termed CoLIE, which redefines the enhancement process through mapping the 2D coordinates of an underexposed image to its illumination component, conditioned on local context. We propose a reconstruction of enhanced-light images within the HSV space utilizing an implicit neural function combined with an embedded guided filter, thereby significantly reducing computational overhead. Moreover, we introduce a single image-based training loss function to enhance the model's adaptability to various scenes, further enhancing its practical applicability. Through rigorous evaluations, we analyze the properties of our proposed framework, demonstrating its superiority in both image quality and scene adaptability. Furthermore, our evaluation extends to applications in downstream tasks within low-light scenarios, underscoring the practical utility of CoLIE. The source code is available at https://github.com/ctom2/colie.
CVFeb 13, 2023
Render-and-Compare: Cross-View 6 DoF Localization from Noisy PriorShen Yan, Xiaoya Cheng, Yuxiang Liu et al.
Despite the significant progress in 6-DoF visual localization, researchers are mostly driven by ground-level benchmarks. Compared with aerial oblique photography, ground-level map collection lacks scalability and complete coverage. In this work, we propose to go beyond the traditional ground-level setting and exploit the cross-view localization from aerial to ground. We solve this problem by formulating camera pose estimation as an iterative render-and-compare pipeline and enhancing the robustness through augmenting seeds from noisy initial priors. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of cross-view images from smartphones and drones and develop a semi-automatic system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate that our method outperforms other approaches by a large margin.
IRJul 3, 2023
OpenSiteRec: An Open Dataset for Site RecommendationXinhang Li, Xiangyu Zhao, Yejing Wang et al. · tsinghua
As a representative information retrieval task, site recommendation, which aims at predicting the optimal sites for a brand or an institution to open new branches in an automatic data-driven way, is beneficial and crucial for brand development in modern business. However, there is no publicly available dataset so far and most existing approaches are limited to an extremely small scope of brands, which seriously hinders the research on site recommendation. Therefore, we collect, construct and release an open comprehensive dataset, namely OpenSiteRec, to facilitate and promote the research on site recommendation. Specifically, OpenSiteRec leverages a heterogeneous graph schema to represent various types of real-world entities and relations in four international metropolises. To evaluate the performance of the existing general methods on the site recommendation task, we conduct benchmarking experiments of several representative recommendation models on OpenSiteRec. Furthermore, we also highlight the potential application directions to demonstrate the wide applicability of OpenSiteRec. We believe that our OpenSiteRec dataset is significant and anticipated to encourage the development of advanced methods for site recommendation. OpenSiteRec is available online at https://OpenSiteRec.github.io/.
CVOct 4, 2023Code
Magicremover: Tuning-free Text-guided Image inpainting with Diffusion ModelsSiyuan Yang, Lu Zhang, Liqian Ma et al.
Image inpainting aims to fill in the missing pixels with visually coherent and semantically plausible content. Despite the great progress brought from deep generative models, this task still suffers from i. the difficulties in large-scale realistic data collection and costly model training; and ii. the intrinsic limitations in the traditionally user-defined binary masks on objects with unclear boundaries or transparent texture. In this paper, we propose MagicRemover, a tuning-free method that leverages the powerful diffusion models for text-guided image inpainting. We introduce an attention guidance strategy to constrain the sampling process of diffusion models, enabling the erasing of instructed areas and the restoration of occluded content. We further propose a classifier optimization algorithm to facilitate the denoising stability within less sampling steps. Extensive comparisons are conducted among our MagicRemover and state-of-the-art methods including quantitative evaluation and user study, demonstrating the significant improvement of MagicRemover on high-quality image inpainting. We will release our code at https://github.com/exisas/Magicremover.
CVMar 9, 2023
Presentation Attack Detection with Advanced CNN Models for Noncontact-based Fingerprint SystemsSandip Purnapatra, Conor Miller-Lynch, Stephen Miner et al.
Touch-based fingerprint biometrics is one of the most popular biometric modalities with applications in several fields. Problems associated with touch-based techniques such as the presence of latent fingerprints and hygiene issues due to many people touching the same surface motivated the community to look for non-contact-based solutions. For the last few years, contactless fingerprint systems are on the rise and in demand because of the ability to turn any device with a camera into a fingerprint reader. Yet, before we can fully utilize the benefit of noncontact-based methods, the biometric community needs to resolve a few concerns such as the resiliency of the system against presentation attacks. One of the major obstacles is the limited publicly available data sets with inadequate spoof and live data. In this publication, we have developed a Presentation attack detection (PAD) dataset of more than 7500 four-finger images and more than 14,000 manually segmented single-fingertip images, and 10,000 synthetic fingertips (deepfakes). The PAD dataset was collected from six different Presentation Attack Instruments (PAI) of three different difficulty levels according to FIDO protocols, with five different types of PAI materials, and different smartphone cameras with manual focusing. We have utilized DenseNet-121 and NasNetMobile models and our proposed dataset to develop PAD algorithms and achieved PAD accuracy of Attack presentation classification error rate (APCER) 0.14\% and Bonafide presentation classification error rate (BPCER) 0.18\%. We have also reported the test results of the models against unseen spoof types to replicate uncertain real-world testing scenarios.
CVMay 19Code
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video GenerationYujie Wei, Yujin Han, Zhekai Chen et al.
Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.
CVFeb 25, 2023
Knowledge-infused Contrastive Learning for Urban Imagery-based Socioeconomic PredictionYu Liu, Xin Zhang, Jingtao Ding et al.
Monitoring sustainable development goals requires accurate and timely socioeconomic statistics, while ubiquitous and frequently-updated urban imagery in web like satellite/street view images has emerged as an important source for socioeconomic prediction. Especially, recent studies turn to self-supervised contrastive learning with manually designed similarity metrics for urban imagery representation learning and further socioeconomic prediction, which however suffers from effectiveness and robustness issues. To address such issues, in this paper, we propose a Knowledge-infused Contrastive Learning (KnowCL) model for urban imagery-based socioeconomic prediction. Specifically, we firstly introduce knowledge graph (KG) to effectively model the urban knowledge in spatiality, mobility, etc., and then build neural network based encoders to learn representations of an urban image in associated semantic and visual spaces, respectively. Finally, we design a cross-modality based contrastive learning framework with a novel image-KG contrastive loss, which maximizes the mutual information between semantic and visual representations for knowledge infusion. Extensive experiments of applying the learnt visual representations for socioeconomic prediction on three datasets demonstrate the superior performance of KnowCL with over 30\% improvements on $R^2$ compared with baselines. Especially, our proposed KnowCL model can apply to both satellite and street imagery with both effectiveness and transferability achieved, which provides insights into urban imagery-based socioeconomic prediction.
CVAug 1, 2023
A Satellite Imagery Dataset for Long-Term Sustainable Development in United States CitiesYanxin Xi, Yu Liu, Tong Li et al.
Cities play an important role in achieving sustainable development goals (SDGs) to promote economic growth and meet social needs. Especially satellite imagery is a potential data source for studying sustainable urban development. However, a comprehensive dataset in the United States (U.S.) covering multiple cities, multiple years, multiple scales, and multiple indicators for SDG monitoring is lacking. To support the research on SDGs in U.S. cities, we develop a satellite imagery dataset using deep learning models for five SDGs containing 25 sustainable development indicators. The proposed dataset covers the 100 most populated U.S. cities and corresponding Census Block Groups from 2014 to 2023. Specifically, we collect satellite imagery and identify objects with state-of-the-art object detection and semantic segmentation models to observe cities' bird's-eye view. We further gather population, nighttime light, survey, and built environment data to depict SDGs regarding poverty, health, education, inequality, and living environment. We anticipate the dataset to help urban policymakers and researchers to advance SDGs-related studies, especially applying satellite imagery to monitor long-term and multi-scale SDGs in cities.
CVOct 1, 2023
Liveness Detection Competition -- Noncontact-based Fingerprint Algorithms and Systems (LivDet-2023 Noncontact Fingerprint)Sandip Purnapatra, Humaira Rezaie, Bhavin Jawade et al.
Liveness Detection (LivDet) is an international competition series open to academia and industry with the objec-tive to assess and report state-of-the-art in Presentation Attack Detection (PAD). LivDet-2023 Noncontact Fingerprint is the first edition of the noncontact fingerprint-based PAD competition for algorithms and systems. The competition serves as an important benchmark in noncontact-based fingerprint PAD, offering (a) independent assessment of the state-of-the-art in noncontact-based fingerprint PAD for algorithms and systems, and (b) common evaluation protocol, which includes finger photos of a variety of Presentation Attack Instruments (PAIs) and live fingers to the biometric research community (c) provides standard algorithm and system evaluation protocols, along with the comparative analysis of state-of-the-art algorithms from academia and industry with both old and new android smartphones. The winning algorithm achieved an APCER of 11.35% averaged overall PAIs and a BPCER of 0.62%. The winning system achieved an APCER of 13.0.4%, averaged over all PAIs tested over all the smartphones, and a BPCER of 1.68% over all smartphones tested. Four-finger systems that make individual finger-based PAD decisions were also tested. The dataset used for competition will be available 1 to all researchers as per data share protocol