CVMay 28
CapTalk: Text-Guided Stylization and Speech-Driven 3D Head AnimationXuangeng Chu, Yuan Gan, Ziteng Cui et al.
Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles. Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion. Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.
CVAug 10, 2022
Multi-scale Feature Aggregation for Crowd CountingXiaoheng Jiang, Xinyi Wu, Hisham Cholakkal et al. · ibm-research
Convolutional Neural Network (CNN) based crowd counting methods have achieved promising results in the past few years. However, the scale variation problem is still a huge challenge for accurate count estimation. In this paper, we propose a multi-scale feature aggregation network (MSFANet) that can alleviate this problem to some extent. Specifically, our approach consists of two feature aggregation modules: the short aggregation (ShortAgg) and the skip aggregation (SkipAgg). The ShortAgg module aggregates the features of the adjacent convolution blocks. Its purpose is to make features with different receptive fields fused gradually from the bottom to the top of the network. The SkipAgg module directly propagates features with small receptive fields to features with much larger receptive fields. Its purpose is to promote the fusion of features with small and large receptive fields. Especially, the SkipAgg module introduces the local self-attention features from the Swin Transformer blocks to incorporate rich spatial information. Furthermore, we present a local-and-global based counting loss by considering the non-uniform crowd distribution. Extensive experiments on four challenging datasets (ShanghaiTech dataset, UCF_CC_50 dataset, UCF-QNRF Dataset, WorldExpo'10 dataset) demonstrate the proposed easy-to-implement MSFANet can achieve promising results when compared with the previous state-of-the-art approaches.
CLJan 31, 2023
TopoBERT: Plug and Play Toponym Recognition Module Harnessing Fine-tuned BERTBing Zhou, Lei Zou, Yingjie Hu et al.
Extracting precise geographical information from textual contents is crucial in a plethora of applications. For example, during hazardous events, a robust and unbiased toponym extraction framework can provide an avenue to tie the location concerned to the topic discussed by news media posts and pinpoint humanitarian help requests or damage reports from social media. Early studies have leveraged rule-based, gazetteer-based, deep learning, and hybrid approaches to address this problem. However, the performance of existing tools is deficient in supporting operations like emergency rescue, which relies on fine-grained, accurate geographic information. The emerging pretrained language models can better capture the underlying characteristics of text information, including place names, offering a promising pathway to optimize toponym recognition to underpin practical applications. In this paper, TopoBERT, a toponym recognition module based on a one dimensional Convolutional Neural Network (CNN1D) and Bidirectional Encoder Representation from Transformers (BERT), is proposed and fine-tuned. Three datasets (CoNLL2003-Train, Wikipedia3000, WNUT2017) are leveraged to tune the hyperparameters, discover the best training strategy, and train the model. Another two datasets (CoNLL2003-Test and Harvey2017) are used to evaluate the performance. Three distinguished classifiers, linear, multi-layer perceptron, and CNN1D, are benchmarked to determine the optimal model architecture. TopoBERT achieves state-of-the-art performance (f1-score=0.865) compared to the other five baseline models and can be applied to diverse toponym recognition tasks without additional training.
CVDec 16, 2025Code
TalkVerse: Democratizing Minute-Long Audio-Driven Video GenerationZhenzhi Wang, Jian Wang, Ke Ma et al.
We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/
LGMay 3, 2022
Multi-Spatio-temporal Fusion Graph Recurrent Network for Traffic forecastingWei Zhao, Shiqi Zhang, Bing Zhou et al.
Traffic forecasting is essential for the traffic construction of smart cities in the new era. However, traffic data's complex spatial and temporal dependencies make traffic forecasting extremely challenging. Most existing traffic forecasting methods rely on the predefined adjacency matrix to model the Spatio-temporal dependencies. Nevertheless, the road traffic state is highly real-time, so the adjacency matrix should change dynamically with time. This article presents a new Multi-Spatio-temporal Fusion Graph Recurrent Network (MSTFGRN) to address the issues above. The network proposes a data-driven weighted adjacency matrix generation method to compensate for real-time spatial dependencies not reflected by the predefined adjacency matrix. It also efficiently learns hidden Spatio-temporal dependencies by performing a new two-way Spatio-temporal fusion operation on parallel Spatio-temporal relations at different moments. Finally, global Spatio-temporal dependencies are captured simultaneously by integrating a global attention mechanism into the Spatio-temporal fusion module. Extensive trials on four large-scale, real-world traffic datasets demonstrate that our method achieves state-of-the-art performance compared to alternative baselines.
CVSep 20, 2024
T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated DataMingdian Liu, Yilin Liu, Gurunandan Krishnan et al.
The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.
CVNov 13, 2025
AHA! Animating Human Avatars in Diverse Scenes with Gaussian SplattingAymen Mir, Jian Wang, Riza Alp Guler et al.
We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.
LGMar 21, 2022
STCGAT: A Spatio-temporal Causal Graph Attention Network for traffic flow prediction in Intelligent Transportation SystemsWei Zhao, Shiqi Zhang, Bing Zhou et al.
Air pollution and carbon emissions caused by modern transportation are closely related to global climate change. With the help of next-generation information technology such as Internet of Things (IoT) and Artificial Intelligence (AI), accurate traffic flow prediction can effectively solve problems such as traffic congestion and mitigate environmental pollution and climate change. It further promotes the development of Intelligent Transportation Systems (ITS) and smart cities. However, the strong spatial and temporal correlation of traffic data makes the task of accurate traffic forecasting a significant challenge. Existing methods are usually based on graph neural networks using predefined spatial adjacency graphs of traffic networks to model spatial dependencies, ignoring the dynamic correlation of relationships between road nodes. In addition, they usually use independent Spatio-temporal components to capture Spatio-temporal dependencies and do not effectively model global Spatio-temporal dependencies. This paper proposes a new Spatio-temporal Causal Graph Attention Network (STCGAT) for traffic prediction to address the above challenges. In STCGAT, we use a node embedding approach that can adaptively generate spatial adjacency subgraphs at each time step without a priori geographic knowledge and fine-grained modeling of the topology of dynamically generated graphs for different time steps. Meanwhile, we propose an efficient causal temporal correlation component that contains node adaptive learning, graph convolution, and local and global causal temporal convolution modules to learn local and global Spatio-temporal dependencies jointly. Extensive experiments on four real, large traffic datasets show that our model consistently outperforms all baseline models.
CVMar 26
Unleashing Guidance Without Classifiers for Human-Object Interaction AnimationZiyin Wang, Sirui Xu, Chuan Guo et al.
Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.
CVMar 16
DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View ImageryYifan Yang, Lei Zou, Wenjing Gong et al.
Analyzing street-view imagery with computer vision models for rapid, hyperlocal damage assessment is becoming popular and valuable in emergency response and recovery, but traditional models often act like black boxes, lacking interpretability and reliability. This study proposes a multimodal disagreement-driven Arbitration framework powered by Contrastive Language-Image Pre-training (CLIP) models, DamageArbiter, to improve the accuracy, interpretability, and robustness of damage estimation from street-view imagery. DamageArbiter leverages the complementary strengths of unimodal and multimodal models, employing a lightweight logistic regression meta-classifier to arbitrate cases of disagreement. Using 2,556 post-disaster street-view images, paired with both manually generated and large language model (LLM)-generated text descriptions, we systematically compared the performance of unimodal models (including image-only and text-only models), multimodal CLIP-based models, and DamageArbiter. Notably, DamageArbiter improved the accuracy from 74.33% (ViT-B/32, image-only) to 82.79%, surpassing the 80% accuracy threshold and achieving an absolute improvement of 8.46% compared to the strongest baseline model. Beyond improvements in overall accuracy, compared to visual models relying solely on images, DamageArbiter, through arbitration of discrepancies between unimodal and multimodal predictions, mitigates common overconfidence errors in visual models, especially in situations where disaster visual cues are ambiguous or subject to interference, reducing overconfidence but incorrect predictions. We further mapped and analyzed geo-referenced predictions and misclassifications to compare model performance across locations. Overall, this work advances street-view-based disaster assessment from coarse severity classification toward a more reliable and interpretable framework.
CVMar 30
HandX: Scaling Bimanual Motion and Interaction GenerationZimu Zhang, Yucheng Zhang, Xiyan Xu et al.
Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
LGJan 22
Predicting Healthcare System Visitation Flow by Integrating Hospital Attributes and Population Socioeconomics with Human Mobility DataBinbin Lin, Lei Zou, Hao Tian et al.
Healthcare visitation patterns are influenced by a complex interplay of hospital attributes, population socioeconomics, and spatial factors. However, existing research often adopts a fragmented approach, examining these determinants in isolation. This study addresses this gap by integrating hospital capacities, occupancy rates, reputation, and popularity with population SES and spatial mobility patterns to predict visitation flows and analyze influencing factors. Utilizing four years of SafeGraph mobility data and user experience data from Google Maps Reviews, five flow prediction models, Naive Regression, Gradient Boosting, Multilayer Perceptrons (MLPs), Deep Gravity, and Heterogeneous Graph Neural Networks (HGNN),were trained and applied to simulate visitation flows in Houston, Texas, U.S. The Shapley additive explanation (SHAP) analysis and the Partial Dependence Plot (PDP) method were employed to examine the combined impacts of different factors on visitation patterns. The findings reveal that Deep Gravity outperformed other models. Hospital capacities, ICU occupancy rates, ratings, and popularity significantly influence visitation patterns, with their effects varying across different travel distances. Short-distance visits are primarily driven by convenience, whereas long-distance visits are influenced by hospital ratings. White-majority areas exhibited lower sensitivity to hospital ratings for short-distance visits, while Asian populations and those with higher education levels prioritized hospital rating in their visitation decisions. SES further influence these patterns, as areas with higher proportions of Hispanic, Black, under-18, and over-65 populations tend to have more frequent hospital visits, potentially reflecting greater healthcare needs or limited access to alternative medical services.
CVMay 12
ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion GenerationInwoo Hwang, Hojun Jang, Bing Zhou et al.
We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.
LGOct 14, 2025Code
Unveiling the Vulnerability of Graph-LLMs: An Interpretable Multi-Dimensional Adversarial Attack on TAGsBowen Fan, Zhilin Guo, Xunkai Li et al.
Graph Neural Networks (GNNs) have become a pivotal framework for modeling graph-structured data, enabling a wide range of applications from social network analysis to molecular chemistry. By integrating large language models (LLMs), text-attributed graphs (TAGs) enhance node representations with rich textual semantics, significantly boosting the expressive power of graph-based learning. However, this sophisticated synergy introduces critical vulnerabilities, as Graph-LLMs are susceptible to adversarial attacks on both their structural topology and textual attributes. Although specialized attack methods have been designed for each of these aspects, no work has yet unified them into a comprehensive approach. In this work, we propose the Interpretable Multi-Dimensional Graph Attack (IMDGA), a novel human-centric adversarial attack framework designed to orchestrate multi-level perturbations across both graph structure and textual features. IMDGA utilizes three tightly integrated modules to craft attacks that balance interpretability and impact, enabling a deeper understanding of Graph-LLM vulnerabilities. Through rigorous theoretical analysis and comprehensive empirical evaluations on diverse datasets and architectures, IMDGA demonstrates superior interpretability, attack effectiveness, stealthiness, and robustness compared to existing methods. By exposing critical weaknesses in TAG representation learning, this work uncovers a previously underexplored semantic dimension of vulnerability in Graph-LLMs, offering valuable insights for improving their resilience. Our code and resources are publicly available at https://anonymous.4open.science/r/IMDGA-7289.
SPJun 12, 2024Code
A Multi-Resolution Mutual Learning Network for Multi-Label ECG ClassificationWei Huang, Ning Wang, Panpan Feng et al.
Electrocardiograms (ECG), which record the electrophysiological activity of the heart, have become a crucial tool for diagnosing these diseases. In recent years, the application of deep learning techniques has significantly improved the performance of ECG signal classification. Multi-resolution feature analysis, which captures and processes information at different time scales, can extract subtle changes and overall trends in ECG signals, showing unique advantages. However, common multi-resolution analysis methods based on simple feature addition or concatenation may lead to the neglect of low-resolution features, affecting model performance. To address this issue, this paper proposes the Multi-Resolution Mutual Learning Network (MRM-Net). MRM-Net includes a dual-resolution attention architecture and a feature complementary mechanism. The dual-resolution attention architecture processes high-resolution and low-resolution features in parallel. Through the attention mechanism, the high-resolution and low-resolution branches can focus on subtle waveform changes and overall rhythm patterns, enhancing the ability to capture critical features in ECG signals. Meanwhile, the feature complementary mechanism introduces mutual feature learning after each layer of the feature extractor. This allows features at different resolutions to reinforce each other, thereby reducing information loss and improving model performance and robustness. Experiments on the PTB-XL and CPSC2018 datasets demonstrate that MRM-Net significantly outperforms existing methods in multi-label ECG classification performance. The code for our framework will be publicly available at https://github.com/wxhdf/MRM.
AIMay 3
NORA: A Harness-Engineered Autonomous Research Agent for End-to-End Spatial Data ScienceBing Zhou, Xiao Huang, Huan Ning et al.
The automation of scientific research workflows has emerged as a transformative frontier in artificial intelligence, yet existing autonomous research agents remain largely domain-agnostic, lacking the specialized reasoning, method selection, and data acquisition capabilities required for rigorous spatial data science. This paper introduces NORA (Night Owl Research Agent), a harness-engineered, multi-agent autonomous research system purpose-built for GIScience and spatial data science. NORA orchestrates the complete research lifecycle through a skills-first architecture comprising 21 domain-specialized workflow skills, 9 specialist sub-agents, and custom Model Context Protocol (MCP) servers. Central to the system's design are two novel domain-specialized skills: a spatial analysis skill unit that encodes decision frameworks for exploratory spatial data analysis, spatial regression, and diagnostics; and a spatial data download skill that supports reproducible acquisition from authoritative geospatial data sources. We formalize the concept of harness engineering for scientific research agents, demonstrating how lifecycle hooks, safety gates, generator-evaluator separation, human-in-the-loop, and state persistence ensure reliable and reproducible autonomous research. We evaluate NORA through case studies by 6 domain specialists and 3 LLM reviewers across seven dimensions (novelty, quality, rigor, etc). Results demonstrate that domain-specialized harness engineering substantially improves the efficiency and quality of research output compared to general-purpose agent configurations.
CVMar 17, 2025
A Survey on Human Interaction Motion GenerationKewei Sui, Anindita Ghosh, Inwoo Hwang et al.
Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.
GRJun 23, 2025
DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked ModelingAnindita Ghosh, Bing Zhou, Rishabh Dabral et al.
We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discrete tokens and then generating these tokens from music. To effectively capture intricate interactions, we represent both dancers' motions as a unified whole to learn the necessary motion tokens, and adopt a coarse-to-fine learning strategy in both the stages. Our first stage utilizes a VQ-VAE that hierarchically separates high-level semantic features at a coarse temporal resolution from low-level details at a finer resolution, producing two discrete token sequences at different abstraction levels. Subsequently, in the second stage, two generative masked transformers learn to map music signals to these dance tokens: the first producing high-level semantic tokens, and the second, conditioned on music and these semantic tokens, producing the low-level tokens. We train both transformers to learn to predict randomly masked tokens within the sequence, enabling them to iteratively generate motion tokens by filling an empty token sequence during inference. Through the hierarchical masked modeling and dedicated interaction representation, DuetGen achieves the generation of synchronized and interactive two-person dances across various genres. Extensive experiments and user studies on a benchmark duet dance dataset demonstrate state-of-the-art performance of DuetGen in motion realism, music-dance alignment, and partner coordination.
CVMar 25, 2025
Dance Like a Chicken: Low-Rank Stylization for Human Motion DiffusionHaim Sawdayee, Chuan Guo, Guy Tevet et al.
Text-to-motion generative models span a wide range of 3D human actions but struggle with nuanced stylistic attributes such as a "Chicken" style. Due to the scarcity of style-specific data, existing approaches pull the generative prior towards a reference style, which often results in out-of-distribution low quality generations. In this work, we introduce LoRA-MDM, a lightweight framework for motion stylization that generalizes to complex actions while maintaining editability. Our key insight is that adapting the generative prior to include the style, while preserving its overall distribution, is more effective than modifying each individual motion during generation. Building on this idea, LoRA-MDM learns to adapt the prior to include the reference style using only a few samples. The style can then be used in the context of different textual prompts for generation. The low-rank adaptation shifts the motion manifold in a semantically meaningful way, enabling realistic style infusion even for actions not present in the reference samples. Moreover, preserving the distribution structure enables advanced operations such as style blending and motion editing. We compare LoRA-MDM to state-of-the-art stylized motion generation methods and demonstrate a favorable balance between text fidelity and style consistency.
CVApr 12, 2025
Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision modelsYifan Yang, Lei Zou, Bing Zhou et al.
Street-view images offer unique advantages for disaster damage estimation as they capture impacts from a visual perspective and provide detailed, on-the-ground insights. Despite several investigations attempting to analyze street-view images for damage estimation, they mainly focus on post-disaster images. The potential of time-series street-view images remains underexplored. Pre-disaster images provide valuable benchmarks for accurate damage estimations at building and street levels. These images could aid annotators in objectively labeling post-disaster impacts, improving the reliability of labeled data sets for model training, and potentially enhancing the model performance in damage evaluation. The goal of this study is to estimate hyperlocal, on-the-ground disaster damages using bi-temporal street-view images and advanced pre-trained vision models. Street-view images before and after 2024 Hurricane Milton in Horseshoe Beach, Florida, were collected for experiments. The objectives are: (1) to assess the performance gains of incorporating pre-disaster street-view images as a no-damage category in fine-tuning pre-trained models, including Swin Transformer and ConvNeXt, for damage level classification; (2) to design and evaluate a dual-channel algorithm that reads pair-wise pre- and post-disaster street-view images for hyperlocal damage assessment. The results indicate that incorporating pre-disaster street-view images and employing a dual-channel processing framework can significantly enhance damage assessment accuracy. The accuracy improves from 66.14% with the Swin Transformer baseline to 77.11% with the dual-channel Feature-Fusion ConvNeXt model. This research enables rapid, operational damage assessments at hyperlocal spatial resolutions, providing valuable insights to support effective decision-making in disaster management and resilience planning.
CVMar 20, 2025
SceneMI: Motion In-betweening for Modeling Human-Scene InteractionsInwoo Hwang, Bing Zhou, Young Min Kim et al.
Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening - a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos.
LGApr 3, 2025
Toward General and Robust LLM-enhanced Text-attributed Graph LearningZihao Zhang, Xunkai Li, Rong-Hua Li et al.
Recent advancements in Large Language Models (LLMs) and the proliferation of Text-Attributed Graphs (TAGs) across various domains have positioned LLM-enhanced TAG learning as a critical research area. By utilizing rich graph descriptions, this paradigm leverages LLMs to generate high-quality embeddings, thereby enhancing the representational capacity of Graph Neural Networks (GNNs). However, the field faces significant challenges: (1) the absence of a unified framework to systematize the diverse optimization perspectives arising from the complex interactions between LLMs and GNNs, and (2) the lack of a robust method capable of handling real-world TAGs, which often suffer from texts and edge sparsity, leading to suboptimal performance. To address these challenges, we propose UltraTAG, a unified pipeline for LLM-enhanced TAG learning. UltraTAG provides a unified comprehensive and domain-adaptive framework that not only organizes existing methodologies but also paves the way for future advancements in the field. Building on this framework, we propose UltraTAG-S, a robust instantiation of UltraTAG designed to tackle the inherent sparsity issues in real-world TAGs. UltraTAG-S employs LLM-based text propagation and text augmentation to mitigate text sparsity, while leveraging LLM-augmented node selection techniques based on PageRank and edge reconfiguration strategies to address edge sparsity. Our extensive experiments demonstrate that UltraTAG-S significantly outperforms existing baselines, achieving improvements of 2.12\% and 17.47\% in ideal and sparse settings, respectively. Moreover, as the data sparsity ratio increases, the performance improvement of UltraTAG-S also rises, which underscores the effectiveness and robustness of UltraTAG-S.
LGApr 15, 2024
PRIME: A CyberGIS Platform for Resilience Inference Measurement and EnhancementDebayan Mandal, Lei Zou, Rohan Singh Wilkho et al.
In an era of increased climatic disasters, there is an urgent need to develop reliable frameworks and tools for evaluating and improving community resilience to climatic hazards at multiple geographical and temporal scales. Defining and quantifying resilience in the social domain is relatively subjective due to the intricate interplay of socioeconomic factors with disaster resilience. Meanwhile, there is a lack of computationally rigorous, user-friendly tools that can support customized resilience assessment considering local conditions. This study aims to address these gaps through the power of CyberGIS with three objectives: 1) To develop an empirically validated disaster resilience model - Customized Resilience Inference Measurement designed for multi-scale community resilience assessment and influential socioeconomic factors identification, 2) To implement a Platform for Resilience Inference Measurement and Enhancement module in the CyberGISX platform backed by high-performance computing, 3) To demonstrate the utility of PRIME through a representative study. CRIM generates vulnerability, adaptability, and overall resilience scores derived from empirical hazard parameters. Computationally intensive Machine Learning methods are employed to explain the intricate relationships between these scores and socioeconomic driving factors. PRIME provides a web-based notebook interface guiding users to select study areas, configure parameters, calculate and geo-visualize resilience scores, and interpret socioeconomic factors shaping resilience capacities. A representative study showcases the efficiency of the platform while explaining how the visual results obtained may be interpreted. The essence of this work lies in its comprehensive architecture that encapsulates the requisite data, analytical and geo-visualization functions, and ML models for resilience assessment.
CVJul 12, 2025
SnapMoGen: Human Motion Generation from Expressive TextsChuan Guo, Inwoo Hwang, Jian Wang et al.
Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen. Project webpage: https://snap-research.github.io/SnapMoGen/
GRMar 15, 2025
Snapmoji: Instant Generation of Animatable Dual-Stylized AvatarsEric M. Chen, Di Liu, Sizhuo Ma et al. · mit
The increasing popularity of personalized avatar systems, such as Snapchat Bitmojis and Apple Memojis, highlights the growing demand for digital self-representation. Despite their widespread use, existing avatar platforms face significant limitations, including restricted expressivity due to predefined assets, tedious customization processes, or inefficient rendering requirements. Addressing these shortcomings, we introduce Snapmoji, an avatar generation system that instantly creates animatable, dual-stylized avatars from a selfie. We propose Gaussian Domain Adaptation (GDA), which is pre-trained on large-scale Gaussian models using 3D data from sources such as Objaverse and fine-tuned with 2D style transfer tasks, endowing it with a rich 3D prior. This enables Snapmoji to transform a selfie into a primary stylized avatar, like the Bitmoji style, and apply a secondary style, such as Plastic Toy or Alien, all while preserving the user's identity and the primary style's integrity. Our system is capable of producing 3D Gaussian avatars that support dynamic animation, including accurate facial expression transfer. Designed for efficiency, Snapmoji achieves selfie-to-avatar conversion in just 0.9 seconds and supports real-time interactions on mobile devices at 30 to 40 frames per second. Extensive testing confirms that Snapmoji outperforms existing methods in versatility and speed, making it a convenient tool for automatic avatar creation in various styles.
CVJan 4
Animated 3DGS Avatars in Diverse Scenes with Consistent Lighting and ShadowsAymen Mir, Riza Alp Guler, Jian Wang et al.
We present a method for consistent lighting and shadows when animated 3D Gaussian Splatting (3DGS) avatars interact with 3DGS scenes or with dynamic objects inserted into otherwise static scenes. Our key contribution is Deep Gaussian Shadow Maps (DGSM), a modern analogue of the classical shadow mapping algorithm tailored to the volumetric 3DGS representation. Building on the classic deep shadow mapping idea, we show that 3DGS admits closed form light accumulation along light rays, enabling volumetric shadow computation without meshing. For each estimated light, we tabulate transmittance over concentric radial shells and store them in octahedral atlases, which modern GPUs can sample in real time per query to attenuate affected scene Gaussians and thus cast and receive shadows consistently. To relight moving avatars, we approximate the local environment illumination with HDRI probes represented in a spherical harmonic (SH) basis and apply a fast per Gaussian radiance transfer, avoiding explicit BRDF estimation or offline optimization. We demonstrate environment consistent lighting for avatars from AvatarX and ActorsHQ, composited into ScanNet++, DL3DV, and SuperSplat scenes, and show interactions with inserted objects. Across single and multi avatar settings, DGSM and SH relighting operate fully in the volumetric 3DGS representation, yielding coherent shadows and relighting while avoiding meshing.
CVOct 16, 2025
Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction AnimationShaowei Liu, Chuan Guo, Bing Zhou et al.
Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.
LGOct 10, 2025
MagicDock: Toward Docking-oriented De Novo Ligand Design via Gradient InversionZekai Chen, Xunkai Li, Sirui Zhang et al.
De novo ligand design is a fundamental task that seeks to generate protein or molecule candidates that can effectively dock with protein receptors and achieve strong binding affinity entirely from scratch. It holds paramount significance for a wide spectrum of biomedical applications. However, most existing studies are constrained by the \textbf{Pseudo De Novo}, \textbf{Limited Docking Modeling}, and \textbf{Inflexible Ligand Type}. To address these issues, we propose MagicDock, a forward-looking framework grounded in the progressive pipeline and differentiable surface modeling. (1) We adopt a well-designed gradient inversion framework. To begin with, general docking knowledge of receptors and ligands is incorporated into the backbone model. Subsequently, the docking knowledge is instantiated as reverse gradient flows by binding prediction, which iteratively guide the de novo generation of ligands. (2) We emphasize differentiable surface modeling in the docking process, leveraging learnable 3D point-cloud representations to precisely capture binding details, thereby ensuring that the generated ligands preserve docking validity through direct and interpretable spatial fingerprints. (3) We introduce customized designs for different ligand types and integrate them into a unified gradient inversion framework with flexible triggers, thereby ensuring broad applicability. Moreover, we provide rigorous theoretical guarantees for each component of MagicDock. Extensive experiments across 9 scenarios demonstrate that MagicDock achieves average improvements of 27.1\% and 11.7\% over SOTA baselines specialized for protein or molecule ligand design, respectively.
LGOct 10, 2025
When LLM Agents Meet Graph Optimization: An Automated Data Quality Improvement ApproachZhihan Zhang, Xunkai Li, Yilong Zuo et al.
Text-attributed graphs (TAGs) have become a key form of graph-structured data in modern data management and analytics, combining structural relationships with rich textual semantics for diverse applications. However, the effectiveness of analytical models, particularly graph neural networks (GNNs), is highly sensitive to data quality. Our empirical analysis shows that both conventional and LLM-enhanced GNNs degrade notably under textual, structural, and label imperfections, underscoring TAG quality as a key bottleneck for reliable analytics. Existing studies have explored data-level optimization for TAGs, but most focus on specific degradation types and target a single aspect like structure or label, lacking a systematic and comprehensive perspective on data quality improvement. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi-agent framework for comprehensive TAG quality optimization. LAGA formulates graph quality control as a data-centric process, integrating detection, planning, action, and evaluation agents into an automated loop. It holistically enhances textual, structural, and label aspects through coordinated multi-modal optimization. Extensive experiments on 5 datasets and 16 baselines across 9 scenarios demonstrate the effectiveness, robustness and scalability of LAGA, confirming the importance of data-centric quality optimization for reliable TAG analytics.
CVOct 7, 2025
Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction GenerationQingxuan Wu, Zhiyang Dou, Chuan Guo et al.
Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.
LGSep 10, 2025
Two Facets of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation ModelsXunkai Li, Daohan Su, Sicheng Liu et al.
Inspired by the success of LLMs, GFMs are designed to learn the optimal embedding functions from multi-domain text-attributed graphs for the downstream cross-task generalization capability. Among the diverse architectures, graph VQ-MAE stands out among the increasingly diverse landscape of GFM. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.
LGJul 28, 2025
HIAL: A New Paradigm for Hypergraph Active Learning via Influence MaximizationYanheng Hou, Xunkai Li, Zhenjun Li et al.
In recent years, Hypergraph Neural Networks (HNNs) have demonstrated immense potential in handling complex systems with high-order interactions. However, acquiring large-scale, high-quality labeled data for these models is costly, making Active Learning (AL) a critical technique. Existing Graph Active Learning (GAL) methods, when applied to hypergraphs, often rely on techniques like "clique expansion," which destroys the high-order structural information crucial to a hypergraph's success, thereby leading to suboptimal performance. To address this challenge, we introduce HIAL (Hypergraph Active Learning), a native active learning framework designed specifically for hypergraphs. We innovatively reformulate the Hypergraph Active Learning (HAL) problem as an Influence Maximization task. The core of HIAL is a dual-perspective influence function that, based on our novel "High-Order Interaction-Aware (HOI-Aware)" propagation mechanism, synergistically evaluates a node's feature-space coverage (via Magnitude of Influence, MoI) and its topological influence (via Expected Diffusion Value, EDV). We prove that this objective function is monotone and submodular, thus enabling the use of an efficient greedy algorithm with a formal (1-1/e) approximation guarantee. Extensive experiments on seven public datasets demonstrate that HIAL significantly outperforms state-of-the-art baselines in terms of performance, efficiency, generality, and robustness, establishing an efficient and powerful new paradigm for active learning on hypergraphs.
SEMar 24, 2025
Toward building next-generation Geocoding systems: a systematic reviewZhengcong Yin, Daniel W. Goldberg, Binbin Lin et al.
Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across various scenarios these systems must address. It then provides a detailed analysis of how to construct such systems by breaking them down into key functional components and reviewing a broad spectrum of existing approaches, from traditional rule-based methods to advanced techniques in information retrieval, natural language processing, and large language models. Finally, we identify opportunities to improve next-generation geocoding systems in light of recent technological advances.
HCDec 11, 2021
Acoustic Sensing-based Hand Gesture Detection for Wearable Device InteractionBing Zhou, Matias Aiskovich, Sinem Guven
Hand gesture recognition attracts great attention for interaction since it is intuitive and natural to perform. In this paper, we explore a novel method for interaction by using bone-conducted sound generated by finger movements while performing gestures. We design a set of gestures that generate unique sound features, and capture the resulting sound from the wrist using a commodity microphone. Next, we design a sound event detector and a recognition model to classify the gestures. Our system achieves an overall accuracy of 90.13% in quiet environments and 85.79% under noisy conditions. This promising technology can be deployed on existing smartwatches as a low power service at no additional cost, and can be used for interaction in augmented and virtual reality applications.
CVDec 10, 2021
Sparse Depth Completion with Semantic Mesh Deformation OptimizationBing Zhou, Matias Aiskovich, Sinem Guven
Sparse depth measurements are widely available in many applications such as augmented reality, visual inertial odometry and robots equipped with low cost depth sensors. Although such sparse depth samples work well for certain applications like motion tracking, a complete depth map is usually preferred for broader applications, such as 3D object recognition, 3D reconstruction and autonomous driving. Despite the recent advancements in depth prediction from single RGB images with deeper neural networks, the existing approaches do not yield reliable results for practical use. In this work, we propose a neural network with post-optimization, which takes an RGB image and sparse depth samples as input and predicts the complete depth map. We make three major contributions to advance the state-of-the-art: an improved backbone network architecture named EDNet, a semantic edge-weighted loss function and a semantic mesh deformation optimization method. Our evaluation results outperform the existing work consistently on both indoor and outdoor datasets, and it significantly reduces the mean average error by up to 19.5% under the same settings of 200 sparse samples on NYU-Depth-V2 dataset.
CVJun 14, 2021
User-Guided Personalized Image Aesthetic Assessment based on Deep Reinforcement LearningPei Lv, Jianqi Fan, Xixi Nie et al.
Personalized image aesthetic assessment (PIAA) has recently become a hot topic due to its usefulness in a wide variety of applications such as photography, film and television, e-commerce, fashion design and so on. This task is more seriously affected by subjective factors and samples provided by users. In order to acquire precise personalized aesthetic distribution by small amount of samples, we propose a novel user-guided personalized image aesthetic assessment framework. This framework leverages user interactions to retouch and rank images for aesthetic assessment based on deep reinforcement learning (DRL), and generates personalized aesthetic distribution that is more in line with the aesthetic preferences of different users. It mainly consists of two stages. In the first stage, personalized aesthetic ranking is generated by interactive image enhancement and manual ranking, meanwhile two policy networks will be trained. The images will be pushed to the user for manual retouching and simultaneously to the enhancement policy network. The enhancement network utilizes the manual retouching results as the optimization goals of DRL. After that, the ranking process performs the similar operations like the retouching mentioned before. These two networks will be trained iteratively and alternatively to help to complete the final personalized aesthetic assessment automatically. In the second stage, these modified images are labeled with aesthetic attributes by one style-specific classifier, and then the personalized aesthetic distribution is generated based on the multiple aesthetic attributes of these images, which conforms to the aesthetic preference of users better.
LGApr 29, 2021
Emotional Contagion-Aware Deep Reinforcement Learning for Antagonistic Crowd SimulationPei Lv, Qingqing Yu, Boya Xu et al.
The antagonistic behavior in the crowd usually exacerbates the seriousness of the situation in sudden riots, where the antagonistic emotional contagion and behavioral decision making play very important roles. However, the complex mechanism of antagonistic emotion influencing decision making, especially in the environment of sudden confrontation, has not yet been explored very clearly. In this paper, we propose an Emotional contagion-aware Deep reinforcement learning model for Antagonistic Crowd Simulation (ACSED). Firstly, we build a group emotional contagion module based on the improved Susceptible Infected Susceptible (SIS) infection disease model, and estimate the emotional state of the group at each time step during the simulation. Then, the tendency of crowd antagonistic action is estimated based on Deep Q Network (DQN), where the agent learns the action autonomously, and leverages the mean field theory to quickly calculate the influence of other surrounding individuals on the central one. Finally, the rationality of the predicted actions by DQN is further analyzed in combination with group emotion, and the final action of the agent is determined. The proposed method in this paper is verified through several experiments with different settings. The results prove that the antagonistic emotion has a vital impact on the group combat, and positive emotional states are more conducive to combat. Moreover, by comparing the simulation results with real scenes, the feasibility of our method is further confirmed, which can provide good reference to formulate battle plans and improve the win rate of righteous groups in a variety of situations.
CVJan 26, 2021
Probability Trajectory: One New Movement Description for Trajectory PredictionPei Lv, Hui Wei, Tianxin Gu et al.
Trajectory prediction is a fundamental and challenging task for numerous applications, such as autonomous driving and intelligent robots. Currently, most of existing work treat the pedestrian trajectory as a series of fixed two-dimensional coordinates. However, in real scenarios, the trajectory often exhibits randomness, and has its own probability distribution. Inspired by this observed fact, also considering other movement characteristics of pedestrians, we propose one simple and intuitive movement description, probability trajectory, which maps the coordinate points of pedestrian trajectory into two-dimensional Gaussian distribution in images. Based on this unique description, we develop one novel trajectory prediction method, called social probability. The method combines the new probability trajectory and powerful convolution recurrent neural networks together. Both the input and output of our method are probability trajectories, which provide the recurrent neural network with sufficient spatial and random information of moving pedestrians. And the social probability extracts spatio-temporal features directly on the new movement description to generate robust and accurate predicted results. The experiments on public benchmark datasets show the effectiveness of the proposed method.
GRNov 1, 2019
Personality-Aware Probabilistic Map for Trajectory Prediction of PedestriansChaochao Li, Pei Lv, Mingliang Xu et al.
We present a novel trajectory prediction algorithm for pedestrians based on a personality-aware probabilistic feature map. This map is computed using a spatial query structure and each value represents the probability of the predicted pedestrian passing through various positions in the crowd space. We update this map dynamically based on the agents in the environment and prior trajectory of a pedestrian. Furthermore, we estimate the personality characteristics of each pedestrian and use them to improve the prediction by estimating the shortest path in this map. Our approach is general and works well on crowd videos with low and high pedestrian density. We evaluate our model on standard human-trajectory datasets. In practice, our prediction algorithm improves the accuracy by 5-9% over prior algorithms.
CVSep 24, 2019
Multi-scale discriminative Region Discovery for Weakly-Supervised Object LocalizationPei Lv, Haiyu Yu, Junxiao Xue et al.
Localizing objects with weak supervision in an image is a key problem of the research in computer vision community. Many existing Weakly-Supervised Object Localization (WSOL) approaches tackle this problem by estimating the most discriminative regions with feature maps (activation maps) obtained by Deep Convolutional Neural Network, that is, only the objects or parts of them with the most discriminative response will be located. However, the activation maps often display different local maximum responses or relatively weak response when one image contains multiple objects with the same type or small objects. In this paper, we propose a simple yet effective multi-scale discriminative region discovery method to localize not only more integral objects but also as many as possible with only image-level class labels. The gradient weights flowing into different convolutional layers of CNN are taken as the input of our method, which is different from previous methods only considering that of the final convolutional layer. To mine more discriminative regions for the task of object localization, the multiple local maximum from the gradient weight maps are leveraged to generate the localization map with a parallel sliding window. Furthermore, multi-scale localization maps from different convolutional layers are fused to produce the final result. We evaluate the proposed method with the foundation of VGGnet on the ILSVRC 2016, CUB-200-2011 and PASCAL VOC 2012 datasets. On ILSVRC 2016, the proposed method yields the Top-1 localization error of 48.65\%, which outperforms previous results by 2.75\%. On PASCAL VOC 2012, our approach achieve the highest localization accuracy of 0.43. Even for CUB-200-2011 dataset, our method still achieves competitive results.
MAJan 31, 2019
ACSEE: Antagonistic Crowd Simulation Model with Emotional Contagion and Evolutionary Game TheoryChaochao Li, Pei Lv, Dinesh Manocha et al.
Antagonistic crowd behaviors are often observed in cases of serious conflict. Antagonistic emotions, which is the typical psychological state of agents in different roles (i.e. cops, activists, and civilians) in crowd violent scenes, and the way they spread through contagion in a crowd are important causes of crowd antagonistic behaviors. Moreover, games, which refers to the interaction between opposing groups adopting different strategies to obtain higher benefits and less casualties, determine the level of crowd violence. We present an antagonistic crowd simulation model, ACSEE, which is integrated with antagonistic emotional contagion and evolutionary game theories. Our approach models the antagonistic emotions between agents in different roles using two components: mental emotion and external emotion. We combine enhanced susceptible-infectious-susceptible (SIS) and game approaches to evaluate the role of antagonistic emotional contagion in crowd violence. Our evolutionary game theoretic approach incorporates antagonistic emotional contagion through deterrent force, which is modelled by a mixture of emotional forces and physical forces defeating the opponents. Antagonistic emotional contagion and evolutionary game theories influence each other to determine antagonistic crowd behaviors. We evaluate our approach on real-world scenarios consisting of different kinds of agents. We also compare the simulated crowd behaviors with real-world crowd videos and use our approach to predict the trends of crowd movements in violence incidents. We investigate the impact of various factors (number of agents, emotion, strategy, etc.) on the outcome of crowd violence. We present results from user studies suggesting that our model can simulate antagonistic crowd behaviors similar to those seen in real-world scenarios.
CVAug 21, 2018
Abnormal Event Detection and Location for Dense Crowds using Repulsive Forces and Sparse ReconstructionPei Lv, Shunhua Liu, Mingliang Xu et al.
This paper proposes a method based on repulsive forces and sparse reconstruction for the detection and location of abnormal events in crowded scenes. In order to avoid the challenging problem of accurately tracking each specific individual in a dense or complex scene, we divide each frame of the surveillance video into a fixed number of grids and select a single representative point in each grid as the individual to track. The repulsive force model, which can accurately reflect interactive behaviors of crowds, is used to calculate the interactive forces between grid particles in crowded scenes and to construct a force flow matrix using these discrete forces from a fixed number of continuous frames. The force flow matrix, which contains spatial and temporal information, is adopted to train a group of visual dictionaries by sparse coding. To further improve the detection efficiency and avoid concept drift, we propose a fully unsupervised global and local dynamic updating algorithm, based on sparse reconstruction and a group of word pools. For anomaly location, since our method is based on a fixed grid, we can judge whether anomalies occur in a region intuitively according to the reconstruction error of the corresponding visual words. We experimentally verify the proposed method using the UMN dataset, the UCSD dataset and the Web dataset separately. The results indicate that our method can not only detect abnormal events accurately, but can also pinpoint the location of anomalies.
CVMay 18, 2018
MDSSD: Multi-scale Deconvolutional Single Shot Detector for Small ObjectsLisha Cui, Rui Ma, Pei Lv et al.
For most of the object detectors based on multi-scale feature maps, the shallow layers are rich in fine spatial information and thus mainly responsible for small object detection. The performance of small object detection, however, is still less than satisfactory because of the deficiency of semantic information on shallow feature maps. In this paper, we design a Multi-scale Deconvolutional Single Shot Detector (MDSSD), especially for small object detection. In MDSSD, multiple high-level feature maps at different scales are upsampled simultaneously to increase the spatial resolution. Afterwards, we implement the skip connections with low-level feature maps via Fusion Block. The fusion feature maps, named Fusion Module, are of strong feature representational power of small instances. It is noteworthy that these high-level feature maps utilized in Fusion Block preserve both strong semantic information and some fine details of small instances, rather than the top-most layer where the representation of fine details for small objects are potentially wiped out. The proposed framework achieves 77.6% mAP for small object detection on the challenging dataset TT100K with 512 x 512 input, outperforming other detectors with a large margin. Moreover, it can also achieve state-of-the-art results for general object detection on PASCAL VOC2007 test and MS COCO test-dev2015, especially achieving 2 to 5 points improvement on small object categories.
CVMay 3, 2018
USAR: an Interactive User-specific Aesthetic Ranking Framework for ImagesPei Lv, Meng Wang, Yongbo Xu et al.
When assessing whether an image is of high or low quality, it is indispensable to take personal preference into account. Existing aesthetic models lay emphasis on hand-crafted features or deep features commonly shared by high quality images, but with limited or no consideration for personal preference and user interaction. To that end, we propose a novel and user-friendly aesthetic ranking framework via powerful deep neural network and a small amount of user interaction, which can automatically estimate and rank the aesthetic characteristics of images in accordance with users' preference. Our framework takes as input a series of photos that users prefer, and produces as output a reliable, user-specific aesthetic ranking model matching with users' preference. Considering the subjectivity of personal preference and the uncertainty of user's single selection, a unique and exclusive dataset will be constructed interactively to describe the preference of one individual by retrieving the most similar images with regard to those specified by users. Based on this unique user-specific dataset and sufficient well-designed aesthetic attributes, a customized aesthetic distribution model can be learned, which concatenates both personalized preference and aesthetic rules. We conduct extensive experiments and user studies on two large-scale public datasets, and demonstrate that our framework outperforms those work based on conventional aesthetic assessment or ranking model.
CVMay 2, 2018
Bi-directional Graph Structure Information Model for Multi-Person Pose EstimationJing Wang, Ze Peng, Pei Lv et al.
In this paper, we propose a novel multi-stage network architecture with two branches in each stage to estimate multi-person poses in images. The first branch predicts the confidence maps of joints and uses a geometrical transform kernel to propagate information between neighboring joints at the confidence level. The second branch proposes a bi-directional graph structure information model (BGSIM) to encode rich contextual information and to infer the occlusion relationship among different joints. We dynamically determine the joint point with highest response of the confidence maps as base point of passing message in BGSIM. Based on the proposed network structure, we achieve an average precision of 62.9 on the COCO Keypoint Challenge dataset and 77.6 on the MPII (multi-person) dataset. Compared with other state-of-art methods, our method can achieve highly promising results on our selected multi-person dataset without extra training.
MMMar 6, 2018
ART-UP: A Novel Method for Generating Scanning-robust Aesthetic QR codesMingliang Xu, Qingfeng Li, Jianwei Niu et al.
QR codes are usually scanned in different environments, so they must be robust to variations in illumination, scale, coverage, and camera angles. Aesthetic QR codes improve the visual quality, but subtle changes in their appearance may cause scanning failure. In this paper, a new method to generate scanning-robust aesthetic QR codes is proposed, which is based on a module-based scanning probability estimation model that can effectively balance the tradeoff between visual quality and scanning robustness. Our method locally adjusts the luminance of each module by estimating the probability of successful sampling. The approach adopts the hierarchical, coarse-to-fine strategy to enhance the visual quality of aesthetic QR codes, which sequentially generate the following three codes: a binary aesthetic QR code, a grayscale aesthetic QR code, and the final color aesthetic QR code. Our approach also can be used to create QR codes with different visual styles by adjusting some initialization parameters. User surveys and decoding experiments were adopted for evaluating our method compared with state-of-the-art algorithms, which indicates that the proposed approach has excellent performance in terms of both visual quality and scanning robustness.
CVMar 3, 2018
Depth Information Guided Crowd Counting for Complex Crowd ScenesMingliang Xu, Zhaoyang Ge, Xiaoheng Jiang et al.
It is important to monitor and analyze crowd events for the sake of city safety. In an EDOF (extended depth of field) image with a crowded scene, the distribution of people is highly imbalanced. People far away from the camera look much smaller and often occlude each other heavily, while people close to the camera look larger. In such a case, it is difficult to accurately estimate the number of people by using one technique. In this paper, we propose a Depth Information Guided Crowd Counting (DigCrowd) method to deal with crowded EDOF scenes. DigCrowd first uses the depth information of an image to segment the scene into a far-view region and a near-view region. Then Digcrowd maps the far-view region to its crowd density map and uses a detection method to count the people in the near-view region. In addition, we introduce a new crowd dataset that contains 1000 images. Experimental results demonstrate the effectiveness of our DigCrowd method
MMMar 3, 2018
Stylize Aesthetic QR CodeMingliang Xu, Hao Su, Yafei Li et al.
With the continued proliferation of smart mobile devices, Quick Response (QR) code has become one of the most-used types of two-dimensional code in the world. Aiming at beautifying the visual-unpleasant appearance of QR codes, existing works have developed a series of techniques. However, these works still leave much to be desired, such as personalization, artistry, and robustness. To address these issues, in this paper, we propose a novel type of aesthetic QR codes, SEE (Stylize aEsthEtic) QR code, and a three-stage approach to automatically produce such robust style-oriented codes. Specifically, in the first stage, we propose a method to generate an optimized baseline aesthetic QR code, which reduces the visual contrast between the noise-like black/white modules and the blended image. In the second stage, to obtain art style QR code, we tailor an appropriate neural style transformation network to endow the baseline aesthetic QR code with artistic elements. In the third stage, we design an error-correction mechanism by balancing two competing terms, visual quality and readability, to ensure the performance robust. Extensive experiments demonstrate that SEE QR code has high quality in terms of both visual appearance and robustness, and also offers a greater variety of personalized choices to users.
CVFeb 28, 2015
Improved Image Deblurring based on Salient-region SegmentationChongyang Zhang, Weiyao Lin, Wei Li et al.
Image deblurring techniques play important roles in many image processing applications. As the blur varies spatially across the image plane, it calls for robust and effective methods to deal with the spatially-variant blur problem. In this paper, a Saliency-based Deblurring (SD) approach is proposed based on the saliency detection for salient-region segmentation and a corresponding compensate method for image deblurring. We also propose a PDE-based deblurring method which introduces an anisotropic Partial Differential Equation (PDE) model for latent image prediction and employs an adaptive optimization model in the kernel estimation and deconvolution steps. Experimental results demonstrate the effectiveness of the proposed algorithm.
GRFeb 28, 2015
Facial Expression Cloning with Elastic and Muscle ModelsYihao Zhang, Weiyao Lin, Bing Zhou et al.
Expression cloning plays an important role in facial expression synthesis. In this paper, a novel algorithm is proposed for facial expression cloning. The proposed algorithm first introduces a new elastic model to balance the global and local warping effects, such that the impacts from facial feature diversity among people can be minimized, and thus more effective geometric warping results can be achieved. Furthermore, a muscle-distribution-based (MD) model is proposed, which utilizes the muscle distribution of the human face and results in more accurate facial illumination details. In addition, we also propose a new distance-based metric to automatically select the optimal parameters such that the global and local warping effects in the elastic model can be suitably balanced. Experimental results show that our proposed algorithm outperforms the existing methods.