CVNov 30, 2023Code
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person PerspectivesKristen Grauman, Andrew Westbury, Lorenzo Torresani et al. · cmu, gatech
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/
92.7IRJun 3
Bridging Short Videos and Live Streams: Reasoning-Guided Multimodal LLMs for Cross-Domain Representation LearningLe Zhang, Xiaolan Zhu, Yuchen Wang et al.
As live streaming services grow, many platforms offer short videos and live streams to meet diverse needs. Short videos carry substantial traffic and rich behavior signals, whereas live streaming is a core conversion scenario with sparse behavior data, making cold start severe. Transferring user interests from short videos to live streaming recommendation can alleviate these issues. Meanwhile, short videos and live streams are complex multimodal items, and integrating multimodal signals improves recommendation performance. Although Multimodal Large Language Models (MLLMs) show strong multimodal understanding and reasoning, their application to cross-domain recommendation remains underexplored. To this end, we propose Reasoning-Guided Cross-Domain Representation Learning (RGCD-Rep), a reasoning-guided framework for cross-domain recommendation from short videos to live streams. RGCD-Rep introduces MLLM reasoning resource-efficiently and learns transferable item representations guided by behavioral collaboration via two-stage training. First, reasoning-aware distillation lets a frozen teacher MLLM generate structured cross-domain reasoning knowledge and distills it into a lightweight student MLLM. Second, transferability-guided cross-domain representation learning decomposes item representations into transferable and domain residual representations. The resulting representations are computed offline and integrated into downstream retrieval tasks, enabling low-cost industrial deployment. Extensive offline experiments demonstrate RGCD-Rep's superiority. After deployment in Kuaishou's live streaming recommendation system, A/B tests show significant gains across multiple core business metrics, confirming its effectiveness and practicality in real industrial scenarios. RGCD-Rep is fully deployed and serves over 400 million users daily.
AIFeb 21, 2023
Label Information Enhanced Fraud Detection against Low Homophily in GraphsYuchen Wang, Jinghui Zhang, Zhengjie Huang et al.
Node classification is a substantial problem in graph-based fraud detection. Many existing works adopt Graph Neural Networks (GNNs) to enhance fraud detectors. While promising, currently most GNN-based fraud detectors fail to generalize to the low homophily setting. Besides, label utilization has been proved to be significant factor for node classification problem. But we find they are less effective in fraud detection tasks due to the low homophily in graphs. In this work, we propose GAGA, a novel Group AGgregation enhanced TrAnsformer, to tackle the above challenges. Specifically, the group aggregation provides a portable method to cope with the low homophily issue. Such an aggregation explicitly integrates the label information to generate distinguishable neighborhood information. Along with group aggregation, an attempt towards end-to-end trainable group encoding is proposed which augments the original feature space with the class labels. Meanwhile, we devise two additional learnable encodings to recognize the structural and relational context. Then, we combine the group aggregation and the learnable encodings into a Transformer encoder to capture the semantic information. Experimental results clearly show that GAGA outperforms other competitive graph-based fraud detectors by up to 24.39% on two trending public datasets and a real-world industrial dataset from Anonymous. Even more, the group aggregation is demonstrated to outperform other label utilization methods (e.g., C&S, BoT/UniMP) in the low homophily setting.
CVJun 27, 2022
Key-frame Guided Network for Thyroid Nodule Recognition using Ultrasound VideosYuchen Wang, Zhongyu Li, Xiangxiang Cui et al.
Ultrasound examination is widely used in the clinical diagnosis of thyroid nodules (benign/malignant). However, the accuracy relies heavily on radiologist experience. Although deep learning techniques have been investigated for thyroid nodules recognition. Current solutions are mainly based on static ultrasound images, with limited temporal information used and inconsistent with clinical diagnosis. This paper proposes a novel method for the automated recognition of thyroid nodules through an exhaustive exploration of ultrasound videos and key-frames. We first propose a detection-localization framework to automatically identify the clinical key-frame with a typical nodule in each ultrasound video. Based on the localized key-frame, we develop a key-frame guided video classification model for thyroid nodule recognition. Besides, we introduce a motion attention module to help the network focus on significant frames in an ultrasound video, which is consistent with clinical diagnosis. The proposed thyroid nodule recognition framework is validated on clinically collected ultrasound videos, demonstrating superior performance compared with other state-of-the-art methods.
83.9CLApr 3Code
Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text DecodingYuchen Wang, Haonan Wang, Yu Guo et al.
Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high-frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key-Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N-way Retrieval Accuracy and Fréchet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.
CLSep 23, 2022
Whodunit? Learning to Contrast for Authorship AttributionBo Ai, Yuchen Wang, Yugin Tan et al.
Authorship attribution is the task of identifying the author of a given text. The key is finding representations that can differentiate between authors. Existing approaches typically use manually designed features that capture a dataset's content and style, but these approaches are dataset-dependent and yield inconsistent performance across corpora. In this work, we propose \textit{learning} author-specific representations by fine-tuning pre-trained generic language representations with a contrastive objective (Contra-X). We show that Contra-X learns representations that form highly separable clusters for different authors. It advances the state-of-the-art on multiple human and machine authorship attribution benchmarks, enabling improvements of up to 6.8% over cross-entropy fine-tuning. However, we find that Contra-X improves overall accuracy at the cost of sacrificing performance for some authors. Resolving this tension will be an important direction for future work. To the best of our knowledge, we are the first to integrate contrastive learning with pre-trained language model fine-tuning for authorship attribution.
86.9SEMay 6Code
PARNESS: A Paper Harness for End-to-End Automated Scientific Research with Dynamic Workflows, Full-Text Indexing, and Cross-Run Knowledge AccumulationYuchen Wang, Zhongzhi Luan
Recent autonomous research systems -- AI-Scientist, PaperOrchestra, AutoSOTA, DeepResearch, InternAgent, ResearchAgent and others -- show LLM agents can ideate, run experiments and write papers, but each fixes a particular control-flow shape (linear pipeline, state machine, single-agent loop, or fixed-recipe skill pack) at the framework level. We argue this rigidity has five roots: (1) workflows are dynamic and discipline-specific (lab work, surveys, simulations, theory all loop differently); (2) ideation is bounded by LLM context and cross-domain ideation needs knowledge a single context cannot hold; (3) summary-only views miss the paper body, yet full-text access is uneven, so the cumulative corpus must do the work; (4) a paper's open-source repository is often the only complete specification of its experimental scheme, but the paper-to-code link is neglected; (5) no tool persists cross-run knowledge retrievably into a finite LLM context. We present PARNESS, an open-source framework built on four design moves. (i) A thin DAG kernel with a four-field Agent contract decouples scheduling from domain semantics, so any discipline's loop is expressible as user-editable YAML. (ii) A full-text PDF-parsing and literature-library subsystem indexes paper bodies, figures and tables as typed objects, with graceful abstract-only fall-back. (iii) A knowledge-graph index over papers, ideas, experiments and code repositories, with scenario-typed retrieval (similar / contradictory / cross-domain / counter-intuitive), surfaces a focused slice into each LLM call. (iv) A small extension surface lets any modern coding agent (Claude Code, Cursor, Copilot, OpenCode) add or replace any module. To our knowledge PARNESS is the first open-source system combining declarative pipelines, full-PDF and code-repository indexing, and cross-run knowledge. Source: https://github.com/gtrhythm/PARNESS
CVAug 8, 2024
Dual-branch PolSAR Image Classification Based on GraphMAE and Local Feature ExtractionYuchen Wang, Ziyi Guo, Haixia Bi et al.
The annotation of polarimetric synthetic aperture radar (PolSAR) images is a labor-intensive and time-consuming process. Therefore, classifying PolSAR images with limited labels is a challenging task in remote sensing domain. In recent years, self-supervised learning approaches have proven effective in PolSAR image classification with sparse labels. However, we observe a lack of research on generative selfsupervised learning in the studied task. Motivated by this, we propose a dual-branch classification model based on generative self-supervised learning in this paper. The first branch is a superpixel-branch, which learns superpixel-level polarimetric representations using a generative self-supervised graph masked autoencoder. To acquire finer classification results, a convolutional neural networks-based pixel-branch is further incorporated to learn pixel-level features. Classification with fused dual-branch features is finally performed to obtain the predictions. Experimental results on the benchmark Flevoland dataset demonstrate that our approach yields promising classification results.
CVJan 23Code
HA2F: Dual-module Collaboration-Guided Hierarchical Adaptive Aggregation Framework for Remote Sensing Change DetectionShuying Li, Yuchen Wang, San Zhang et al.
Remote sensing change detection (RSCD) aims to identify the spatio-temporal changes of land cover, providing critical support for multi-disciplinary applications (e.g., environmental monitoring, disaster assessment, and climate change studies). Existing methods focus either on extracting features from localized patches, or pursue processing entire images holistically, which leads to the cross temporal feature matching deviation and exhibiting sensitivity to radiometric and geometric noise. Following the above issues, we propose a dual-module collaboration guided hierarchical adaptive aggregation framework, namely HA2F, which consists of dynamic hierarchical feature calibration module (DHFCM) and noise-adaptive feature refinement module (NAFRM). The former dynamically fuses adjacent-level features through perceptual feature selection, suppressing irrelevant discrepancies to address multi-temporal feature alignment deviations. The NAFRM utilizes the dual feature selection mechanism to highlight the change sensitive regions and generate spatial masks, suppressing the interference of irrelevant regions or shadows. Extensive experiments verify the effectiveness of the proposed HA2F, which achieves state-of-the-art performance on LEVIR-CD, WHU-CD, and SYSU-CD datasets, surpassing existing comparative methods in terms of both precision metrics and computational efficiency. In addition, ablation experiments show that DHFCM and NAFRM are effective. \href{https://huggingface.co/InPeerReview/RemoteSensingChangeDetection-RSCD.HA2F}{HA2F Official Code is Available Here!}
92.5ROMay 21
Action with Visual PrimitivesWeilong Guo, Yuchen Wang, Renping Zhou et al.
Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics. Real-robot experiments on general pick-and-place tasks show that AVP improves the success rate by 27.61% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial-compositional generalization, and object-level transfer.
91.8CVMay 3Code
Exploring Data-Free LoRA Transferability for Video Diffusion ModelsYuchen Wang, Wenliang Zhong, Lichen Bai et al.
Video diffusion models leveraging step distillation or causal distillation have achieved remarkable performance. However, adapting existing LoRAs to these variants remains a critical challenge due to weight space mismatches. We observe that direct application leads to style degradation and structural collapse, yet the underlying mechanisms remain poorly understood. To fill this gap, we delve into the weight space and identify that the incompatibility stems from spectral interference within shared functional clusters defined over singular subspaces. Specifically, our analysis reveals that while both paradigms respect spectral rigidity, they establish conflicting routing pathways that clash through constructive overload or destructive cancellation. To address this issue, we propose Cluster-Aware Spectral Arbitration (CASA), a data-free framework that dynamically arbitrates between safeguarding the target's manifold and restoring LoRA alignment based on spectral density. Extensive experiments demonstrate that CASA effectively mitigates artifacts and revives LoRA functionality. Our code is available at https://github.com/Noahwangyuchen/CASA
64.6SIMar 10
Modeling Trend Dynamics with Variational Neural ODEs for Information Popularity PredictionYuchen Wang, Dongpeng Hou, Weikai Jing et al.
Predicting the future popularity of information in online social networks is a crucial yet challenging task, due to the complex spatiotemporal dynamics underlying information diffusion. Existing methods typically use structural or sequential patterns within the observation window as direct inputs for subsequent popularity prediction. However, most approaches lack the ability to explicitly model the overall trend of popularity up to the prediction time, which leads to limited predictive capability. To address these limitations, we propose VNOIP, a novel method based on variational neural Ordinary Differential Equations (ODEs) for information popularity prediction. Specifically, VNOIP introduces bidirectional jump ODEs with attention mechanisms to capture long-range dependencies and bidirectional context within cascade sequences. Furthermore, by jointly considering both cascade patterns and overall trend temporal patterns, VNOIP explicitly models the continuous-time dynamics of popularity trend trajectories with variational neural ODEs. Additionally, a knowledge distillation loss is employed to align the evolution of prior and posterior latent variables. Extensive experiments on real-world datasets demonstrate that VNOIP is highly competitive in both prediction accuracy and efficiency compared to state-of-the-art baselines.
LGOct 7, 2023
Robustness-enhanced Uplift Modeling with Adversarial Feature DesensitizationZexu Sun, Bowei He, Ming Ma et al.
Uplift modeling has shown very promising results in online marketing. However, most existing works are prone to the robustness challenge in some practical applications. In this paper, we first present a possible explanation for the above phenomenon. We verify that there is a feature sensitivity problem in online marketing using different real-world datasets, where the perturbation of some key features will seriously affect the performance of the uplift model and even cause the opposite trend. To solve the above problem, we propose a novel robustness-enhanced uplift modeling framework with adversarial feature desensitization (RUAD). Specifically, our RUAD can more effectively alleviate the feature sensitivity of the uplift model through two customized modules, including a feature selection module with joint multi-label modeling to identify a key subset from the input features and an adversarial feature desensitization module using adversarial training and soft interpolation operations to enhance the robustness of the model against this selected subset of features. Finally, we conduct extensive experiments on a public dataset and a real product dataset to verify the effectiveness of our RUAD in online marketing. In addition, we also demonstrate the robustness of our RUAD to the feature sensitivity, as well as the compatibility with different uplift models.
CVNov 26, 2024Code
AnchorCrafter: Animate Cyber-Anchors Selling Your Products via Human-Object Interacting Video GenerationZiyi Xu, Ziyao Huang, Juan Cao et al.
The generation of anchor-style product promotion videos presents promising opportunities in e-commerce, advertising, and consumer engagement. Despite advancements in pose-guided human video generation, creating product promotion videos remains challenging. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Extensive experiments show that our system improves object appearance preservation by 7.5\% and doubles the object localization accuracy compared to existing state-of-the-art approaches. It also outperforms existing approaches in maintaining human motion consistency and high-quality video generation. Project page including data, code, and Huggingface demo: https://github.com/cangcz/AnchorCrafter.
CLMar 21, 2023
Understand Legal Documents with Contextualized Large Language ModelsXin Jin, Yuchen Wang
The growth of pending legal cases in populous countries, such as India, has become a major issue. Developing effective techniques to process and understand legal documents is extremely useful in resolving this problem. In this paper, we present our systems for SemEval-2023 Task 6: understanding legal texts (Modi et al., 2023). Specifically, we first develop the Legal-BERT-HSLN model that considers the comprehensive context information in both intra- and inter-sentence levels to predict rhetorical roles (subtask A) and then train a Legal-LUKE model, which is legal-contextualized and entity-aware, to recognize legal entities (subtask B). Our evaluations demonstrate that our designed models are more accurate than baselines, e.g., with an up to 15.0% better F1 score in subtask B. We achieved notable performance in the task leaderboard, e.g., 0.834 micro F1 score, and ranked No.5 out of 27 teams in subtask A.
CVJun 3, 2025Code
A Foundation Model for Spatial ProteomicsMuhammad Shaban, Yuzhou Chang, Huaying Qiu et al.
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at https://github.com/mahmoodlab/KRONOS.
92.2DBMar 17
MFTune: An Efficient Multi-fidelity Framework for Spark SQL Configuration TuningBeicheng Xu, Lingching Tung, Yuchen Wang et al.
Apache Spark SQL is a cornerstone of modern big data analytics.However,optimizing Spark SQL performance is challenging due to its vast configuration space and the prohibitive cost of evaluating massive workloads. Existing tuning methods predominantly rely on full-fidelity evaluations, which are extremely time-consuming,often leading to suboptimal performance within practical budgets.While multi-fidelity optimization offers a potential solution, directly applying standard techniques-such as data volume reduction or early stopping-proves ineffective for Spark SQL as they fail to preserve performance correlations or represent true system bottlenecks. To address these challenges, we propose MFTune, an efficient multi-fidelity framework that introduces a query-based fidelity partitioning strategy, utilizing representative SQL subsets to provide accurate, low-cost proxies. To navigate the huge search space, MFTune incorporates a density-based optimization mechanism for automated knob and range compression, alongside an adapted transfer learning approach and a two-phase warm start to further accelerate the tuning process. Experimental results on TPC-H and TPC-DS benchmarks demonstrate that MFTune significantly outperforms five state-of-the-art tuning methods, identifying superior configurations within practical time constraints.
CVDec 21, 2024Code
Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image DenoisingYuchen Wang, Hongyuan Wang, Lizhi Wang et al.
Existing single-image denoising algorithms often struggle to restore details when dealing with complex noisy images. The introduction of near-infrared (NIR) images offers new possibilities for RGB image denoising. However, due to the inconsistency between NIR and RGB images, the existing works still struggle to balance the contributions of two fields in the process of image fusion. In response to this, in this paper, we develop a cross-field Frequency Correlation Exploiting Network (FCENet) for NIR-assisted image denoising. We first propose the frequency correlation prior based on an in-depth statistical frequency analysis of NIR-RGB image pairs. The prior reveals the complementary correlation of NIR and RGB images in the frequency domain. Leveraging frequency correlation prior, we then establish a frequency learning framework composed of Frequency Dynamic Selection Mechanism (FDSM) and Frequency Exhaustive Fusion Mechanism (FEFM). FDSM dynamically selects complementary information from NIR and RGB images in the frequency domain, and FEFM strengthens the control of common and differential features during the fusion process of NIR and RGB features. Extensive experiments on simulated and real data validate that the proposed method outperforms other state-of-the-art methods. The code will be released at https://github.com/yuchenwang815/FCENet.
47.9AIMar 18
Understanding the Theoretical Foundations of Deep Neural Networks through Differential EquationsHongjue Zhao, Yizhuo Chen, Yuchen Wang et al.
Deep neural networks (DNNs) have achieved remarkable empirical success, yet the absence of a principled theoretical foundation continues to hinder their systematic development. In this survey, we present differential equations as a theoretical foundation for understanding, analyzing, and improving DNNs. We organize the discussion around three guiding questions: i) how differential equations offer a principled understanding of DNN architectures, ii) how tools from differential equations can be used to improve DNN performance in a principled way, and iii) what real-world applications benefit from grounding DNNs in differential equations. We adopt a two-fold perspective spanning the model level, which interprets the whole DNN as a differential equation, and the layer level, which models individual DNN components as differential equations. From these two perspectives, we review how this framework connects model design, theoretical analysis, and performance improvement. We further discuss real-world applications, as well as key challenges and opportunities for future research.
96.1CVApr 9Code
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal ModelsQihui Zhu, Tao Zhang, Yuchen Wang et al.
In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.
LGJul 14, 2025Code
A Generalizable Physics-Enhanced State Space Model for Long-Term Dynamics Forecasting in Complex EnvironmentsYuchen Wang, Hongjue Zhao, Haohong Lin et al.
This work aims to address the problem of long-term dynamic forecasting in complex environments where data are noisy and irregularly sampled. While recent studies have introduced some methods to improve prediction performance, these approaches still face a significant challenge in handling long-term extrapolation tasks under such complex scenarios. To overcome this challenge, we propose Phy-SSM, a generalizable method that integrates partial physics knowledge into state space models (SSMs) for long-term dynamics forecasting in complex environments. Our motivation is that SSMs can effectively capture long-range dependencies in sequential data and model continuous dynamical systems, while the incorporation of physics knowledge improves generalization ability. The key challenge lies in how to seamlessly incorporate partially known physics into SSMs. To achieve this, we decompose partially known system dynamics into known and unknown state matrices, which are integrated into a Phy-SSM unit. To further enhance long-term prediction performance, we introduce a physics state regularization term to make the estimated latent states align with system dynamics. Besides, we theoretically analyze the uniqueness of the solutions for our method. Extensive experiments on three real-world applications, including vehicle motion prediction, drone state prediction, and COVID-19 epidemiology forecasting, demonstrate the superior performance of Phy-SSM over the baselines in both long-term interpolation and extrapolation tasks. The code is available at https://github.com/511205787/Phy_SSM-ICML2025.
AIFeb 15Code
A Generalizable Physics-guided Causal Model for Trajectory Prediction in Autonomous DrivingZhenyu Zong, Yuchen Wang, Haohong Lin et al.
Trajectory prediction for traffic agents is critical for safe autonomous driving. However, achieving effective zero-shot generalization in previously unseen domains remains a significant challenge. Motivated by the consistent nature of kinematics across diverse domains, we aim to incorporate domain-invariant knowledge to enhance zero-shot trajectory prediction capabilities. The key challenges include: 1) effectively extracting domain-invariant scene representations, and 2) integrating invariant features with kinematic models to enable generalized predictions. To address these challenges, we propose a novel generalizable Physics-guided Causal Model (PCM), which comprises two core components: a Disentangled Scene Encoder, which adopts intervention-based disentanglement to extract domain-invariant features from scenes, and a CausalODE Decoder, which employs a causal attention mechanism to effectively integrate kinematic models with meaningful contextual information. Extensive experiments on real-world autonomous driving datasets demonstrate our method's superior zero-shot generalization performance in unseen cities, significantly outperforming competitive baselines. The source code is released at https://github.com/ZY-Zong/Physics-guided-Causal-Model.
48.6LGApr 3
Earth Embeddings Reveal Diverse Urban Signals from SpaceWenjing Gong, Udbhav Srivastava, Yuchen Wang et al.
Conventional urban indicators derived from censuses, surveys, and administrative records are often costly, spatially inconsistent, and slow to update. Recent geospatial foundation models enable Earth embeddings, compact satellite image representations transferable across downstream tasks, but their utility for neighborhood-scale urban monitoring remains unclear. Here, we benchmark three Earth embedding families, AlphaEarth, Prithvi, and Clay, for urban signal prediction across six U.S. metropolitan areas from 2020 to 2023. Using a unified supervised-learning framework, we predict 14 neighborhood-level indicators spanning crime, income, health, and travel behavior, and evaluate performance under four settings: global, city-wise, year-wise, and city-year. Results show that Earth embeddings capture substantial urban variation, with the highest predictive skill for outcomes more directly tied to built-environment structure, including chronic health burdens and dominant commuting modes. By contrast, indicators shaped more strongly by fine-scale behavior and local policy, such as cycling, remain difficult to infer. Predictive performance varies markedly across cities but remains comparatively stable across years, indicating strong spatial heterogeneity alongside temporal robustness. Exploratory analysis suggests that cross-city variation in predictive performance is associated with urban form in task-specific ways. Controlled dimensionality experiments show that representation efficiency is critical: compact 64-dimensional AlphaEarth embeddings remain more informative than 64-dimensional reductions of Prithvi and Clay. This study establishes a benchmark for evaluating Earth embeddings in urban remote sensing and demonstrates their potential as scalable, low-cost features for SDG-aligned neighborhood-scale urban monitoring.
CVDec 8, 2025
Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language ModelsBiao Chen, Lin Zuo, Mengmeng Jing et al.
Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.
CVJul 20, 2025Code
FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language ModelsDong Shu, Haoyang Yuan, Yuchen Wang et al.
Large vision-language models (LVLMs) have made significant progress in chart understanding. However, financial charts, characterized by complex temporal structures and domain-specific terminology, remain notably underexplored. We introduce FinChart-Bench, the first benchmark specifically focused on real-world financial charts. FinChart-Bench comprises 1,200 financial chart images collected from 2015 to 2024, each annotated with True/False (TF), Multiple Choice (MC), and Question Answering (QA) questions, totaling 7,016 questions. We conduct a comprehensive evaluation of 25 state-of-the-art LVLMs on FinChart-Bench. Our evaluation reveals critical insights: (1) the performance gap between open-source and closed-source models is narrowing, (2) performance degradation occurs in upgraded models within families, (3) many models struggle with instruction following, (4) both advanced models show significant limitations in spatial reasoning abilities, and (5) current LVLMs are not reliable enough to serve as automated evaluators. These findings highlight important limitations in current LVLM capabilities for financial chart understanding. The FinChart-Bench dataset is available at https://huggingface.co/datasets/Tizzzzy/FinChart-Bench.
CVMay 4, 2025Code
Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated MarginYuchen Wang, Xuefeng Bai, Xiucheng Li et al.
Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29% over the SoTA method. Our code is avaliable at https://anonymous.4open.science/r/CAP-C642/
CVMar 25, 2021Code
Stepwise Goal-Driven Networks for Trajectory PredictionChuhua Wang, Yuchen Wang, Mingze Xu et al.
We propose to predict the future trajectories of observed agents (e.g., pedestrians or vehicles) by estimating and using their goals at multiple time scales. We argue that the goal of a moving agent may change over time, and modeling goals continuously provides more accurate and detailed information for future trajectory estimation. To this end, we present a recurrent network for trajectory prediction, called Stepwise Goal-Driven Network (SGNet). Unlike prior work that models only a single, long-term goal, SGNet estimates and uses goals at multiple temporal scales. In particular, it incorporates an encoder that captures historical information, a stepwise goal estimator that predicts successive goals into the future, and a decoder that predicts future trajectory. We evaluate our model on three first-person traffic datasets (HEV-I, JAAD, and PIE) as well as on three bird's eye view datasets (NuScenes, ETH, and UCY), and show that our model achieves state-of-the-art results on all datasets. Code has been made available at: https://github.com/ChuhuaW/SGNet.pytorch.
68.8LGMar 15
WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic SystemsYuchen Wang, Jiangtao Kong, Sizhe Wei et al.
Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduce WestWorld, a knoWledge-Encoded Scalable Trajectory World model for diverse robotic systems. To tackle the scalability challenge, we propose a novel system-aware Mixture-of-Experts (Sys-MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero-shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real-world settings, WestWorld achieves significant improvements over competitive baselines in zero- and few-shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model-based control for different robots. Finally, we deploy our model on a real-world Unitree Go1, where it demonstrates stable locomotion performance (see our demo on the website: https://westworldrobot.github.io/). The code will be available upon publication.
CVDec 17, 2025
GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker DetectionYu Wang, Juhyung Ha, Frangil M. Ramirez et al.
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
LGApr 29, 2025
Multidimensional precipitation index prediction based on CNN-LSTM hybrid frameworkYuchen Wang, Pengfei Jia, Zhitao Shu et al.
With the intensification of global climate change, accurate prediction of weather indicators is of great significance in disaster prevention and mitigation, agricultural production, and transportation. Precipitation, as one of the key meteorological indicators, plays a crucial role in water resource management, agricultural production, and urban flood control. This study proposes a multidimensional precipitation index prediction model based on a CNN- LSTM hybrid framework, aiming to improve the accuracy of precipitation forecasts. The dataset is sourced from Pune, Maharashtra, India, covering monthly mean precipitation data from 1972 to 2002. This dataset includes nearly 31 years (1972-2002) of monthly average precipitation, reflecting the long-term fluctuations and seasonal variations of precipitation in the region. By analyzing these time series data, the CNN-LSTM model effectively captures local features and long-term dependencies. Experimental results show that the model achieves a root mean square error (RMSE) of 6.752, which demonstrates a significant advantage over traditional time series prediction methods in terms of prediction accuracy and generalization ability. Furthermore, this study provides new research ideas for precipitation prediction. However, the model requires high computational resources when dealing with large-scale datasets, and its predictive ability for multidimensional precipitation data still needs improvement. Future research could extend the model to support and predict multidimensional precipitation data, thereby promoting the development of more accurate and efficient meteorological prediction technologies.
SEApr 2, 2025
From Code Generation to Software Testing: AI Copilot with Context-Based RAGYuchen Wang, Shangxin Guo, Chee Wei Tan
The rapid pace of large-scale software development places increasing demands on traditional testing methodologies, often leading to bottlenecks in efficiency, accuracy, and coverage. We propose a novel perspective on software testing by positing bug detection and coding with fewer bugs as two interconnected problems that share a common goal, which is reducing bugs with limited resources. We extend our previous work on AI-assisted programming, which supports code auto-completion and chatbot-powered Q&A, to the realm of software testing. We introduce Copilot for Testing, an automated testing system that synchronizes bug detection with codebase updates, leveraging context-based Retrieval Augmented Generation (RAG) to enhance the capabilities of large language models (LLMs). Our evaluation demonstrates a 31.2% improvement in bug detection accuracy, a 12.6% increase in critical test coverage, and a 10.5% higher user acceptance rate, highlighting the transformative potential of AI-driven technologies in modern software development practices.
24.0AIApr 23
Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language ModelsChee Wei Tan, Yuchen Wang, Shangxin Guo
This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability. In rigorously solvable games, it employs mathematical reasoning to compute optimal strategies and generates human-readable explanations for its decisions. For heuristic-based games, it synthesizes strategies by combining insights from classical minimax algorithms (see, e.g., shannon1950chess) with crowd-sourced data. Finally, in learning-based games, it utilizes reinforcement learning with human feedback and self-critique to iteratively refine strategies through trial-and-error and imitation learning. Nemobot amplifies this framework by offering a programmable environment where users can experiment with tool-augmented generation and fine-tuning of strategic game agents. From strategic games to role-playing games, Nemobot demonstrates how AI agents can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own logic. This represents a step toward the long-term goal of self-programming AI.
LGNov 13, 2025
Robust Decentralized Multi-armed Bandits: From Corruption-Resilience to Byzantine-ResilienceZicheng Hu, Yuchen Wang, Cheng Chen
Decentralized cooperative multi-agent multi-armed bandits (DeCMA2B) considers how multiple agents collaborate in a decentralized multi-armed bandit setting. Though this problem has been extensively studied in previous work, most existing methods remain susceptible to various adversarial attacks. In this paper, we first study DeCMA2B with adversarial corruption, where an adversary can corrupt reward observations of all agents with a limited corruption budget. We propose a robust algorithm, called DeMABAR, which ensures that each agent's individual regret suffers only an additive term proportional to the corruption budget. Then we consider a more realistic scenario where the adversary can only attack a small number of agents. Our theoretical analysis shows that the DeMABAR algorithm can also almost completely eliminate the influence of adversarial attacks and is inherently robust in the Byzantine setting, where an unknown fraction of the agents can be Byzantine, i.e., may arbitrarily select arms and communicate wrong information. We also conduct numerical experiments to illustrate the robustness and effectiveness of the proposed method.
CRMay 14, 2024
Achieving Resolution-Agnostic DNN-based Image Watermarking: A Novel Perspective of Implicit Neural RepresentationYuchen Wang, Xingyu Zhu, Guanhui Ye et al.
DNN-based watermarking methods are rapidly developing and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the best of our knowledge, Resolution-Agnostic Image WaterMarking (RAIMark) framework by watermarking the implicit neural representation (INR) of image. Unlike previous methods, our method does not rely on the previous reduction process by directly watermarking the continuous signal instead of image pixels, thus achieving resolution-agnostic watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for the target image. As a continuous signal, such an INR can be sampled to obtain images with variant resolutions. Then, we quickly fine-tune the fitted INR to get a watermarked INR conditioned on a binary secret message. A pre-trained watermark decoder extracts the hidden message from any sampled images with arbitrary resolutions. By directly watermarking INR, we achieve resolution-agnostic watermarking with increased robustness. Extensive experiments show that our method outperforms previous methods with significant improvements: averagely improved bit accuracy by 7%$\sim$29%. Notably, we observe that previous methods are vulnerable to at least one watermarking attack (e.g. JPEG, crop, resize), while ours are robust against all watermarking attacks.
AIDec 23, 2025
MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMsOnat Ozer, Grace Wu, Yuchen Wang et al.
LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to generate reflections. Through out extensive experimentation, we've found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.
LGApr 21, 2024
Fermi-Bose Machine achieves both generalization and adversarial robustnessMingshan Xie, Yuchen Wang, Haiping Huang
Distinct from human cognitive processing, deep neural networks trained by backpropagation can be easily fooled by adversarial examples. To design a semantically meaningful representation learning, we discard backpropagation, and instead, propose a local contrastive learning, where the representation for the inputs bearing the same label shrink (akin to boson) in hidden layers, while those of different labels repel (akin to fermion). This layer-wise learning is local in nature, being biological plausible. A statistical mechanics analysis shows that the target fermion-pair-distance is a key parameter. Moreover, the application of this local contrastive learning to MNIST benchmark dataset demonstrates that the adversarial vulnerability of standard perceptron can be greatly mitigated by tuning the target distance, i.e., controlling the geometric separation of prototype manifolds.
LGMar 5, 2025
TrafficKAN-GCN: Graph Convolutional-based Kolmogorov-Arnold Network for Traffic Flow OptimizationJiayi Zhang, Yiming Zhang, Yuan Zheng et al.
Urban traffic optimization is critical for improving transportation efficiency and alleviating congestion, particularly in large-scale dynamic networks. Traditional methods, such as Dijkstra's and Floyd's algorithms, provide effective solutions in static settings, but they struggle with the spatial-temporal complexity of real-world traffic flows. In this work, we propose TrafficKAN-GCN, a hybrid deep learning framework combining Kolmogorov-Arnold Networks (KAN) with Graph Convolutional Networks (GCN), designed to enhance urban traffic flow optimization. By integrating KAN's adaptive nonlinear function approximation with GCN's spatial graph learning capabilities, TrafficKAN-GCN captures both complex traffic patterns and topological dependencies. We evaluate the proposed framework using real-world traffic data from the Baltimore Metropolitan area. Compared with baseline models such as MLP-GCN, standard GCN, and Transformer-based approaches, TrafficKAN-GCN achieves competitive prediction accuracy while demonstrating improved robustness in handling noisy and irregular traffic data. Our experiments further highlight the framework's ability to redistribute traffic flow, mitigate congestion, and adapt to disruptive events, such as the Francis Scott Key Bridge collapse. This study contributes to the growing body of work on hybrid graph learning for intelligent transportation systems, highlighting the potential of combining KAN and GCN for real-time traffic optimization. Future work will focus on reducing computational overhead and integrating Transformer-based temporal modeling for enhanced long-term traffic prediction. The proposed TrafficKAN-GCN framework offers a promising direction for data-driven urban mobility management, balancing predictive accuracy, robustness, and computational efficiency.
CLFeb 9, 2024
Detection of Opioid Users from Reddit Posts via an Attention-based Bidirectional Recurrent Neural NetworkYuchen Wang, Zhengyu Fang, Wei Du et al.
The opioid epidemic, referring to the growing hospitalizations and deaths because of overdose of opioid usage and addiction, has become a severe health problem in the United States. Many strategies have been developed by the federal and local governments and health communities to combat this crisis. Among them, improving our understanding of the epidemic through better health surveillance is one of the top priorities. In addition to direct testing, machine learning approaches may also allow us to detect opioid users by analyzing data from social media because many opioid users may choose not to do the tests but may share their experiences on social media anonymously. In this paper, we take advantage of recent advances in machine learning, collect and analyze user posts from a popular social network Reddit with the goal to identify opioid users. Posts from more than 1,000 users who have posted on three sub-reddits over a period of one month have been collected. In addition to the ones that contain keywords such as opioid, opiate, or heroin, we have also collected posts that contain slang words of opioid such as black or chocolate. We apply an attention-based bidirectional long short memory model to identify opioid users. Experimental results show that the approaches significantly outperform competitive algorithms in terms of F1-score. Furthermore, the model allows us to extract most informative words, such as opiate, opioid, and black, from posts via the attention layer, which provides more insights on how the machine learning algorithm works in distinguishing drug users from non-drug users.
AISep 28, 2025
Future-Proofing Programmers: Optimal Knowledge Tracing for AI-Assisted Personalized EducationYuchen Wang, Pei-Duo Yu, Chee Wei Tan
Learning to learn is becoming a science, driven by the convergence of knowledge tracing, signal processing, and generative AI to model student learning states and optimize education. We propose CoTutor, an AI-driven model that enhances Bayesian Knowledge Tracing with signal processing techniques to improve student progress modeling and deliver adaptive feedback and strategies. Deployed as an AI copilot, CoTutor combines generative AI with adaptive learning technology. In university trials, it has demonstrated measurable improvements in learning outcomes while outperforming conventional educational tools. Our results highlight its potential for AI-driven personalization, scalability, and future opportunities for advancing privacy and ethical considerations in educational technology. Inspired by Richard Hamming's vision of computer-aided 'learning to learn,' CoTutor applies convex optimization and signal processing to automate and scale up learning analytics, while reserving pedagogical judgment for humans, ensuring AI facilitates the process of knowledge tracing while enabling learners to uncover new insights.
LGAug 26, 2025
The Sound of Risk: A Multimodal Physics-Informed Acoustic Model for Forecasting Market Volatility and Enhancing Market InterpretabilityXiaoliang Chen, Xin Yu, Le Chang et al.
Information asymmetry in financial markets, often amplified by strategically crafted corporate narratives, undermines the effectiveness of conventional textual analysis. We propose a novel multimodal framework for financial risk assessment that integrates textual sentiment with paralinguistic cues derived from executive vocal tract dynamics in earnings calls. Central to this framework is the Physics-Informed Acoustic Model (PIAM), which applies nonlinear acoustics to robustly extract emotional signatures from raw teleconference sound subject to distortions such as signal clipping. Both acoustic and textual emotional states are projected onto an interpretable three-dimensional Affective State Label (ASL) space-Tension, Stability, and Arousal. Using a dataset of 1,795 earnings calls (approximately 1,800 hours), we construct features capturing dynamic shifts in executive affect between scripted presentation and spontaneous Q&A exchanges. Our key finding reveals a pronounced divergence in predictive capacity: while multimodal features do not forecast directional stock returns, they explain up to 43.8% of the out-of-sample variance in 30-day realized volatility. Importantly, volatility predictions are strongly driven by emotional dynamics during executive transitions from scripted to spontaneous speech, particularly reduced textual stability and heightened acoustic instability from CFOs, and significant arousal variability from CEOs. An ablation study confirms that our multimodal approach substantially outperforms a financials-only baseline, underscoring the complementary contributions of acoustic and textual modalities. By decoding latent markers of uncertainty from verifiable biometric signals, our methodology provides investors and regulators a powerful tool for enhancing market interpretability and identifying hidden corporate uncertainty.
CVMay 31, 2025
Sequence-Based Identification of First-Person Camera Wearers in Third-Person ViewsZiwei Zhao, Xizi Wang, Yuchen Wang et al.
The increasing popularity of egocentric cameras has generated growing interest in studying multi-camera interactions in shared environments. Although large-scale datasets such as Ego4D and Ego-Exo4D have propelled egocentric vision research, interactions between multiple camera wearers remain underexplored-a key gap for applications like immersive learning and collaborative robotics. To bridge this, we present TF2025, an expanded dataset with synchronized first- and third-person views. In addition, we introduce a sequence-based method to identify first-person wearers in third-person footage, combining motion cues and person re-identification.
GEO-PHOct 21, 2024
Seismic Phase PickingYuchen Wang, Ruihuan Wang
Seismic phase picking, which aims to determine the arrival time of P- and S-waves according to seismic waveforms, is fundamental to earthquake monitoring. Generally, manual phase picking is trustworthy, but with the increasing number of worldwide stations and seismic monitors, it becomes more challenging for human to complete the task comprehensively. In this work, we explore multiple ways to do automatic phase picking, including traditional and learning-based methods.
AIOct 20, 2024
Contextual Augmented Multi-Model Programming (CAMP): A Hybrid Local-Cloud Copilot FrameworkYuchen Wang, Shangxin Guo, Chee Wei Tan
The advancements in cloud-based Large Languages Models (LLMs) have revolutionized AI-assisted programming. However, their integration into certain local development environments like ones within the Apple software ecosystem (e.g., iOS apps, macOS) remains challenging due to computational demands and sandboxed constraints. This paper presents CAMP, a multi-model AI-assisted programming framework that consists of a local model that employs Retrieval-Augmented Generation (RAG) to retrieve contextual information from the codebase to facilitate context-aware prompt construction thus optimizing the performance of the cloud model, empowering LLMs' capabilities in local Integrated Development Environments (IDEs). The methodology is actualized in Copilot for Xcode, an AI-assisted programming tool crafted for Xcode that employs the RAG module to address software constraints and enables diverse generative programming tasks, including automatic code completion, documentation, error detection, and intelligent user-agent interaction. The results from objective experiments on generated code quality and subjective experiments on user adoption collectively demonstrate the pilot success of the proposed system and mark its significant contributions to the realm of AI-assisted programming.
SDJun 19, 2024
Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword SpottingShuai Wang, Dehao Zhang, Kexin Shi et al.
Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative modules: 1) Global-Local Spiking Convolution (GLSC) module and 2) Bottleneck-PLIF module. Compared to the hand-crafted feature extraction methods, the GLSC module achieves speech feature extraction that is sparser, more energy-efficient, and yields better performance. The Bottleneck-PLIF module further processes the signals from GLSC with the aim to achieve higher accuracy with fewer parameters. Extensive experiments are conducted on the Google Speech Commands Dataset (V1 and V2). The results show our method achieves competitive performance among SNN-based KWS models with fewer parameters.
IRJun 15, 2024
MMBee: Live Streaming Gift-Sending Recommendations via Multi-Modal Fusion and Behaviour ExpansionJiaxin Deng, Shiyao Wang, Yuchen Wang et al.
Live streaming services are becoming increasingly popular due to real-time interactions and entertainment. Viewers can chat and send comments or virtual gifts to express their preferences for the streamers. Accurately modeling the gifting interaction not only enhances users' experience but also increases streamers' revenue. Previous studies on live streaming gifting prediction treat this task as a conventional recommendation problem, and model users' preferences using categorical data and observed historical behaviors. However, it is challenging to precisely describe the real-time content changes in live streaming using limited categorical information. Moreover, due to the sparsity of gifting behaviors, capturing the preferences and intentions of users is quite difficult. In this work, we propose MMBee based on real-time Multi-Modal Fusion and Behaviour Expansion to address these issues. Specifically, we first present a Multi-modal Fusion Module with Learnable Query (MFQ) to perceive the dynamic content of streaming segments and process complex multi-modal interactions, including images, text comments and speech. To alleviate the sparsity issue of gifting behaviors, we present a novel Graph-guided Interest Expansion (GIE) approach that learns both user and streamer representations on large-scale gifting graphs with multi-modal attributes. Comprehensive experiment results show that MMBee achieves significant performance improvements on both public datasets and Kuaishou real-world streaming datasets and the effectiveness has been further validated through online A/B experiments. MMBee has been deployed and is serving hundreds of millions of users at Kuaishou.
CLDec 7, 2021
UNITER-Based Situated Coreference Resolution with Rich Multimodal InputYichen Huang, Yuchen Wang, Yik-Cheung Tam
We present our work on the multimodal coreference resolution task of the Situated and Interactive Multimodal Conversation 2.0 (SIMMC 2.0) dataset as a part of the tenth Dialog System Technology Challenge (DSTC10). We propose a UNITER-based model utilizing rich multimodal context such as textual dialog history, object knowledge base and visual dialog scenes to determine whether each object in the current scene is mentioned in the current dialog turn. Results show that the proposed approach outperforms the official DSTC10 baseline substantially, with the object F1 score boosted from 36.6% to 77.3% on the development set, demonstrating the effectiveness of the proposed object representations from rich multimodal input. Our model ranks second in the official evaluation on the object coreference resolution task with an F1 score of 73.3% after model ensembling.
CVOct 13, 2021
Ego4D: Around the World in 3,000 Hours of Egocentric VideoKristen Grauman, Andrew Westbury, Eugene Byrne et al.
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/
LGDec 1, 2020
Forecasting Black Sigatoka Infection Risks with Latent Neural ODEsYuchen Wang, Matthieu Chan Chee, Ziyad Edher et al.
Black Sigatoka disease severely decreases global banana production, and climate change aggravates the problem by altering fungal species distributions. Due to the heavy financial burden of managing this infectious disease, farmers in developing countries face significant banana crop losses. Though scientists have produced mathematical models of infectious diseases, adapting these models to incorporate climate effects is difficult. We present MR. NODE (Multiple predictoR Neural ODE), a neural network that models the dynamics of black Sigatoka infection learnt directly from data via Neural Ordinary Differential Equations. Our method encodes external predictor factors into the latent space in addition to the variable that we infer, and it can also predict the infection risk at an arbitrary point in time. Empirically, we demonstrate on historical climate data that our method has superior generalization performance on time points up to one month in the future and unseen irregularities. We believe that our method can be a useful tool to control the spread of black Sigatoka.
CVOct 8, 2020
Deep Tiered Image Segmentation For Detecting Internal Ice Layers in Radar ImageryYuchen Wang, Mingze Xu, John Paden et al.
Understanding the structure of Earth's polar ice sheets is important for modeling how global warming will impact polar ice and, in turn, the Earth's climate. Ground-penetrating radar is able to collect observations of the internal structure of snow and ice, but the process of manually labeling these observations is slow and laborious. Recent work has developed automatic techniques for finding the boundaries between the ice and the bedrock, but finding internal layers - the subtle boundaries that indicate where one year's ice accumulation ended and the next began - is much more challenging because the number of layers varies and the boundaries often merge and split. In this paper, we propose a novel deep neural network for solving a general class of tiered segmentation problems. We then apply it to detecting internal layers in polar ice, evaluating on a large-scale dataset of polar ice radar data with human-labeled annotations as ground truth.
CRSep 9, 2019
Puncturable Signatures and Applications in Proof-of-Stake Blockchain ProtocolXinyu Li, Jing Xu, Xiong Fan et al.
Proof-of-stake blockchain protocols are becoming one of the most promising alternatives to the energy-consuming proof-of-work protocols. However, one particularly critical threat in the PoS setting is the well-known long-range attacks caused by secret key leakage (LRSL attack). Specifically, an adversary can attempt to control/compromise accounts possessing substantial stake at some past moment such that double-spend or erase past transactions, violating the fundamental persistence property of blockchain. Puncturable signatures provide a satisfying solution to construct practical proof-of-stake blockchain resilient to LRSL attack, despite of the fact that existent constructions are not efficient enough for practical deployments. In this paper, we provide an in-depth study of puncturable signatures and explore its applications in the proof-of-stake blockchain. We formalize a security model that allows the adversary for adaptive signing and puncturing queries, and show a construction with efficient puncturing operations based on the Bloom filter data structure and strong Diffie-Hellman assumption. The puncturing functionality we desire is for a particular part of message, like prefix, instead of the whole message. Furthermore, we use puncturable signatures to construct practical proof-of-stake blockchain protocols that are resilient to LRSL attack, while previously the forward-secure signature is used to immunize this attack. We implement our scheme and provide experimental results showing that in comparison with the forward-secure signature, our construction performs substantially better on signature size, signing and verification efficiency, significantly on key update efficiency.