Zeyu Xiong

CV
h-index17
11papers
393citations
Novelty44%
AI Score48

11 Papers

59.9CVMay 26
Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

Xiang Fang, Zeyu Xiong, Wanlong Fang et al.

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction.Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization.Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

56.7ROApr 7Code
RoboBPP: Benchmarking Robotic Online Bin Packing with Physics-based Simulation

Zhoufeng Wang, Hang Zhao, Juzhan Xu et al.

Physical feasibility in 3D bin packing is a key requirement in modern industrial logistics and robotic automation. With the growing adoption of industrial automation, online bin packing has gained increasing attention. However, inconsistencies in problem settings, test datasets, and evaluation metrics have hindered progress in the field, and there is a lack of a comprehensive benchmarking system. Direct testing on real hardware is costly, and building a realistic simulation environment is also challenging. To address these limitations, we introduce RoboBPP, a benchmarking system designed for robotic online bin packing. RoboBPP integrates a physics-based simulator to assess physical feasibility. In our simulation environment, we introduce a robotic arm and boxes at real-world scales to replicate real industrial packing workflows. By simulating conditions that arise in real industrial applications, we ensure that evaluated algorithms are practically deployable. In addition, prior studies often rely on synthetic datasets whose distributions differ from real-world industrial data. To address this issue, we collect three datasets from real industrial workflows, including assembly-line production, logistics packing, and furniture manufacturing. The benchmark comprises three carefully designed test settings and extends existing evaluation metrics with new metrics for structural stability and operational safety. We design a scoring system and derive a range of insights from the evaluation results. RoboBPP is fully open-source and is equipped with visualization tools and an online leaderboard, providing a reproducible and extensible foundation for future research and industrial applications (https://robot-bin-packing-benchmark.github.io).

CVJan 2, 2023
Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Jiahao Zhu, Daizong Liu, Pan Zhou et al.

Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.

CVJul 2, 2022
Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding

Zeyu Xiong, Daizong Liu, Pan Zhou

Spatial-Temporal Video Grounding (STVG) is a challenging task which aims to localize the spatio-temporal tube of the interested object semantically according to a natural language query. Most previous works not only severely rely on the anchor boxes extracted by Faster R-CNN, but also simply regard the video as a series of individual frames, thus lacking their temporal modeling. Instead, in this paper, we are the first to propose an anchor-free framework for STVG, called Gaussian Kernel-based Cross Modal Network (GKCMN). Specifically, we utilize the learned Gaussian Kernel-based heatmaps of each video frame to locate the query-related object. A mixed serial and parallel connection network is further developed to leverage both spatial and temporal relations among frames for better grounding. Experimental results on VidSTG dataset demonstrate the effectiveness of our proposed GKCMN.

SDAug 26, 2023
A Comprehensive Survey for Evaluation Methodologies of AI-Generated Music

Zeyu Xiong, Weitao Wang, Jing Yu et al.

In recent years, AI-generated music has made significant progress, with several models performing well in multimodal and complex musical genres and scenes. While objective metrics can be used to evaluate generative music, they often lack interpretability for musical evaluation. Therefore, researchers often resort to subjective user studies to assess the quality of the generated works, which can be resource-intensive and less reproducible than objective metrics. This study aims to comprehensively evaluate the subjective, objective, and combined methodologies for assessing AI-generated music, highlighting the advantages and disadvantages of each approach. Ultimately, this study provides a valuable reference for unifying generative AI in the field of music evaluation.

CVFeb 21, 2023
Tracking Objects and Activities with Attention for Temporal Sentence Grounding

Zeyu Xiong, Daizong Liu, Pan Zhou et al.

Temporal sentence grounding (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed video.Most existing methods extract frame-grained features or object-grained features by 3D ConvNet or detection network under a conventional TSG framework, failing to capture the subtle differences between frames or to model the spatio-temporal behavior of core persons/objects. In this paper, we introduce a new perspective to address the TSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal behaviors. Specifically, we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal templates and search space, filtering objects and activities, and (B) a Temporal Sentence Tracker to track multi-modal targets for modeling the targets' behavior and to predict query-related segment. Extensive experiments and comparisons with state-of-the-arts are conducted on challenging benchmarks: Charades-STA and TACoS. And our TSTNet achieves the leading performance with a considerable real-time speed.

HCMay 7, 2024
Sketch Then Generate: Providing Incremental User Feedback and Guiding LLM Code Generation through Language-Oriented Code Sketches

Chen Zhu-Tian, Zeyu Xiong, Xiaoshuo Yao et al.

Crafting effective prompts for code generation or editing with Large Language Models (LLMs) is not an easy task. Particularly, the absence of immediate, stable feedback during prompt crafting hinders effective interaction, as users are left to mentally imagine possible outcomes until the code is generated. In response, we introduce Language-Oriented Code Sketching, an interactive approach that provides instant, incremental feedback in the form of code sketches (i.e., incomplete code outlines) during prompt crafting. This approach converts a prompt into a code sketch by leveraging the inherent linguistic structures within the prompt and applying classic natural language processing techniques. The sketch then serves as an intermediate placeholder that not only previews the intended code structure but also guides the LLM towards the desired code, thereby enhancing human-LLM interaction. We conclude by discussing the approach's applicability and future plans.

CLAug 28, 2025
Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search

Zeyu Xiong, Yixuan Nan, Li Gao et al.

In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.

CRFeb 14, 2022
HUT: Enabling High-UTility, Batched Queries under Differential Privacy Protection for Internet-of-Vehicles

Junyu Liu, Wangkai Jin, Zhenyong He et al.

The emerging trends of Internet-of-Vehicles (IoV) demand centralized servers to collect/process sensitive data with limited computational resources on a single vehicle. Such centralizations of sensitive data demand practical privacy protections. One widely-applied paradigm, Differential Privacy, can provide strong guarantees over sensitive data by adding noises. However, directly applying DP for IoV incurs significant challenges for data utility and effective protection. We observe that the key issue about DP-enabled protection in IoV lies in how to synergistically combine DP with special characteristics of IoV, whose query sequences are usually formed as unbalanced batches due to frequent interactions between centralized servers and edge vehicles. To this end, we propose HUT, a new algorithm to enable High UTility for DP-enabled protection in IoV. Our key insight is to leverage the inherent characteristics in IoV: the unbalanced batches. Our key idea is to aggregate local batches and apply Order Constraints, so that information loss from DP protection can be mitigated. We evaluate the effectiveness of HUT against the state-of-the-art DP protection mechanisms. The results show that HUT can provide much lower information loss by 95.69\% and simultaneously enable strong mathematically-guaranteed protection over sensitive data.

HCFeb 14, 2022
BROOK Dataset: A Playground for Exploiting Data-Driven Techniques in Human-Vehicle Interactive Designs

Wangkai Jin, Yicun Duan, Junyu Liu et al.

Emerging Autonomous Vehicles (AV) breed great potentials to exploit data-driven techniques for adaptive and personalized Human-Vehicle Interactions. However, the lack of high-quality and rich data supports limits the opportunities to explore the design space of data-driven techniques, and validate the effectiveness of concrete mechanisms. Our goal is to initialize the efforts to deliver the building block for exploring data-driven Human-Vehicle Interaction designs. To this end, we present BROOK dataset, a multi-modal dataset with facial video records. We first brief our rationales to build BROOK dataset. Then, we elaborate how to build the current version of BROOK dataset via a year-long study, and give an overview of the dataset. Next, we present three example studies using BROOK to justify the applicability of BROOK dataset. We also identify key learning lessons from building BROOK dataset, and discuss about how BROOK dataset can foster an extensive amount of follow-up studies.

GEO-PHApr 12, 2021
Equivariant geometric learning for digital rock physics: estimating formation factor and effective permeability tensors from Morse graph

Chen Cai, Nikolaos Vlassis, Lucas Magee et al.

We present a SE(3)-equivariant graph neural network (GNN) approach that directly predicting the formation factor and effective permeability from micro-CT images. FFT solvers are established to compute both the formation factor and effective permeability, while the topology and geometry of the pore space are represented by a persistence-based Morse graph. Together, they constitute the database for training, validating, and testing the neural networks. While the graph and Euclidean convolutional approaches both employ neural networks to generate low-dimensional latent space to represent the features of the micro-structures for forward predictions, the SE(3) equivariant neural network is found to generate more accurate predictions, especially when the training data is limited. Numerical experiments have also shown that the new SE(3) approach leads to predictions that fulfill the material frame indifference whereas the predictions from classical convolutional neural networks (CNN) may suffer from spurious dependence on the coordinate system of the training data. Comparisons among predictions inferred from training the CNN and those from graph convolutional neural networks (GNN) with and without the equivariant constraint indicate that the equivariant graph neural network seems to perform better than the CNN and GNN without enforcing equivariant constraints.