Sixiao Zheng

CV
h-index13
13papers
3,778citations
Novelty58%
AI Score54

13 Papers

CVJul 19, 2022
Vision Transformers: From Semantic Segmentation to Dense Prediction

Li Zhang, Jiachen Lu, Sixiao Zheng et al.

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image patches, in comparison to the increasing receptive fields of CNNs across layers and other alternatives (e.g., large kernels and atrous convolution). In this work, for the first time we explore the global context learning potentials of ViTs for dense visual prediction (e.g., semantic segmentation). Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information, critical for dense prediction tasks. We first demonstrate that encoding an image as a sequence of patches, a vanilla ViT without local convolution and resolution reduction can yield stronger visual representation for semantic segmentation. For example, our model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission) and performs competitively on Cityscapes. However, the basic ViT architecture falls short in broader dense prediction applications, such as object detection and instance segmentation, due to its lack of a pyramidal structure, high computational demand, and insufficient local context. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification.

CVJul 13, 2024Code
ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context

Sixiao Zheng, Yanwei Fu

Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for visual storytelling. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames for guiding the model. Extensive experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation. Code is available at https://github.com/sixiaozheng/ContextualStory.

CVJan 8
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

Sixiao Zheng, Minghao Yin, Wenbo Hu et al.

Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.

83.6CVMar 16
Mind-of-Director: Multi-modal Agent-Driven Film Previsualization via Collaborative Decision-Making

Shufeng Nan, Mengtian Li, Sixiao Zheng et al.

We present Mind-of-Director, a multi-modal agent-driven framework for film previz that models the collaborative decision-making process of a film production team. Given a creative idea, Mind-of-Director orchestrates multiple specialized agents to produce previz sequences within the game engine. The framework consists of four cooperative modules: Script Development, where agents draft and refine the screenplay iteratively; Virtual Scene Design, which transforms text into semantically aligned 3D environments; Character Behaviour Control, which determines character blocking and motion; and Camera Planning, which optimizes framing, movement, and composition for cinematic camera effects. A real-time visual editing system built in the game engine further enables interactive inspection and synchronized timeline adjustment across scenes, behaviours, and cameras. Extensive experiments and human evaluations show that Mind-of-Director generates high-quality, semantically grounded previz sequences in approximately 25 minutes per idea, demonstrating the effectiveness of agent collaboration for both automated prototyping and human-in-the-loop filmmaking.

LGFeb 20, 2022Code
Clustering by the Probability Distributions from Extreme Value Theory

Sixiao Zheng, Ke Fan, Yanxi Hou et al.

Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in https://github.com/sixiaozheng/EVT-K-means.

CVFeb 11, 2025
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Sixiao Zheng, Zimian Peng, Yanpeng Zhou et al.

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. In content creation workflows, precise and simultaneous control over camera motion, object motion, and lighting direction enhances both accuracy and flexibility. However, existing approaches typically treat these control signals separately, largely due to the scarcity of datasets with high-quality joint annotations and mismatched control spaces across modalities. We present VidCRAFT3, a unified and flexible I2V framework that supports both independent and joint control over camera motion, object motion, and lighting direction by integrating three core components. Image2Cloud reconstructs a 3D point cloud from the reference image to enable precise camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale optical flow features to guide object motion. The Spatial Triple-Attention Transformer integrates lighting direction embeddings via parallel cross-attention. To address the scarcity of jointly annotated data, we curate the VideoLightingDirection (VLD) dataset of synthetic static-scene video clips with per-frame lighting-direction labels, and adopt a three-stage training strategy that enables robust learning without fully joint annotations. Extensive experiments show that VidCRAFT3 outperforms existing methods in control precision and visual coherence. Code and data will be released. Project page: https://sixiaozheng.github.io/VidCRAFT3/.

CVMar 30, 2025
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Zhenyang Liu, Yikai Wang, Sixiao Zheng et al.

Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. In this work, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the SAM and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.

CVJul 9, 2025
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

Zhenyang Liu, Sixiao Zheng, Siyu Chen et al.

Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.

CVFeb 24, 2024
Intelligent Director: An Automatic Framework for Dynamic Visual Composition using ChatGPT

Sixiao Zheng, Jingyang Huo, Yu Wang et al.

With the rise of short video platforms represented by TikTok, the trend of users expressing their creativity through photos and videos has increased dramatically. However, ordinary users lack the professional skills to produce high-quality videos using professional creation software. To meet the demand for intelligent and user-friendly video creation tools, we propose the Dynamic Visual Composition (DVC) task, an interesting and challenging task that aims to automatically integrate various media elements based on user requirements and create storytelling videos. We propose an Intelligent Director framework, utilizing LENS to generate descriptions for images and video frames and combining ChatGPT to generate coherent captions while recommending appropriate music names. Then, the best-matched music is obtained through music retrieval. Then, materials such as captions, images, videos, and music are integrated to seamlessly synthesize the video. Finally, we apply AnimeGANv2 for style transfer. We construct UCF101-DVC and Personal Album datasets and verified the effectiveness of our framework in solving DVC through qualitative and quantitative comparisons, along with user studies, demonstrating its substantial potential.

CVJun 4, 2021
NMS-Loss: Learning with Non-Maximum Suppression for Crowded Pedestrian Detection

Zekun Luo, Zheng Fang, Sixiao Zheng et al.

Non-Maximum Suppression (NMS) is essential for object detection and affects the evaluation results by incorporating False Positives (FP) and False Negatives (FN), especially in crowd occlusion scenes. In this paper, we raise the problem of weak connection between the training targets and the evaluation metrics caused by NMS and propose a novel NMS-Loss making the NMS procedure can be trained end-to-end without any additional network parameters. Our NMS-Loss punishes two cases when FP is not suppressed and FN is wrongly eliminated by NMS. Specifically, we propose a pull loss to pull predictions with the same target close to each other, and a push loss to push predictions with different targets away from each other. Experimental results show that with the help of NMS-Loss, our detector, namely NMS-Ped, achieves impressive results with Miss Rate of 5.92% on Caltech dataset and 10.08% on CityPersons dataset, which are both better than state-of-the-art competitors.

CVMar 23, 2021
Incrementally Zero-Shot Detection by an Extreme Value Analyzer

Sixiao Zheng, Yanwei Fu, Yanxi Hou

Human beings not only have the ability to recognize novel unseen classes, but also can incrementally incorporate the new classes to existing knowledge preserved. However, zero-shot learning models assume that all seen classes should be known beforehand, while incremental learning models cannot recognize unseen classes. This paper introduces a novel and challenging task of Incrementally Zero-Shot Detection (IZSD), a practical strategy for both zero-shot learning and class-incremental learning in real-world object detection. An innovative end-to-end model -- IZSD-EVer was proposed to tackle this task that requires incrementally detecting new classes and detecting the classes that have never been seen. Specifically, we propose a novel extreme value analyzer to detect objects from old seen, new seen, and unseen classes, simultaneously. Additionally and technically, we propose two innovative losses, i.e., background-foreground mean squared error loss alleviating the extreme imbalance of the background and foreground of images, and projection distance loss aligning the visual space and semantic spaces of old seen classes. Experiments demonstrate the efficacy of our model in detecting objects from both the seen and unseen classes, outperforming the alternative models on Pascal VOC and MSCOCO datasets.

CVDec 31, 2020
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao et al.

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

NCMar 4, 2018
Could Interaction with Social Robots Facilitate Joint Attention of Children with Autism Spectrum Disorder?

Wei Cao, Wenxu Song, Xinge Li et al.

This research addressed whether interactions with social robots could facilitate joint attention of the autism spectrum disorder (ASD). Two conditions of initiators, namely 'Human' vs. 'Robot' were measured with 15 children with ASD and 15 age-matched typically developing (TD) children. Apart from fixation and gaze transition, a new longest common subsequence (LCS) approach was proposed to analyze eye-movement traces. Results revealed that children with ASD showed deficits of joint attention. Compared to the human agent, robot facilitate less fixations towards the targets, but it attracted more attention and allowed the children to show gaze transition and to follow joint attention logic. This results highlight both potential application of LCS analysis on eye-tracking studies and of social robot to intervention.