Sibo Zhang

CV
h-index10
10papers
585citations
Novelty32%
AI Score38

10 Papers

CLSep 23, 2023Code
A Survey on Image-text Multimodal Models

Ruifeng Guo, Jingxuan Wei, Linzhuang Sun et al.

With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. Current surveys on image-text multimodal models mainly focus on representative models or application domains, but lack a review on how general technical models influence the development of domain-specific models, which is crucial for domain researchers. Based on this, this paper first reviews the technological evolution of image-text multimodal models, from early explorations of feature space to visual language encoding structures, and then to the latest large model architectures. Next, from the perspective of technological evolution, we explain how the development of general image-text multimodal technologies promotes the progress of multimodal technologies in the biomedical field, as well as the importance and complexity of specific datasets in the biomedical domain. Then, centered on the tasks of image-text multimodal models, we analyze their common components and challenges. After that, we summarize the architecture, components, and data of general image-text multimodal models, and introduce the applications and improvements of image-text multimodal models in the biomedical field. Finally, we categorize the challenges faced in the development and application of general models into external factors and intrinsic factors, further refining them into 2 external factors and 5 intrinsic factors, and propose targeted solutions, providing guidance for future research directions. For more details and data, please visit our GitHub page: \url{https://github.com/i2vec/A-survey-on-image-text-multimodal-models}.

AIDec 12, 2023Code
Brain-inspired Computing Based on Deep Learning for Human-computer Interaction: A Review

Bihui Yu, Sibo Zhang, Lili Zhou et al.

The continuous development of artificial intelligence has a profound impact on biomedicine and other fields, providing new research ideas and technical methods. Brain-inspired computing is an important intersection between multimodal technology and biomedical field. Focusing on the application scenarios of decoding text and speech from brain signals in human-computer interaction, this paper presents a comprehensive review of the brain-inspired computing models based on deep learning (DL), tracking its evolution, application value, challenges and potential research trends. We first reviews its basic concepts and development history, and divides its evolution into two stages: recent machine learning and current deep learning, emphasizing the importance of each stage in the research of brain-inspired computing for human-computer interaction. In addition, the latest progress of deep learning in different tasks of brain-inspired computing for human-computer interaction is reviewed from five perspectives, including datasets and different brain signals, and the application of key technologies in the model is elaborated in detail. Despite significant advances in brain-inspired computational models, challenges remain to fully exploit their capabilities, and we provide insights into possible directions for future academic research. For more detailed information, please visit our GitHub page: https://github.com/ultracoolHub/brain-inspired-computing.

LGFeb 17
CDRL: A Reinforcement Learning Framework Inspired by Cerebellar Circuits and Dendritic Computational Strategies

Sibo Zhang, Rui Jing, Liangfu Lv et al.

Reinforcement learning (RL) has achieved notable performance in high-dimensional sequential decision-making tasks, yet remains limited by low sample efficiency, sensitivity to noise, and weak generalization under partial observability. Most existing approaches address these issues primarily through optimization strategies, while the role of architectural priors in shaping representation learning and decision dynamics is less explored. Inspired by structural principles of the cerebellum, we propose a biologically grounded RL architecture that incorporate large expansion, sparse connectivity, sparse activation, and dendritic-level modulation. Experiments on noisy, high-dimensional RL benchmarks show that both the cerebellar architecture and dendritic modulation consistently improve sample efficiency, robustness, and generalization compared to conventional designs. Sensitivity analysis of architectural parameters suggests that cerebellum-inspired structures can offer optimized performance for RL with constrained model parameters. Overall, our work underscores the value of cerebellar structural priors as effective inductive biases for RL.

CVOct 6, 2021
Construction Site Safety Monitoring and Excavator Activity Analysis System

Sibo Zhang, Liangjun Zhang

With the recent advancements in deep learning and computer vision, the AI-powered construction machine such as autonomous excavator has made significant progress. Safety is the most important section in modern construction, where construction machines are more and more automated. In this paper, we propose a vision-based excavator perception, activity analysis, and safety monitoring system. Our perception system could detect multi-class construction machines and humans in real-time while estimating the poses and actions of the excavator. Then, we present a novel safety monitoring and excavator activity analysis system based on the perception result. To evaluate the performance of our method, we collect a dataset using the Autonomous Excavator System (AES) including multi-class of objects in different lighting conditions with human annotations. We also evaluate our method on a benchmark construction dataset. The results showed our YOLO v5 multi-class objects detection model improved inference speed by 8 times (YOLO v5 x-large) to 34 times (YOLO v5 small) compared with Faster R-CNN/ YOLO v3 model. Furthermore, the accuracy of YOLO v5 models is improved by 2.7% (YOLO v5 x-large) while model size is reduced by 63.9% (YOLO v5 x-large) to 93.9% (YOLO v5 small). The experimental results show that the proposed action recognition approach outperforms the state-of-the-art approaches on top-1 accuracy by about 5.18%. The proposed real-time safety monitoring system is not only designed for our Autonomous Excavator System (AES) in solid waste scenes, it can also be applied to general construction scenarios.

CVApr 29, 2021
Text2Video: Text-driven Talking-head Video Synthesis with Personalized Phoneme-Pose Dictionary

Sibo Zhang, Jiahong Yuan, Miao Liao et al.

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

CVJul 17, 2020
Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses

Miao Liao, Sibo Zhang, Peng Wang et al.

In this paper, we propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution video with motion details, we propose to insert part attention mechanisms in the conditional GAN, where each detailed part, e.g. head and hand, is automatically zoomed in to have their own discriminators. To validate our approach, we collect a dataset with 20 high-quality videos from 1 male and 1 female model reading various documents under different topics. Compared with previous SoTA pipelines handling similar tasks, our approach achieves better results by a user study.

CVJul 17, 2020
DVI: Depth Guided Video Inpainting for Autonomous Driving

Miao Liao, Feixiang Lu, Dingfu Zhou et al.

To get clear street-view and photo-realistic simulation in autonomous driving, we present an automatic video inpainting algorithm that can remove traffic agents from videos and synthesize missing regions with the guidance of depth/point cloud. By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated via this common 3D map. In order to fill a target inpainting area in a frame, it is straightforward to transform pixels from other frames into the current one with correct occlusion. Furthermore, we are able to fuse multiple videos through 3D point cloud registration, making it possible to inpaint a target video with multiple source videos. The motivation is to solve the long-time occlusion problem where an occluded area has never been visible in the entire video. To our knowledge, we are the first to fuse multiple videos for video inpainting. To verify the effectiveness of our approach, we build a large inpainting dataset in the real urban road environment with synchronized images and Lidar data including many challenge scenes, e.g., long time occlusion. The experimental results show that the proposed approach outperforms the state-of-the-art approaches for all the criteria, especially the RMSE (Root Mean Squared Error) has been reduced by about 13%.

CVApr 6, 2020
CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception

Sibo Zhang, Yuexin Ma, Ruigang Yang

This paper reviews the CVPR 2019 challenge on Autonomous Driving. Baidu's Robotics and Autonomous Driving Lab (RAL) providing 150 minutes labeled Trajectory and 3D Perception dataset including about 80k lidar point cloud and 1000km trajectories for urban traffic. The challenge has two tasks in (1) Trajectory Prediction and (2) 3D Lidar Object Detection. There are more than 200 teams submitted results on Leaderboard and more than 1000 participants attended the workshop.

CVNov 6, 2018
TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents

Yuexin Ma, Xinge Zhu, Sibo Zhang et al.

To safely and efficiently navigate in complex urban traffic, autonomous vehicles must make responsible predictions in relation to surrounding traffic-agents (vehicles, bicycles, pedestrians, etc.). A challenging and critical task is to explore the movement patterns of different traffic-agents and predict their future trajectories accurately to help the autonomous vehicle make reasonable navigation decision. To solve this problem, we propose a long short-term memory-based (LSTM-based) realtime traffic prediction algorithm, TrafficPredict. Our approach uses an instance layer to learn instances' movements and interactions and has a category layer to learn the similarities of instances belonging to the same type to refine the prediction. In order to evaluate its performance, we collected trajectory datasets in a large city consisting of varying conditions and traffic densities. The dataset includes many challenging scenarios where vehicles, bicycles, and pedestrians move among one another. We evaluate the performance of TrafficPredict on our new dataset and highlight its higher accuracy for trajectory prediction by comparing with prior prediction methods.

IRAug 19, 2017
Event-Radar: Real-time Local Event Detection System for Geo-Tagged Tweet Streams

Sibo Zhang, Yuan Cheng, Deyuan Ke

The local event detection is to use posting messages with geotags on social networks to reveal the related ongoing events and their locations. Recent studies have demonstrated that the geo-tagged tweet stream serves as an unprecedentedly valuable source for local event detection. Nevertheless, how to effectively extract local events from large geo-tagged tweet streams in real time remains challenging. A robust and efficient cloud-based real-time local event detection software system would benefit various aspects in the real-life society, from shopping recommendation for customer service providers to disaster alarming for emergency departments. We use the preliminary research GeoBurst as a starting point, which proposed a novel method to detect local events. GeoBurst+ leverages a novel cross-modal authority measure to identify several pivots in the query window. Such pivots reveal different geo-topical activities and naturally attract related tweets to form candidate events. It further summarises the continuous stream and compares the candidates against the historical summaries to pinpoint truly interesting local events. We mainly implement a website demonstration system Event-Radar with an improved algorithm to show the real-time local events online for public interests. Better still, as the query window shifts, our method can update the event list with little time cost, thus achieving continuous monitoring of the stream.