Changbin Yu

CV
11papers
180citations
Novelty47%
AI Score25

11 Papers

SYSep 14, 2011
A Statistically Modelling Method for Performance Limits in Sensor Localization

Baoqi Huang, Tao Li, Brian D. O. Anderson et al.

In this paper, we study performance limits of sensor localization from a novel perspective. Specifically, we consider the Cramer-Rao Lower Bound (CRLB) in single-hop sensor localization using measurements from received signal strength (RSS), time of arrival (TOA) and bearing, respectively, but differently from the existing work, we statistically analyze the trace of the associated CRLB matrix (i.e. as a scalar metric for performance limits of sensor localization) by assuming anchor locations are random. By the Central Limit Theorems for $U$-statistics, we show that as the number of the anchors increases, this scalar metric is asymptotically normal in the RSS/bearing case, and converges to a random variable which is an affine transformation of a chi-square random variable of degree 2 in the TOA case. Moreover, we provide formulas quantitatively describing the relationship among the mean and standard deviation of the scalar metric, the number of the anchors, the parameters of communication channels, the noise statistics in measurements and the spatial distribution of the anchors. These formulas, though asymptotic in the number of the anchors, in many cases turn out to be remarkably accurate in predicting performance limits, even if the number is small. Simulations are carried out to confirm our results.

CVDec 15, 2022
Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Liqi Yan, Qifan Wang, Siqi Ma et al.

Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.

SYJan 31, 2011
Control of Multi-Agent Formations with Only Shape Constraints

Huang Huang, Changbin Yu, Qinghe Wu

This paper considers a novel problem of how to choose an appropriate geometry for a group of agents with only shape constraints but with a flexible scale. Instead of assigning the formation system with a specific geometry, here the only requirement on the desired geometry is a shape without any location, rotation and, most importantly, scale constraints. Optimal rigid transformation between two different geometries is discussed with especial focus on the scaling operation, and the cooperative performance of the system is evaluated by what we call the geometries degrees of similarity (DOS) with respect to the desired shape during the entire convergence process. The design of the scale when measuring the DOS is discussed from constant value and time-varying function perspectives respectively. Fixed structured nonlinear control laws that are functions on the scale are developed to guarantee the exponential convergence of the system to the assigned shape. Our research is originated from a three-agent formation system and is further extended to multiple (n > 3) agents by defining a triangular complement graph. Simulations demonstrate that formation system with the time-varying scale function outperforms the one with an arbitrary constant scale, and the relationship between underlying topology and the system performance is further discussed based on the simulation observations. Moveover, the control scheme is applied to bearing-only sensor-target localization to show its application potentials.

SYJan 24, 2011
Parameter Optimization of Multi-Agent Formations based on LQR Design

Huang Huang, Changbin Yu

In this paper we study the optimal formation control of multiple agents whose interaction parameters are adjusted upon a cost function consisting of both the control energy and the geometrical performance. By optimizing the interaction parameters and by the linear quadratic regulation(LQR) controllers, the upper bound of the cost function is minimized. For systems with homogeneous agents interconnected over sparse graphs, distributed controllers are proposed that inherit the same underlying graph as the one among agents. For the more general case, a relaxed optimization problem is considered so as to eliminate the nonlinear constraints. Using the subgradient method, interaction parameters among agents are optimized under the constraint of a sparse graph, and the optimum of the cost function is a better result than the one when agents interacted only through the control channel. Numerical examples are provided to validate the effectiveness of the method and to illustrate the geometrical performance of the system.

CVJan 2, 2021
Video Captioning in Compressed Video

Mingjian Zhu, Chenrui Duan, Changbin Yu

Existing approaches in video captioning concentrate on exploring global frame features in the uncompressed videos, while the free of charge and critical saliency information already encoded in the compressed videos is generally neglected. We propose a video captioning method which operates directly on the stored compressed videos. To learn a discriminative visual representation for video captioning, we design a residuals-assisted encoder (RAE), which spots regions of interest in I-frames under the assistance of the residuals frames. First, we obtain the spatial attention weights by extracting features of residuals as the saliency value of each location in I-frame and design a spatial attention module to refine the attention weights. We further propose a temporal gate module to determine how much the attended features contribute to the caption generation, which enables the model to resist the disturbance of some noisy signals in the compressed videos. Finally, Long Short-Term Memory is utilized to decode the visual representations into descriptions. We evaluate our method on two benchmark datasets and demonstrate the effectiveness of our approach.

CVDec 1, 2020
Dynamic Feature Pyramid Networks for Object Detection

Mingjian Zhu, Kai Han, Changbin Yu et al.

Feature pyramid network (FPN) is a critical component in modern object detection frameworks. The performance gain in most of the existing FPN variants is mainly attributed to the increase of computational burden. An attempt to enhance the FPN is enriching the spatial information by expanding the receptive fields, which is promising to largely improve the detection accuracy. In this paper, we first investigate how expanding the receptive fields affect the accuracy and computational costs of FPN. We explore a baseline model called inception FPN in which each lateral connection contains convolution filters with different kernel sizes. Moreover, we point out that not all objects need such a complicated calculation and propose a new dynamic FPN (DyFPN). The output features of DyFPN will be calculated by using the adaptively selected branch according to a dynamic gating operation. Therefore, the proposed method can provide a more efficient dynamic inference for achieving a better trade-off between accuracy and computational cost. Extensive experiments conducted on MS-COCO benchmark demonstrate that the proposed DyFPN significantly improves performance with the optimal allocation of computation resources. For instance, replacing inception FPN with DyFPN reduces about 40% of its FLOPs while maintaining similar high performance.

CVSep 1, 2020
Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Liqi Yan, Dongfang Liu, Yaoxian Song et al.

Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.

CVNov 13, 2019
Crowd Video Captioning

Liqi Yan, Mingjian Zhu, Changbin Yu

Describing a video automatically with natural language is a challenging task in the area of computer vision. In most cases, the on-site situation of great events is reported in news, but the situation of the off-site spectators in the entrance and exit is neglected which also arouses people's interest. Since the deployment of reporters in the entrance and exit costs lots of manpower, how to automatically describe the behavior of a crowd of off-site spectators is significant and remains a problem. To tackle this problem, we propose a new task called crowd video captioning (CVC) which aims to describe the crowd of spectators. We also provide baseline methods for this task and evaluate them on the dataset WorldExpo'10. Our experimental results show that captioning models have a fairly deep understanding of the crowd in video and perform satisfactorily in the CVC task.

ROSep 14, 2019
Deep Robotic Prediction with hierarchical RGB-D Fusion

Yaoxian Song, Jun Wen, Yuejiao Fei et al.

Robotic arm grasping is a fundamental operation in robotic control task goals. Most current methods for robotic grasping focus on RGB-D policy in the table surface scenario or 3D point cloud analysis and inference in the 3D space. Comparing to these methods, we propose a novel real-time multimodal hierarchical encoder-decoder neural network that fuses RGB and depth data to realize robotic humanoid grasping in 3D space with only partial observation. The quantification of raw depth data's uncertainty and depth estimation fusing RGB is considered. We develop a general labeling method to label ground-truth on common RGB-D datasets. We evaluate the effectiveness and performance of our method on a physical robot setup and our method achieves over 90\% success rate in both table surface and 3D space scenarios.

SYJan 11, 2019
Cooperative event-based rigid formation control

Zhiyong Sun, Qingchen Liu, Na Huang et al.

This paper discusses cooperative stabilization control of rigid formations via an event-based approach. We first design a centralized event-based formation control system, in which a central event controller determines the next triggering time and broadcasts the event signal to all the agents for control input update. We then build on this approach to propose a distributed event control strategy, in which each agent can use its local event trigger and local information to update the control input at its own event time. For both cases, the triggering condition, event function and triggering behavior are discussed in detail, and the exponential convergence of the event-based formation system is guaranteed.

SYSep 29, 2018
Collaborative target-tracking control using multiple autonomous fixed-wing UAVs with constant speeds

Zhiyong Sun, Hector Garcia de Marina, Brian D. O. Anderson et al.

This paper considers a collaborative tracking control problem using a group of fixed-wing unmanned aerial vehicles (UAVs) with constant and non-identical speeds. The dynamics of fixed-wing UAVs are modelled by unicycle-type equations with nonholonomic constraints, assuming that UAVs fly at constant altitudes in the nominal operation mode. The controller is designed such that all fixed-wing UAVs as a group can collaboratively track a desired target's position and velocity. We first present conditions on the relative speeds of tracking UAVs and the target to ensure that the tracking objective can be achieved when UAVs are subject to constant speed constraints. We construct a reference velocity that includes both the target's velocity and position as feedback, which is to be tracked by the group centroid. In this way, all vehicles' headings are controlled such that the group centroid follows a reference trajectory that successfully tracks the target's trajectory. A spacing controller is further devised to ensure that all vehicles stay close to the group centroid trajectory. Trade-offs in the controller design and performance limitations of the target tracking control due to the constant-speed constraint are also discussed in detail. Experimental results with three fixed-wing UAVs tracking a target rotorcraft are provided.