CVMar 1, 2023Code
Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal SupervisionFangqiang Ding, Andras Palffy, Dariu M. Gavrila et al.
This work proposes a novel approach to 4D radar-based scene flow estimation via cross-modal learning. Our approach is motivated by the co-located sensing redundancy in modern autonomous vehicles. Such redundancy implicitly provides various forms of supervision cues to the radar scene flow estimation. Specifically, we introduce a multi-task model architecture for the identified cross-modal learning problem and propose loss functions to opportunistically engage scene flow estimation using multiple cross-modal constraints for effective model training. Extensive experiments show the state-of-the-art performance of our method and demonstrate the effectiveness of cross-modal supervised learning to infer more accurate 4D radar scene flow. We also show its usefulness to two subtasks - motion segmentation and ego-motion estimation. Our source code will be available on https://github.com/Toytiny/CMFlow.
CVSep 18, 2023Code
RaTrack: Moving Object Detection and Tracking with 4D Radar Point CloudZhijun Pan, Fangqiang Ding, Hantao Zhong et al.
Mobile autonomy relies on the precise perception of dynamic environments. Robustly tracking moving objects in 3D world thus plays a pivotal role for applications like trajectory prediction, obstacle avoidance, and path planning. While most current methods utilize LiDARs or cameras for Multiple Object Tracking (MOT), the capabilities of 4D imaging radars remain largely unexplored. Recognizing the challenges posed by radar noise and point sparsity in 4D radar data, we introduce RaTrack, an innovative solution tailored for radar-based tracking. Bypassing the typical reliance on specific object types and 3D bounding boxes, our method focuses on motion segmentation and clustering, enriched by a motion estimation module. Evaluated on the View-of-Delft dataset, RaTrack showcases superior tracking precision of moving objects, largely surpassing the performance of the state of the art. We release our code and model at https://github.com/LJacksonPan/RaTrack.
CVJun 29, 2023Code
milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion SensingFangqiang Ding, Zhen Luo, Peijun Zhao et al.
Human motion sensing plays a crucial role in smart systems for decision-making, user interaction, and personalized services. Extensive research that has been conducted is predominantly based on cameras, whose intrusive nature limits their use in smart home applications. To address this, mmWave radars have gained popularity due to their privacy-friendly features. In this work, we propose milliFlow, a novel deep learning approach to estimate scene flow as complementary motion information for mmWave point cloud, serving as an intermediate level of features and directly benefiting downstream human motion sensing tasks. Experimental results demonstrate the superior performance of our method when compared with the competing approaches. Furthermore, by incorporating scene flow information, we achieve remarkable improvements in human activity recognition and human parsing and support human body part tracking. Code and dataset are available at https://github.com/Toytiny/milliFlow.
98.1CVMar 20Code
Physion-Eval: Evaluating Physical Realism in Generated Video via Human ReasoningQin Zhang, Peiyu Jing, Hong-Xing Yu et al.
Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at https://huggingface.co/datasets/PhysionLabs/Physion-Eval.
CVMar 2, 2022
Self-Supervised Scene Flow Estimation with 4-D Automotive RadarFangqiang Ding, Zhijun Pan, Yimin Deng et al.
Scene flow allows autonomous vehicles to reason about the arbitrary motion of multiple independent objects which is the key to long-term mobile autonomy. While estimating the scene flow from LiDAR has progressed recently, it remains largely unknown how to estimate the scene flow from a 4-D radar - an increasingly popular automotive sensor for its robustness against adverse weather and lighting conditions. Compared with the LiDAR point clouds, radar data are drastically sparser, noisier and in much lower resolution. Annotated datasets for radar scene flow are also in absence and costly to acquire in the real world. These factors jointly pose the radar scene flow estimation as a challenging problem. This work aims to address the above challenges and estimate scene flow from 4-D radar point clouds by leveraging self-supervised learning. A robust scene flow estimation architecture and three novel losses are bespoken designed to cope with intractable radar data. Real-world experimental results validate that our method is able to robustly estimate the radar scene flow in the wild and effectively supports the downstream task of motion segmentation.
CVMar 3, 2022
Ad2Attack: Adaptive Adversarial Attack on Real-Time UAV TrackingChanghong Fu, Sihang Li, Xinnan Yuan et al.
Visual tracking is adopted to extensive unmanned aerial vehicle (UAV)-related applications, which leads to a highly demanding requirement on the robustness of UAV trackers. However, adding imperceptible perturbations can easily fool the tracker and cause tracking failures. This risk is often overlooked and rarely researched at present. Therefore, to help increase awareness of the potential risk and the robustness of UAV tracking, this work proposes a novel adaptive adversarial attack approach, i.e., Ad$^2$Attack, against UAV object tracking. Specifically, adversarial examples are generated online during the resampling of the search patch image, which leads trackers to lose the target in the following frames. Ad$^2$Attack is composed of a direct downsampling module and a super-resolution upsampling module with adaptive stages. A novel optimization function is proposed for balancing the imperceptibility and efficiency of the attack. Comprehensive experiments on several well-known benchmarks and real-world conditions show the effectiveness of our attack method, which dramatically reduces the performance of the most advanced Siamese trackers.
97.7ROApr 4
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic ManipulationYuanzhe Liu, Jingyuan Zhu, Yuchen Mo et al.
Recent advancements in vision-language-action (VLA) models have shown promise in robotic manipulation, yet they continue to struggle with long-horizon, multi-step tasks. Existing methods lack internal reasoning mechanisms that can identify task-relevant interaction cues or track progress within a subtask, leading to critical execution errors such as repeated actions, missed steps, and premature termination. To address these challenges, we introduce PALM, a VLA framework that structures policy learning around interaction-centric affordance reasoning and subtask progress cues. PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless subtask transitions. Across extensive simulation and real-world experiments, PALM consistently outperforms baselines, achieving a 91.8% success rate on LIBERO-LONG, a 12.5% improvement in average length on CALVIN ABC->D, and a 2x improvement over real-world baselines across three long-horizon generalization settings.
CVDec 19, 2025
RadarGen: Automotive Radar Point Cloud Generation from CamerasTomer Borreda, Fangqiang Ding, Sanja Fidler et al.
We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird's-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.
CVDec 28, 2025
Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object DetectionRunwei Guan, Jianan Liu, Shaofeng Liang et al.
4D millimeter-wave (mmWave) radar has been widely adopted in autonomous driving and robot perception due to its low cost and all-weather robustness. However, point-cloud-based radar representations suffer from information loss due to multi-stage signal processing, while directly utilizing raw 4D radar tensors incurs prohibitive computational costs. To address these challenges, we propose WRCFormer, a novel 3D object detection framework that efficiently fuses raw 4D radar cubes with camera images via decoupled multi-view radar representations. Our approach introduces two key components: (1) A Wavelet Attention Module embedded in a wavelet-based Feature Pyramid Network (FPN), which enhances the representation of sparse radar signals and image data by capturing joint spatial-frequency features, thereby mitigating information loss while maintaining computational efficiency. (2) A Geometry-guided Progressive Fusion mechanism, a two-stage query-based fusion strategy that progressively aligns multi-view radar and visual features through geometric priors, enabling modality-agnostic and efficient integration without overwhelming computational overhead. Extensive experiments on the K-Radar benchmark show that WRCFormer achieves state-of-the-art performance, surpassing the best existing model by approximately 2.4% in all scenarios and 1.6% in sleet conditions, demonstrating strong robustness in adverse weather.
CVMar 29, 2020Code
AutoTrack: Towards High-Performance Visual Tracking for UAV with Automatic Spatio-Temporal RegularizationYiming Li, Changhong Fu, Fangqiang Ding et al.
Most existing trackers based on discriminative correlation filters (DCF) try to introduce predefined regularization term to improve the learning of target objects, e.g., by suppressing background learning or by restricting change rate of correlation filters. However, predefined parameters introduce much effort in tuning them and they still fail to adapt to new situations that the designer did not think of. In this work, a novel approach is proposed to online automatically and adaptively learn spatio-temporal regularization term. Spatially local response map variation is introduced as spatial regularization to make DCF focus on the learning of trust-worthy parts of the object, and global response map variation determines the updating rate of the filter. Extensive experiments on four UAV benchmarks have proven the superiority of our method compared to the state-of-the-art CPU- and GPU-based trackers, with a speed of ~60 frames per second running on a single CPU. Our tracker is additionally proposed to be applied in UAV localization. Considerable tests in the indoor practical scenarios have proven the effectiveness and versatility of our localization method. The code is available at https://github.com/vision4robotics/AutoTrack.
71.0CVMar 15
Seeking Physics in Diffusion NoiseChujun Tang, Lei Zhong, Fangqiang Ding
Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.
ROFeb 12
HyperDet: 3D Object Detection with Hyper 4D Radar Point CloudsYichun Xiao, Runwei Guan, Fangqiang Ding
4D mmWave radar provides weather-robust, velocity-aware measurements and is more cost-effective than LiDAR. However, radar-only 3D detection still trails LiDAR-based systems because radar point clouds are sparse, irregular, and often corrupted by multipath noise, yielding weak and unstable geometry. We present HyperDet, a detector-agnostic radar-only 3D detection framework that constructs a task-aware hyper 4D radar point cloud for standard LiDAR-oriented detectors. HyperDet aggregates returns from multiple surround-view 4D radars over consecutive frames to improve coverage and density, then applies geometry-aware cross-sensor consensus validation with a lightweight self-consistency check outside overlap regions to suppress inconsistent returns. It further integrates a foreground-focused diffusion module with training-time mixed radar-LiDAR supervision to densify object structures while lifting radar attributes (e.g., Doppler, RCS); the model is distilled into a consistency model for single-step inference. On MAN TruckScenes, HyperDet consistently improves over raw radar inputs with VoxelNeXt and CenterPoint, partially narrowing the radar-LiDAR gap. These results show that input-level refinement enables radar to better leverage LiDAR-oriented detectors without architectural modifications.
CVMay 22, 2024
RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging RadarFangqiang Ding, Xiangyu Wen, Yunzhou Zhu et al.
3D occupancy-based perception pipeline has significantly advanced autonomous driving by capturing detailed scene descriptions and demonstrating strong generalizability across various object categories and shapes. Current methods predominantly rely on LiDAR or camera inputs for 3D occupancy prediction. These methods are susceptible to adverse weather conditions, limiting the all-weather deployment of self-driving cars. To improve perception robustness, we leverage the recent advances in automotive radars and introduce a novel approach that utilizes 4D imaging radar sensors for 3D occupancy prediction. Our method, RadarOcc, circumvents the limitations of sparse radar point clouds by directly processing the 4D radar tensor, thus preserving essential scene details. RadarOcc innovatively addresses the challenges associated with the voluminous and noisy 4D radar data by employing Doppler bins descriptors, sidelobe-aware spatial sparsification, and range-wise self-attention mechanisms. To minimize the interpolation errors associated with direct coordinate transformations, we also devise a spherical-based feature encoding followed by spherical-to-Cartesian feature aggregation. We benchmark various baseline methods based on distinct modalities on the public K-Radar dataset. The results demonstrate RadarOcc's state-of-the-art performance in radar-based 3D occupancy prediction and promising results even when compared with LiDAR- or camera-based methods. Additionally, we present qualitative evidence of the superior performance of 4D radar in adverse weather conditions and explore the impact of key pipeline components through ablation studies.
ROApr 4, 2025
Dexterous Manipulation through Imitation Learning: A SurveyShan An, Ziyu Meng, Chao Tang et al.
Dexterous manipulation, which refers to the ability of a robotic hand or multi-fingered end-effector to skillfully control, reorient, and manipulate objects through precise, coordinated finger movements and adaptive force modulation, enables complex interactions similar to human hand dexterity. With recent advances in robotics and machine learning, there is a growing demand for these systems to operate in complex and unstructured environments. Traditional model-based approaches struggle to generalize across tasks and object variations due to the high dimensionality and complex contact dynamics of dexterous manipulation. Although model-free methods such as reinforcement learning (RL) show promise, they require extensive training, large-scale interaction data, and carefully designed rewards for stability and effectiveness. Imitation learning (IL) offers an alternative by allowing robots to acquire dexterous manipulation skills directly from expert demonstrations, capturing fine-grained coordination and contact dynamics while bypassing the need for explicit modeling and large-scale trial-and-error. This survey provides an overview of dexterous manipulation methods based on imitation learning, details recent advances, and addresses key challenges in the field. Additionally, it explores potential research directions to enhance IL-driven dexterous manipulation. Our goal is to offer researchers and practitioners a comprehensive introduction to this rapidly evolving domain.
CVDec 13, 2025
M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh ReconstructionJunqiao Fan, Yunjiao Zhou, Yizhuo Yang et al.
Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.
CVMar 14, 2024
ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal ImagesFangqiang Ding, Yunzhou Zhu, Xiangyu Wen et al.
Designing egocentric 3D hand pose estimation systems that can perform reliably in complex, real-world scenarios is crucial for downstream applications. Previous approaches using RGB or NIR imagery struggle in challenging conditions: RGB methods are susceptible to lighting variations and obstructions like handwear, while NIR techniques can be disrupted by sunlight or interference from other NIR-equipped devices. To address these limitations, we present ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation, demonstrating the potential of thermal imaging to achieve robust performance under these conditions. The benchmark includes a multi-view and multi-spectral dataset collected from 28 subjects performing hand-object and hand-virtual interactions under diverse scenarios, accurately annotated with 3D hand poses through an automated process. We introduce a new baseline method, TherFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TherFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.
CVJun 15, 2021
Mutation Sensitive Correlation Filter for Real-Time UAV Tracking with Adaptive Hybrid LabelGuangze Zheng, Changhong Fu, Junjie Ye et al.
Unmanned aerial vehicle (UAV) based visual tracking has been confronted with numerous challenges, e.g., object motion and occlusion. These challenges generally introduce unexpected mutations of target appearance and result in tracking failure. However, prevalent discriminative correlation filter (DCF) based trackers are insensitive to target mutations due to a predefined label, which concentrates on merely the centre of the training region. Meanwhile, appearance mutations caused by occlusion or similar objects usually lead to the inevitable learning of wrong information. To cope with appearance mutations, this paper proposes a novel DCF-based method to enhance the sensitivity and resistance to mutations with an adaptive hybrid label, i.e., MSCF. The ideal label is optimized jointly with the correlation filter and remains temporal consistency. Besides, a novel measurement of mutations called mutation threat factor (MTF) is applied to correct the label dynamically. Considerable experiments are conducted on widely used UAV benchmarks. The results indicate that the performance of MSCF tracker surpasses other 26 state-of-the-art DCF-based and deep-based trackers. With a real-time speed of _38 frames/s, the proposed approach is sufficient for UAV tracking commissions.
CVJun 4, 2021
ADTrack: Target-Aware Dual Filter Learning for Real-Time Anti-Dark UAV TrackingBowen Li, Changhong Fu, Fangqiang Ding et al.
Prior correlation filter (CF)-based tracking methods for unmanned aerial vehicles (UAVs) have virtually focused on tracking in the daytime. However, when the night falls, the trackers will encounter more harsh scenes, which can easily lead to tracking failure. In this regard, this work proposes a novel tracker with anti-dark function (ADTrack). The proposed method integrates an efficient and effective low-light image enhancer into a CF-based tracker. Besides, a target-aware mask is simultaneously generated by virtue of image illumination variation. The target-aware mask can be applied to jointly train a target-focused filter that assists the context filter for robust tracking. Specifically, ADTrack adopts dual regression, where the context filter and the target-focused filter restrict each other for dual filter learning. Exhaustive experiments are conducted on typical dark sceneries benchmark, consisting of 37 typical night sequences from authoritative benchmarks, i.e., UAVDark, and our newly constructed benchmark UAVDark70. The results have shown that ADTrack favorably outperforms other state-of-the-art trackers and achieves a real-time speed of 34 frames/s on a single CPU, greatly extending robust UAV tracking to night scenes.
CVJan 21, 2021
All-Day Object Tracking for Unmanned Aerial VehicleBowen Li, Changhong Fu, Fangqiang Ding et al.
Visual object tracking, which is representing a major interest in image processing field, has facilitated numerous real world applications. Among them, equipping unmanned aerial vehicle (UAV) with real time robust visual trackers for all day aerial maneuver, is currently attracting incremental attention and has remarkably broadened the scope of applications of object tracking. However, prior tracking methods have merely focused on robust tracking in the well-illuminated scenes, while ignoring trackers' capabilities to be deployed in the dark. In darkness, the conditions can be more complex and harsh, easily posing inferior robust tracking or even tracking failure. To this end, this work proposed a novel discriminative correlation filter based tracker with illumination adaptive and anti dark capability, namely ADTrack. ADTrack firstly exploits image illuminance information to enable adaptability of the model to the given light condition. Then, by virtue of an efficient and effective image enhancer, ADTrack carries out image pretreatment, where a target aware mask is generated. Benefiting from the mask, ADTrack aims to solve a dual regression problem where dual filters, i.e., the context filter and target focused filter, are trained with mutual constraint. Thus ADTrack is able to maintain continuously favorable performance in all-day conditions. Besides, this work also constructed one UAV nighttime tracking benchmark UAVDark135, comprising of more than 125k manually annotated frames, which is also very first UAV nighttime tracking benchmark. Exhaustive experiments are extended on authoritative daytime benchmarks, i.e., UAV123 10fps, DTB70, and the newly built dark benchmark UAVDark135, which have validated the superiority of ADTrack in both bright and dark conditions on a single CPU.
CVOct 13, 2020
Correlation Filters for Unmanned Aerial Vehicle-Based Aerial Tracking: A Review and Experimental EvaluationChanghong Fu, Bowen Li, Fangqiang Ding et al.
Aerial tracking, which has exhibited its omnipresent dedication and splendid performance, is one of the most active applications in the remote sensing field. Especially, unmanned aerial vehicle (UAV)-based remote sensing system, equipped with a visual tracking approach, has been widely used in aviation, navigation, agriculture,transportation, and public security, etc. As is mentioned above, the UAV-based aerial tracking platform has been gradually developed from research to practical application stage, reaching one of the main aerial remote sensing technologies in the future. However, due to the real-world onerous situations, e.g., harsh external challenges, the vibration of the UAV mechanical structure (especially under strong wind conditions), the maneuvering flight in complex environment, and the limited computation resources onboard, accuracy, robustness, and high efficiency are all crucial for the onboard tracking methods. Recently, the discriminative correlation filter (DCF)-based trackers have stood out for their high computational efficiency and appealing robustness on a single CPU, and have flourished in the UAV visual tracking community. In this work, the basic framework of the DCF-based trackers is firstly generalized, based on which, 23 state-of-the-art DCF-based trackers are orderly summarized according to their innovations for solving various issues. Besides, exhaustive and quantitative experiments have been extended on various prevailing UAV tracking benchmarks, i.e., UAV123, UAV123@10fps, UAV20L, UAVDT, DTB70, and VisDrone2019-SOT, which contain 371,903 frames in total. The experiments show the performance, verify the feasibility, and demonstrate the current challenges of DCF-based trackers onboard UAV tracking.
CVAug 10, 2020
Automatic Failure Recovery and Re-Initialization for Online UAV Tracking with Joint Scale and Aspect Ratio OptimizationFangqiang Ding, Changhong Fu, Yiming Li et al.
Current unmanned aerial vehicle (UAV) visual tracking algorithms are primarily limited with respect to: (i) the kind of size variation they can deal with, (ii) the implementation speed which hardly meets the real-time requirement. In this work, a real-time UAV tracking algorithm with powerful size estimation ability is proposed. Specifically, the overall tracking task is allocated to two 2D filters: (i) translation filter for location prediction in the space domain, (ii) size filter for scale and aspect ratio optimization in the size domain. Besides, an efficient two-stage re-detection strategy is introduced for long-term UAV tracking tasks. Large-scale experiments on four UAV benchmarks demonstrate the superiority of the presented method which has computation feasibility on a low-cost CPU.
CVAug 10, 2020
DR^2Track: Towards Real-Time Visual Tracking for UAV via Distractor Repressed Dynamic RegressionChanghong Fu, Fangqiang Ding, Yiming Li et al.
Visual tracking has yielded promising applications with unmanned aerial vehicle (UAV). In literature, the advanced discriminative correlation filter (DCF) type trackers generally distinguish the foreground from the background with a learned regressor which regresses the implicit circulated samples into a fixed target label. However, the predefined and unchanged regression target results in low robustness and adaptivity to uncertain aerial tracking scenarios. In this work, we exploit the local maximum points of the response map generated in the detection phase to automatically locate current distractors. By repressing the response of distractors in the regressor learning, we can dynamically and adaptively alter our regression target to leverage the tracking robustness as well as adaptivity. Substantial experiments conducted on three challenging UAV benchmarks demonstrate both excellent performance and extraordinary speed (~50fps on a cheap CPU) of our tracker.
CVSep 24, 2019
Augmented Memory for Correlation Filters in Real-Time UAV TrackingYiming Li, Changhong Fu, Fangqiang Ding et al.
The outstanding computational efficiency of discriminative correlation filter (DCF) fades away with various complicated improvements. Previous appearances are also gradually forgotten due to the exponential decay of historical views in traditional appearance updating scheme of DCF framework, reducing the model's robustness. In this work, a novel tracker based on DCF framework is proposed to augment memory of previously appeared views while running at real-time speed. Several historical views and the current view are simultaneously introduced in training to allow the tracker to adapt to new appearances as well as memorize previous ones. A novel rapid compressed context learning is proposed to increase the discriminative ability of the filter efficiently. Substantial experiments on UAVDT and UAV123 datasets have validated that the proposed tracker performs competitively against other 26 top DCF and deep-based trackers with over 40 FPS on CPU.