CVAug 22, 2024
BihoT: A Large-Scale Dataset and Benchmark for Hyperspectral Camouflaged Object TrackingHanzheng Wang, Wei Li, Xiang-Gen Xia et al.
Hyperspectral object tracking (HOT) has exhibited potential in various applications, particularly in scenes where objects are camouflaged. Existing trackers can effectively retrieve objects via band regrouping because of the bias in existing HOT datasets, where most objects tend to have distinguishing visual appearances rather than spectral characteristics. This bias allows the tracker to directly use the visual features obtained from the false-color images generated by hyperspectral images without the need to extract spectral features. To tackle this bias, we find that the tracker should focus on the spectral information when object appearance is unreliable. Thus, we provide a new task called hyperspectral camouflaged object tracking (HCOT) and meticulously construct a large-scale HCOT dataset, termed BihoT, which consists of 41,912 hyperspectral images covering 49 video sequences. The dataset covers various artificial camouflage scenes where objects have similar appearances, diverse spectrums, and frequent occlusion, making it a very challenging dataset for HCOT. Besides, a simple but effective baseline model, named spectral prompt-based distractor-aware network (SPDAN), is proposed, comprising a spectral embedding network (SEN), a spectral prompt-based backbone network (SPBN), and a distractor-aware module (DAM). Specifically, the SEN extracts spectral-spatial features via 3-D and 2-D convolutions. Then, the SPBN fine-tunes powerful RGB trackers with spectral prompts and alleviates the insufficiency of training samples. Moreover, the DAM utilizes a novel statistic to capture the distractor caused by occlusion from objects and background. Extensive experiments demonstrate that our proposed SPDAN achieves state-of-the-art performance on the proposed BihoT and other HOT datasets.
CVMar 9, 2024
SSF-Net: Spatial-Spectral Fusion Network with Spectral Angle Awareness for Hyperspectral Object TrackingHanzheng Wang, Wei Li, Xiang-Gen Xia et al.
Hyperspectral video (HSV) offers valuable spatial, spectral, and temporal information simultaneously, making it highly suitable for handling challenges such as background clutter and visual similarity in object tracking. However, existing methods primarily focus on band regrouping and rely on RGB trackers for feature extraction, resulting in limited exploration of spectral information and difficulties in achieving complementary representations of object features. In this paper, a spatial-spectral fusion network with spectral angle awareness (SST-Net) is proposed for hyperspectral (HS) object tracking. Firstly, to address the issue of insufficient spectral feature extraction in existing networks, a spatial-spectral feature backbone ($S^2$FB) is designed. With the spatial and spectral extraction branch, a joint representation of texture and spectrum is obtained. Secondly, a spectral attention fusion module (SAFM) is presented to capture the intra- and inter-modality correlation to obtain the fused features from the HS and RGB modalities. It can incorporate the visual information into the HS spectral context to form a robust representation. Thirdly, to ensure a more accurate response of the tracker to the object position, a spectral angle awareness module (SAAM) investigates the region-level spectral similarity between the template and search images during the prediction stage. Furthermore, we develop a novel spectral angle awareness loss (SAAL) to offer guidance for the SAAM based on similar regions. Finally, to obtain the robust tracking results, a weighted prediction method is considered to combine the HS and RGB predicted motions of objects to leverage the strengths of each modality. Extensive experiments on the HOTC dataset demonstrate the effectiveness of the proposed SSF-Net, compared with state-of-the-art trackers.
CVMar 7
PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide GenerationXin-Sheng Chen, Jiayu Zhu, Pei-lin Li et al.
Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment. In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks. Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.