CVFeb 14, 2023
DualStreamFoveaNet: A Dual Stream Fusion Architecture with Anatomical Awareness for Robust Fovea LocalizationSifan Song, Jinfeng Wang, Zilong Wang et al.
Accurate fovea localization is essential for analyzing retinal diseases to prevent irreversible vision loss. While current deep learning-based methods outperform traditional ones, they still face challenges such as the lack of local anatomical landmarks around the fovea, the inability to robustly handle diseased retinal images, and the variations in image conditions. In this paper, we propose a novel transformer-based architecture called DualStreamFoveaNet (DSFN) for multi-cue fusion. This architecture explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder to extract and fuse self-learned anatomical information, focusing more on features distributed along blood vessels and significantly reducing computational costs by decreasing token numbers. Our extensive experiments show that the proposed architecture achieves state-of-the-art performance on two public datasets and one large-scale private dataset. Furthermore, we demonstrate that the DSFN is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.
CVMar 10, 2025
Aligning Instance-Semantic Sparse Representation towards Unsupervised Object Segmentation and Shape Abstraction with Repeatable PrimitivesJiaxin Li, Hongxing Wang, Jiawei Tan et al.
Understanding 3D object shapes necessitates shape representation by object parts abstracted from results of instance and semantic segmentation. Promising shape representations enable computers to interpret a shape with meaningful parts and identify their repeatability. However, supervised shape representations depend on costly annotation efforts, while current unsupervised methods work under strong semantic priors and involve multi-stage training, thereby limiting their generalization and deployment in shape reasoning and understanding. Driven by the tendency of high-dimensional semantically similar features to lie in or near low-dimensional subspaces, we introduce a one-stage, fully unsupervised framework towards semantic-aware shape representation. This framework produces joint instance segmentation, semantic segmentation, and shape abstraction through sparse representation and feature alignment of object parts in a high-dimensional space. For sparse representation, we devise a sparse latent membership pursuit method that models each object part feature as a sparse convex combination of point features at either the semantic or instance level, promoting part features in the same subspace to exhibit similar semantics. For feature alignment, we customize an attention-based strategy in the feature space to align instance- and semantic-level object part features and reconstruct the input shape using both of them, ensuring geometric reusability and semantic consistency of object parts. To firm up semantic disambiguation, we construct cascade unfrozen learning on geometric parameters of object parts.
CVDec 23, 2024
Modality-Aware Shot Relating and Comparing for Video Scene DetectionJiawei Tan, Hongxing Wang, Kang Dang et al.
Video scene detection involves assessing whether each shot and its surroundings belong to the same scene. Achieving this requires meticulously correlating multi-modal cues, $\it{e.g.}$ visual entity and place modalities, among shots and comparing semantic changes around each shot. However, most methods treat multi-modal semantics equally and do not examine contextual differences between the two sides of a shot, leading to sub-optimal detection performance. In this paper, we propose the $\bf{M}$odality-$\bf{A}$ware $\bf{S}$hot $\bf{R}$elating and $\bf{C}$omparing approach (MASRC), which enables relating shots per their own characteristics of visual entity and place modalities, as well as comparing multi-shots similarities to have scene changes explicitly encoded. Specifically, to fully harness the potential of visual entity and place modalities in modeling shot relations, we mine long-term shot correlations from entity semantics while simultaneously revealing short-term shot correlations from place semantics. In this way, we can learn distinctive shot features that consolidate coherence within scenes and amplify distinguishability across scenes. Once equipped with distinctive shot features, we further encode the relations between preceding and succeeding shots of each target shot by similarity convolution, aiding in the identification of scene ending shots. We validate the broad applicability of the proposed components in MASRC. Extensive experimental results on public benchmark datasets demonstrate that the proposed MASRC significantly advances video scene detection.
AIDec 21, 2024
STAMPsy: Towards SpatioTemporal-Aware Mixed-Type Dialogues for Psychological CounselingJieyi Wang, Yue Huang, Zeming Liu et al.
Online psychological counseling dialogue systems are trending, offering a convenient and accessible alternative to traditional in-person therapy. However, existing psychological counseling dialogue systems mainly focus on basic empathetic dialogue or QA with minimal professional knowledge and without goal guidance. In many real-world counseling scenarios, clients often seek multi-type help, such as diagnosis, consultation, therapy, console, and common questions, but existing dialogue systems struggle to combine different dialogue types naturally. In this paper, we identify this challenge as how to construct mixed-type dialogue systems for psychological counseling that enable clients to clarify their goals before proceeding with counseling. To mitigate the challenge, we collect a mixed-type counseling dialogues corpus termed STAMPsy, covering five dialogue types, task-oriented dialogue for diagnosis, knowledge-grounded dialogue, conversational recommendation, empathetic dialogue, and question answering, over 5,000 conversations. Moreover, spatiotemporal-aware knowledge enables systems to have world awareness and has been proven to affect one's mental health. Therefore, we link dialogues in STAMPsy to spatiotemporal state and propose a spatiotemporal-aware mixed-type psychological counseling dataset. Additionally, we build baselines on STAMPsy and develop an iterative self-feedback psychological dialogue generation framework, named Self-STAMPsy. Results indicate that clarifying dialogue goals in advance and utilizing spatiotemporal states are effective.