CVApr 25

A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis

arXiv:2604.234151.9h-index: 5
Predicted impact top 98% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For video action recognition researchers, the paper provides a systematic comparison of fusion strategies in a heterogeneous two-stream setup, but the results are incremental and limited to small datasets.

The paper proposes a heterogeneous two-stream architecture for video action recognition, using different backbones for RGB (ViT-Tiny) and optical flow (MobileNetV2), and evaluates five fusion strategies on UCF11 and UCF50. Cross-attention achieves 98.12% on UCF11, while weighted fusion reaches 96.86% on UCF50, showing that optimal fusion depends on dataset scale.

Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes