CVApr 9
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video CaptioningDaichi Yashima, Shuhei Kurita, Yusuke Oda et al.
In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.
CVApr 12
Neural Stochastic Processes for Satellite Precipitation RefinementShunya Nagashima, Takumi Bannai, Shuitsu Koyama et al.
Accurate precipitation estimation is critical for flood forecasting, water resource management, and disaster preparedness. Satellite products provide global hourly coverage but contain systematic biases; ground-based gauges are accurate at point locations but too sparse for direct gridded correction. Existing methods fuse these sources by interpolating gauge observations onto the satellite grid, but treat each time step independently and therefore discard temporal structure in precipitation fields. We propose Neural Stochastic Process (NSP), a model that pairs a Neural Process encoder conditioning on arbitrary sets of gauge observations with a latent Neural SDE on a 2D spatial representation. NSP is trained under a single variational objective with simulation-free cost. We also introduce QPEBench, a benchmark of 43{,}756 hourly samples over the Contiguous United States (2021--2025) with four aligned data sources and six evaluation metrics. On QPEBench, NSP outperforms 13 baselines across all six metrics and surpasses JAXA's operational gauge-calibrated product. An additional experiment on Kyushu, Japan confirms generalization to a different region with independent data sources.
NCDec 19, 2025
MEGState: Phoneme Decoding from Magnetoencephalography SignalsShuntaro Suzuki, Chia-Chun Dan Hsu, Yu Tsao et al.
Decoding linguistically meaningful representations from non-invasive neural recordings remains a central challenge in neural speech decoding. Among available neuroimaging modalities, magnetoencephalography (MEG) provides a safe and repeatable means of mapping speech-related cortical dynamics, yet its low signal-to-noise ratio and high temporal dimensionality continue to hinder robust decoding. In this work, we introduce MEGState, a novel architecture for phoneme decoding from MEG signals that captures fine-grained cortical responses evoked by auditory stimuli. Extensive experiments on the LibriBrain dataset demonstrate that MEGState consistently surpasses baseline model across multiple evaluation metrics. These findings highlight the potential of MEG-based phoneme decoding as a scalable pathway toward non-invasive brain-computer interfaces for speech.
SPJan 23
PENGUIN: General Vital Sign Reconstruction from PPG with Flow Matching State Space ModelShuntaro Suzuki, Shuitsu Koyama, Shinnosuke Hirano et al.
Photoplethysmography (PPG) plays a crucial role in continuous cardiovascular health monitoring as a non-invasive and cost-effective modality. However, PPG signals are susceptible to motion artifacts and noise, making accurate estimation of vital signs such as arterial blood pressure (ABP) challenging. Existing estimation methods are often restricted to a single-task or environment, limiting their generalizability across diverse PPG decoding scenarios. Moreover, recent general-purpose approaches typically rely on predictions over multi-second intervals, discarding the morphological characteristics of vital signs. To address these challenges, we propose PENGUIN, a generative flow-matching framework that extends deep state space models, enabling fine-grained conditioning on PPG for reconstructing multiple vital signs as continuous waveforms. We evaluate PENGUIN using six real-world PPG datasets across three distinct vital sign reconstruction tasks (electrocardiogram reconstruction, respiratory monitoring, and ABP monitoring). Our method consistently outperformed both task-specific and general-purpose baselines, demonstrating PENGUIN as a general framework for robust vital sign reconstruction from PPG.
LGFeb 5
A Decomposition-based State Space Model for Multivariate Time-Series ForecastingShunya Nagashima, Shuntaro Suzuki, Shuitsu Koyama et al.
Multivariate time series (MTS) forecasting is crucial for decision-making in domains such as weather, energy, and finance. It remains challenging because real-world sequences intertwine slow trends, multi-rate seasonalities, and irregular residuals. Existing methods often rely on rigid, hand-crafted decompositions or generic end-to-end architectures that entangle components and underuse structure shared across variables. To address these limitations, we propose DecompSSM, an end-to-end decomposition framework using three parallel deep state space model branches to capture trend, seasonal, and residual components. The model features adaptive temporal scales via an input-dependent predictor, a refinement module for shared cross-variable context, and an auxiliary loss that enforces reconstruction and orthogonality. Across standard benchmarks (ECL, Weather, ETTm2, and PEMS04), DecompSSM outperformed strong baselines, indicating the effectiveness of combining component-wise deep state space models and global context refinement.
NCOct 21, 2025
Condition-Invariant fMRI Decoding of Speech Intelligibility with Deep State Space ModelChing-Chih Sung, Shuntaro Suzuki, Francis Pingfan Chien et al.
Clarifying the neural basis of speech intelligibility is critical for computational neuroscience and digital speech processing. Recent neuroimaging studies have shown that intelligibility modulates cortical activity beyond simple acoustics, primarily in the superior temporal and inferior frontal gyri. However, previous studies have been largely confined to clean speech, leaving it unclear whether the brain employs condition-invariant neural codes across diverse listening environments. To address this gap, we propose a novel architecture built upon a deep state space model for decoding intelligibility from fMRI signals, specifically tailored to their high-dimensional temporal structure. We present the first attempt to decode intelligibility across acoustically distinct conditions, showing our method significantly outperforms classical approaches. Furthermore, region-wise analysis highlights contributions from auditory, frontal, and parietal regions, and cross-condition transfer indicates the presence of condition-invariant neural codes, thereby advancing understanding of abstract linguistic representations in the brain.
CVOct 17, 2025
Cortical-SSM: A Deep State Space Model for EEG and ECoG Motor Imagery DecodingShuntaro Suzuki, Shunya Nagashima, Masayuki Hirata et al.
Classification of electroencephalogram (EEG) and electrocorticogram (ECoG) signals obtained during motor imagery (MI) has substantial application potential, including for communication assistance and rehabilitation support for patients with motor impairments. These signals remain inherently susceptible to physiological artifacts (e.g., eye blinking, swallowing), which pose persistent challenges. Although Transformer-based approaches for classifying EEG and ECoG signals have been widely adopted, they often struggle to capture fine-grained dependencies within them. To overcome these limitations, we propose Cortical-SSM, a novel architecture that extends deep state space models to capture integrated dependencies of EEG and ECoG signals across temporal, spatial, and frequency domains. We validated our method across three benchmarks: 1) two large-scale public MI EEG datasets containing more than 50 subjects, and 2) a clinical MI ECoG dataset recorded from a patient with amyotrophic lateral sclerosis. Our method outperformed baseline methods on the three benchmarks. Furthermore, visual explanations derived from our model indicate that it effectively captures neurophysiologically relevant regions of both EEG and ECoG signals.