AS SDMay 22

Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech

arXiv:2605.2361942.3

Predicted impact top 80% in AS · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers developing hearing aid algorithms, this work provides a practical fusion method that improves prediction accuracy, though the gains are incremental over simpler baselines.

The paper investigates non-intrusive intelligibility prediction for hearing-aid-processed speech, comparing fusion strategies for two pretrained encoders. The best method, frame-aligned fusion with learnable strided convolution, achieves Eval RMSE 24.96 and Eval Corr 0.796.

Non-intrusive intelligibility prediction estimates how well hearing-impaired listeners understand hearing-aid-processed speech without a clean reference. We study this task in the 3rd Clarity Prediction Challenge using two frozen speech encoders, Canary and WavLM. The central question is not only whether complementary pretrained representations should be combined, but where their interaction should occur. We compare single-backbone baselines, uniform score averaging, pool-late fusion, cross-attention, frame-aligned fusion, and reverse alignment under a shared left/right-preserving binaural framework. Among the compared systems, the best model temporally prepares WavLM with a learnable strided convolution and fuses it with Canary on the coarser Canary timeline before pooling, reaching Eval RMSE 24.96$\pm$0.06 and Eval Corr 0.796$\pm$0.001. Severity, enhancement-system, layer-window, and temporal-shift analyses indicate that coarse local temporal correspondence before pooling is a useful inductive bias for this task.

View on arXiv PDF

Similar