Łukasz Bondaruk

h-index1

3papers

4citations

Novelty30%

AI Score33

Ranked #119,188 of 194,257 authors (top 61%)#939 in SD (top 52%)

3 Papers

2.2SDJan 27

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models

Iwona Christop, Mateusz Czyżnikiewicz, Paweł Skórzewski et al.

The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories, cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.

12.9SDFeb 11, 2025

LoRP-TTS: Low-Rank Personalized Text-To-Speech

Łukasz Bondaruk, Jakub Kubiak

Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.

4.9CLSep 15, 2025

Preservation of Language Understanding Capabilities in Speech-aware Large Language Models

Marek Kubis, Paweł Skórzewski, Iwona Christop et al.

The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.