Piotr Syga

SD
h-index16
16papers
332citations
Novelty52%
AI Score56

16 Papers

SDMay 18
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Nicolas M. Müller, Piotr Kawa, Wei Herng Choong et al.

This paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 10: a dataset of synthetic audio to train and evaluate audio deepfake detection models. It features 175 Text-to-Speech (TTS) models, comprising a total of 1002.9 hours of synthetic voice in 54 different languages. To evaluate this dataset, we train three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance to comparable datasets like InTheWild and FakeOrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing the dataset and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.

SDJun 2, 2023
Improved DeepFake Detection Using Whisper Features

Piotr Kawa, Marcin Plata, Michał Czuba et al.

With a recent influx of voice generation methods, the threat introduced by audio DeepFake (DF) is ever-increasing. Several different detection methods have been presented as a countermeasure. Many methods are based on so-called front-ends, which, by transforming the raw audio, emphasize features crucial for assessing the genuineness of the audio sample. Our contribution contains investigating the influence of the state-of-the-art Whisper automatic speech recognition model as a DF detection front-end. We compare various combinations of Whisper and well-established front-ends by training 3 detection models (LCNN, SpecRNet, and MesoNet) on a widely used ASVspoof 2021 DF dataset and later evaluating them on the DF In-The-Wild dataset. We show that using Whisper-based features improves the detection for each model and outperforms recent results on the In-The-Wild dataset by reducing Equal Error Rate by 21%.

SDJun 27, 2022
Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection

Piotr Kawa, Marcin Plata, Piotr Syga

Audio DeepFakes allow the creation of high-quality, convincing utterances and therefore pose a threat due to its potential applications such as impersonation or fake news. Methods for detecting these manipulations should be characterized by good generalization and stability leading to robustness against attacks conducted with techniques that are not explicitly included in the training. In this work, we introduce Attack Agnostic Dataset - a combination of two audio DeepFakes and one anti-spoofing datasets that, thanks to the disjoint use of attacks, can lead to better generalization of detection methods. We present a thorough analysis of current DeepFake detection methods and consider different audio features (front-ends). In addition, we propose a model based on LCNN with LFCC and mel-spectrogram front-end, which not only is characterized by a good generalization and stability results but also shows improvement over LFCC-based mode - we decrease standard deviation on all folds and EER in two folds by up to 5%.

SDOct 12, 2022
SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection

Piotr Kawa, Marcin Plata, Piotr Syga

Audio DeepFakes are utterances generated with the use of deep neural networks. They are highly misleading and pose a threat due to use in fake news, impersonation, or extortion. In this work, we focus on increasing accessibility to the audio DeepFake detection methods by providing SpecRNet, a neural network architecture characterized by a quick inference time and low computational requirements. Our benchmark shows that SpecRNet, requiring up to about 40% less time to process an audio sample, provides performance comparable to LCNN architecture - one of the best audio DeepFake detection models. Such a method can not only be used by online multimedia services to verify a large bulk of content uploaded daily but also, thanks to its low requirements, by average citizens to evaluate materials on their devices. In addition, we provide benchmarks in three unique settings that confirm the correctness of our model. They reflect scenarios of low-resource datasets, detection on short utterances and limited attacks benchmark in which we take a closer look at the influence of particular attacks on given architectures.

SDDec 30, 2022
Defense Against Adversarial Attacks on Audio DeepFake Detection

Piotr Kawa, Marcin Plata, Piotr Syga

Audio DeepFakes (DF) are artificially generated utterances created using deep learning, with the primary aim of fooling the listeners in a highly convincing manner. Their quality is sufficient to pose a severe threat in terms of security and privacy, including the reliability of news or defamation. Multiple neural network-based methods to detect generated speech have been proposed to prevent the threats. In this work, we cover the topic of adversarial attacks, which decrease the performance of detectors by adding superficial (difficult to spot by a human) changes to input data. Our contribution contains evaluating the robustness of 3 detection architectures against adversarial attacks in two scenarios (white-box and using transferability) and enhancing it later by using adversarial training performed by our novel adaptive training. Moreover, one of the investigated architectures is RawNet3, which, to the best of our knowledge, we adapted for the first time to DeepFake detection.

SDMar 4, 2025Code
As Good as It KAN Get: High-Fidelity Audio Representation

Patryk Marszałek, Maciej Rut, Piotr Kawa et al.

Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.

SDMay 11
APEX: Audio Prototype EXplanations for Classification Tasks

Piotr Kawa, Kornel Howil, Piotr Borycki et al.

Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.

CVFeb 10Code
XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability

Dominik Galus, Julia Farganus, Tymoteusz Zapala et al.

3D Gaussian Splatting (3DGS) has rapidly become a standard for high-fidelity 3D reconstruction, yet its adoption in multiple critical domains is hindered by the lack of interpretability of the generation models as well as classification of the Splats. While explainability methods exist for other 3D representations, like point clouds, they typically rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives. We introduce XSPLAIN, the first ante-hoc, prototype-based interpretability framework designed specifically for 3DGS classification. Our approach leverages a voxel-aggregated PointNet backbone and a novel, invertible orthogonal transformation that disentangles feature channels for interpretability while strictly preserving the original decision boundaries. Explanations are grounded in representative training examples, enabling intuitive ``this looks like that'' reasoning without any degradation in classification performance. A rigorous user study (N=51) demonstrates a decisive preference for our approach: participants selected XSPLAIN explanations 48.4\% of the time as the best, significantly outperforming baselines $(p<0.001)$, showing that XSPLAIN provides transparency and user trust. The source code for this work is available at: https://github.com/Solvro/ml-splat-xai

SDApr 22
Enhancing Speaker Verification with Whispered Speech via Post-Processing

Magdalena Gołębiowska, Piotr Syga

Speaker verification is a task of confirming an individual's identity through the analysis of their voice. Whispered speech differs from phonated speech in acoustic characteristics, which degrades the performance of speaker verification systems in real-life scenarios, including avoiding fully phonated speech to protect privacy, disrupt others, or when the lack of full vocalization is dictated by a disease. In this paper we propose a model with a training recipe to obtain more robust representations against whispered speech hindrances. The proposed system employs an encoder--decoder structure built atop a fine-tuned speaker verification backbone, optimized jointly using cosine similarity--based classification and triplet loss. We gain relative improvement of 22.26\% compared to the baseline (baseline 6.77\% vs ours 5.27\%) in normal vs whispered speech trials, achieving AUC of 98.16\%. In tests comparing whispered to whispered, our model attains an EER of 1.88\% with AUC equal to 99.73\%, which represents a 15\% relative enhancement over the prior leading ReDimNet-B2. We also offer a summary of the most popular and state-of-the-art speaker verification models in terms of their performance with whispered speech. Additionally, we evaluate how these models perform under noisy audios, obtaining that generally the same relative level of noise degrades the performance of speaker verification more significantly on whispered speech than on normal speech.

GRMar 25
FaceParts: Segmentation and Editing of Gaussian Splatting

Tymoteusz Zapała, Julia Farganus, Dominik Galus et al.

Facial editing is an important task with applications in entertainment, virtual reality, and digital avatars. Most existing approaches rely on generative models in the 2D image domain, while in 3D the task is typically performed through labor-intensive manual editing. We propose FaceParts, a framework for unsupervised segmentation and editing of Gaussian Splatting avatars. Unlike existing 2D or mesh-assisted methods, our approach operates directly in the Gaussian domain, decomposing avatars into semantically coherent facial parts without supervision. The method integrates feature disentanglement, density-based clustering, and FLAME-anchored part transfer, enabling precise editing and cross-avatar part swapping. Experiments on the NeRSemble dataset with 11 subjects demonstrate robust isolation of features such as beards, eyebrows, eyes and mustaches. Quantitative evaluation confirms that transferred segments adapt to pose and expression, while maintaining identity consistency (ID = 0.943), low Average Expression Distance (AED = 0.021) and low Average Pose Distance (APD = 0.004).

CVNov 21, 2025
AEGIS: Preserving privacy of 3D Facial Avatars with Adversarial Perturbations

Dawid Wolkiewicz, Anastasiya Pechko, Przemysław Spurek et al.

The growing adoption of photorealistic 3D facial avatars, particularly those utilizing efficient 3D Gaussian Splatting representations, introduces new risks of online identity theft, especially in systems that rely on biometric authentication. While effective adversarial masking methods have been developed for 2D images, a significant gap remains in achieving robust, viewpoint-consistent identity protection for dynamic 3D avatars. To address this, we present AEGIS, the first privacy-preserving identity masking framework for 3D Gaussian Avatars that maintains the subject's perceived characteristics. Our method aims to conceal identity-related facial features while preserving the avatar's perceptual realism and functional integrity. AEGIS applies adversarial perturbations to the Gaussian color coefficients, guided by a pre-trained face verification network, ensuring consistent protection across multiple viewpoints without retraining or modifying the avatar's geometry. AEGIS achieves complete de-identification, reducing face retrieval and verification accuracy to 0%, while maintaining high perceptual quality (SSIM = 0.9555, PSNR = 35.52 dB). It also preserves key facial attributes such as age, race, gender, and emotion, demonstrating strong privacy protection with minimal visual distortion.

CVJan 19, 2025
Counteracting temporal attacks in Video Copy Detection

Katarzyna Fojcik, Piotr Syga

Video Copy Detection (VCD) plays a crucial role in copyright protection and content verification by identifying duplicates and near-duplicates in large-scale video databases. The META AI Challenge on video copy detection provided a benchmark for evaluating state-of-the-art methods, with the Dual-level detection approach emerging as a winning solution. This method integrates Video Editing Detection and Frame Scene Detection to handle adversarial transformations and large datasets efficiently. However, our analysis reveals significant limitations in the VED component, particularly in its ability to handle exact copies. Moreover, Dual-level detection shows vulnerability to temporal attacks. To address it, we propose an improved frame selection strategy based on local maxima of interframe differences, which enhances robustness against adversarial temporal modifications while significantly reducing computational overhead. Our method achieves an increase of 1.4 to 5.8 times in efficiency over the standard 1 FPS approach. Compared to Dual-level detection method, our approach maintains comparable micro-average precision ($μ$AP) while also demonstrating improved robustness against temporal attacks. Given 56\% reduced representation size and the inference time of more than 2 times faster, our approach is more suitable to real-world resource restriction.

CVJun 9, 2020
A Note on Deepfake Detection with Low-Resources

Piotr Kawa, Piotr Syga

Deepfakes are videos that include changes, quite often substituting face of a portrayed individual with a different face using neural networks. Even though the technology gained its popularity as a carrier of jokes and parodies it raises a serious threat to ones security - via biometric impersonation or besmearing. In this paper we present two methods that allow detecting Deepfakes for a user without significant computational power. In particular, we enhance MesoNet by replacing the original activation functions allowing a nearly 1% improvement as well as increasing the consistency of the results. Moreover, we introduced and verified a new activation function - Pish that at the cost of slight time overhead allows even higher consistency. Additionally, we present a preliminary results of Deepfake detection method based on Local Feature Descriptors (LFD), that allows setting up the system even faster and without resorting to GPU computation. Our method achieved Equal Error Rate of 0.28, with both accuracy and recall exceeding 0.7.

MMJun 6, 2020
Robust watermarking with double detector-discriminator approach

Marcin Plata, Piotr Syga

In this paper we present a novel deep framework for a watermarking - a technique of embedding a transparent message into an image in a way that allows retrieving the message from a (perturbed) copy, so that copyright infringement can be tracked. For this technique, it is essential to extract the information from the image even after imposing some digital processing operations on it. Our framework outperforms recent methods in the context of robustness against not only spectrum of attacks (e.g. rotation, resizing, Gaussian smoothing) but also against compression, especially JPEG. The bit accuracy of our method is at least 0.86 for all types of distortions. We also achieved 0.90 bit accuracy for JPEG while recent methods provided at most 0.83. Our method retains high transparency and capacity as well. Moreover, we present our double detector-discriminator approach - a scheme to detect and discriminate if the image contains the embedded message or not, which is crucial for real-life watermarking systems and up to now was not investigated using neural networks. With this, we design a testing formula to validate our extended approach and compared it with a common procedure. We also present an alternative method of balancing between image quality and robustness on attacks which is easily applicable to the framework.

MMMay 24, 2020
Robust Spatial-spread Deep Neural Image Watermarking

Marcin Plata, Piotr Syga

Watermarking is an operation of embedding an information into an image in a way that allows to identify ownership of the image despite applying some distortions on it. In this paper, we presented a novel end-to-end solution for embedding and recovering the watermark in the digital image using convolutional neural networks. The method is based on spreading the message over the spatial domain of the image, hence reducing the "local bits per pixel" capacity. To obtain the model we used adversarial training and applied noiser layers between the encoder and the decoder. Moreover, we broadened the spectrum of typically considered attacks on the watermark and by grouping the attacks according to their scope, we achieved high general robustness, most notably against JPEG compression, Gaussian blurring, subsampling or resizing. To help us in the models training we also proposed a precise differentiable approximation of JPEG.

CRFeb 12, 2016
Practical Fault-Tolerant Data Aggregation

Krzysztof Grining, Marek Klonowski, Piotr Syga

During Financial Cryptography 2012 Chan et al. presented a novel privacy-protection fault-tolerant data aggregation protocol. Comparing to previous work, their scheme guaranteed provable privacy of individuals and could work even if some number of users refused to participate. In our paper we demonstrate that despite its merits, their method provides unacceptably low accuracy of aggregated data for a wide range of assumed parameters and cannot be used in majority of real-life systems. To show this we use both precise analytic and experimental methods. Additionally, we present a precise data aggregation protocol that provides provable level of security even facing massive failures of nodes. Moreover, the protocol requires significantly less computation (limited exploiting of heavy cryptography) than most of currently known fault tolerant aggregation protocols and offers better security guarantees that make it suitable for systems of limited resources (including sensor networks). To obtain our result we relax however the model and allow some limited communication between the nodes.