Stefano Tubaro

h-index48

42papers

1,607citations

Novelty41%

AI Score47

Ranked #31,585 of 194,257 authors (top 16%)#11,300 in CV (top 19%)

42 Papers

12.8CVJul 15, 2024Code

When Synthetic Traces Hide Real Content: Analysis of Stable Diffusion Image Laundering

Sara Mandelli, Paolo Bestagini, Stefano Tubaro

In recent years, methods for producing highly realistic synthetic images have significantly advanced, allowing the creation of high-quality images from text prompts that describe the desired content. Even more impressively, Stable Diffusion (SD) models now provide users with the option of creating synthetic images in an image-to-image translation fashion, modifying images in the latent space of advanced autoencoders. This striking evolution, however, brings an alarming consequence: it is possible to pass an image through SD autoencoders to reproduce a synthetic copy of the image with high realism and almost no visual artifacts. This process, known as SD image laundering, can transform real images into lookalike synthetic ones and risks complicating forensic analysis for content authenticity verification. Our paper investigates the forensic implications of image laundering, revealing a serious potential to obscure traces of real content, including sensitive and harmful materials that could be mistakenly classified as synthetic, thereby undermining the protection of individuals depicted. To address this issue, we propose a two-stage detection pipeline that effectively differentiates between pristine, laundered, and fully synthetic images (those generated from text prompts), showing robustness across various conditions. Finally, we highlight another alarming property of image laundering, which appears to mask the unique artifacts exploited by forensic detectors to solve the camera model identification task, strongly undermining their performance. Our experimental code is available at https://github.com/polimi-ispl/synthetic-image-detection.

13.2CVMay 3, 2022

Splicing Detection and Localization In Satellite Imagery Using Conditional GANs

Emily R. Bartusiak, Sri Kalyan Yarlagadda, David Güera et al.

The widespread availability of image editing tools and improvements in image processing techniques allow image manipulation to be very easy. Oftentimes, easy-to-use yet sophisticated image manipulation tools yields distortions/changes imperceptible to the human observer. Distribution of forged images can have drastic ramifications, especially when coupled with the speed and vastness of the Internet. Therefore, verifying image integrity poses an immense and important challenge to the digital forensic community. Satellite images specifically can be modified in a number of ways, including the insertion of objects to hide existing scenes and structures. In this paper, we describe the use of a Conditional Generative Adversarial Network (cGAN) to identify the presence of such spliced forgeries within satellite images. Additionally, we identify their locations and shapes. Trained on pristine and falsified images, our method achieves high success on these detection and localization objectives.

21.5CVMar 4, 2022Code

Detecting GAN-generated Images by Orthogonal Training of Multiple CNNs

Sara Mandelli, Nicolò Bonettini, Paolo Bestagini et al.

In the last few years, we have witnessed the rise of a series of deep learning methods to generate synthetic images that look extremely realistic. These techniques prove useful in the movie industry and for artistic purposes. However, they also prove dangerous if used to spread fake news or to generate fake online accounts. For this reason, detecting if an image is an actual photograph or has been synthetically generated is becoming an urgent necessity. This paper proposes a detector of synthetic images based on an ensemble of Convolutional Neural Networks (CNNs). We consider the problem of detecting images generated with techniques not available at training time. This is a common scenario, given that new image generators are published more and more frequently. To solve this issue, we leverage two main ideas: (i) CNNs should provide orthogonal results to better contribute to the ensemble; (ii) original images are better defined than synthetic ones, thus they should be better trusted at testing time. Experiments show that pursuing these two ideas improves the detector accuracy on NVIDIA's newly generated StyleGAN3 images, never used in training.

5.7CVSep 19, 2022

An Overview on the Generation and Detection of Synthetic and Manipulated Satellite Images

Lydia Abady, Edoardo Daniele Cannas, Paolo Bestagini et al.

Due to the reduction of technological costs and the increase of satellites launches, satellite images are becoming more popular and easier to obtain. Besides serving benevolent purposes, satellite data can also be used for malicious reasons such as misinformation. As a matter of fact, satellite images can be easily manipulated relying on general image editing tools. Moreover, with the surge of Deep Neural Networks (DNNs) that can generate realistic synthetic imagery belonging to various domains, additional threats related to the diffusion of synthetically generated satellite images are emerging. In this paper, we review the State of the Art (SOTA) on the generation and manipulation of satellite images. In particular, we focus on both the generation of synthetic satellite imagery from scratch, and the semantic manipulation of satellite images by means of image-transfer technologies, including the transformation of images obtained from one type of sensor to another one. We also describe forensic detection techniques that have been researched so far to classify and detect synthetic image forgeries. While we focus mostly on forensic techniques explicitly tailored to the detection of AI-generated synthetic contents, we also review some methods designed for general splicing detection, which can in principle also be used to spot AI manipulate images

9.4SDOct 31, 2022

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

Luigi Attorresi, Davide Salvi, Clara Borrelli et al.

The rapid spread of media content synthesis technology and the potentially damaging impact of audio and video deepfakes on people's lives have raised the need to implement systems able to detect these forgeries automatically. In this work we present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice. On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task. On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder. We show that the combination of these two embeddings fed to a supervised binary classifier allows the detection of deepfake speech generated with both Text-to-Speech and Voice Conversion techniques. Our results show improvements over the considered baselines, good generalization properties over multiple datasets and robustness to audio compression.

10.6SDApr 6, 2023

DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Amit Kumar Singh Yadav, Kratika Bhagtani, Ziyue Xiang et al.

Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approaches. In this paper, we propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two staged trained variational autoencoder that processes spectrograms of speech using disentangled representation learning to generate interpretable representations of a speech signal for detecting synthetic speech. DSVAE also creates an activation map to highlight the spectrogram regions that discriminate synthetic and bona fide human speech signals. We evaluated the representations obtained from DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers. We also visualize the representation obtained from DSVAE for 17 different speech synthesizers and verify that they are indeed interpretable and discriminate bona fide and synthetic speech from each of the synthesizers.

8.7CVAug 1, 2024

Deepfake Media Forensics: State of the Art and Challenges Ahead

Irene Amerini, Mauro Barni, Sebastiano Battiato et al.

AI-generated synthetic media, also called Deepfakes, have significantly influenced so many domains, from entertainment to cybersecurity. Generative Adversarial Networks (GANs) and Diffusion Models (DMs) are the main frameworks used to create Deepfakes, producing highly realistic yet fabricated content. While these technologies open up new creative possibilities, they also bring substantial ethical and security risks due to their potential misuse. The rise of such advanced media has led to the development of a cognitive bias known as Impostor Bias, where individuals doubt the authenticity of multimedia due to the awareness of AI's capabilities. As a result, Deepfake detection has become a vital area of research, focusing on identifying subtle inconsistencies and artifacts with machine learning techniques, especially Convolutional Neural Networks (CNNs). Research in forensic Deepfake technology encompasses five main areas: detection, attribution and recognition, passive authentication, detection in realistic scenarios, and active authentication. This paper reviews the primary algorithms that address these challenges, examining their advantages, limitations, and future prospects.

8.9IVFeb 15, 2023

Super-Resolution of BVOC Maps by Adapting Deep Learning Methods

Antonio Giganti, Sara Mandelli, Paolo Bestagini et al.

Biogenic Volatile Organic Compounds (BVOCs) play a critical role in biosphere-atmosphere interactions, being a key factor in the physical and chemical properties of the atmosphere and climate. Acquiring large and fine-grained BVOC emission maps is expensive and time-consuming, so most available BVOC data are obtained on a loose and sparse sampling grid or on small regions. However, high-resolution BVOC data are desirable in many applications, such as air quality, atmospheric chemistry, and climate monitoring. In this work, we investigate the possibility of enhancing BVOC acquisitions, further explaining the relationships between the environment and these compounds. We do so by comparing the performances of several state-of-the-art neural networks proposed for image Super-Resolution (SR), adapting them to overcome the challenges posed by the large dynamic range of the emission and reduce the impact of outliers in the prediction. Moreover, we also consider realistic scenarios, considering both temporal and geographical constraints. Finally, we present possible future developments regarding SR generalization, considering the scale-invariance property and super-resolving emissions from unseen compounds.

1.4CVOct 20, 2022

H4VDM: H.264 Video Device Matching

Ziyue Xiang, Paolo Bestagini, Stefano Tubaro et al.

Methods that can determine if two given video sequences are captured by the same device (e.g., mobile telephone or digital camera) can be used in many forensics tasks. In this paper we refer to this as "video device matching". In open-set video forensics scenarios it is easier to determine if two video sequences were captured with the same device than identifying the specific device. In this paper, we propose a technique for open-set video device matching. Given two H.264 compressed video sequences, our method can determine if they are captured by the same device, even if our method has never encountered the device in training. We denote our proposed technique as H.264 Video Device Matching (H4VDM). H4VDM uses H.264 compression information extracted from video sequences to make decisions. It is more robust against artifacts that alter camera sensor fingerprints, and it can be used to analyze relatively small fragments of the H.264 sequence. We trained and tested our method on a publicly available video forensics dataset consisting of 35 devices, where our proposed method demonstrated good performance.

5.3IVJun 22, 2023Code

Super-Resolution of BVOC Emission Maps Via Domain Adaptation

Antonio Giganti, Sara Mandelli, Paolo Bestagini et al.

Enhancing the resolution of Biogenic Volatile Organic Compound (BVOC) emission maps is a critical task in remote sensing. Recently, some Super-Resolution (SR) methods based on Deep Learning (DL) have been proposed, leveraging data from numerical simulations for their training process. However, when dealing with data derived from satellite observations, the reconstruction is particularly challenging due to the scarcity of measurements to train SR algorithms with. In our work, we aim at super-resolving low resolution emission maps derived from satellite observations by leveraging the information of emission maps obtained through numerical simulations. To do this, we combine a SR method based on DL with Domain Adaptation (DA) techniques, harmonizing the different aggregation strategies and spatial information used in simulated and observed domains to ensure compatibility. We investigate the effectiveness of DA strategies at different stages by systematically varying the number of simulated and observed emissions used, exploring the implications of data scarcity on the adaptation strategies. To the best of our knowledge, there are no prior investigations of DA in satellite-derived BVOC maps enhancement. Our work represents a first step toward the development of robust strategies for the reconstruction of observed BVOC emissions.

5.8SDMay 6

Viola Negroni, Davide Salvi, Daniele Ugo Leonzio et al.

In this paper, we introduce the concept of forensic similarity in the speech deepfake detection domain, which aims to determine whether two audio segments share the same underlying forensic traces. Our approach is inspired by prior work in the image domain. To transfer this idea to the audio domain, we propose a two-stage deep learning framework consisting of a Siamese-based feature extractor and a core decision module, referred to as the similarity network. The system goal to assess whether two speech samples originate from the same source by comparing their forensic characteristics. In practice, the model maps pairs of audio segments to a similarity score indicating whether they contain identical or different forensic traces. We evaluate the proposed method on the emerging task of source verification, demonstrating its ability to determine whether two speech samples were generated by the same model. In addition, we explore its applicability to audio splicing detection as a complementary use case. Experimental results show that the proposed approach generalizes well to previously unseen forensic traces, highlighting its robustness, flexibility, and practical relevance for digital audio forensics.

14.8SDSep 24, 2024

Leveraging Mixture of Experts for Improved Speech Deepfake Detection

Viola Negroni, Davide Salvi, Alessandro Ilic Mezza et al.

Speech deepfakes pose a significant threat to personal security and content authenticity. Several detectors have been proposed in the literature, and one of the primary challenges these systems have to face is the generalization over unseen data to identify fake signals across a wide range of datasets. In this paper, we introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture. The Mixture of Experts framework is well-suited for the speech deepfake detection task due to its ability to specialize in different input types and handle data variability efficiently. This approach offers superior generalization and adaptability to unseen data compared to traditional single models or ensemble methods. Additionally, its modular structure supports scalable updates, making it more flexible in managing the evolving complexity of deepfake techniques while maintaining high detection accuracy. We propose an efficient, lightweight gating mechanism to dynamically assign expert weights for each input, optimizing detection performance. Experimental results across multiple datasets demonstrate the effectiveness and potential of our proposed approach.

8.3SDAug 25, 2024

Analyzing the Impact of Splicing Artifacts in Partially Fake Speech Signals

Viola Negroni, Davide Salvi, Paolo Bestagini et al.

Speech deepfake detection has recently gained significant attention within the multimedia forensics community. Related issues have also been explored, such as the identification of partially fake signals, i.e., tracks that include both real and fake speech segments. However, generating high-quality spliced audio is not as straightforward as it may appear. Spliced signals are typically created through basic signal concatenation. This process could introduce noticeable artifacts that can make the generated data easier to detect. We analyze spliced audio tracks resulting from signal concatenation, investigate their artifacts and assess whether such artifacts introduce any bias in existing datasets. Our findings reveal that by analyzing splicing artifacts, we can achieve a detection EER of 6.16% and 7.36% on PartialSpoof and HAD datasets, respectively, without needing to train any detector. These results underscore the complexities of generating reliable spliced audio data and lead to discussions that can help improve future research in this area.

3.7CVJul 9, 2024

Hiding Local Manipulations on SAR Images: a Counter-Forensic Attack

Sara Mandelli, Edoardo Daniele Cannas, Paolo Bestagini et al.

The vast accessibility of Synthetic Aperture Radar (SAR) images through online portals has propelled the research across various fields. This widespread use and easy availability have unfortunately made SAR data susceptible to malicious alterations, such as local editing applied to the images for inserting or covering the presence of sensitive targets. Vulnerability is further emphasized by the fact that most SAR products, despite their original complex nature, are often released as amplitude-only information, allowing even inexperienced attackers to edit and easily alter the pixel content. To contrast malicious manipulations, in the last years the forensic community has begun to dig into the SAR manipulation issue, proposing detectors that effectively localize the tampering traces in amplitude images. Nonetheless, in this paper we demonstrate that an expert practitioner can exploit the complex nature of SAR data to obscure any signs of manipulation within a locally altered amplitude image. We refer to this approach as a counter-forensic attack. To achieve the concealment of manipulation traces, the attacker can simulate a re-acquisition of the manipulated scene by the SAR system that initially generated the pristine image. In doing so, the attacker can obscure any evidence of manipulation, making it appear as if the image was legitimately produced by the system. This attack has unique features that make it both highly generalizable and relatively easy to apply. First, it is a black-box attack, meaning it is not designed to deceive a specific forensic detector. Furthermore, it does not require a training phase and is not based on adversarial operations. We assess the effectiveness of the proposed counter-forensic approach across diverse scenarios, examining various manipulation operations.

4.1LGNov 10, 2025

Enhanced Water Leak Detection with Convolutional Neural Networks and One-Class Support Vector Machine

Daniele Ugo Leonzio, Paolo Bestagini, Marco Marcon et al.

Water is a critical resource that must be managed efficiently. However, a substantial amount of water is lost each year due to leaks in Water Distribution Networks (WDNs). This underscores the need for reliable and effective leak detection and localization systems. In recent years, various solutions have been proposed, with data-driven approaches gaining increasing attention due to their superior performance. In this paper, we propose a new method for leak detection. The method is based on water pressure measurements acquired at a series of nodes of a WDN. Our technique is a fully data-driven solution that makes only use of the knowledge of the WDN topology, and a series of pressure data acquisitions obtained in absence of leaks. The proposed solution is based on an feature extractor and a one-class Support Vector Machines (SVM) trained on no-leak data, so that leaks are detected as anomalies. The results achieved on a simulate dataset using the Modena WDN demonstrate that the proposed solution outperforms recent methods for leak detection.

4.6LGApr 8, 2024Code

Back to the Future: GNN-based NO$_2$ Forecasting via Future Covariates

Antonio Giganti, Sara Mandelli, Paolo Bestagini et al.

Due to the latest environmental concerns in keeping at bay contaminants emissions in urban areas, air pollution forecasting has been rising the forefront of all researchers around the world. When predicting pollutant concentrations, it is common to include the effects of environmental factors that influence these concentrations within an extended period, like traffic, meteorological conditions and geographical information. Most of the existing approaches exploit this information as past covariates, i.e., past exogenous variables that affected the pollutant but were not affected by it. In this paper, we present a novel forecasting methodology to predict NO$_2$ concentration via both past and future covariates. Future covariates are represented by weather forecasts and future calendar events, which are already known at prediction time. In particular, we deal with air quality observations in a city-wide network of ground monitoring stations, modeling the data structure and estimating the predictions with a Spatiotemporal Graph Neural Network (STGNN). We propose a conditioning block that embeds past and future covariates into the current observations. After extracting meaningful spatiotemporal representations, these are fused together and projected into the forecasting horizon to generate the final prediction. To the best of our knowledge, it is the first time that future covariates are included in time series predictions in a structured way. Remarkably, we find that conditioning on future weather information has a greater impact than considering past traffic conditions. We release our code implementation at https://github.com/polimi-ispl/MAGCRN.

14.5IVDec 4, 2024Code

Is JPEG AI going to change image forensics?

Edoardo Daniele Cannas, Sara Mandelli, Nataša Popović et al.

In this paper, we investigate the counter-forensic effects of the new JPEG AI standard based on neural image compression, focusing on two critical areas: deepfake image detection and image splicing localization. Neural image compression leverages advanced neural network algorithms to achieve higher compression rates while maintaining image quality. However, it introduces artifacts that closely resemble those generated by image synthesis techniques and image splicing pipelines, complicating the work of researchers when discriminating pristine from manipulated content. We comprehensively analyze JPEG AI's counter-forensic effects through extensive experiments on several state-of-the-art detectors and datasets. Our results demonstrate a reduction in the performance of leading forensic detectors when analyzing content processed through JPEG AI. By exposing the vulnerabilities of the available forensic tools, we aim to raise the urgent need for multimedia forensics researchers to include JPEG AI images in their experimental setups and develop robust forensic techniques to distinguish between neural compression artifacts and actual manipulations.

4.9SDFeb 22, 2024

Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer

Amit Kumar Singh Yadav, Ziyue Xiang, Kratika Bhagtani et al.

Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detecting synthetic speech shared on social platforms. In this paper we propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our experiments show that PS3DT performs well on ASVspoof2019 dataset compared to other approaches using spectrogram for synthetic speech detection. We also investigate generalization performance of PS3DT on In-the-Wild dataset. PS3DT generalizes well than several existing methods on detecting synthetic speech from an out-of-distribution dataset. We also evaluate robustness of PS3DT to detect telephone quality synthetic speech and synthetic speech shared on social platforms (compressed speech). PS3DT is robust to compression and can detect telephone quality synthetic speech better than several existing methods.

7.3IVMay 23, 2023

Multi-BVOC Super-Resolution Exploiting Compounds Inter-Connection

Antonio Giganti, Sara Mandelli, Paolo Bestagini et al.

Biogenic Volatile Organic Compounds (BVOCs) emitted from the terrestrial ecosystem into the Earth's atmosphere are an important component of atmospheric chemistry. Due to the scarcity of measurement, a reliable enhancement of BVOCs emission maps can aid in providing denser data for atmospheric chemical, climate, and air quality models. In this work, we propose a strategy to super-resolve coarse BVOC emission maps by simultaneously exploiting the contributions of different compounds. To this purpose, we first accurately investigate the spatial inter-connections between several BVOC species. Then, we exploit the found similarities to build a Multi-Image Super-Resolution (MISR) system, in which a number of emission maps associated with diverse compounds are aggregated to boost Super-Resolution (SR) performance. We compare different configurations regarding the species and the number of joined BVOCs. Our experimental results show that incorporating BVOCs' relationship into the process can substantially improve the accuracy of the super-resolved maps. Interestingly, the best results are achieved when we aggregate the emission maps of strongly uncorrelated compounds. This peculiarity seems to confirm what was already guessed for other data-domains, i.e., joined uncorrelated information are more helpful than correlated ones to boost MISR performance. Nonetheless, the proposed work represents the first attempt in SR of BVOC emissions through the fusion of multiple different compounds.

6.6IVJan 7, 2022

Amplitude SAR Imagery Splicing Localization

Edoardo Daniele Cannas, Nicolò Bonettini, Sara Mandelli et al.

Synthetic Aperture Radar (SAR) images are a valuable asset for a wide variety of tasks. In the last few years, many websites have been offering them for free in the form of easy to manage products, favoring their widespread diffusion and research work in the SAR field. The drawback of these opportunities is that such images might be exposed to forgeries and manipulations by malicious users, raising new concerns about their integrity and trustworthiness. Up to now, the multimedia forensics literature has proposed various techniques to localize manipulations in natural photographs, but the integrity assessment of SAR images was never investigated. This task poses new challenges, since SAR images are generated with a processing chain completely different from that of natural photographs. This implies that many forensics methods developed for natural images are not guaranteed to succeed. In this paper, we investigate the problem of amplitude SAR imagery splicing localization. Our goal is to localize regions of an amplitude SAR image that have been copied and pasted from another image, possibly undergoing some kind of editing in the process. To do so, we leverage a Convolutional Neural Network (CNN) to extract a fingerprint highlighting inconsistencies in the processing traces of the analyzed input. Then, we examine this fingerprint to produce a binary tampering mask indicating the pixel region under splicing attack. Results show that our proposed method, tailored to the nature of SAR signals, provides better performances than state-of-the-art forensic tools developed for natural images.

9.4CVDec 16, 2021

Forensic Analysis of Synthetically Generated Western Blot Images

Sara Mandelli, Davide Cozzolino, Edoardo D. Cannas et al.

The widespread diffusion of synthetically generated content is a serious threat that needs urgent countermeasures. As a matter of fact, the generation of synthetic content is not restricted to multimedia data like videos, photographs or audio sequences, but covers a significantly vast area that can include biological images as well, such as western blot and microscopic images. In this paper, we focus on the detection of synthetically generated western blot images. These images are largely explored in the biomedical literature and it has been already shown they can be easily counterfeited with few hopes to spot manipulations by visual inspection or by using standard forensics detectors. To overcome the absence of publicly available data for this task, we create a new dataset comprising more than 14K original western blot images and 24K synthetic western blot images, generated using four different state-of-the-art generation methods. We investigate different strategies to detect synthetic western blots, exploring binary classification methods as well as one-class detectors. In both scenarios, we never exploit synthetic western blot images at training stage. The achieved results show that synthetically generated western blot images can be spot with good accuracy, even though the exploited detectors are not optimized over synthetic versions of these scientific images. We also test the robustness of the developed detectors against post-processing operations commonly performed on scientific images, showing that we can be robust to JPEG compression and that some generative models are easily recognizable, despite the application of editing might alter the artifacts they leave.

3.3MMMay 13, 2021

Forensic Analysis of Video Files Using Metadata

Ziyue Xiang, János Horváth, Sriram Baireddy et al.

The unprecedented ease and ability to manipulate video content has led to a rapid spread of manipulated media. The availability of video editing tools greatly increased in recent years, allowing one to easily generate photo-realistic alterations. Such manipulations can leave traces in the metadata embedded in video files. This metadata information can be used to determine video manipulations, brand of video recording device, the type of video editing tool, and other important evidence. In this paper, we focus on the metadata contained in the popular MP4 video wrapper/container. We describe our method for metadata extractor that uses the MP4's tree structure. Our approach for analyzing the video metadata produces a more compact representation. We will describe how we construct features from the metadata and then use dimensionality reduction and nearest neighbor classification for forensic analysis of a video file. Our approach allows one to visually inspect the distribution of metadata features and make decisions. The experimental results confirm that the performance of our approach surpasses other methods.

2.3SPJan 27, 2021Code

Anti-Aliasing Add-On for Deep Prior Seismic Data Interpolation

Francesco Picetti, Vincenzo Lipari, Paolo Bestagini et al.

Data interpolation is a fundamental step in any seismic processing workflow. Among machine learning techniques recently proposed to solve data interpolation as an inverse problem, Deep Prior paradigm aims at employing a convolutional neural network to capture priors on the data in order to regularize the inversion. However, this technique lacks of reconstruction precision when interpolating highly decimated data due to the presence of aliasing. In this work, we propose to improve Deep Prior inversion by adding a directional Laplacian as regularization term to the problem. This regularizer drives the optimization towards solutions that honor the slopes estimated from the interpolated data low frequencies. We provide some numerical examples to showcase the methodology devised in this manuscript, showing that our results are less prone to aliasing also in presence of noisy and corrupted data.

2.3MMDec 7, 2020Code

DIPPAS: A Deep Image Prior PRNU Anonymization Scheme

Francesco Picetti, Sara Mandelli, Paolo Bestagini et al.

Source device identification is an important topic in image forensics since it allows to trace back the origin of an image. Its forensics counter-part is source device anonymization, that is, to mask any trace on the image that can be useful for identifying the source device. A typical trace exploited for source device identification is the Photo Response Non-Uniformity (PRNU), a noise pattern left by the device on the acquired images. In this paper, we devise a methodology for suppressing such a trace from natural images without significant impact on image quality. Specifically, we turn PRNU anonymization into an optimization problem in a Deep Image Prior (DIP) framework. In a nutshell, a Convolutional Neural Network (CNN) acts as generator and returns an image that is anonymized with respect to the source PRNU, still maintaining high visual quality. With respect to widely-adopted deep learning paradigms, our proposed CNN is not trained on a set of input-target pairs of images. Instead, it is optimized to reconstruct the PRNU-free image from the original image under analysis itself. This makes the approach particularly suitable in scenarios where large heterogeneous databases are analyzed and prevents any problem due to lack of generalization. Through numerical examples on publicly available datasets, we prove our methodology to be effective compared to state-of-the-art techniques.

11.6CVNov 16, 2020

Training Strategies and Data Augmentations in CNN-based DeepFake Video Detection

Luca Bondi, Edoardo Daniele Cannas, Paolo Bestagini et al.

The fast and continuous growth in number and quality of deepfake videos calls for the development of reliable detection systems capable of automatically warning users on social media and on the Internet about the potential untruthfulness of such contents. While algorithms, software, and smartphone apps are getting better every day in generating manipulated videos and swapping faces, the accuracy of automated systems for face forgery detection in videos is still quite limited and generally biased toward the dataset used to design and train a specific detection system. In this paper we analyze how different training strategies and data augmentation techniques affect CNN-based deepfake detectors when training and testing on the same dataset or across different datasets.

9.1CVSep 25, 2020

Training CNNs in Presence of JPEG Compression: Multimedia Forensics vs Computer Vision

Sara Mandelli, Nicolò Bonettini, Paolo Bestagini et al.

Convolutional Neural Networks (CNNs) have proved very accurate in multiple computer vision image classification tasks that required visual inspection in the past (e.g., object recognition, face detection, etc.). Motivated by these astonishing results, researchers have also started using CNNs to cope with image forensic problems (e.g., camera model identification, tampering detection, etc.). However, in computer vision, image classification methods typically rely on visual cues easily detectable by human eyes. Conversely, forensic solutions rely on almost invisible traces that are often very subtle and lie in the fine details of the image under analysis. For this reason, training a CNN to solve a forensic task requires some special care, as common processing operations (e.g., resampling, compression, etc.) can strongly hinder forensic traces. In this work, we focus on the effect that JPEG has on CNN training considering different computer vision and forensic image classification problems. Specifically, we consider the issues that rise from JPEG compression and misalignment of the JPEG grid. We show that it is necessary to consider these effects when generating a training dataset in order to properly train a forensic detector not losing generalization capability, whereas it is almost possible to ignore these effects for computer vision tasks.

4.3MMMay 20, 2020

A Modified Fourier-Mellin Approach for Source Device Identification on Stabilized Videos

Sara Mandelli, Fabrizio Argenti, Paolo Bestagini et al.

To decide whether a digital video has been captured by a given device, multimedia forensic tools usually exploit characteristic noise traces left by the camera sensor on the acquired frames. This analysis requires that the noise pattern characterizing the camera and the noise pattern extracted from video frames under analysis are geometrically aligned. However, in many practical scenarios this does not occur, thus a re-alignment or synchronization has to be performed. Current solutions often require time consuming search of the realignment transformation parameters. In this paper, we propose to overcome this limitation by searching scaling and rotation parameters in the frequency domain. The proposed algorithm tested on real videos from a well-known state-of-the-art dataset shows promising results.

11.6CVApr 16, 2020Code

On the use of Benford's law to detect GAN-generated images

Nicolò Bonettini, Paolo Bestagini, Simone Milani et al.

The advent of Generative Adversarial Network (GAN) architectures has given anyone the ability of generating incredibly realistic synthetic imagery. The malicious diffusion of GAN-generated images may lead to serious social and political consequences (e.g., fake news spreading, opinion formation, etc.). It is therefore important to regulate the widespread distribution of synthetic imagery by developing solutions able to detect them. In this paper, we study the possibility of using Benford's law to discriminate GAN-generated images from natural photographs. Benford's law describes the distribution of the most significant digit for quantized Discrete Cosine Transform (DCT) coefficients. Extending and generalizing this property, we show that it is possible to extract a compact feature vector from an image. This feature vector can be fed to an extremely simple classifier for GAN-generated image detection purpose.

22.6CVApr 16, 2020Code

Video Face Manipulation Detection Through Ensemble of CNNs

Nicolò Bonettini, Edoardo Daniele Cannas, Sara Mandelli et al.

In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a significantly bad impact on society (e.g., fake news spreading, cyber bullying through fake revenge porn). The ability of objectively detecting whether a face has been manipulated in a video sequence is then a task of utmost importance. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. In particular, we study the ensembling of different trained Convolutional Neural Network (CNN) models. In the proposed solution, different models are obtained starting from a base network (i.e., EfficientNetB4) making use of two different concepts: (i) attention layers; (ii) siamese training. We show that combining these networks leads to promising face manipulation detection results on two publicly available datasets with more than 119000 videos.

8.6LGJun 20, 2019Code

We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

David Güera, Sriram Baireddy, Paolo Bestagini et al.

Manipulating video content is easier than ever. Due to the misuse potential of manipulated content, multiple detection techniques that analyze the pixel data from the videos have been proposed. However, clever manipulators should also carefully forge the metadata and auxiliary header information, which is harder to do for videos than images. In this paper, we propose to identify forged videos by analyzing their multimedia stream descriptors with simple binary classifiers, completely avoiding the pixel space. Using well-known datasets, our results show that this scalable approach can achieve a high manipulation detection score if the manipulators have not done a careful data sanitization of the multimedia stream descriptors.

5.4CVApr 11, 2019

An In-Depth Study on Open-Set Camera Model Identification

Pedro Ribeiro Mendes Júnior, Luca Bondi, Paolo Bestagini et al.

Camera model identification refers to the problem of linking a picture to the camera model used to shoot it. As this might be an enabling factor in different forensic applications to single out possible suspects (e.g., detecting the author of child abuse or terrorist propaganda material), many accurate camera model attribution methods have been developed in the literature. One of their main drawbacks, however, is the typical closed-set assumption of the problem. This means that an investigated photograph is always assigned to one camera model within a set of known ones present during investigation, i.e., training time, and the fact that the picture can come from a completely unrelated camera model during actual testing is usually ignored. Under realistic conditions, it is not possible to assume that every picture under analysis belongs to one of the available camera models. To deal with this issue, in this paper, we present the first in-depth study on the possibility of solving the camera model identification problem in open-set scenarios. Given a photograph, we aim at detecting whether it comes from one of the known camera models of interest or from an unknown one. We compare different feature extraction algorithms and classifiers specially targeting open-set recognition. We also evaluate possible open-set training protocols that can be applied along with any open-set classifier, observing that a simple of those alternatives obtains best results. Thorough testing on independent datasets shows that it is possible to leverage a recently proposed convolutional neural network as feature extractor paired with a properly trained open-set classifier aiming at solving the open-set camera model attribution problem even to small-scale image patches, improving over state-of-the-art available solutions.

13.5NEJan 23, 2019

Interpolation and Denoising of Seismic Data using Convolutional Neural Networks

Sara Mandelli, Vincenzo Lipari, Paolo Bestagini et al.

Seismic data processing algorithms greatly benefit from regularly sampled and reliable data. Therefore, interpolation and denoising play a fundamental role as one of the starting steps of most seismic processing workflows. We exploit convolutional neural networks for the joint tasks of interpolation and random noise attenuation of 2D common shot gathers. Inspired by the great contributions achieved in image processing and computer vision, we investigate a particular architecture of convolutional neural network referred to as U-net, which implements a convolutional autoencoder able to describe the complex features of clean and regularly sampled data for reconstructing the corrupted ones. In training phase we exploit part of the data for tailoring the network to the specific tasks of interpolation, denoising and joint denoising/interpolation, while during the system deployment we are able to recover the remaining corrupted shot gathers in a computationally efficient procedure. We consider a plurality of data corruptions in our numerical experiments, including different noise models and different distributions of missing traces. Several examples on synthetic and field data illustrate the appealing features of the aforementioned strategy. Comparative examples show improvements with respect to recently proposed solutions for joint denoising and interpolation.

10.3MMNov 5, 2018

Facing Device Attribution Problem for Stabilized Video Sequences

Sara Mandelli, Paolo Bestagini, Luisa Verdoliva et al.

A problem deeply investigated by multimedia forensics researchers is the one of detecting which device has been used to capture a video. This enables to trace down the owner of a video sequence, which proves extremely helpful to solve copyright infringement cases as well as to fight distribution of illicit material (e.g., underage clips, terroristic threats, etc.). Currently, the most promising methods to tackle this task exploit unique noise traces left by camera sensors on acquired images. However, given the recent advancements in motion stabilization of video content, robustness of sensor pattern noise-based techniques are strongly hindered. Indeed, video stabilization introduces geometric transformations between video frames, thus making camera fingerprint estimation problematic with classical approaches. In this paper, we deal with the challenging problem of attributing stabilized videos to their recording device. Specifically, we propose: (i) a strategy to extract the characteristic fingerprint of a device, starting from either a set of images or stabilized video sequences; (ii) a strategy to match a stabilized video sequence with a given fingerprint in order to solve the device attribution problem. The proposed methodology is tested on videos coming from a set of different smartphones, taken from the modern publicly available Vision Dataset. The conducted experiments also provide an interesting insight on the effect of modern smartphones video stabilization algorithms on specific video frames.

1.5LGOct 2, 2018

Landmine Detection Using Autoencoders on Multi-polarization GPR Volumetric Data

Paolo Bestagini, Federico Lombardi, Maurizio Lualdi et al.

Buried landmines and unexploded remnants of war are a constant threat for the population of many countries that have been hit by wars in the past years. The huge amount of human lives lost due to this phenomenon has been a strong motivation for the research community toward the development of safe and robust techniques designed for landmine clearance. Nonetheless, being able to detect and localize buried landmines with high precision in an automatic fashion is still considered a challenging task due to the many different boundary conditions that characterize this problem (e.g., several kinds of objects to detect, different soils and meteorological conditions, etc.). In this paper, we propose a novel technique for buried object detection tailored to unexploded landmine discovery. The proposed solution exploits a specific kind of convolutional neural network (CNN) known as autoencoder to analyze volumetric data acquired with ground penetrating radar (GPR) using different polarizations. This method works in an anomaly detection framework, indeed we only train the autoencoder on GPR data acquired on landmine-free areas. The system then recognizes landmines as objects that are dissimilar to the soil used during the training step. Experiments conducted on real data show that the proposed technique requires little training and no ad-hoc data pre-processing to achieve accuracy higher than 93% on challenging datasets.

8.3CVMay 6, 2018

A Counter-Forensic Method for CNN-Based Camera Model Identification

David Güera, Yu Wang, Luca Bondi et al.

An increasing number of digital images are being shared and accessed through websites, media, and social applications. Many of these images have been modified and are not authentic. Recent advances in the use of deep convolutional neural networks (CNNs) have facilitated the task of analyzing the veracity and authenticity of largely distributed image datasets. We examine in this paper the problem of identifying the camera model or type that was used to take an image and that can be spoofed. Due to the linear nature of CNNs and the high-dimensionality of images, neural networks are vulnerable to attacks with adversarial examples. These examples are imperceptibly different from correctly classified images but are misclassified with high confidence by CNNs. In this paper, we describe a counter-forensic method capable of subtly altering images to change their estimated camera model when they are analyzed by any CNN-based camera model detector. Our method can use both the Fast Gradient Sign Method (FGSM) or the Jacobian-based Saliency Map Attack (JSMA) to craft these adversarial images and does not require direct access to the CNN. Our results show that even advanced deep learning architectures trained to analyze images and obtain camera model information are still vulnerable to our proposed method.

3.9CVMay 4, 2018

Reliability Map Estimation For CNN-Based Camera Model Attribution

David Güera, Sri Kalyan Yarlagadda, Paolo Bestagini et al.

Among the image forensic issues investigated in the last few years, great attention has been devoted to blind camera model attribution. This refers to the problem of detecting which camera model has been used to acquire an image by only exploiting pixel information. Solving this problem has great impact on image integrity assessment as well as on authenticity verification. Recent advancements that use convolutional neural networks (CNNs) in the media forensic field have enabled camera model attribution methods to work well even on small image patches. These improvements are also important for determining forgery localization. Some patches of an image may not contain enough information related to the camera model (e.g., saturated patches). In this paper, we propose a CNN-based solution to estimate the camera model attribution reliability of a given image patch. We show that we can estimate a reliability-map indicating which portions of the image contain reliable camera traces. Testing using a well known dataset confirms that by using this information, it is possible to increase small patch camera model attribution accuracy by more than 8% on a single patch.

12.1CVFeb 13, 2018

Satellite Image Forgery Detection and Localization Using GAN and One-Class Classifier

Sri Kalyan Yarlagadda, David Güera, Paolo Bestagini et al.

Current satellite imaging technology enables shooting high-resolution pictures of the ground. As any other kind of digital images, overhead pictures can also be easily forged. However, common image forensic techniques are often developed for consumer camera images, which strongly differ in their nature from satellite ones (e.g., compression schemes, post-processing, sensors, etc.). Therefore, many accurate state-of-the-art forensic algorithms are bound to fail if blindly applied to overhead image analysis. Development of novel forensic tools for satellite images is paramount to assess their authenticity and integrity. In this paper, we propose an algorithm for satellite image forgery detection and localization. Specifically, we consider the scenario in which pixels within a region of a satellite image are replaced to add or remove an object from the scene. Our algorithm works under the assumption that no forged images are available for training. Using a generative adversarial network (GAN), we learn a feature representation of pristine satellite images. A one-class support vector machine (SVM) is trained on these features to determine their distribution. Finally, image forgeries are detected as anomalies. The proposed algorithm is validated against different kinds of satellite images containing forgeries of different size and shape.

27.5CRAug 2, 2017

Aligned and Non-Aligned Double JPEG Detection Using Convolutional Neural Networks

Mauro Barni, Luca Bondi, Nicolò Bonettini et al.

Due to the wide diffusion of JPEG coding standard, the image forensic community has devoted significant attention to the development of double JPEG (DJPEG) compression detectors through the years. The ability of detecting whether an image has been compressed twice provides paramount information toward image authenticity assessment. Given the trend recently gained by convolutional neural networks (CNN) in many computer vision tasks, in this paper we propose to use CNNs for aligned and non-aligned double JPEG compression detection. In particular, we explore the capability of CNNs to capture DJPEG artifacts directly from images. Results show that the proposed CNN-based detectors achieve good performance even with small size images (i.e., 64x64), outperforming state-of-the-art solutions, especially in the non-aligned case. Besides, good results are also achieved in the commonly-recognized challenging case in which the first quality factor is larger than the second one.

1.2ITOct 14, 2016

A Geometrical-Statistical approach to outlier removal for TDOA measuments

Marco Compagnoni, Alessia Pini, Antonio Canclini et al.

The curse of outlier measurements in estimation problems is a well known issue in a variety of fields. Therefore, outlier removal procedures, which enables the identification of spurious measurements within a set, have been developed for many different scenarios and applications. In this paper, we propose a statistically motivated outlier removal algorithm for time differences of arrival (TDOAs), or equivalently range differences (RD), acquired at sensor arrays. The method exploits the TDOA-space formalism and works by only knowing relative sensor positions. As the proposed method is completely independent from the application for which measurements are used, it can be reliably used to identify outliers within a set of TDOA/RD measurements in different fields (e.g. acoustic source localization, sensor synchronization, radar, remote sensing, etc.). The proposed outlier removal algorithm is validated by means of synthetic simulations and real experiments.

16.5CVMar 3, 2016

First Steps Toward Camera Model Identification with Convolutional Neural Networks

Luca Bondi, Luca Baroffio, David Güera et al.

Detecting the camera model used to shoot a picture enables to solve a wide series of forensic problems, from copyright infringement to ownership attribution. For this reason, the forensic community has developed a set of camera model identification algorithms that exploit characteristic traces left on acquired images by the processing pipelines specific of each camera model. In this paper, we investigate a novel approach to solve camera model identification problem. Specifically, we propose a data-driven algorithm based on convolutional neural networks, which learns features characterizing each camera model directly from the acquired pictures. Results on a well-known dataset of 18 camera models show that: (i) the proposed method outperforms up-to-date state-of-the-art algorithms on classification of 64x64 color image patches; (ii) features learned by the proposed network generalize to camera models never used for training.

7.7CVOct 13, 2015Code

Deep convolutional neural networks for pedestrian detection

Denis Tomè, Federico Monti, Luca Baroffio et al.

Pedestrian detection is a popular research topic due to its paramount importance for a number of applications, especially in the fields of automotive, surveillance and robotics. Despite the significant improvements, pedestrian detection is still an open challenge that calls for more and more accurate algorithms. In the last few years, deep learning and in particular convolutional neural networks emerged as the state of the art in terms of accuracy for a number of computer vision tasks such as image classification, object detection and segmentation, often outperforming the previous gold standards by a large margin. In this paper, we propose a pedestrian detection system based on deep learning, adapting a general-purpose convolutional network to the task at hand. By thoroughly analyzing and optimizing each step of the detection pipeline we propose an architecture that outperforms traditional methods, achieving a task accuracy close to that of state-of-the-art approaches, while requiring a low computational time. Finally, we tested the system on an NVIDIA Jetson TK1, a 192-core platform that is envisioned to be a forerunner computational brain of future self-driving cars.

1.3CVMar 24, 2015

Fast keypoint detection in video sequences

Luca Baroffio, Matteo Cesana, Alessandro Redondi et al.

A number of computer vision tasks exploit a succinct representation of the visual content in the form of sets of local features. Given an input image, feature extraction algorithms identify a set of keypoints and assign to each of them a description vector, based on the characteristics of the visual content surrounding the interest point. Several tasks might require local features to be extracted from a video sequence, on a frame-by-frame basis. Although temporal downsampling has been proven to be an effective solution for mobile augmented reality and visual search, high temporal resolution is a key requirement for time-critical applications such as object tracking, event recognition, pedestrian detection, surveillance. In recent years, more and more computationally efficient visual feature detectors and decriptors have been proposed. Nonetheless, such approaches are tailored to still images. In this paper we propose a fast keypoint detection algorithm for video sequences, that exploits the temporal coherence of the sequence of keypoints. According to the proposed method, each frame is preprocessed so as to identify the parts of the input frame for which keypoint detection and description need to be performed. Our experiments show that it is possible to achieve a reduction in computational time of up to 40%, without significantly affecting the task accuracy.