80.7CVJun 2Code
Training-Free Multi-Concept LoRA Composition with Prompt-Aware WeightingGeorgios Tsoumplekas, Stella Bounareli, Vasileios Argyriou
Low-Rank Adaptation (LoRA) successfully enables personalization in text-to-image generation by adapting pre-trained diffusion models to specific visual concepts and styles. However, extending such models to multi-concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi-concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W-Switch and W-Composite, that employ a prompt-aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image-based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real-world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state-of-the-art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image-based metrics. Our code is available at https://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Composition.
CVJul 20, 2023Code
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget FacesStella Bounareli, Christos Tzelepis, Vasileios Argyriou et al.
In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes. We make the code and the pretrained models publicly available at: https://github.com/StelaBou/HyperReenact .
CVSep 27, 2022Code
StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face ReenactmentStella Bounareli, Christos Tzelepis, Vasileios Argyriou et al.
In this paper we address the problem of neural face reenactment, where, given a pair of a source and a target facial image, we need to transfer the target's pose (defined as the head pose and its facial expressions) to the source image, by preserving at the same time the source's identity characteristics (e.g., facial shape, hair style, etc), even in the challenging case where the source and the target faces belong to different identities. In doing so, we address some of the limitations of the state-of-the-art works, namely, a) that they depend on paired training data (i.e., source and target faces have the same identity), b) that they rely on labeled data during inference, and c) that they do not preserve identity in large head pose changes. More specifically, we propose a framework that, using unpaired randomly generated facial images, learns to disentangle the identity characteristics of the face from its pose by incorporating the recently introduced style space $\mathcal{S}$ of StyleGAN2, a latent representation space that exhibits remarkable disentanglement properties. By capitalizing on this, we learn to successfully mix a pair of source and target style codes using supervision from a 3D model. The resulting latent code, that is subsequently used for reenactment, consists of latent units corresponding to the facial pose of the target only and of units corresponding to the identity of the source only, leading to notable improvement in the reenactment performance compared to recent state-of-the-art methods. In comparison to state of the art, we quantitatively and qualitatively show that the proposed method produces higher quality results even on extreme pose variations. Finally, we report results on real images by first embedding them on the latent space of the pretrained generator. We make the code and pretrained models publicly available at: https://github.com/StelaBou/StyleMask
65.7ROApr 16
Foundation Models in Robotics: A Comprehensive Review of Methods, Models, Datasets, Challenges and Future Research DirectionsAggelos Psiris, Vasileios Argyriou, Evangelos K. Markakis et al.
Over the recent years, the field of robotics has been undergoing a transformative paradigm shift from fixed, single-task, domain-specific solutions towards adaptive, multi-function, general-purpose agents, capable of operating in complex, open-world, and dynamic environments. This tremendous advancement is primarily driven by the emergence of Foundation Models (FMs), i.e., large-scale neural-network architectures trained on massive, heterogeneous datasets that provide unprecedented capabilities in multi-modal understanding and reasoning, long-horizon planning, and cross-embodiment generalization. In this context, the current study provides a holistic, systematic, and in-depth review of the research landscape of FMs in robotics. In particular, the evolution of the field is initially delineated through five distinct research phases, spanning from the early incorporation of Natural Language Processing (NLP) and Computer Vision (CV) models to the current frontier of multi-sensory generalization and real-world deployment. Subsequently, a highly-granular taxonomic investigation of the literature is performed, examining the following key aspects: a) the employed FM types, including LLMs, VFMs, VLMs, and VLAs, b) the underlying neural-network architectures, c) the adopted learning paradigms, d) the different learning stages of knowledge incorporation, e) the major robotic tasks, and f) the main real-world application domains. For each aspect, comparative analysis and critical insights are provided. Moreover, a report on the publicly available datasets used for model training and evaluation across the considered robotic tasks is included. Furthermore, a hierarchical discussion on the current open challenges and promising future research directions in the field is incorporated.
77.4HCMay 4
Interactive Augmented Reality-enabled Outdoor Scene Visualization For Enhanced Real-time Disaster ResponseDimitrios Apostolakis, Georgios Angelidis, Vasileios Argyriou et al.
A user-centered AR interface for disaster response is presented in this work that uses 3D Gaussian Splatting (3DGS) to visualize detailed scene reconstructions, while maintaining situational awareness and keeping cognitive load low. The interface relies on a lightweight interaction approach, combining World-in-Miniature (WIM) navigation with semantic Points of Interest (POIs) that can be filtered as needed, and it is supported by an architecture designed to stream updates as reconstructions evolve. User feedback from a preliminary evaluation indicates that this design is easy to use and supports real-time coordination, with participants highlighting the value of interaction and POIs for fast decision-making in context. Thorough user-centric performance evaluation demonstrates strong usability of the developed interface and high acceptance ratios.
CVAug 20, 2024
A Closer Look at Data Augmentation Strategies for Finetuning-Based Low/Few-Shot Object DetectionVladislav Li, Georgios Tsoumplekas, Ilias Siniosoglou et al.
Current methods for low- and few-shot object detection have primarily focused on enhancing model performance for detecting objects. One common approach to achieve this is by combining model finetuning with data augmentation strategies. However, little attention has been given to the energy efficiency of these approaches in data-scarce regimes. This paper seeks to conduct a comprehensive empirical study that examines both model performance and energy efficiency of custom data augmentations and automated data augmentation selection strategies when combined with a lightweight object detector. The methods are evaluated in three different benchmark datasets in terms of their performance and energy consumption, and the Efficiency Factor is employed to gain insights into their effectiveness considering both performance and efficiency. Consequently, it is shown that in many cases, the performance gains of data augmentation strategies are overshadowed by their increased energy usage, necessitating the development of more energy efficient data augmentation strategies to address data scarcity.
IVJul 24, 2024
Introducing DEFORMISE: A deep learning framework for dementia diagnosis in the elderly using optimized MRI slice selectionNikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda et al.
Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce DEFORMISE, a novel DEep learning Framework for dementia diagnOsis of eldeRly patients using 3D brain Magnetic resonance Imaging (MRI) scans with Optimized Slice sElection. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three novel deep learning models. Tested on the Open OASIS datasets, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the ADNI dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.
CVJun 29, 2023
Evaluation of Environmental Conditions on Object Detection using Oriented Bounding Boxes for AR ApplicationsVladislav Li, Barbara Villarini, Jean-Christophe Nebel et al.
The objective of augmented reality (AR) is to add digital content to natural images and videos to create an interactive experience between the user and the environment. Scene analysis and object recognition play a crucial role in AR, as they must be performed quickly and accurately. In this study, a new approach is proposed that involves using oriented bounding boxes with a detection and recognition deep network to improve performance and processing time. The approach is evaluated using two datasets: a real image dataset (DOTA dataset) commonly used for computer vision tasks, and a synthetic dataset that simulates different environmental, lighting, and acquisition conditions. The focus of the evaluation is on small objects, which are difficult to detect and recognise. The results indicate that the proposed approach tends to produce better Average Precision and greater accuracy for small objects in most of the tested conditions.
LGSep 10, 2024
Applied Federated Model Personalisation in the Industrial Domain: A Comparative StudyIlias Siniosoglou, Vasileios Argyriou, George Fragulis et al.
The time-consuming nature of training and deploying complicated Machine and Deep Learning (DL) models for a variety of applications continues to pose significant challenges in the field of Machine Learning (ML). These challenges are particularly pronounced in the federated domain, where optimizing models for individual nodes poses significant difficulty. Many methods have been developed to tackle this problem, aiming to reduce training expenses and time while maintaining efficient optimisation. Three suggested strategies to tackle this challenge include Active Learning, Knowledge Distillation, and Local Memorization. These methods enable the adoption of smaller models that require fewer computational resources and allow for model personalization with local insights, thereby improving the effectiveness of current models. The present study delves into the fundamental principles of these three approaches and proposes an advanced Federated Learning System that utilises different Personalisation methods towards improving the accuracy of AI models and enhancing user experience in real-time NG-IoT applications, investigating the efficacy of these techniques in the local and federated domain. The results of the original and optimised models are then compared in both local and federated contexts using a comparison analysis. The post-analysis shows encouraging outcomes when it comes to optimising and personalising the models with the suggested techniques.
ROFeb 19
FR-GESTURE: An RGBD Dataset For Gesture-based Human-Robot Interaction In First Responder OperationsKonstantinos Foteinos, Georgios Angelidis, Aggelos Psiris et al.
The ever increasing intensity and number of disasters make even more difficult the work of First Responders (FRs). Artificial intelligence and robotics solutions could facilitate their operations, compensating these difficulties. To this end, we propose a dataset for gesture-based UGV control by FRs, introducing a set of 12 commands, drawing inspiration from existing gestures used by FRs and tactical hand signals and refined after incorporating feedback from experienced FRs. Then we proceed with the data collection itself, resulting in 3312 RGBD pairs captured from 2 viewpoints and 7 distances. To the best of our knowledge, this is the first dataset especially intended for gesture-based UGV guidance by FRs. Finally we define evaluation protocols for our RGBD dataset, termed FR-GESTURE, and we perform baseline experiments, which are put forward for improvement. We have made data publicly available to promote future research on the domain: https://doi.org/10.5281/zenodo.18131333.
CVMar 1
Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene UnderstandingAnna Michailidou, Georgios Angelidis, Vasileios Argyriou et al.
Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.
CVMay 1, 2025Code
X-ray illicit object detection using hybrid CNN-transformer neural network architecturesJorgen Cani, Christos Diou, Spyridon Evangelatos et al.
In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at https://github.com/jgenc/xray-comparative-evaluation.
CVJul 23, 2025Code
Illicit object detection in X-ray imaging using deep learning techniques: A comparative evaluationJorgen Cani, Christos Diou, Spyridon Evangelatos et al.
Automated X-ray inspection is crucial for efficient and unobtrusive security screening in various public settings. However, challenges such as object occlusion, variations in the physical properties of items, diversity in X-ray scanning devices, and limited training data hinder accurate and reliable detection of illicit items. Despite the large body of research in the field, reported experimental evaluations are often incomplete, with frequently conflicting outcomes. To shed light on the research landscape and facilitate further research, a systematic, detailed, and thorough comparative evaluation of recent Deep Learning (DL)-based methods for X-ray object detection is conducted. For this, a comprehensive evaluation framework is developed, composed of: a) Six recent, large-scale, and widely used public datasets for X-ray illicit item detection (OPIXray, CLCXray, SIXray, EDS, HiXray, and PIDray), b) Ten different state-of-the-art object detection schemes covering all main categories in the literature, including generic Convolutional Neural Network (CNN), custom CNN, generic transformer, and hybrid CNN-transformer architectures, and c) Various detection (mAP50 and mAP50:95) and time/computational-complexity (inference time (ms), parameter size (M), and computational load (GFLOPS)) metrics. A thorough analysis of the results leads to critical observations and insights, emphasizing key aspects such as: a) Overall behavior of the object detection schemes, b) Object-level detection performance, c) Dataset-specific observations, and d) Time efficiency and computational complexity analysis. To support reproducibility of the reported experimental results, the evaluation code and model weights are made publicly available at https://github.com/jgenc/xray-comparative-evaluation.
CRApr 4, 2025Code
Malware Detection in Docker Containers: An Image is Worth a Thousand LogsAkis Nousias, Efklidis Katsaros, Evangelos Syrmos et al.
Malware detection is increasingly challenged by evolving techniques like obfuscation and polymorphism, limiting the effectiveness of traditional methods. Meanwhile, the widespread adoption of software containers has introduced new security challenges, including the growing threat of malicious software injection, where a container, once compromised, can serve as entry point for further cyberattacks. In this work, we address these security issues by introducing a method to identify compromised containers through machine learning analysis of their file systems. We cast the entire software containers into large RGB images via their tarball representations, and propose to use established Convolutional Neural Network architectures on a streaming, patch-based manner. To support our experiments, we release the COSOCO dataset--the first of its kind--containing 3364 large-scale RGB images of benign and compromised software containers at https://huggingface.co/datasets/k3ylabs/cosoco-image-dataset. Our method detects more malware and achieves higher F1 and Recall scores than all individual and ensembles of VirusTotal engines, demonstrating its effectiveness and setting a new standard for identifying malware-compromised software containers.
CVJan 31, 2022Code
Finding Directions in GAN's Latent Space for Neural Face ReenactmentStella Bounareli, Vasileios Argyriou, Georgios Tzimiropoulos
This paper is on face/head reenactment where the goal is to transfer the facial pose (3D head orientation and expression) of a target face to a source face. Previous methods focus on learning embedding networks for identity and pose disentanglement which proves to be a rather hard task, degrading the quality of the generated images. We take a different approach, bypassing the training of such networks, by using (fine-tuned) pre-trained GANs which have been shown capable of producing high-quality facial images. Because GANs are characterized by weak controllability, the core of our approach is a method to discover which directions in latent GAN space are responsible for controlling facial pose and expression variations. We present a simple pipeline to learn such directions with the aid of a 3D shape model which, by construction, already captures disentangled directions for facial pose, identity and expression. Moreover, we show that by embedding real images in the GAN latent space, our method can be successfully used for the reenactment of real-world faces. Our method features several favorable properties including using a single source image (one-shot) and enabling cross-person reenactment. Our qualitative and quantitative results show that our approach often produces reenacted faces of significantly higher quality than those produced by state-of-the-art methods for the standard benchmarks of VoxCeleb1 & 2. Source code is available at: https://github.com/StelaBou/stylegan_directions_face_reenactment
CVSep 14, 2023
PRE: Vision-Language Prompt Learning with Reparameterization EncoderThi Minh Anh Pham, An Duc Nguyen, Cephas Svosve et al.
Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.
CVFeb 5, 2024
One-shot Neural Face Reenactment via Finding Directions in GAN's Latent SpaceStella Bounareli, Christos Tzelepis, Vasileios Argyriou et al.
In this paper, we present our framework for neural face/head reenactment whose goal is to transfer the 3D head orientation and expression of a target face to a source face. Previous methods focus on learning embedding networks for identity and head pose/expression disentanglement which proves to be a rather hard task, degrading the quality of the generated images. We take a different approach, bypassing the training of such networks, by using (fine-tuned) pre-trained GANs which have been shown capable of producing high-quality facial images. Because GANs are characterized by weak controllability, the core of our approach is a method to discover which directions in latent GAN space are responsible for controlling head pose and expression variations. We present a simple pipeline to learn such directions with the aid of a 3D shape model which, by construction, inherently captures disentangled directions for head pose, identity, and expression. Moreover, we show that by embedding real images in the GAN latent space, our method can be successfully used for the reenactment of real-world faces. Our method features several favorable properties including using a single source image (one-shot) and enabling cross-person reenactment. Extensive qualitative and quantitative results show that our approach typically produces reenacted faces of notably higher quality than those produced by state-of-the-art methods for the standard benchmarks of VoxCeleb1 & 2.
CVMar 25, 2024
DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face ReenactmentStella Bounareli, Christos Tzelepis, Vasileios Argyriou et al.
Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face, while transferring the target head pose and facial expressions. Existing GAN-based methods suffer from either distortions and visual artifacts or poor reconstruction quality, i.e., the background and several important appearance details, such as hair style/color, glasses and accessories, are not faithfully reconstructed. Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images. To this end, in this paper we present DiffusionAct, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment. Specifically, we propose to control the semantic space of a Diffusion Autoencoder (DiffAE), in order to edit the facial pose of the input images, defined as the head pose orientation and the facial expressions. Our method allows one-shot, self, and cross-subject reenactment, without requiring subject-specific fine-tuning. We compare against state-of-the-art GAN-, StyleGAN2-, and diffusion-based methods, showing better or on-par reenactment performance.
CVDec 10, 2024
Enhancing 3D Object Detection in Autonomous Vehicles Based on Synthetic Virtual Environment AnalysisVladislav Li, Ilias Siniosoglou, Thomai Karamitsou et al.
Autonomous Vehicles (AVs) use natural images and videos as input to understand the real world by overlaying and inferring digital elements, facilitating proactive detection in an effort to assure safety. A crucial aspect of this process is real-time, accurate object recognition through automatic scene analysis. While traditional methods primarily concentrate on 2D object detection, exploring 3D object detection, which involves projecting 3D bounding boxes into the three-dimensional environment, holds significance and can be notably enhanced using the AR ecosystem. This study examines an AI model's ability to deduce 3D bounding boxes in the context of real-time scene analysis while producing and evaluating the model's performance and processing time, in the virtual domain, which is then applied to AVs. This work also employs a synthetic dataset that includes artificially generated images mimicking various environmental, lighting, and spatiotemporal states. This evaluation is oriented in handling images featuring objects in diverse weather conditions, captured with varying camera settings. These variations pose more challenging detection and recognition scenarios, which the outcomes of this work can help achieve competitive results under most of the tested conditions.
CVApr 7, 2024
Dynamic Distinction Learning: Adaptive Pseudo Anomalies for Video Anomaly DetectionDemetris Lappas, Vasileios Argyriou, Dimitrios Makris
We introduce Dynamic Distinction Learning (DDL) for Video Anomaly Detection, a novel video anomaly detection methodology that combines pseudo-anomalies, dynamic anomaly weighting, and a distinction loss function to improve detection accuracy. By training on pseudo-anomalies, our approach adapts to the variability of normal and anomalous behaviors without fixed anomaly thresholds. Our model showcases superior performance on the Ped2, Avenue and ShanghaiTech datasets, where individual models are tailored for each scene. These achievements highlight DDL's effectiveness in advancing anomaly detection, offering a scalable and adaptable solution for video surveillance challenges.
CLApr 22, 2024
Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional ApproachesDimitris Asimopoulos, Ilias Siniosoglou, Vasileios Argyriou et al.
In the realm of data privacy, the ability to effectively anonymise text is paramount. With the proliferation of deep learning and, in particular, transformer architectures, there is a burgeoning interest in leveraging these advanced models for text anonymisation tasks. This paper presents a comprehensive benchmarking study comparing the performance of transformer-based models and Large Language Models(LLM) against traditional architectures for text anonymisation. Utilising the CoNLL-2003 dataset, known for its robustness and diversity, we evaluate several models. Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods. Notably, while modern models exhibit advanced capabilities in capturing con textual nuances, certain traditional architectures still keep high performance. This work aims to guide researchers in selecting the most suitable model for their anonymisation needs, while also shedding light on potential paths for future advancements in the field.
LGMar 11, 2024
Evaluating the Energy Efficiency of Few-Shot Learning for Object Detection in Industrial SettingsGeorgios Tsoumplekas, Vladislav Li, Ilias Siniosoglou et al.
In the ever-evolving era of Artificial Intelligence (AI), model performance has constituted a key metric driving innovation, leading to an exponential growth in model size and complexity. However, sustainability and energy efficiency have been critical requirements during deployment in contemporary industrial settings, necessitating the use of data-efficient approaches such as few-shot learning. In this paper, to alleviate the burden of lengthy model training and minimize energy consumption, a finetuning approach to adapt standard object detection models to downstream tasks is examined. Subsequently, a thorough case study and evaluation of the energy demands of the developed models, applied in object detection benchmark datasets from volatile industrial environments is presented. Specifically, different finetuning strategies as well as utilization of ancillary evaluation data during training are examined, and the trade-off between performance and efficiency is highlighted in this low-data regime. Finally, this paper introduces a novel way to quantify this trade-off through a customized Efficiency Factor metric.
CVMay 17, 2024
Autonomous AI-enabled Industrial Sorting Pipeline for Advanced Textile RecyclingYannis Spyridis, Vasileios Argyriou, Antonios Sarigiannidis et al.
The escalating volumes of textile waste globally necessitate innovative waste management solutions to mitigate the environmental impact and promote sustainability in the fashion industry. This paper addresses the inefficiencies of traditional textile sorting methods by introducing an autonomous textile analysis pipeline. Utilising robotics, spectral imaging, and AI-driven classification, our system enhances the accuracy, efficiency, and scalability of textile sorting processes, contributing to a more sustainable and circular approach to waste management. The integration of a Digital Twin system further allows critical evaluation of technical and economic feasibility, providing valuable insights into the sorting system's accuracy and reliability. The proposed framework, inspired by Industry 4.0 principles, comprises five interconnected layers facilitating seamless data exchange and coordination within the system. Preliminary results highlight the potential of our holistic approach to mitigate environmental impact and foster a positive shift towards recycling in the textile industry.
CVMar 20, 2024
An AI-Assisted Skincare Routine Recommendation System in XRGowravi Malalur Rajegowda, Yannis Spyridis, Barbara Villarini et al.
In recent years, there has been an increasing interest in the use of artificial intelligence (AI) and extended reality (XR) in the beauty industry. In this paper, we present an AI-assisted skin care recommendation system integrated into an XR platform. The system uses a convolutional neural network (CNN) to analyse an individual's skin type and recommend personalised skin care products in an immersive and interactive manner. Our methodology involves collecting data from individuals through a questionnaire and conducting skin analysis using a provided facial image in an immersive environment. This data is then used to train the CNN model, which recognises the skin type and existing issues and allows the recommendation engine to suggest personalised skin care products. We evaluate our system in terms of the accuracy of the CNN model, which achieves an average score of 93% in correctly classifying existing skin issues. Being integrated into an XR system, this approach has the potential to significantly enhance the beauty industry by providing immersive and engaging experiences to users, leading to more efficient and consistent skincare routines.
LGFeb 5, 2024
A Complete Survey on Contemporary Methods, Emerging Paradigms and Hybrid Approaches for Few-Shot LearningGeorgios Tsoumplekas, Vladislav Li, Panagiotis Sarigiannidis et al.
Despite the widespread success of deep learning, its intense requirements for vast amounts of data and extensive training make it impractical for various real-world applications where data is scarce. In recent years, Few-Shot Learning (FSL) has emerged as a learning paradigm that aims to address these limitations by leveraging prior knowledge to enable rapid adaptation to novel learning tasks. Due to its properties that highly complement deep learning's data-intensive needs, FSL has seen significant growth in the past few years. This survey provides a comprehensive overview of both well-established methods as well as recent advancements in the FSL field. The presented taxonomy extends previously proposed ones by incorporating emerging FSL paradigms, such as in-context learning, along with novel categories within the meta-learning paradigm for FSL, including neural processes and probabilistic meta-learning. Furthermore, a holistic overview of FSL is provided by discussing hybrid FSL approaches that extend FSL beyond the typically examined supervised learning setting. The survey also explores FSL's diverse applications across various domains. Finally, recent trends shaping the field, outstanding challenges, and promising future research directions are discussed.
CRMay 20, 2024
StatAvg: Mitigating Data Heterogeneity in Federated Learning for Intrusion Detection SystemsPavlos S. Bouzinis, Panagiotis Radoglou-Grammatikis, Ioannis Makris et al.
Federated learning (FL) is a decentralized learning technique that enables participating devices to collaboratively build a shared Machine Leaning (ML) or Deep Learning (DL) model without revealing their raw data to a third party. Due to its privacy-preserving nature, FL has sparked widespread attention for building Intrusion Detection Systems (IDS) within the realm of cybersecurity. However, the data heterogeneity across participating domains and entities presents significant challenges for the reliable implementation of an FL-based IDS. In this paper, we propose an effective method called Statistical Averaging (StatAvg) to alleviate non-independently and identically (non-iid) distributed features across local clients' data in FL. In particular, StatAvg allows the FL clients to share their individual data statistics with the server, which then aggregates this information to produce global statistics. The latter are shared with the clients and used for universal data normalisation. It is worth mentioning that StatAvg can seamlessly integrate with any FL aggregation strategy, as it occurs before the actual FL training process. The proposed method is evaluated against baseline approaches using datasets for network and host Artificial Intelligence (AI)-powered IDS. The experimental results demonstrate the efficiency of StatAvg in mitigating non-iid feature distributions across the FL clients compared to the baseline methods.
ROMay 19, 2024
Enhancing Vehicle Aerodynamics with Deep Reinforcement Learning in Voxelised ModelsJignesh Patel, Yannis Spyridis, Vasileios Argyriou
Aerodynamic design optimisation plays a crucial role in improving the performance and efficiency of automotive vehicles. This paper presents a novel approach for aerodynamic optimisation in car design using deep reinforcement learning (DRL). Traditional optimisation methods often face challenges in handling the complexity of the design space and capturing non-linear relationships between design parameters and aerodynamic performance metrics. This study addresses these challenges by employing DRL to learn optimal aerodynamic design strategies in a voxelised model representation. The proposed approach utilises voxelised models to discretise the vehicle geometry into a grid of voxels, allowing for a detailed representation of the aerodynamic flow field. The Proximal Policy Optimisation (PPO) algorithm is then employed to train a DRL agent to optimise the design parameters of the vehicle with respect to drag force, kinetic energy, and voxel collision count. Experimental results demonstrate the effectiveness and efficiency of the proposed approach in achieving significant results in aerodynamic performance. The findings highlight the potential of DRL techniques for addressing complex aerodynamic design optimisation problems in automotive engineering, with implications for improving vehicle performance, fuel efficiency, and environmental sustainability.
CVJul 6, 2025
Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research DirectionsKonstantinos Foteinos, Jorgen Cani, Manousos Linardakis et al.
The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always important field of visual hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction using cameras. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the right combination of data, model, and approach for each task. The current survey aims to fill this gap by presenting a comprehensive overview of this computer vision field. With a systematic research methodology that identifies the state-of-the-art works and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to choose the right strategy for handling a VHGR task. Starting with the methodology used to locate the related literature, the survey identifies and organizes the key VHGR approaches in a taxonomy-based format, and presents the various dimensions that affect the final method choice, such as input modality, task type, and application domain. The state-of-the-art techniques are grouped across three primary VHGR tasks: static gesture recognition, isolated dynamic gestures, and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. To support the experimental evaluation of future methods in the field, the study reviews commonly used datasets and presents the standard performance metrics. Our survey concludes by identifying the major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.
LGFeb 2, 2025
Emotion Recognition and Generation: A Comprehensive Review of Face, Speech, and Text ModalitiesRebecca Mobbs, Dimitrios Makris, Vasileios Argyriou
Emotion recognition and generation have emerged as crucial topics in Artificial Intelligence research, playing a significant role in enhancing human-computer interaction within healthcare, customer service, and other fields. Although several reviews have been conducted on emotion recognition and generation as separate entities, many of these works are either fragmented or limited to specific methodologies, lacking a comprehensive overview of recent developments and trends across different modalities. In this survey, we provide a holistic review aimed at researchers beginning their exploration in emotion recognition and generation. We introduce the fundamental principles underlying emotion recognition and generation across facial, vocal, and textual modalities. This work categorises recent state-of-the-art research into distinct technical approaches and explains the theoretical foundations and motivations behind these methodologies, offering a clearer understanding of their application. Moreover, we discuss evaluation metrics, comparative analyses, and current limitations, shedding light on the challenges faced by researchers in the field. Finally, we propose future research directions to address these challenges and encourage further exploration into developing robust, effective, and ethically responsible emotion recognition and generation systems.
LGFeb 14, 2025
NeuroXVocal: Detection and Explanation of Alzheimer's Disease through Non-invasive Analysis of Picture-prompted SpeechNikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda et al.
The early diagnosis of Alzheimer's Disease (AD) through non invasive methods remains a significant healthcare challenge. We present NeuroXVocal, a novel dual-component system that not only classifies but also explains potential AD cases through speech analysis. The classification component (Neuro) processes three distinct data streams: acoustic features capturing speech patterns and voice characteristics, textual features extracted from speech transcriptions, and precomputed embeddings representing linguistic patterns. These streams are fused through a custom transformer-based architecture that enables robust cross-modal interactions. The explainability component (XVocal) implements a Retrieval-Augmented Generation (RAG) approach, leveraging Large Language Models combined with a domain-specific knowledge base of AD research literature. This architecture enables XVocal to retrieve relevant clinical studies and research findings to generate evidence-based context-sensitive explanations of the acoustic and linguistic markers identified in patient speech. Using the IS2021 ADReSSo Challenge benchmark dataset, our system achieved state-of-the-art performance with 95.77% accuracy in AD classification, significantly outperforming previous approaches. The explainability component was qualitatively evaluated using a structured questionnaire completed by medical professionals, validating its clinical relevance. NeuroXVocal's unique combination of high-accuracy classification and interpretable, literature-grounded explanations demonstrates its potential as a practical tool for supporting clinical AD diagnosis.
CVMar 10, 2025
Public space security management using digital twin technologiesStylianos Zindros, Christos Chronis, Panagiotis Radoglou-Grammatikis et al.
As the security of public spaces remains a critical issue in today's world, Digital Twin technologies have emerged in recent years as a promising solution for detecting and predicting potential future threats. The applied methodology leverages a Digital Twin of a metro station in Athens, Greece, using the FlexSim simulation software. The model encompasses points of interest and passenger flows, and sets their corresponding parameters. These elements influence and allow the model to provide reasonable predictions on the security management of the station under various scenarios. Experimental tests are conducted with different configurations of surveillance cameras and optimizations of camera angles to evaluate the effectiveness of the space surveillance setup. The results show that the strategic positioning of surveillance cameras and the adjustment of their angles significantly improves the detection of suspicious behaviors and with the use of the DT it is possible to evaluate different scenarios and find the optimal camera setup for each case. In summary, this study highlights the value of Digital Twins in real-time simulation and data-driven security management. The proposed approach contributes to the ongoing development of smart security solutions for public spaces and provides an innovative framework for threat detection and prevention.
CVMar 4, 2025
State of play and future directions in industrial computer vision AI standardsArtemis Stefanidou, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou et al.
The recent tremendous advancements in the areas of Artificial Intelligence (AI) and Deep Learning (DL) have also resulted into corresponding remarkable progress in the field of Computer Vision (CV), showcasing robust technological solutions in a wide range of application sectors of high industrial interest (e.g., healthcare, autonomous driving, automation, etc.). Despite the outstanding performance of CV systems in specific domains, their development and exploitation at industrial-scale necessitates, among other, the addressing of requirements related to the reliability, transparency, trustworthiness, security, safety, and robustness of the developed AI models. The latter raises the imperative need for the development of efficient, comprehensive and widely-adopted industrial standards. In this context, this study investigates the current state of play regarding the development of industrial computer vision AI standards, emphasizing on critical aspects, like model interpretability, data quality, and regulatory compliance. In particular, a systematic analysis of launched and currently developing CV standards, proposed by the main international standardization bodies (e.g. ISO/IEC, IEEE, DIN, etc.) is performed. The latter is complemented by a comprehensive discussion on the current challenges and future directions observed in this regularization endeavor.
CLMay 17, 2024
Empowering Prior to Court Legal Analysis: A Transparent and Accessible Dataset for Defensive Statement Classification and InterpretationYannis Spyridis, Jean-Paul, Haneen Deeb et al.
The classification of statements provided by individuals during police interviews is a complex and significant task within the domain of natural language processing (NLP) and legal informatics. The lack of extensive domain-specific datasets raises challenges to the advancement of NLP methods in the field. This paper aims to address some of the present challenges by introducing a novel dataset tailored for classification of statements made during police interviews, prior to court proceedings. Utilising the curated dataset for training and evaluation, we introduce a fine-tuned DistilBERT model that achieves state-of-the-art performance in distinguishing truthful from deceptive statements. To enhance interpretability, we employ explainable artificial intelligence (XAI) methods to offer explainability through saliency maps, that interpret the model's decision-making process. Lastly, we present an XAI interface that empowers both legal professionals and non-specialists to interact with and benefit from our system. Our model achieves an accuracy of 86%, and is shown to outperform a custom transformer architecture in a comparative study. This holistic approach advances the accessibility, transparency, and effectiveness of statement analysis, with promising implications for both legal practice and research.
CLMay 9, 2024
Evaluating the Efficacy of AI Techniques in Textual Anonymization: A Comparative StudyDimitris Asimopoulos, Ilias Siniosoglou, Vasileios Argyriou et al.
In the digital era, with escalating privacy concerns, it's imperative to devise robust strategies that protect private data while maintaining the intrinsic value of textual information. This research embarks on a comprehensive examination of text anonymisation methods, focusing on Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), Embeddings from Language Models (ELMo), and the transformative capabilities of the Transformers architecture. Each model presents unique strengths since LSTM is modeling long-term dependencies, CRF captures dependencies among word sequences, ELMo delivers contextual word representations using deep bidirectional language models and Transformers introduce self-attention mechanisms that provide enhanced scalability. Our study is positioned as a comparative analysis of these models, emphasising their synergistic potential in addressing text anonymisation challenges. Preliminary results indicate that CRF, LSTM, and ELMo individually outperform traditional methods. The inclusion of Transformers, when compared alongside with the other models, offers a broader perspective on achieving optimal text anonymisation in contemporary settings.
CVJun 17, 2019
Content-aware Density Map for Crowd Counting and Density EstimationMahdi Maktabdar Oghaz, Anish R Khadka, Vasileios Argyriou et al.
Precise knowledge about the size of a crowd, its density and flow can provide valuable information for safety and security applications, event planning, architectural design and to analyze consumer behavior. Creating a powerful machine learning model, to employ for such applications requires a large and highly accurate and reliable dataset. Unfortunately the existing crowd counting and density estimation benchmark datasets are not only limited in terms of their size, but also lack annotation, in general too time consuming to implement. This paper attempts to address this very issue through a content aware technique, uses combinations of Chan-Vese segmentation algorithm, two-dimensional Gaussian filter and brute-force nearest neighbor search. The results shows that by simply replacing the commonly used density map generators with the proposed method, higher level of accuracy can be achieved using the existing state of the art models.
CVJun 6, 2019
Iterative Self-Learning: Semi-Supervised Improvement to Dataset Volumes and Model AccuracyRobert Dupre, Jiri Fajtl, Vasileios Argyriou et al.
A novel semi-supervised learning technique is introduced based on a simple iterative learning cycle together with learned thresholding techniques and an ensemble decision support system. State-of-the-art model performance and increased training data volume are demonstrated, through the use of unlabelled data when training deeply learned classification models. Evaluation of the proposed approach is performed on commonly used datasets when evaluating semi-supervised learning techniques as well as a number of more challenging image classification datasets (CIFAR-100 and a 200 class subset of ImageNet).
CVJun 6, 2019
Scene and Environment Monitoring Using Aerial Imagery and Deep LearningMahdi Maktabdar Oghaz, Manzoor Razaak, Hamideh Kerdegari et al.
Unmanned Aerial vehicles (UAV) are a promising technology for smart farming related applications. Aerial monitoring of agriculture farms with UAV enables key decision-making pertaining to crop monitoring. Advancements in deep learning techniques have further enhanced the precision and reliability of aerial imagery based analysis. The capabilities to mount various kinds of sensors (RGB, spectral cameras) on UAV allows remote crop analysis applications such as vegetation classification and segmentation, crop counting, yield monitoring and prediction, crop mapping, weed detection, disease and nutrient deficiency detection and others. A significant amount of studies are found in the literature that explores UAV for smart farming applications. In this paper, a review of studies applying deep learning on UAV imagery for smart farming is presented. Based on the application, we have classified these studies into five major groups including: vegetation identification, classification and segmentation, crop counting and yield predictions, crop mapping, weed detection and crop disease and nutrient deficiency detection. An in depth critical analysis of each study is provided.
CVJun 6, 2019
Smart IoT Cameras for Crowd Analysis based on augmentation for automatic pedestrian detection, simulation and annotationAntoine Rimboux, Rob Dupre, Thomas Lagkas et al.
Smart video sensors for applications related to surveillance and security are IOT-based as they use Internet for various purposes. Such applications include crowd behaviour monitoring and advanced decision support systems operating and transmitting information over internet. The analysis of crowd and pedestrian behaviour is an important task for smart IoT cameras and in particular video processing. In order to provide related behavioural models, simulation and tracking approaches have been considered in the literature. In both cases ground truth is essential to train deep models and provide a meaningful quantitative evaluation. We propose a framework for crowd simulation and automatic data generation and annotation that supports multiple cameras and multiple targets. The proposed approach is based on synthetically generated human agents, augmented frames and compositing techniques combined with path finding and planning methods. A number of popular crowd and pedestrian data sets were used to validate the model, and scenarios related to annotation and simulation were considered.
IVMay 24, 2019
Semi-supervised GAN for Classification of Multispectral Imagery Acquired by UAVsHamideh Kerdegari, Manzoor Razaak, Vasileios Argyriou et al.
Unmanned aerial vehicles (UAV) are used in precision agriculture (PA) to enable aerial monitoring of farmlands. Intelligent methods are required to pinpoint weed infestations and make optimal choice of pesticide. UAV can fly a multispectral camera and collect data. However, the classification of multispectral images using supervised machine learning algorithms such as convolutional neural networks (CNN) requires large amount of training data. This is a common drawback in deep learning we try to circumvent making use of a semi-supervised generative adversarial networks (GAN), providing a pixel-wise classification for all the acquired multispectral images. Our algorithm consists of a generator network that provides photo-realistic images as extra training data to a multi-class classifier, acting as a discriminator and trained on small amounts of labeled data. The performance of the proposed method is evaluated on the weedNet dataset consisting of multispectral crop and weed images collected by a micro aerial vehicle (MAV). The results by the proposed semi-supervised GAN achieves high classification accuracy and demonstrates the potential of GAN-based methods for the challenging task of multispectral image classification.
CVDec 12, 2018
Features Extraction Based on an Origami Representation of 3D LandmarksJuan Manuel Fernandez Montenegro, Mahdi Maktab Dar Oghaz, Athanasios Gkelias et al.
Feature extraction analysis has been widely investigated during the last decades in computer vision community due to the large range of possible applications. Significant work has been done in order to improve the performance of the emotion detection methods. Classification algorithms have been refined, novel preprocessing techniques have been applied and novel representations from images and videos have been introduced. In this paper, we propose a preprocessing method and a novel facial landmarks' representation aiming to improve the facial emotion detection accuracy. We apply our novel methodology on the extended Cohn-Kanade (CK+) dataset and other datasets for affect classification based on Action Units (AU). The performance evaluation demonstrates an improvement on facial emotion classification (accuracy and F1 score) that indicates the superiority of the proposed methodology.
CVDec 9, 2018
A Comparison of Embedded Deep Learning Methods for Person DetectionChloe Eunhyang Kim, Mahdi Maktab Dar Oghaz, Jiri Fajtl et al.
Recent advancements in parallel computing, GPU technology and deep learning provide a new platform for complex image processing tasks such as person detection to flourish. Person detection is fundamental preliminary operation for several high level computer vision tasks. One industry that can significantly benefit from person detection is retail. In recent years, various studies attempt to find an optimal solution for person detection using neural networks and deep learning. This study conducts a comparison among the state of the art deep learning base object detector with the focus on person detection performance in indoor environments. Performance of various implementations of YOLO, SSD, RCNN, R-FCN and SqueezeDet have been assessed using our in-house proprietary dataset which consists of over 10 thousands indoor images captured form shopping malls, retails and stores. Experimental results indicate that, Tiny YOLO-416 and SSD (VGG-300) are the fastest and Faster-RCNN (Inception ResNet-v2) and R-FCN (ResNet-101) are the most accurate detectors investigated in this study. Further analysis shows that YOLO v3-416 delivers relatively accurate result in a reasonable amount of time, which makes it an ideal model for person detection in embedded platforms.
CVDec 5, 2018
Summarizing Videos with AttentionJiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou et al.
In this work we propose a novel method for supervised, keyshots based video summarization by applying a conceptually simple and computationally efficient soft, self-attention mechanism. Current state of the art methods leverage bi-directional recurrent networks such as BiLSTM combined with attention. These networks are complex to implement and computationally demanding compared to fully connected networks. To that end we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training. Our method sets a new state of the art results on two benchmarks TvSum and SumMe, commonly used in this domain.
CVNov 6, 2018
Object 3D Reconstruction based on Photometric Stereo and Inverted RenderingAnish R. Khadka, Paolo Remagnino, Vasileios Argyriou
Methods for 3D reconstruction such as Photometric stereo recover the shape and reflectance properties using multiple images of an object taken with variable lighting conditions from a fixed viewpoint. Photometric stereo assumes that a scene is illuminated only directly by the illumination source. As result, indirect illumination effects due to inter-reflections introduce strong biases in the recovered shape. Our suggested approach is to recover scene properties in the presence of indirect illumination. To this end, we proposed an iterative PS method combined with a reverted Monte-Carlo ray tracing algorithm to overcome the inter-reflection effects aiming to separate the direct and indirect lighting. This approach iteratively reconstructs a surface considering both the environment around the object and its concavities. We demonstrate and evaluate our approach using three datasets and the overall results illustrate improvement over the classic PS approaches.
HCNov 2, 2018
Machine learning architectures to predict motion sickness using a Virtual Reality rollercoaster simulation toolStefan Hell, Vasileios Argyriou
Virtual Reality (VR) can cause an unprecedented immersion and feeling of presence yet a lot of users experience motion sickness when moving through a virtual environment. Rollercoaster rides are popular in Virtual Reality but have to be well designed to limit the amount of nausea the user may feel. This paper describes a novel framework to get automated ratings on motion sickness using Neural Networks. An application that lets users create rollercoasters directly in VR, share them with other users and ride and rate them is used to gather real-time data related to the in-game behaviour of the player, the track itself and users' ratings based on a Simulator Sickness Questionnaire (SSQ) integrated into the application. Machine learning architectures based on deep neural networks are trained using this data aiming to predict motion sickness levels. While this paper focuses on rollercoasters this framework could help to rate any VR application on motion sickness and intensity that involves camera movement. A new well defined dataset is provided in this paper and the performance of the proposed architectures are evaluated in a comparative study.
CVNov 1, 2018
Asymmetric Bilateral Phase Correlation for Optical Flow Estimation in the Frequency DomainVasileios Argyriou
We address the problem of motion estimation in images operating in the frequency domain. A method is presented which extends phase correlation to handle multiple motions present in an area. Our scheme is based on a novel Bilateral-Phase Correlation (BLPC) technique that incorporates the concept and principles of Bilateral Filters retaining the motion boundaries by taking into account the difference both in value and distance in a manner very similar to Gaussian convolution. The optical flow is obtained by applying the proposed method at certain locations selected based on the present motion differences and then performing non-uniform interpolation in a multi-scale iterative framework. Experiments with several well-known datasets with and without ground-truth show that our scheme outperforms recently proposed state-of-the-art phase correlation based optical flow methods.
CVApr 18, 2018
Superframes, A Temporal Video SegmentationHajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso et al.
The goal of video segmentation is to turn video data into a set of concrete motion clusters that can be easily interpreted as building blocks of the video. There are some works on similar topics like detecting scene cuts in a video, but there is few specific research on clustering video data into the desired number of compact segments. It would be more intuitive, and more efficient, to work with perceptually meaningful entity obtained from a low-level grouping process which we call it superframe. This paper presents a new simple and efficient technique to detect superframes of similar content patterns in videos. We calculate the similarity of content-motion to obtain the strength of change between consecutive frames. With the help of existing optical flow technique using deep models, the proposed method is able to perform more accurate motion estimation efficiently. We also propose two criteria for measuring and comparing the performance of different algorithms on various databases. Experimental results on the videos from benchmark databases have demonstrated the effectiveness of the proposed method.
AIApr 9, 2018
AMNet: Memorability Estimation with AttentionJiri Fajtl, Vasileios Argyriou, Dorothy Monekosso et al.
In this paper we present the design and evaluation of an end-to-end trainable, deep neural network with a visual attention mechanism for memorability estimation in still images. We analyze the suitability of transfer learning of deep models from image classification to the memorability task. Further on we study the impact of the attention mechanism on the memorability estimation and evaluate our network on the SUN Memorability and the LaMem datasets. Our network outperforms the existing state of the art models on both datasets in terms of the Spearman's rank correlation as well as the mean squared error, closely matching human consistency.
CVJul 9, 2017
A Human and Group Behaviour Simulation Evaluation Framework utilising Composition and Video AnalysisRob Dupre, Vasileios Argyriou
In this work we present the modular Crowd Simulation Evaluation through Composition framework (CSEC) which provides a quantitative comparison between different pedestrian and crowd simulation approaches. Evaluation is made based on the comparison of source footage against synthetic video created through novel composition techniques. The proposed framework seeks to reduce the complexity of simulation evaluation and provide a platform from which the comparison of differing simulation algorithms as well as parametric tuning can be conducted to improve simulation accuracy or providing measures of similarity between crowd simulation algorithms and source data. Through the use of features designed to mimic the Human Visual System (HVS), specific simulation properties can be evaluated relative to sample footage. Validation was performed on a number of popular crowd datasets and through comparisons of multiple pedestrian and crowd simulation algorithms.
CVMar 22, 2017
Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN RegressionAaron S. Jackson, Adrian Bulat, Vasileios Argyriou et al.
3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the availability of multiple facial images (sometimes from the same subject) as input, and must address a number of methodological challenges such as establishing dense correspondences across large facial poses, expressions, and non-uniform illumination. In general these methods require complex and inefficient pipelines for model building and fitting. In this work, we propose to address many of these limitations by training a Convolutional Neural Network (CNN) on an appropriate dataset consisting of 2D images and 3D facial models or scans. Our CNN works with just a single 2D facial image, does not require accurate alignment nor establishes dense correspondence between images, works for arbitrary facial poses and expressions, and can be used to reconstruct the whole 3D facial geometry (including the non-visible parts of the face) bypassing the construction (during training) and fitting (during testing) of a 3D Morphable Model. We achieve this via a simple CNN architecture that performs direct regression of a volumetric representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related task of facial landmark localization can be incorporated into the proposed framework and help improve reconstruction quality, especially for the cases of large poses and facial expressions. Testing code will be made available online, along with pre-trained models http://aaronsplace.co.uk/papers/jackson2017recon