CVJun 29, 2022Code
vMFNet: Compositionality Meets Domain-generalised SegmentationXiao Liu, Spyridon Thermos, Pedro Sanchez et al.
Training medical image segmentation models usually requires a large amount of labeled data. By contrast, humans can quickly learn to accurately recognise anatomy of interest from medical (e.g. MRI and CT) images with some limited guidance. Such recognition ability can easily generalise to new images from different clinical centres. This rapid and generalisable learning ability is mostly due to the compositional structure of image patterns in the human brain, which is less incorporated in medical image segmentation. In this paper, we model the compositional components (i.e. patterns) of human anatomy as learnable von-Mises-Fisher (vMF) kernels, which are robust to images collected from different domains (e.g. clinical centres). The image features can be decomposed to (or composed by) the components with the composing operations, i.e. the vMF likelihoods. The vMF likelihoods tell how likely each anatomical part is at each position of the image. Hence, the segmentation mask can be predicted based on the vMF likelihoods. Moreover, with a reconstruction module, unlabeled data can also be used to learn the vMF kernels and likelihoods by recombining them to reconstruct the input image. Extensive experiments show that the proposed vMFNet achieves improved generalisation performance on two benchmarks, especially when annotations are limited. Code is publicly available at: https://github.com/vios-s/vMFNet.
CVJun 13, 2023
Compositionally Equivariant Representation LearningXiao Liu, Pedro Sanchez, Spyridon Thermos et al.
Deep learning models often need sufficient supervision (i.e. labelled data) in order to be trained effectively. By contrast, humans can swiftly learn to identify important anatomy in medical images like MRI and CT scans, with minimal guidance. This recognition capability easily generalises to new images from different medical facilities and to new tasks in different settings. This rapid and generalisable learning ability is largely due to the compositional structure of image patterns in the human brain, which are not well represented in current medical models. In this paper, we study the utilisation of compositionality in learning more interpretable and generalisable representations for medical image segmentation. Overall, we propose that the underlying generative factors that are used to generate the medical images satisfy compositional equivariance property, where each factor is compositional (e.g. corresponds to the structures in human anatomy) and also equivariant to the task. Hence, a good representation that approximates well the ground truth factor has to be compositionally equivariant. By modelling the compositional representations with learnable von-Mises-Fisher (vMF) kernels, we explore how different design and learning biases can be used to enforce the representations to be more compositionally equivariant under un-, weakly-, and semi-supervised settings. Extensive results show that our methods achieve the best performance over several strong baselines on the task of semi-supervised domain-generalised medical image segmentation. Code will be made publicly available upon acceptance at https://github.com/vios-s.
CVAug 6, 2022
HSIC-InfoGAN: Learning Unsupervised Disentangled Representations by Maximising Approximated Mutual InformationXiao Liu, Spyridon Thermos, Pedro Sanchez et al.
Learning disentangled representations requires either supervision or the introduction of specific model designs and learning constraints as biases. InfoGAN is a popular disentanglement framework that learns unsupervised disentangled representations by maximising the mutual information between latent representations and their corresponding generated images. Maximisation of mutual information is achieved by introducing an auxiliary network and training with a latent regression loss. In this short exploratory paper, we study the use of the Hilbert-Schmidt Independence Criterion (HSIC) to approximate mutual information between latent representation and image, termed HSIC-InfoGAN. Directly optimising the HSIC loss avoids the need for an additional auxiliary network. We qualitatively compare the level of disentanglement in each model, suggest a strategy to tune the hyperparameters of HSIC-InfoGAN, and discuss the potential of HSIC-InfoGAN for medical applications.
CVSep 25, 2023
Noise-in, Bias-out: Balanced and Real-time MoCap SolvingGeorgios Albanis, Nikolaos Zioulis, Spyridon Thermos et al.
Real-time optical Motion Capture (MoCap) systems have not benefited from the advances in modern data-driven modeling. In this work we apply machine learning to solve noisy unstructured marker estimates in real-time and deliver robust marker-based MoCap even when using sparse affordable sensors. To achieve this we focus on a number of challenges related to model training, namely the sourcing of training data and their long-tailed distribution. Leveraging representation learning we design a technique for imbalanced regression that requires no additional data or labels and improves the performance of our model in rare and challenging poses. By relying on a unified representation, we show that training such a model is not bound to high-end MoCap training data acquisition, and exploit the advances in marker-less MoCap to acquire the necessary data. Finally, we take a step towards richer and affordable MoCap by adapting a body model-based inverse kinematics solution to account for measurement and inference uncertainty, further improving performance and robustness. Project page: https://moverseai.github.io/noise-tail
CVJun 10, 2025Code
Monocular 3D Hand Pose Estimation with Implicit Camera AlignmentChristos Pantazopoulos, Spyridon Thermos, Gerasimos Potamianos
Estimating the 3D hand articulation from a single color image is an important problem with applications in Augmented Reality (AR), Virtual Reality (VR), Human-Computer Interaction (HCI), and robotics. Apart from the absence of depth information, occlusions, articulation complexity, and the need for camera parameters knowledge pose additional challenges. In this work, we propose an optimization pipeline for estimating the 3D hand articulation from 2D keypoint input, which includes a keypoint alignment step and a fingertip loss to overcome the need to know or estimate the camera parameters. We evaluate our approach on the EgoDexter and Dexter+Object benchmarks to showcase that it performs competitively with the state-of-the-art, while also demonstrating its robustness when processing "in-the-wild" images without any prior camera knowledge. Our quantitative analysis highlights the sensitivity of the 2D keypoint estimation accuracy, despite the use of hand priors. Code is available at the project page https://cpantazop.github.io/HandRepo/
IVJul 4, 2021Code
Controllable cardiac synthesis via disentangled anatomy arithmeticSpyridon Thermos, Xiao Liu, Alison O'Neil et al.
Acquiring annotated data at scale with rare diseases or conditions remains a challenge. It would be extremely useful to have a method that controllably synthesizes images that can correct such underrepresentation. Assuming a proper latent representation, the idea of a "latent vector arithmetic" could offer the means of achieving such synthesis. A proper representation must encode the fidelity of the input data, preserve invariance and equivariance, and permit arithmetic operations. Motivated by the ability to disentangle images into spatial anatomy (tensor) factors and accompanying imaging (vector) representations, we propose a framework termed "disentangled anatomy arithmetic", in which a generative model learns to combine anatomical factors of different input images such that when they are re-entangled with the desired imaging modality (e.g. MRI), plausible new cardiac images are created with the target characteristics. To encourage a realistic combination of anatomy factors after the arithmetic step, we propose a localized noise injection network that precedes the generator. Our model is used to generate realistic images, pathology labels, and segmentation masks that are used to augment the existing datasets and subsequently improve post-hoc classification and segmentation tasks. Code is publicly available at https://github.com/vios-s/DAA-GAN.
CVSep 3, 2019Code
Self-Supervised Deep Depth DenoisingVladimiros Sterzentsenko, Leonidas Saroglou, Anargyros Chatzitofis et al.
Depth perception is considered an invaluable source of information for various vision tasks. However, depth maps acquired using consumer-level sensors still suffer from non-negligible noise. This fact has recently motivated researchers to exploit traditional filters, as well as the deep learning paradigm, in order to suppress the aforementioned non-uniform noise, while preserving geometric details. Despite the effort, deep depth denoising is still an open challenge mainly due to the lack of clean data that could be used as ground truth. In this paper, we propose a fully convolutional deep autoencoder that learns to denoise depth maps, surpassing the lack of ground truth data. Specifically, the proposed autoencoder exploits multiple views of the same scene from different points of view in order to learn to suppress noise in a self-supervised end-to-end manner using depth and color information during training, yet only depth during inference. To enforce selfsupervision, we leverage a differentiable rendering technique to exploit photometric supervision, which is further regularized using geometric and surface priors. As the proposed approach relies on raw data acquisition, a large RGB-D corpus is collected using Intel RealSense sensors. Complementary to a quantitative evaluation, we demonstrate the effectiveness of the proposed self-supervised denoising approach on established 3D reconstruction applications. Code is avalable at https://github.com/VCL3D/DeepDepthDenoising
CVSep 14, 2025
On the Skinning of Gaussian AvatarsNikolaos Zioulis, Nikolaos Kotarelas, Georgios Albanis et al.
Radiance field-based methods have recently been used to reconstruct human avatars, showing that we can significantly downscale the systems needed for creating animated human avatars. Although this progress has been initiated by neural radiance fields, their slow rendering and backward mapping from the observation space to the canonical space have been the main challenges. With Gaussian splatting overcoming both challenges, a new family of approaches has emerged that are faster to train and render, while also straightforward to implement using forward skinning from the canonical to the observation space. However, the linear blend skinning required for the deformation of the Gaussians does not provide valid results for their non-linear rotation properties. To address such artifacts, recent works use mesh properties to rotate the non-linear Gaussian properties or train models to predict corrective offsets. Instead, we propose a weighted rotation blending approach that leverages quaternion averaging. This leads to simpler vertex-based Gaussians that can be efficiently animated and integrated in any engine by only modifying the linear blend skinning technique, and using any Gaussian rasterizer.
GRAug 25, 2025
Controllable Single-shot Animation Blending with Temporal ConditioningEleni Tselepi, Spyridon Thermos, Gerasimos Potamianos
Training a generative model on a single human skeletal motion sequence without being bound to a specific kinematic tree has drawn significant attention from the animation community. Unlike text-to-motion generation, single-shot models allow animators to controllably generate variations of existing motion patterns without requiring additional data or extensive retraining. However, existing single-shot methods do not explicitly offer a controllable framework for blending two or more motions within a single generative pass. In this paper, we present the first single-shot motion blending framework that enables seamless blending by temporally conditioning the generation process. Our method introduces a skeleton-aware normalization mechanism to guide the transition between motions, allowing smooth, data-driven control over when and how motions blend. We perform extensive quantitative and qualitative evaluations across various animation styles and different kinematic skeletons, demonstrating that our approach produces plausible, smooth, and controllable motion blends in a unified and efficient manner.
CVJun 3, 2024
Towards Practical Single-shot Motion SynthesisKonstantinos Roditakis, Spyridon Thermos, Nikolaos Zioulis
Despite the recent advances in the so-called "cold start" generation from text prompts, their needs in data and computing resources, as well as the ambiguities around intellectual property and privacy concerns pose certain counterarguments for their utility. An interesting and relatively unexplored alternative has been the introduction of unconditional synthesis from a single sample, which has led to interesting generative applications. In this paper we focus on single-shot motion generation and more specifically on accelerating the training time of a Generative Adversarial Network (GAN). In particular, we tackle the challenge of GAN's equilibrium collapse when using mini-batch training by carefully annealing the weights of the loss functions that prevent mode collapse. Additionally, we perform statistical analysis in the generator and discriminator models to identify correlations between training stages and enable transfer learning. Our improved GAN achieves competitive quality and diversity on the Mixamo benchmark when compared to the original GAN architecture and a single-shot diffusion model, while being up to x6.8 faster in training time from the former and x1.75 from the latter. Finally, we demonstrate the ability of our improved GAN to mix and compose motion with a single forward pass. Project page available at https://moverseai.github.io/single-shot.
CVAug 26, 2021
Learning Disentangled Representations in the Imaging DomainXiao Liu, Pedro Sanchez, Spyridon Thermos et al.
Disentangled representation learning has been proposed as an approach to learning general representations even in the absence of, or with limited, supervision. A good general representation can be fine-tuned for new target tasks using modest amounts of data, or used directly in unseen domains achieving remarkable performance in the corresponding task. This alleviation of the data and annotation requirements offers tantalising prospects for applications in computer vision and healthcare. In this tutorial paper, we motivate the need for disentangled representations, revisit key concepts, and describe practical building blocks and criteria for learning such representations. We survey applications in medical imaging emphasising choices made in exemplar key works, and then discuss links to computer vision applications. We conclude by presenting limitations, challenges, and opportunities.
CVJun 24, 2021
Semi-supervised Meta-learning with Disentanglement for Domain-generalised Medical Image SegmentationXiao Liu, Spyridon Thermos, Alison O'Neil et al.
Generalising deep models to new data from new centres (termed here domains) remains a challenge. This is largely attributed to shifts in data statistics (domain shifts) between source and unseen domains. Recently, gradient-based meta-learning approaches where the training data are split into meta-train and meta-test sets to simulate and handle the domain shifts during training have shown improved generalisation performance. However, the current fully supervised meta-learning approaches are not scalable for medical image segmentation, where large effort is required to create pixel-wise annotations. Meanwhile, in a low data regime, the simulated domain shifts may not approximate the true domain shifts well across source and unseen domains. To address this problem, we propose a novel semi-supervised meta-learning framework with disentanglement. We explicitly model the representations related to domain shifts. Disentangling the representations and combining them to reconstruct the input image allows unlabeled data to be used to better approximate the true domain shifts for meta-learning. Hence, the model can achieve better generalisation performance, especially when there is a limited amount of labeled data. Experiments show that the proposed method is robust on different segmentation tasks and achieves state-of-the-art generalisation performance on two public benchmarks.
LGFeb 26, 2021
GaNDLF: A Generally Nuanced Deep Learning Framework for Scalable End-to-End Clinical Workflows in Medical ImagingSarthak Pati, Siddhesh P. Thakur, İbrahim Ethem Hamamcı et al.
Deep Learning (DL) has the potential to optimize machine learning in both the scientific and clinical communities. However, greater expertise is required to develop DL algorithms, and the variability of implementations hinders their reproducibility, translation, and deployment. Here we present the community-driven Generally Nuanced Deep Learning Framework (GaNDLF), with the goal of lowering these barriers. GaNDLF makes the mechanism of DL development, training, and inference more stable, reproducible, interpretable, and scalable, without requiring an extensive technical background. GaNDLF aims to provide an end-to-end solution for all DL-related tasks in computational precision medicine. We demonstrate the ability of GaNDLF to analyze both radiology and histology images, with built-in support for k-fold cross-validation, data augmentation, multiple modalities and output classes. Our quantitative performance evaluation on numerous use cases, anatomies, and computational tasks supports GaNDLF as a robust application framework for deployment in clinical workflows.
CVAug 27, 2020
Measuring the Biases and Effectiveness of Content-Style DisentanglementXiao Liu, Spyridon Thermos, Gabriele Valvano et al.
A recent spate of state-of-the-art semi- and un-supervised solutions disentangle and encode image "content" into a spatial tensor and image appearance or "style" into a vector, to achieve good performance in spatially equivariant tasks (e.g. image-to-image translation). To achieve this, they employ different model design, learning objective, and data biases. While considerable effort has been made to measure disentanglement in vector representations, and assess its impact on task performance, such analysis for (spatial) content - style disentanglement is lacking. In this paper, we conduct an empirical study to investigate the role of different biases in content-style disentanglement settings and unveil the relationship between the degree of disentanglement and task performance. In particular, we consider the setting where we: (i) identify key design choices and learning constraints for three popular content-style disentanglement models; (ii) relax or remove such constraints in an ablation fashion; and (iii) use two metrics to measure the degree of disentanglement and assess its effect on each task performance. Our experiments reveal that there is a "sweet spot" between disentanglement, task performance and - surprisingly - content interpretability, suggesting that blindly forcing for higher disentanglement can hurt model performance and content factors semanticness. Our findings, as well as the used task-independent metrics, can be used to guide the design and selection of new models for tasks where content-style representations are useful.
IVAug 26, 2020
Disentangled Representations for Domain-generalized Cardiac SegmentationXiao Liu, Spyridon Thermos, Agisilaos Chartsias et al.
Robust cardiac image segmentation is still an open challenge due to the inability of the existing methods to achieve satisfactory performance on unseen data of different domains. Since the acquisition and annotation of medical data are costly and time-consuming, recent work focuses on domain adaptation and generalization to bridge the gap between data from different populations and scanners. In this paper, we propose two data augmentation methods that focus on improving the domain adaptation and generalization abilities of state-to-the-art cardiac segmentation models. In particular, our "Resolution Augmentation" method generates more diverse data by rescaling images to different resolutions within a range spanning different scanner protocols. Subsequently, our "Factor-based Augmentation" method generates more diverse data by projecting the original samples onto disentangled latent spaces, and combining the learned anatomy and modality factors from different domains. Our extensive experiments demonstrate the importance of efficient adaptation between seen and unseen domains, as well as model generalization ability, to robust cardiac image segmentation.
CVApr 18, 2020
A Deep Learning Approach to Object Affordance SegmentationSpyridon Thermos, Petros Daras, Gerasimos Potamianos
Learning to understand and infer object functionalities is an important step towards robust visual intelligence. Significant research efforts have recently focused on segmenting the object parts that enable specific types of human-object interaction, the so-called "object affordances". However, most works treat it as a static semantic segmentation problem, focusing solely on object appearance and relying on strong supervision and object detection. In this paper, we propose a novel approach that exploits the spatio-temporal nature of human-object interaction for affordance segmentation. In particular, we design an autoencoder that is trained using ground-truth labels of only the last frame of the sequence, and is able to infer pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism that enables the implicit localization of the interaction hotspot. For evaluation purposes, we introduce the SOR3D-AFF corpus, which consists of human-object interaction sequences and supports 9 types of affordances in terms of pixel-wise annotation, covering typical manipulations of tool-like objects. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF, while being able to predict affordances for similar unseen objects in two affordance image-only datasets.
CVMar 23, 2020
Deep Soft Procrustes for Markerless Volumetric Sensor AlignmentVladimiros Sterzentsenko, Alexandros Doumanoglou, Spyridon Thermos et al.
With the advent of consumer grade depth sensors, low-cost volumetric capture systems are easier to deploy. Their wider adoption though depends on their usability and by extension on the practicality of spatially aligning multiple sensors. Most existing alignment approaches employ visual patterns, e.g. checkerboards, or markers and require high user involvement and technical knowledge. More user-friendly and easier-to-use approaches rely on markerless methods that exploit geometric patterns of a physical structure. However, current SoA approaches are bounded by restrictions in the placement and the number of sensors. In this work, we improve markerless data-driven correspondence estimation to achieve more robust and flexible multi-sensor spatial alignment. In particular, we incorporate geometric constraints in an end-to-end manner into a typical segmentation based model and bridge the intermediate dense classification task with the targeted pose estimation one. This is accomplished by a soft, differentiable procrustes analysis that regularizes the segmentation and achieves higher extrinsic calibration performance in expanded sensor placement configurations, while being unrestricted by the number of sensors of the volumetric capture system. Our model is experimentally shown to achieve similar results with marker-based methods and outperform the markerless ones, while also being robust to the pose variations of the calibration structure. Code and pretrained models are available at https://vcl3d.github.io/StructureNet/.
CVApr 10, 2017
Deep Affordance-grounded Sensorimotor Object RecognitionSpyridon Thermos, Georgios Th. Papadopoulos, Petros Daras et al.
It is well-established by cognitive neuroscience that human perception of objects constitutes a complex process, where object appearance information is combined with evidence about the so-called object "affordances", namely the types of actions that humans typically perform when interacting with them. This fact has recently motivated the "sensorimotor" approach to the challenging task of automatic object recognition, where both information sources are fused to improve robustness. In this work, the aforementioned paradigm is adopted, surpassing current limitations of sensorimotor object recognition research. Specifically, the deep learning paradigm is introduced to the problem for the first time, developing a number of novel neuro-biologically and neuro-physiologically inspired architectures that utilize state-of-the-art neural networks for fusing the available information sources in multiple ways. The proposed methods are evaluated using a large RGB-D corpus, which is specifically collected for the task of sensorimotor object recognition and is made publicly available. Experimental results demonstrate the utility of affordance information to object recognition, achieving an up to 29% relative error reduction by its inclusion.