17.1CRMay 25Code
Operational Runtime Behavior Mining for Open-Source Supply Chain SecurityZhuoran Tan, Ke Xiao, Jeremy Singer et al.
Open-source software (OSS) is a critical component of modern software systems, yet supply chain security remains challenging in practice due to unavailable or obfuscated source code. Consequently, security teams often rely on runtime observations collected from sandboxed executions to investigate suspicious third-party components. We present HeteroGAT-Rank, an industry-oriented runtime behavior mining system that supports analyst-in-the-loop supply chain threat investigation. The system models execution-time behaviors of OSS packages as lightweight heterogeneous graphs and applies attention-based graph learning to rank behavioral patterns that are most relevant for security analysis. Rather than aiming for fully automated detection, HeteroGAT-Rank surfaces actionable runtime signals - such as file, network, and command activities - to guide manual investigation and threat hunting. To operate at ecosystem scale, the system decouples offline behavior mining from online analysis and integrates parallel graph construction for efficient processing across multiple ecosystems. An evaluation on a large-scale OSS execution dataset shows that HeteroGAT-Rank effectively highlights meaningful and interpretable behavioral indicators aligned with real-world vulnerability and attack trends, supporting practical security workflows under realistic operational constraints.
CVOct 14, 2022Code
Optimizing Vision Transformers for Medical Image SegmentationQianying Liu, Chaitanya Kaul, Jun Wang et al.
For medical image semantic segmentation (MISS), Vision Transformers have emerged as strong alternatives to convolutional neural networks thanks to their inherent ability to capture long-range correlations. However, existing research uses off-the-shelf vision Transformer blocks based on linear projections and feature processing which lack spatial and local context to refine organ boundaries. Furthermore, Transformers do not generalize well on small medical imaging datasets and rely on large-scale pre-training due to limited inductive biases. To address these problems, we demonstrate the design of a compact and accurate Transformer network for MISS, CS-Unet, which introduces convolutions in a multi-stage design for hierarchically enhancing spatial and local modeling ability of Transformers. This is mainly achieved by our well-designed Convolutional Swin Transformer (CST) block which merges convolutions with Multi-Head Self-Attention and Feed-Forward Networks for providing inherent localized spatial context and inductive biases. Experiments demonstrate CS-Unet without pre-training outperforms other counterparts by large margins on multi-organ and cardiac datasets with fewer parameters and achieves state-of-the-art performance. Our code is available at Github.
IRMay 12, 2022Code
Improving Sequential Query Recommendation with Immediate User FeedbackShameem A Puthiya Parambath, Christos Anagnostopoulos, Roderick Murray-Smith
We propose an algorithm for next query recommendation in interactive data exploration settings, like knowledge discovery for information gathering. The state-of-the-art query recommendation algorithms are based on sequence-to-sequence learning approaches that exploit historical interaction data. Due to the supervision involved in the learning process, such approaches fail to adapt to immediate user feedback. We propose to augment the transformer-based causal language models for query recommendations to adapt to the immediate user feedback using multi-armed bandit (MAB) framework. We conduct a large-scale experimental study using log files from a popular online literature discovery service and demonstrate that our algorithm improves the per-round regret substantially, with respect to the state-of-the-art transformer-based query recommendation models, which do not make use of immediate user feedback. Our data model and source code are available at https://github.com/shampp/exp3_ss
LGSep 13, 2023Code
FedDIP: Federated Learning with Extreme Dynamic Pruning and Incremental RegularizationQianyu Long, Christos Anagnostopoulos, Shameem Puthiya Parambath et al.
Federated Learning (FL) has been successfully adopted for distributed training and inference of large-scale Deep Neural Networks (DNNs). However, DNNs are characterized by an extremely large number of parameters, thus, yielding significant challenges in exchanging these parameters among distributed nodes and managing the memory. Although recent DNN compression methods (e.g., sparsification, pruning) tackle such challenges, they do not holistically consider an adaptively controlled reduction of parameter exchange while maintaining high accuracy levels. We, therefore, contribute with a novel FL framework (coined FedDIP), which combines (i) dynamic model pruning with error feedback to eliminate redundant information exchange, which contributes to significant performance improvement, with (ii) incremental regularization that can achieve \textit{extreme} sparsity of models. We provide convergence analysis of FedDIP and report on a comprehensive performance and comparative assessment against state-of-the-art methods using benchmark data sets and DNN models. Our results showcase that FedDIP not only controls the model sparsity but efficiently achieves similar or better performance compared to other model pruning methods adopting incremental regularization during distributed model training. The code is available at: https://github.com/EricLoong/feddip.
LGFeb 9
An arithmetic method algorithm optimizing k-nearest neighbors compared to regression algorithms and evaluated on real world data sourcesTheodoros Anagnostopoulos, Evanthia Zervoudi, Christos Anagnostopoulos et al.
Linear regression analysis focuses on predicting a numeric regressand value based on certain regressor values. In this context, k-Nearest Neighbors (k-NN) is a common non-parametric regression algorithm, which achieves efficient performance when compared with other algorithms in literature. In this research effort an optimization of the k-NN algorithm is proposed by exploiting the potentiality of an introduced arithmetic method, which can provide solutions for linear equations involving an arbitrary number of real variables. Specifically, an Arithmetic Method Algorithm (AMA) is adopted to assess the efficiency of the introduced arithmetic method, while an Arithmetic Method Regression (AMR) algorithm is proposed as an optimization of k-NN adopting the potentiality of AMA. Such algorithm is compared with other regression algorithms, according to an introduced optimal inference decision rule, and evaluated on certain real world data sources, which are publicly available. Results are promising since the proposed AMR algorithm has comparable performance with the other algorithms, while in most cases it achieves better performance than the k-NN. The output results indicate that introduced AMR is an optimization of k-NN.
LGApr 24, 2024Code
Decentralized Personalized Federated Learning based on a Conditional Sparse-to-Sparser SchemeQianyu Long, Qiyuan Wang, Christos Anagnostopoulos et al.
Decentralized Federated Learning (DFL) has become popular due to its robustness and avoidance of centralized coordination. In this paradigm, clients actively engage in training by exchanging models with their networked neighbors. However, DFL introduces increased costs in terms of training and communication. Existing methods focus on minimizing communication often overlooking training efficiency and data heterogeneity. To address this gap, we propose a novel \textit{sparse-to-sparser} training scheme: DA-DPFL. DA-DPFL initializes with a subset of model parameters, which progressively reduces during training via \textit{dynamic aggregation} and leads to substantial energy savings while retaining adequate information during critical learning periods. Our experiments showcase that DA-DPFL substantially outperforms DFL baselines in test accuracy, while achieving up to $5$ times reduction in energy costs. We provide a theoretical analysis of DA-DPFL's convergence by solidifying its applicability in decentralized and personalized learning. The code is available at:https://github.com/EricLoong/da-dpfl
CVSep 10, 2025Code
Multi-Modal Robust Enhancement for Coastal Water Segmentation: A Systematic HSV-Guided FrameworkZhen Tian, Christos Anagnostopoulos, Qiyuan Wang et al.
Coastal water segmentation from satellite imagery presents unique challenges due to complex spectral characteristics and irregular boundary patterns. Traditional RGB-based approaches often suffer from training instability and poor generalization in diverse maritime environments. This paper introduces a systematic robust enhancement framework, referred to as Robust U-Net, that leverages HSV color space supervision and multi-modal constraints for improved coastal water segmentation. Our approach integrates five synergistic components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Through comprehensive ablation studies, we demonstrate that HSV supervision provides the highest impact (0.85 influence score), while the complete framework achieves superior training stability (84\% variance reduction) and enhanced segmentation quality. Our method shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency. For reproducibility, our training configurations and code are available here: https://github.com/UofgCoastline/ICASSP-2026-Robust-Unet.
CRAug 31, 2025Code
Integrated Simulation Framework for Adversarial Attacks on Autonomous VehiclesChristos Anagnostopoulos, Ioulia Kapsali, Alexandros Gkillas et al.
Autonomous vehicles (AVs) rely on complex perception and communication systems, making them vulnerable to adversarial attacks that can compromise safety. While simulation offers a scalable and safe environment for robustness testing, existing frameworks typically lack comprehensive supportfor modeling multi-domain adversarial scenarios. This paper introduces a novel, open-source integrated simulation framework designed to generate adversarial attacks targeting both perception and communication layers of AVs. The framework provides high-fidelity modeling of physical environments, traffic dynamics, and V2X networking, orchestrating these components through a unified core that synchronizes multiple simulators based on a single configuration file. Our implementation supports diverse perception-level attacks on LiDAR sensor data, along with communication-level threats such as V2X message manipulation and GPS spoofing. Furthermore, ROS 2 integration ensures seamless compatibility with third-party AV software stacks. We demonstrate the framework's effectiveness by evaluating the impact of generated adversarial scenarios on a state-of-the-art 3D object detector, revealing significant performance degradation under realistic conditions.
22.6CRMar 17
SynthChain: A Synthetic Benchmark and Forensic Analysis of Advanced and Stealthy Software Supply Chain AttacksZhuoran Tan, Wenbo Guo, Taylor Brierley et al.
Advanced software supply chain (SSC) attacks are increasingly runtime-only and leave fragmented evidence across hosts, services, and build/dependency layers, so any single telemetry stream is inherently insufficient to reconstruct full compromise chains under realistic access and budget limits. We present SynthChain, a near-production testbed and a multi-source runtime dataset with chain-level ground truth, derived from real-world malicious packages and exploit campaigns. SynthChain covers seven representative supply-chain exploit scenarios across PyPI, npm, and a native C/C++ supply-chain case, spanning Windows and Linux, and involving four hosts and one containerized environment. Scenarios span realistic time windows from minutes to hours and are annotated with 14 MITRE ATT&CK tactics and 161 techniques (29-104 techniques per scenario). Beyond releasing the data, we quantify observability constraints by mapping each chain step to the minimum evidence needed for detection and cross-source correlation. With realistic trace availability, no single source is chain-complete: the best single source reaches only 0.391 weighted tag/step coverage and 0.403 mean chain reconstruction. Even minimal two-source fusion boosts coverage to 0.636 and reconstruction to 0.639 (approximately 1.6x gain), with consistent chain coverage/recall improvements (0.545). The corpus contains approximately 0.58M raw multi-source events and 1.50M evaluation rows, enabling controlled studies of detection under constrained telemetry. We release the dataset, ground truth, and artifacts to support reproducible, forensic-aware runtime defenses and to guide efficient detection for software supply chains.
37.0CRMar 30
Attesting LLM Pipelines: Enforcing Verifiable Training and Release ClaimsZhuoran Tan, Jeremy Singer, Christos Anagnostopoulos
Modern Large Language Model (LLM) systems are assembled from third-party artifacts such as pre-trained weights, fine-tuning adapters, datasets, dependency packages, and container images, fetched through automated pipelines. This speed comes with supply-chain risks, including compromised dependencies, malicious hub artifacts, unsafe deserialization, forged provenance, and backdoored models. A core gap is that training and release claims (e.g., data and code lineage, build environment, and security scanning results) are rarely cryptographically bound to the artifacts they describe, making enforcement inconsistent across teams and stages. We propose an attestation-aware promotion gate: before an artifact is admitted into trusted environments (training, fine-tuning, deployment), the gate verifies claim evidence, enforces safe loading and static scanning policies, and applies secure-by-default deployment constraints. When organizations operate runtime security tooling, the same gate can optionally ingest standardized dynamic signals via plugins to reduce uncertainty for high-risk artifacts. We outline a practical claims-to-controls mapping and an evaluation blueprint using representative supply-chain scenarios and operational metrics (coverage and decisions), charting a path toward a full research paper.
RONov 6, 2024
Federated Data-Driven Kalman Filtering for State EstimationNikos Piperigkos, Alexandros Gkillas, Christos Anagnostopoulos et al.
This paper proposes a novel localization framework based on collaborative training or federated learning paradigm, for highly accurate localization of autonomous vehicles. More specifically, we build on the standard approach of KalmanNet, a recurrent neural network aiming to estimate the underlying system uncertainty of traditional Extended Kalman Filtering, and reformulate it by the adapt-then-combine concept to FedKalmanNet. The latter is trained in a distributed manner by a group of vehicles (or clients), with local training datasets consisting of vehicular location and velocity measurements, through a global server aggregation operation. The FedKalmanNet is then used by each vehicle to localize itself, by estimating the associated system uncertainty matrices (i.e, Kalman gain). Our aim is to actually demonstrate the benefits of collaborative training for state estimation in autonomous driving, over collaborative decision-making which requires rich V2X communication resources for measurement exchange and sensor fusion under real-time constraints. An extensive experimental and evaluation study conducted in CARLA autonomous driving simulator highlights the superior performance of FedKalmanNet over state-of-the-art collaborative decision-making approaches, in localizing vehicles without the need of real-time V2X communication.
LGJul 8, 2025
FedPhD: Federated Pruning with Hierarchical Learning of Diffusion ModelsQianyu Long, Qiyuan Wang, Christos Anagnostopoulos et al.
Federated Learning (FL), as a distributed learning paradigm, trains models over distributed clients' data. FL is particularly beneficial for distributed training of Diffusion Models (DMs), which are high-quality image generators that require diverse data. However, challenges such as high communication costs and data heterogeneity persist in training DMs similar to training Transformers and Convolutional Neural Networks. Limited research has addressed these issues in FL environments. To address this gap and challenges, we introduce a novel approach, FedPhD, designed to efficiently train DMs in FL environments. FedPhD leverages Hierarchical FL with homogeneity-aware model aggregation and selection policy to tackle data heterogeneity while reducing communication costs. The distributed structured pruning of FedPhD enhances computational efficiency and reduces model storage requirements in clients. Our experiments across multiple datasets demonstrate that FedPhD achieves high model performance regarding Fréchet Inception Distance (FID) scores while reducing communication costs by up to $88\%$. FedPhD outperforms baseline methods achieving at least a $34\%$ improvement in FID, while utilizing only $56\%$ of the total computation and communication resources.
AIJan 16, 2025
Artificial Intelligence-Driven Clinical Decision Support SystemsMuhammet Alkan, Idris Zakariyya, Samuel Leighton et al.
As artificial intelligence (AI) becomes increasingly embedded in healthcare delivery, this chapter explores the critical aspects of developing reliable and ethical Clinical Decision Support Systems (CDSS). Beginning with the fundamental transition from traditional statistical models to sophisticated machine learning approaches, this work examines rigorous validation strategies and performance assessment methods, including the crucial role of model calibration and decision curve analysis. The chapter emphasizes that creating trustworthy AI systems in healthcare requires more than just technical accuracy; it demands careful consideration of fairness, explainability, and privacy. The challenge of ensuring equitable healthcare delivery through AI is stressed, discussing methods to identify and mitigate bias in clinical predictive models. The chapter then delves into explainability as a cornerstone of human-centered CDSS. This focus reflects the understanding that healthcare professionals must not only trust AI recommendations but also comprehend their underlying reasoning. The discussion advances in an analysis of privacy vulnerabilities in medical AI systems, from data leakage in deep learning models to sophisticated attacks against model explanations. The text explores privacy-preservation strategies such as differential privacy and federated learning, while acknowledging the inherent trade-offs between privacy protection and model performance. This progression, from technical validation to ethical considerations, reflects the multifaceted challenges of developing AI systems that can be seamlessly and reliably integrated into daily clinical practice while maintaining the highest standards of patient care and data protection.
ROAug 25, 2025
A holistic perception system of internal and external monitoring for ground autonomous vehicles: AutoTRUST paradigmAlexandros Gkillas, Christos Anagnostopoulos, Nikos Piperigkos et al.
This paper introduces a holistic perception system for internal and external monitoring of autonomous vehicles, with the aim of demonstrating a novel AI-leveraged self-adaptive framework of advanced vehicle technologies and solutions that optimize perception and experience on-board. Internal monitoring system relies on a multi-camera setup designed for predicting and identifying driver and occupant behavior through facial recognition, exploiting in addition a large language model as virtual assistant. Moreover, the in-cabin monitoring system includes AI-empowered smart sensors that measure air-quality and perform thermal comfort analysis for efficient on and off-boarding. On the other hand, external monitoring system perceives the surrounding environment of vehicle, through a LiDAR-based cost-efficient semantic segmentation approach, that performs highly accurate and efficient super-resolution on low-quality raw 3D point clouds. The holistic perception framework is developed in the context of EU's Horizon Europe programm AutoTRUST, and has been integrated and deployed on a real electric vehicle provided by ALKE. Experimental validation and evaluation at the integration site of Joint Research Centre at Ispra, Italy, highlights increased performance and efficiency of the modular blocks of the proposed perception architecture.
LGApr 14, 2025
STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential DataMaximilian Forstenhäusler, Daniel Külzer, Christos Anagnostopoulos et al.
Accurate predictions using sequential spatiotemporal data are crucial for various applications. Utilizing real-world data, we aim to learn the intent of a smart device user within confined areas of a vehicle's surroundings. However, in real-world scenarios, environmental factors and sensor limitations result in non-stationary and irregularly sampled data, posing significant challenges. To address these issues, we developed a Transformer-based approach, STaRFormer, which serves as a universal framework for sequential modeling. STaRFormer employs a novel, dynamic attention-based regional masking scheme combined with semi-supervised contrastive learning to enhance task-specific latent representations. Comprehensive experiments on 15 datasets varying in types (including non-stationary and irregularly sampled), domains, sequence lengths, training samples, and applications, demonstrate the efficacy and practicality of STaRFormer. We achieve notable improvements over state-of-the-art approaches. Code and data will be made available.
CVNov 7, 2024
Personalized Federated Learning for Cross-view Geo-localizationChristos Anagnostopoulos, Alexandros Gkillas, Nikos Piperigkos et al.
In this paper we propose a methodology combining Federated Learning (FL) with Cross-view Image Geo-localization (CVGL) techniques. We address the challenges of data privacy and heterogeneity in autonomous vehicle environments by proposing a personalized Federated Learning scenario that allows selective sharing of model parameters. Our method implements a coarse-to-fine approach, where clients share only the coarse feature extractors while keeping fine-grained features specific to local environments. We evaluate our approach against traditional centralized and single-client training schemes using the KITTI dataset combined with satellite imagery. Results demonstrate that our federated CVGL method achieves performance close to centralized training while maintaining data privacy. The proposed partial model sharing strategy shows comparable or slightly better performance than classical FL, offering significant reduced communication overhead without sacrificing accuracy. Our work contributes to more robust and privacy-preserving localization systems for autonomous vehicles operating in diverse environments
LGAug 31, 2021
Max-Utility Based Arm Selection Strategy For Sequential Query RecommendationsShameem A. Puthiya Parambath, Christos Anagnostopoulos, Roderick Murray-Smith et al.
We consider the query recommendation problem in closed loop interactive learning settings like online information gathering and exploratory analytics. The problem can be naturally modelled using the Multi-Armed Bandits (MAB) framework with countably many arms. The standard MAB algorithms for countably many arms begin with selecting a random set of candidate arms and then applying standard MAB algorithms, e.g., UCB, on this candidate set downstream. We show that such a selection strategy often results in higher cumulative regret and to this end, we propose a selection strategy based on the maximum utility of the arms. We show that in tasks like online information gathering, where sequential query recommendations are employed, the sequences of queries are correlated and the number of potentially optimal queries can be reduced to a manageable size by selecting queries with maximum utility with respect to the currently executing query. Our experimental results using a recent real online literature discovery service log file demonstrate that the proposed arm selection strategy improves the cumulative regret substantially with respect to the state-of-the-art baseline algorithms. % and commonly used random selection strategy for a variety of contextual multi-armed bandit algorithms. Our data model and source code are available at ~\url{https://anonymous.4open.science/r/0e5ad6b7-ac02-4577-9212-c9d505d3dbdb/}.
LGJul 22, 2021
A Proactive Management Scheme for Data Synopses at the EdgeKostas Kolomvatsos, Christos Anagnostopoulos
The combination of the infrastructure provided by the Internet of Things (IoT) with numerous processing nodes present at the Edge Computing (EC) ecosystem opens up new pathways to support intelligent applications. Such applications can be provided upon humongous volumes of data collected by IoT devices being transferred to the edge nodes through the network. Various processing activities can be performed on the discussed data and multiple collaborative opportunities between EC nodes can facilitate the execution of the desired tasks. In order to support an effective interaction between edge nodes, the knowledge about the geographically distributed data should be shared. Obviously, the migration of large amounts of data will harm the stability of the network stability and its performance. In this paper, we recommend the exchange of data synopses than real data between EC nodes to provide them with the necessary knowledge about peer nodes owning similar data. This knowledge can be valuable when considering decisions such as data/service migration and tasks offloading. We describe an continuous reasoning model that builds a temporal similarity map of the available datasets to get nodes understanding the evolution of data in their peers. We support the proposed decision making mechanism through an intelligent similarity extraction scheme based on an unsupervised machine learning model, and, at the same time, combine it with a statistical measure that represents the trend of the so-called discrepancy quantum. Our model can reveal the differences in the exchanged synopses and provide a datasets similarity map which becomes the appropriate knowledge base to support the desired processing activities. We present the problem under consideration and suggest a solution for that, while, at the same time, we reveal its advantages and disadvantages through a large number of experiments.
DBAug 12, 2020
An Intelligent Edge-Centric Queries Allocation Scheme based on Ensemble ModelsKostas Kolomvatsos, Christos Anagnostopoulos
The combination of Internet of Things (IoT) and Edge Computing (EC) can assist in the delivery of novel applications that will facilitate end users activities. Data collected by numerous devices present in the IoT infrastructure can be hosted into a set of EC nodes becoming the subject of processing tasks for the provision of analytics. Analytics are derived as the result of various queries defined by end users or applications. Such queries can be executed in the available EC nodes to limit the latency in the provision of responses. In this paper, we propose a meta-ensemble learning scheme that supports the decision making for the allocation of queries to the appropriate EC nodes. Our learning model decides over queries' and nodes' characteristics. We provide the description of a matching process between queries and nodes after concluding the contextual information for each envisioned characteristic adopted in our meta-ensemble scheme. We rely on widely known ensemble models, combine them and offer an additional processing layer to increase the performance. The aim is to result a subset of EC nodes that will host each incoming query. Apart from the description of the proposed model, we report on its evaluation and the corresponding results. Through a large set of experiments and a numerical analysis, we aim at revealing the pros and cons of the proposed scheme.
DCAug 1, 2020
Data Synopses Management based on a Deep Learning ModelPanagiotis Fountas, Kostas Kolomvatsos, Christos Anagnostopoulos
Pervasive computing involves the placement of processing services close to end users to support intelligent applications. With the advent of the Internet of Things (IoT) and the Edge Computing (EC), one can find room for placing services at various points in the interconnection of the aforementioned infrastructures. Of significant importance is the processing of the collected data. Such a processing can be realized upon the EC nodes that exhibit increased computational capabilities compared to IoT devices. An ecosystem of intelligent nodes is created at the EC giving the opportunity to support cooperative models. Nodes become the hosts of geo-distributed datasets formulated by the IoT devices reports. Upon the datasets, a number of queries/tasks can be executed. Queries/tasks can be offloaded for performance reasons. However, an offloading action should be carefully designed being always aligned with the data present to the hosting node. In this paper, we present a model to support the cooperative aspect in the EC infrastructure. We argue on the delivery of data synopses to EC nodes making them capable to take offloading decisions fully aligned with data present at peers. Nodes exchange data synopses to inform their peers. We propose a scheme that detects the appropriate time to distribute synopses trying to avoid the network overloading especially when synopses are frequently extracted due to the high rates at which IoT devices report data to EC nodes. Our approach involves a Deep Learning model for learning the distribution of calculated synopses and estimate future trends. Upon these trends, we are able to find the appropriate time to deliver synopses to peer nodes. We provide the description of the proposed mechanism and evaluate it based on real datasets. An extensive experimentation upon various scenarios reveals the pros and cons of the approach by giving numerical results.
LGJul 29, 2020
On the Use of Interpretable Machine Learning for the Management of Data QualityAnna Karanika, Panagiotis Oikonomou, Kostas Kolomvatsos et al.
Data quality is a significant issue for any application that requests for analytics to support decision making. It becomes very important when we focus on Internet of Things (IoT) where numerous devices can interact to exchange and process data. IoT devices are connected to Edge Computing (EC) nodes to report the collected data, thus, we have to secure data quality not only at the IoT but also at the edge of the network. In this paper, we focus on the specific problem and propose the use of interpretable machine learning to deliver the features that are important to be based for any data processing activity. Our aim is to secure data quality, at least, for those features that are detected as significant in the collected datasets. We have to notice that the selected features depict the highest correlation with the remaining in every dataset, thus, they can be adopted for dimensionality reduction. We focus on multiple methodologies for having interpretability in our learning models and adopt an ensemble scheme for the final decision. Our scheme is capable of timely retrieving the final result and efficiently select the appropriate features. We evaluate our model through extensive simulations and present numerical results. Our aim is to reveal its performance under various experimental scenarios that we create varying a set of parameters adopted in our mechanism.
DBAug 13, 2019
Adaptive Learning of Aggregate Analytics under Dynamic WorkloadsFotis Savva, Christos Anagnostopoulos, Peter Triantafillou
Large organizations have seamlessly incorporated data-driven decision making in their operations. However, as data volumes increase, expensive big data infrastructures are called to rescue. In this setting, analytics tasks become very costly in terms of query response time, resource consumption, and money in cloud deployments, especially when base data are stored across geographically distributed data centers. Therefore, we introduce an adaptive Machine Learning mechanism which is light-weight, stored client-side, can estimate the answers of a variety of aggregate queries and can avoid the big data backend. The estimations are performed in milliseconds are inexpensive and accurate as the mechanism learns from past analytical-query patterns. However, as analytic queries are ad-hoc and analysts' interests change over time we develop solutions that can swiftly and accurately detect such changes and adapt to new query patterns. The capabilities of our approach are demonstrated using extensive evaluation with real and synthetic datasets.
DBDec 29, 2018
Explaining Aggregates for Exploratory AnalyticsFotis Savva, Christos Anagnostopoulos, Peter Triantafillou
Analysts wishing to explore multivariate data spaces, typically pose queries involving selection operators, i.e., range or radius queries, which define data subspaces of possible interest and then use aggregation functions, the results of which determine their exploratory analytics interests. However, such aggregate query (AQ) results are simple scalars and as such, convey limited information about the queried subspaces for exploratory analysis. We address this shortcoming aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism coined XAXA: eXplaining Aggregates for eXploratory Analytics. XAXA's novel AQ explanations are represented using functions obtained by a three-fold joint optimization problem. Explanations assume the form of a set of parametric piecewise-linear functions acquired through a statistical learning model. A key feature of the proposed solution is that model training is performed by only monitoring AQs and their answers on-line. In XAXA, explanations for future AQs can be computed without any database (DB) access and can be used to further explore the queried data subspaces, without issuing any more queries to the DB. We evaluate the explanation accuracy and efficiency of XAXA through theoretically grounded metrics over real-world and synthetic datasets and query workloads.