CYNov 26, 2021
Data Fusion Challenges Privacy: What Can Privacy Regulation Do?Gábor Erdélyi, Olivia J. Erdélyi, Andreas W. Kempa-Liehr
This paper focuses on some shortcomings in current privacy and data protection regulations' ability to adequately address the ramifications of AI-driven data processing practices, in particular where data sets are combined and processed by AI systems. We raise attention to two regulatory anomalies related to two fundamental assumptions underlying traditional privacy and data protection approaches: (1) Only Personally Identifiable Information (PII) and Personal Data (PD) require privacy protection: Privacy and data protection regulations are only triggered with respect to PII/PD, but not anonymous data. This is not only problematic because determining whether data falls in the former or latter category is no longer straightforward, but also because privacy risks associated with data processing may exist whether or not an individual can be identified. (2) Given sufficient information provided in a transparent and understandable manner, individuals are able to adequately assess the privacy implications of their actions and protect their privacy interests: However, relying on human privacy expectations fails to address important privacy threats, because those expectations are at odds with the actual privacy implications of data processing practices, as most people lack the necessary technical literacy to understand the sophisticated technologies at play, and to correctly assess their privacy implications. To tackle these anomalies we recommend regulatory reform in two directions: (1) Abolishing the distinction between personal and anonymized data for the purposes of triggering the application of privacy and data protection regulations and (2) developing methods to prioritize regulatory intervention based on the level of privacy risk posed by individual data processing actions.
LGDec 18, 2019
Feature engineering workflow for activity recognition from synchronized inertial measurement unitsAndreas W. Kempa-Liehr, Jonty Oram, Andrew Wong et al.
The ubiquitous availability of wearable sensors is responsible for driving the Internet-of-Things but is also making an impact on sport sciences and precision medicine. While human activity recognition from smartphone data or other types of inertial measurement units (IMU) has evolved to one of the most prominent daily life examples of machine learning, the underlying process of time-series feature engineering still seems to be time-consuming. This lengthy process inhibits the development of IMU-based machine learning applications in sport science and precision medicine. This contribution discusses a feature engineering workflow, which automates the extraction of time-series feature on based on the FRESH algorithm (FeatuRe Extraction based on Scalable Hypothesis tests) to identify statistically significant features from synchronized IMU sensors (IMeasureU Ltd, NZ). The feature engineering workflow has five main steps: time-series engineering, automated time-series feature extraction, optimized feature extraction, fitting of a specialized classifier, and deployment of optimized machine learning pipeline. The workflow is discussed for the case of a user-specific running-walking classification, and the generalization to a multi-user multi-activity classification is demonstrated.
LGOct 25, 2016
Distributed and parallel time series feature extraction for industrial big data applicationsMaximilian Christ, Andreas W. Kempa-Liehr, Michael Feindt
The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics.