LGSep 17, 2022
AdaCC: Cumulative Cost-Sensitive Boosting for Imbalanced ClassificationVasileios Iosifidis, Symeon Papadopoulos, Bodo Rosenhahn et al.
Class imbalance poses a major challenge for machine learning as most supervised learning models might exhibit bias towards the majority class and under-perform in the minority class. Cost-sensitive learning tackles this problem by treating the classes differently, formulated typically via a user-defined fixed misclassification cost matrix provided as input to the learner. Such parameter tuning is a challenging task that requires domain knowledge and moreover, wrong adjustments might lead to overall predictive performance deterioration. In this work, we propose a novel cost-sensitive boosting approach for imbalanced data that dynamically adjusts the misclassification costs over the boosting rounds in response to model's performance instead of using a fixed misclassification cost matrix. Our method, called AdaCC, is parameter-free as it relies on the cumulative behavior of the boosting model in order to adjust the misclassification costs for the next boosting round and comes with theoretical guarantees regarding the training error. Experiments on 27 real-world datasets from different domains with high class imbalance demonstrate the superiority of our method over 12 state-of-the-art cost-sensitive boosting approaches exhibiting consistent improvements in different measures, for instance, in the range of [0.3%-28.56%] for AUC, [3.4%-21.4%] for balanced accuracy, [4.8%-45%] for gmean and [7.4%-85.5%] for recall.
LGNov 9, 2022
Discrimination and Class Imbalance Aware Online Naive BayesMaryam Badar, Marco Fisichella, Vasileios Iosifidis et al.
Fairness-aware mining of massive data streams is a growing and challenging concern in the contemporary domain of machine learning. Many stream learning algorithms are used to replace humans at critical decision-making points e.g., hiring staff, assessing credit risk, etc. This calls for handling massive incoming information with minimum response delay while ensuring fair and high quality decisions. Recent discrimination-aware learning methods are optimized based on overall accuracy. However, the overall accuracy is biased in favor of the majority class; therefore, state-of-the-art methods mainly diminish discrimination by partially or completely ignoring the minority class. In this context, we propose a novel adaptation of Naïve Bayes to mitigate discrimination embedded in the streams while maintaining high predictive performance for both the majority and minority classes. Our proposed algorithm is simple, fast, and attains multi-objective optimization goals. To handle class imbalance and concept drifts, a dynamic instance weighting module is proposed, which gives more importance to recent instances and less importance to obsolete instances based on their membership in minority or majority class. We conducted experiments on a range of streaming and static datasets and deduced that our proposed methodology outperforms existing state-of-the-art fairness-aware methods in terms of both discrimination score and balanced accuracy.
LGJan 4, 2022
Parity-based Cumulative Fairness-aware BoostingVasileios Iosifidis, Arjun Roy, Eirini Ntoutsi
Data-driven AI systems can lead to discrimination on the basis of protected attributes like gender or race. One reason for this behavior is the encoded societal biases in the training data (e.g., females are underrepresented), which is aggravated in the presence of unbalanced class distributions (e.g., "granted" is the minority class). State-of-the-art fairness-aware machine learning approaches focus on preserving the \emph{overall} classification accuracy while improving fairness. In the presence of class-imbalance, such methods may further aggravate the problem of discrimination by denying an already underrepresented group (e.g., \textit{females}) the fundamental rights of equal social privileges (e.g., equal credit opportunity). To this end, we propose AdaFair, a fairness-aware boosting ensemble that changes the data distribution at each round, taking into account not only the class errors but also the fairness-related performance of the model defined cumulatively based on the partial ensemble. Except for the in-training boosting of the group discriminated over each round, AdaFair directly tackles imbalance during the post-training phase by optimizing the number of ensemble learners for balanced error performance (BER). AdaFair can facilitate different parity-based fairness notions and mitigate effectively discriminatory outcomes. Our experiments show that our approach can achieve parity in terms of statistical parity, equal opportunity, and disparate mistreatment while maintaining good predictive performance for all classes.
LGOct 1, 2021
A survey on datasets for fairness-aware machine learningTai Le Quy, Arjun Roy, Vasileios Iosifidis et al.
As decision-making increasingly relies on Machine Learning (ML) and (big) data, the issue of fairness in data-driven Artificial Intelligence (AI) systems is receiving increasing attention from both research and industry. A large variety of fairness-aware machine learning solutions have been proposed which involve fairness-related interventions in the data, learning algorithms and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware machine learning. We focus on tabular data as the most common data representation for fairness-aware machine learning. We start our analysis by identifying relationships between the different attributes, particularly w.r.t. protected attributes and class attribute, using a Bayesian network. For a deeper understanding of bias in the datasets, we investigate the interesting relationships using exploratory analysis.
LGAug 13, 2021
Online Fairness-Aware Learning with Imbalanced Data StreamsVasileios Iosifidis, Wenbin Zhang, Eirini Ntoutsi
Data-driven learning algorithms are employed in many online applications, in which data become available over time, like network monitoring, stock price prediction, job applications, etc. The underlying data distribution might evolve over time calling for model adaptation as new instances arrive and old instances become obsolete. In such dynamic environments, the so-called data streams, fairness-aware learning cannot be considered as a one-off requirement, but rather it should comprise a continual requirement over the stream. Recent fairness-aware stream classifiers ignore the problem of class imbalance, which manifests in many real-life applications, and mitigate discrimination mainly because they "reject" minority instances at large due to their inability to effectively learn all classes. In this work, we propose \ours, an online fairness-aware approach that maintains a valid and fair classifier over the stream. \ours~is an online boosting approach that changes the training distribution in an online fashion by monitoring stream's class imbalance and tweaks its decision boundary to mitigate discriminatory outcomes over the stream. Experiments on 8 real-world and 1 synthetic datasets from different domains with varying class imbalance demonstrate the superiority of our method over state-of-the-art fairness-aware stream approaches with a range (relative) increase [11.2\%-14.2\%] in balanced accuracy, [22.6\%-31.8\%] in gmean, [42.5\%-49.6\%] in recall, [14.3\%-25.7\%] in kappa and [89.4\%-96.6\%] in statistical parity (fairness).
LGApr 27, 2021
Multi-fairness under class-imbalanceArjun Roy, Vasileios Iosifidis, Eirini Ntoutsi
Recent studies showed that datasets used in fairness-aware machine learning for multiple protected attributes (referred to as multi-discrimination hereafter) are often imbalanced. The class-imbalance problem is more severe for the often underrepresented protected group (e.g. female, non-white, etc.) in the critical minority class. Still, existing methods focus only on the overall error-discrimination trade-off, ignoring the imbalance problem, thus amplifying the prevalent bias in the minority classes. Therefore, solutions are needed to solve the combined problem of multi-discrimination and class-imbalance. To this end, we introduce a new fairness measure, Multi-Max Mistreatment (MMM), which considers both (multi-attribute) protected group and class membership of instances to measure discrimination. To solve the combined problem, we propose a boosting approach that incorporates MMM-costs in the distribution update and post-training selects the optimal trade-off among accurate, balanced, and fair solutions. The experimental results show the superiority of our approach against state-of-the-art methods in producing the best balanced performance across groups and classes and the best accuracy for the protected groups in the minority class.
CLMar 27, 2021
LSTM Based Sentiment Analysis for Cryptocurrency PredictionXin Huang, Wenbin Zhang, Xuejiao Tang et al.
Recent studies in big data analytics and natural language processing develop automatic techniques in analyzing sentiment in the social media information. In addition, the growing user base of social media and the high volume of posts also provide valuable sentiment information to predict the price fluctuation of the cryptocurrency. This research is directed to predicting the volatile price movement of cryptocurrency by analyzing the sentiment in social media and finding the correlation between them. While previous work has been developed to analyze sentiment in English social media posts, we propose a method to identify the sentiment of the Chinese social media posts from the most popular Chinese social media platform Sina-Weibo. We develop the pipeline to capture Weibo posts, describe the creation of the crypto-specific sentiment dictionary, and propose a long short-term memory (LSTM) based recurrent neural network along with the historical cryptocurrency price movement to predict the price trend for future time frames. The conducted experiments demonstrate the proposed approach outperforms the state of the art auto regressive based model by 18.5% in precision and 15.4% in recall.
AIDec 6, 2020
A Data-driven Human Responsibility Management SystemXuejiao Tang, Jiong Qiu, Ruijun Chen et al.
An ideal safe workplace is described as a place where staffs fulfill responsibilities in a well-organized order, potential hazardous events are being monitored in real-time, as well as the number of accidents and relevant damages are minimized. However, occupational-related death and injury are still increasing and have been highly attended in the last decades due to the lack of comprehensive safety management. A smart safety management system is therefore urgently needed, in which the staffs are instructed to fulfill responsibilities as well as automating risk evaluations and alerting staffs and departments when needed. In this paper, a smart system for safety management in the workplace based on responsibility big data analysis and the internet of things (IoT) are proposed. The real world implementation and assessment demonstrate that the proposed systems have superior accountability performance and improve the responsibility fulfillment through real-time supervision and self-reminder.
IVDec 6, 2020
Using Machine Learning to Automate Mammogram Images AnalysisXuejiao Tang, Liuhua Zhang, Wenbin Zhang et al.
Breast cancer is the second leading cause of cancer-related death after lung cancer in women. Early detection of breast cancer in X-ray mammography is believed to have effectively reduced the mortality rate. However, a relatively high false positive rate and a low specificity in mammography technology still exist. In this work, a computer-aided automatic mammogram analysis system is proposed to process the mammogram images and automatically discriminate them as either normal or cancerous, consisting of three consecutive image processing, feature selection, and image classification stages. In designing the system, the discrete wavelet transforms (Daubechies 2, Daubechies 4, and Biorthogonal 6.8) and the Fourier cosine transform were first used to parse the mammogram images and extract statistical features. Then, an entropy-based feature selection method was implemented to reduce the number of features. Finally, different pattern recognition methods (including the Back-propagation Network, the Linear Discriminant Analysis, and the Naive Bayes Classifier) and a voting classification scheme were employed. The performance of each classification strategy was evaluated for sensitivity, specificity, and accuracy and for general performance using the Receiver Operating Curve. Our method is validated on the dataset from the Eastern Health in Newfoundland and Labrador of Canada. The experimental results demonstrated that the proposed automatic mammogram analysis system could effectively improve the classification performances.
LGApr 5, 2020
FairNN- Conjoint Learning of Fair Representations for Fair DecisionsTongxin Hu, Vasileios Iosifidis, Wentong Liao et al.
In this paper, we propose FairNN a neural network that performs joint feature representation and classification for fairness-aware learning. Our approach optimizes a multi-objective loss function in which (a) learns a fair representation by suppressing protected attributes (b) maintains the information content by minimizing a reconstruction loss and (c) allows for solving a classification task in a fair manner by minimizing the classification error and respecting the equalized odds-based fairness regularized. Our experiments on a variety of datasets demonstrate that such a joint approach is superior to separate treatment of unfairness in representation learning or supervised learning. Additionally, our regularizers can be adaptively weighted to balance the different components of the loss function, thus allowing for a very general framework for conjoint fair representation learning and decision making.
AIFeb 3, 2020
FAE: A Fairness-Aware Ensemble FrameworkVasileios Iosifidis, Besnik Fetahu, Eirini Ntoutsi
Automated decision making based on big data and machine learning (ML) algorithms can result in discriminatory decisions against certain protected groups defined upon personal data like gender, race, sexual orientation etc. Such algorithms designed to discover patterns in big data might not only pick up any encoded societal biases in the training data, but even worse, they might reinforce such biases resulting in more severe discrimination. The majority of thus far proposed fairness-aware machine learning approaches focus solely on the pre-, in- or post-processing steps of the machine learning process, that is, input data, learning algorithms or derived models, respectively. However, the fairness problem cannot be isolated to a single step of the ML process. Rather, discrimination is often a result of complex interactions between big data and algorithms, and therefore, a more holistic approach is required. The proposed FAE (Fairness-Aware Ensemble) framework combines fairness-related interventions at both pre- and postprocessing steps of the data analysis process. In the preprocessing step, we tackle the problems of under-representation of the protected group (group imbalance) and of class-imbalance by generating balanced training samples. In the post-processing step, we tackle the problem of class overlapping by shifting the decision boundary in the direction of fairness.
LGSep 17, 2019
AdaFair: Cumulative Fairness Adaptive BoostingVasileios Iosifidis, Eirini Ntoutsi
The widespread use of ML-based decision making in domains with high societal impact such as recidivism, job hiring and loan credit has raised a lot of concerns regarding potential discrimination. In particular, in certain cases it has been observed that ML algorithms can provide different decisions based on sensitive attributes such as gender or race and therefore can lead to discrimination. Although, several fairness-aware ML approaches have been proposed, their focus has been largely on preserving the overall classification accuracy while improving fairness in predictions for both protected and non-protected groups (defined based on the sensitive attribute(s)). The overall accuracy however is not a good indicator of performance in case of class imbalance, as it is biased towards the majority class. As we will see in our experiments, many of the fairness-related datasets suffer from class imbalance and therefore, tackling fairness requires also tackling the imbalance problem. To this end, we propose AdaFair, a fairness-aware classifier based on AdaBoost that further updates the weights of the instances in each boosting round taking into account a cumulative notion of fairness based upon all current ensemble members, while explicitly tackling class-imbalance by optimizing the number of ensemble members for balanced classification error. Our experiments show that our approach can achieve parity in true positive and true negative rates for both protected and non-protected groups, while it significantly outperforms existing fairness-aware methods up to 25% in terms of balanced error.
LGJul 16, 2019
Fairness-enhancing interventions in stream classificationVasileios Iosifidis, Thi Ngoc Han Tran, Eirini Ntoutsi
The wide spread usage of automated data-driven decision support systems has raised a lot of concerns regarding accountability and fairness of the employed models in the absence of human supervision. Existing fairness-aware approaches tackle fairness as a batch learning problem and aim at learning a fair model which can then be applied to future instances of the problem. In many applications, however, the data comes sequentially and its characteristics might evolve with time. In such a setting, it is counter-intuitive to "fix" a (fair) model over the data stream as changes in the data might incur changes in the underlying model therefore, affecting its fairness. In this work, we propose fairness-enhancing interventions that modify the input data so that the outcome of any stream classifier applied to that data will be fair. Experiments on real and synthetic data show that our approach achieves good predictive performance and low discrimination scores over the course of the stream.
SIOct 24, 2018
Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media ArchivesPavlos Fafalios, Vasileios Iosifidis, Kostas Stefanidis et al.
How did the popularity of the Greek Prime Minister evolve in 2015? How did the predominant sentiment about him vary during that period? Were there any controversial sub-periods? What other entities were related to him during these periods? To answer these questions, one needs to analyze archived documents and data about the query entities, such as old news articles or social media archives. In particular, user-generated content posted in social networks, like Twitter and Facebook, can be seen as a comprehensive documentation of our society, and thus meaningful analysis methods over such archived data are of immense value for sociologists, historians and other interested parties who want to study the history and evolution of entities and events. To this end, in this paper we propose an entity-centric approach to analyze social media archives and we define measures that allow studying how entities were reflected in social media in different time periods and under different aspects, like popularity, attitude, controversiality, and connectedness with other entities. A case study using a large Twitter archive of four years illustrates the insights that can be gained by such an entity-centric and multi-aspect analysis.
IROct 23, 2018
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated TweetsPavlos Fafalios, Vasileios Iosifidis, Eirini Ntoutsi et al.
Publicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and annotating large amounts of tweets is costly. In this paper, we describe TweetsKB, a publicly available corpus of currently more than 1.5 billion tweets, spanning almost 5 years (Jan'13-Nov'17). Metadata information about the tweets as well as extracted entities, hashtags, user mentions and sentiment information are exposed using established RDF/S vocabularies. Next to a description of the extraction and annotation process, we present use cases to illustrate scenarios for entity-centric information exploration, data integration and knowledge discovery facilitated by TweetsKB.
IROct 23, 2018
Time-Aware and Corpus-Specific Entity RelatednessNilamadhaba Mohapatra, Vasileios Iosifidis, Asif Ekbal et al.
Entity relatedness has emerged as an important feature in a plethora of applications such as information retrieval, entity recommendation and entity linking. Given an entity, for instance a person or an organization, entity relatedness measures can be exploited for generating a list of highly-related entities. However, the relation of an entity to some other entity depends on several factors, with time and context being two of the most important ones (where, in our case, context is determined by a particular corpus). For example, the entities related to the International Monetary Fund are different now compared to some years ago, while these entities also may highly differ in the context of a USA news portal compared to a Greek news portal. In this paper, we propose a simple but flexible model for entity relatedness which considers time and entity aware word embeddings by exploiting the underlying corpus. The proposed model does not require external knowledge and is language independent, which makes it widely useful in a variety of applications.