9.1CLJun 2
ACAT: A Collaborative Platform for Efficient Aspect-Based Sentiment Dataset AnnotationAna-Maria Luisa Mocanu, Ciprian-Octavian Truica, Elena-Simona Apostol
Aspect-Based Sentiment Analysis (ABSA) requires high-quality datasets to train reliable models. However, existing annotation tools treat output as flat files, leaving researchers to manually consolidate multi-annotator data, reconstruct relational structures, and compute reliability metrics through custom scripts. This paper introduces ACAT (Aspect-based sentiment analysis Collaborative Annotation Tool), a web-based platform natively supporting four ABSA workflows: (1) Aspect-Category Sentiment Analysis, (2) Clause-Level Segmentation, (3) Aspect-Term Sentiment Analysis with character-level position tracking, and (4) Aspect Sentiment Triplet Extraction with dual span offset preservation. Its core contribution is an automated Extract, Transform, Load (ETL) pipeline that aligns collaborative annotations and computes Inter-Annotator Agreement (IAA) metrics directly at export, yielding training-ready datasets. In a preliminary validation on 1,002 restaurant reviews with two annotators of differing expertise, ACAT achieves a median annotation time of 31.58 seconds and a raw IAA ranging from 0.78 to 0.86 across all tasks.
AIFeb 1, 2023
DANES: Deep Neural Network Ensemble Architecture for Social and Textual Context-aware Fake News DetectionCiprian-Octavian Truică, Elena-Simona Apostol, Panagiotis Karras
The growing popularity of social media platforms has simplified the creation and distribution of news articles but also creates a conduit for spreading fake news. In consequence, the need arises for effective context-aware fake news detection mechanisms, where the contextual information can be built either from the textual content of posts or from available social data (e.g., information about the users, reactions to posts, or the social network). In this paper, we propose DANES, a Deep Neural Network Ensemble Architecture for Social and Textual Context-aware Fake News Detection. DANES comprises a Text Branch for a textual content-based context and a Social Branch for the social context. These two branches are used to create a novel Network Embedding. Preliminary ablation results on 3 real-world datasets, i.e., BuzzFace, Twitter15, and Twitter16, are promising, with an accuracy that outperforms state-of-the-art solutions when employing both social and textual content features.
SIJan 30, 2023
ContCommRTD: A Distributed Content-based Misinformation-aware Community Detection System for Real-Time Disaster ReportingElena-Simona Apostol, Ciprian-Octavian Truică, Adrian Paschke
Real-time social media data can provide useful information on evolving hazards. Alongside traditional methods of disaster detection, the integration of social media data can considerably enhance disaster management. In this paper, we investigate the problem of detecting geolocation-content communities on Twitter and propose a novel distributed system that provides in near real-time information on hazard-related events and their evolution. We show that content-based community analysis leads to better and faster dissemination of reports on hazards. Our distributed disaster reporting system analyzes the social relationship among worldwide geolocated tweets, and applies topic modeling to group tweets by topics. Considering for each tweet the following information: user, timestamp, geolocation, retweets, and replies, we create a publisher-subscriber distribution model for topics. We use content similarity and the proximity of nodes to create a new model for geolocation-content based communities. Users can subscribe to different topics in specific geographical areas or worldwide and receive real-time reports regarding these topics. As misinformation can lead to increase damage if propagated in hazards related tweets, we propose a new deep learning model to detect fake news. The misinformed tweets are then removed from display. We also show empirically the scalability capabilities of the proposed system.
SIFeb 23, 2023
MCWDST: a Minimum-Cost Weighted Directed Spanning Tree Algorithm for Real-Time Fake News Mitigation in Social MediaCiprian-Octavian Truică, Elena-Simona Apostol, Radu-Cătălin Nicolescu et al.
The widespread availability of internet access and handheld devices confers to social media a power similar to the one newspapers used to have. People seek affordable information on social media and can reach it within seconds. Yet this convenience comes with dangers; any user may freely post whatever they please and the content can stay online for a long period, regardless of its truthfulness. A need to detect untruthful information, also known as fake news, arises. In this paper, we present an end-to-end solution that accurately detects fake news and immunizes network nodes that spread them in real-time. To detect fake news, we propose two new stack deep learning architectures that utilize convolutional and bidirectional LSTM layers. To mitigate the spread of fake news, we propose a real-time network-aware strategy that (1) constructs a minimum-cost weighted directed spanning tree for a detected node, and (2) immunizes nodes in that tree by scoring their harmfulness using a novel ranking function. We demonstrate the effectiveness of our solution on five real-world datasets.
CLApr 16, 2023
It's All in the Embedding! Fake News Detection Using Document EmbeddingsCiprian-Octavian Truică, Elena-Simona Apostol
With the current shift in the mass media landscape from journalistic rigor to social media, personalized social media is becoming the new norm. Although the digitalization progress of the media brings many advantages, it also increases the risk of spreading disinformation, misinformation, and malformation through the use of fake news. The emergence of this harmful phenomenon has managed to polarize society and manipulate public opinion on particular topics, e.g., elections, vaccinations, etc. Such information propagated on social media can distort public perceptions and generate social unrest while lacking the rigor of traditional journalism. Natural Language Processing and Machine Learning techniques are essential for developing efficient tools that can detect fake news. Models that use the context of textual data are essential for resolving the fake news detection problem, as they manage to encode linguistic features within the vector representation of words. In this paper, we propose a new approach that uses document embeddings to build multiple models that accurately label news articles as reliable or fake. We also present a benchmark on different architectures that detect fake news using binary or multi-labeled classification. We evaluated the models on five large news corpora using accuracy, precision, and recall. We obtained better results than more complex state-of-the-art Deep Neural Network models. We observe that the most important factor for obtaining high accuracy is the document encoding, not the classification model's complexity.
AIJan 30, 2023
EDSA-Ensemble: an Event Detection Sentiment Analysis Ensemble ArchitectureAlexandru Petrescu, Ciprian-Octavian Truică, Elena-Simona Apostol et al.
As global digitization continues to grow, technology becomes more affordable and easier to use, and social media platforms thrive, becoming the new means of spreading information and news. Communities are built around sharing and discussing current events. Within these communities, users are enabled to share their opinions about each event. Using Sentiment Analysis to understand the polarity of each message belonging to an event, as well as the entire event, can help to better understand the general and individual feelings of significant trends and the dynamics on online social networks. In this context, we propose a new ensemble architecture, EDSA-Ensemble (Event Detection Sentiment Analysis Ensemble), that uses Event Detection and Sentiment Analysis to improve the detection of the polarity for current events from Social Media. For Event Detection, we use techniques based on Information Diffusion taking into account both the time span and the topics. To detect the polarity of each event, we preprocess the text and employ several Machine and Deep Learning models to create an ensemble model. The preprocessing step includes several word representation models, i.e., raw frequency, TFIDF, Word2Vec, and Transformers. The proposed EDSA-Ensemble architecture improves the event sentiment classification over the individual Machine and Deep Learning models.
CLApr 16, 2023
MisRoBÆRTa: Transformers versus MisinformationCiprian-Octavian Truică, Elena-Simona Apostol
Misinformation is considered a threat to our democratic values and principles. The spread of such content on social media polarizes society and undermines public discourse by distorting public perceptions and generating social unrest while lacking the rigor of traditional journalism. Transformers and transfer learning proved to be state-of-the-art methods for multiple well-known natural language processing tasks. In this paper, we propose MisRoBÆRTa, a novel transformer-based deep neural ensemble architecture for misinformation detection. MisRoBÆRTa takes advantage of two transformers (BART \& RoBERTa) to improve the classification performance. We also benchmarked and evaluated the performances of multiple transformers on the task of misinformation detection. For training and testing, we used a large real-world news articles dataset labeled with 10 classes, addressing two shortcomings in the current research: increasing the size of the dataset from small to large, and moving the focus of fake news detection from binary classification to multi-class classification. For this dataset, we manually verified the content of the news articles to ensure that they were correctly labeled. The experimental results show that the accuracy of transformers on the misinformation detection problem was significantly influenced by the method employed to learn the context, dataset size, and vocabulary dimension. We observe empirically that the best accuracy performance among the classification models that use only one transformer is obtained by BART, while DistilRoBERTa obtains the best accuracy in the least amount of time required for fine-tuning and training. The proposed MisRoBÆRTa outperforms the other transformer models in the task of misinformation detection. To arrive at this conclusion, we performed ample ablation and sensitivity testing with MisRoBÆRTa on two datasets.
SIMar 3, 2023
CONTAIN: A Community-based Algorithm for Network ImmunizationElena-Simona Apostol, Özgur Coban, Ciprian-Octavian Truică
Network immunization is an automated task in the field of network analysis that involves protecting a network (modeled as a graph) from being infected by an undesired arbitrary diffusion. In this article, we consider the spread of harmful content in social networks, and we propose CONTAIN, a novel COmmuNiTy-based Algorithm for network ImmuNization. Our solution uses the network information to (1) detect harmful content spreaders, and (2) generate partitions and rank them for immunization using the subgraphs induced by each spreader, i.e., employing CONTAIN. The experimental results obtained on real-world datasets show that CONTAIN outperforms state-of-the-art solutions, i.e., NetShield and SparseShield, by immunizing the network in fewer iterations, thus, converging significantly faster than the state-of-the-art algorithms. We also compared our solution in terms of scalability with the state-of-the-art tree-based mitigation algorithm MCWDST, as well as with NetShield and SparseShield. We can conclude that our solution outperforms MCWDST and NetShield.
CLApr 14, 2023
SimpLex: a lexical text simplification architectureCiprian-Octavian Truică, Andrei-Ionut Stan, Elena-Simona Apostol
Text simplification (TS) is the process of generating easy-to-understand sentences from a given sentence or piece of text. The aim of TS is to reduce both the lexical (which refers to vocabulary complexity and meaning) and syntactic (which refers to the sentence structure) complexity of a given text or sentence without the loss of meaning or nuance. In this paper, we present \textsc{SimpLex}, a novel simplification architecture for generating simplified English sentences. To generate a simplified sentence, the proposed architecture uses either word embeddings (i.e., Word2Vec) and perplexity, or sentence transformers (i.e., BERT, RoBERTa, and GPT2) and cosine similarity. The solution is incorporated into a user-friendly and simple-to-use software. We evaluate our system using two metrics, i.e., SARI, and Perplexity Decrease. Experimentally, we observe that the transformer models outperform the other models in terms of the SARI score. However, in terms of Perplexity, the Word-Embeddings-based models achieve the biggest decrease. Thus, the main contributions of this paper are: (1) We propose a new Word Embedding and Transformer based algorithm for text simplification; (2) We design \textsc{SimpLex} -- a modular novel text simplification system -- that can provide a baseline for further research; and (3) We perform an in-depth analysis of our solution and compare our results with two state-of-the-art models, i.e., LightLS [19] and NTS-w2v [44]. We also make the code publicly available online.
CLJul 29, 2023
ATESA-BÆRT: A Heterogeneous Ensemble Learning Model for Aspect-Based Sentiment AnalysisElena-Simona Apostol, Alin-Georgian Pisică, Ciprian-Octavian Truică
The increasing volume of online reviews has made possible the development of sentiment analysis models for determining the opinion of customers regarding different products and services. Until now, sentiment analysis has proven to be an effective tool for determining the overall polarity of reviews. To improve the granularity at the aspect level for a better understanding of the service or product, the task of aspect-based sentiment analysis aims to first identify aspects and then determine the user's opinion about them. The complexity of this task lies in the fact that the same review can present multiple aspects, each with its own polarity. Current solutions have poor performance on such data. We address this problem by proposing ATESA-BÆRT, a heterogeneous ensemble learning model for Aspect-Based Sentiment Analysis. Firstly, we divide our problem into two sub-tasks, i.e., Aspect Term Extraction and Aspect Term Sentiment Analysis. Secondly, we use the \textit{argmax} multi-class classification on six transformers-based learners for each sub-task. Initial experiments on two datasets prove that ATESA-BÆRT outperforms current state-of-the-art solutions while solving the many aspects problem.
LGNov 2, 2023
Efficient Machine Learning Ensemble Methods for Detecting Gravitational Wave Glitches in LIGO Time SeriesElena-Simona Apostol, Ciprian-Octavian Truică
The phenomenon of Gravitational Wave (GW) analysis has grown in popularity as technology has advanced and the process of observing gravitational waves has become more precise. Although the sensitivity and the frequency of observation of GW signals are constantly improving, the possibility of noise in the collected GW data remains. In this paper, we propose two new Machine and Deep learning ensemble approaches (i.e., ShallowWaves and DeepWaves Ensembles) for detecting different types of noise and patterns in datasets from GW observatories. Our research also investigates various Machine and Deep Learning techniques for multi-class classification and provides a comprehensive benchmark, emphasizing the best results in terms of three commonly used performance metrics (i.e., accuracy, precision, and recall). We train and test our models on a dataset consisting of annotated time series from real-world data collected by the Advanced Laser Interferometer GW Observatory (LIGO). We empirically show that the best overall accuracy is obtained by the proposed DeepWaves Ensemble, followed close by the ShallowWaves Ensemble.
CLAug 23, 2023
Semantic Change Detection for the Romanian LanguageCiprian-Octavian Truică, Victor Tudose, Elena-Simona Apostol
Automatic semantic change methods try to identify the changes that appear over time in the meaning of words by analyzing their usage in diachronic corpora. In this paper, we analyze different strategies to create static and contextual word embedding models, i.e., Word2Vec and ELMo, on real-world English and Romanian datasets. To test our pipeline and determine the performance of our models, we first evaluate both word embedding models on an English dataset (SEMEVAL-CCOHA). Afterward, we focus our experiments on a Romanian dataset, and we underline different aspects of semantic changes in this low-resource language, such as meaning acquisition and loss. The experimental results show that, depending on the corpus, the most important factors to consider are the choice of model and the distance to calculate a score for detecting semantic change.
IVApr 29, 2023
Advanced Medical Image Representation for Efficient Processing and Transfer in Multisite CloudsElena-Simona Apostol, Ciprian-Octavian Truică
An important topic in medical research is the process of improving the images obtained from medical devices. As a consequence, there is also a need to improve medical image resolution and analysis. Another issue in this field is the large amount of stored medical data [16]. Human brain databases at medical institutes, for example, can accumulate tens of Terabytes of data per year. In this paper, we propose a novel medical image format representation based on multiple data structures that improve the information maintained in the medical images. The new representation keeps additional metadata information, such as the image class or tags for the objects found in the image. We defined our own ontology to help us classify the objects found in medical images using a multilayer neural network. As we generally deal with large data sets, we used the MapReduce paradigm in the Cloud environment to speed up the image processing. To optimize the transfer between Cloud nodes and to reduce the preprocessing time, we also propose a data compression method based on deduplication. We test our solution for image representation and efficient data transfer in a multisite cloud environment. Our proposed solution optimizes the data transfer with a time improvement of 27% on average.
SINov 9, 2024
StopHC: A Harmful Content Detection and Mitigation Architecture for Social Media PlatformsCiprian-Octavian Truică, Ana-Teodora Constantinescu, Elena-Simona Apostol
The mental health of social media users has started more and more to be put at risk by harmful, hateful, and offensive content. In this paper, we propose \textsc{StopHC}, a harmful content detection and mitigation architecture for social media platforms. Our aim with \textsc{StopHC} is to create more secure online environments. Our solution contains two modules, one that employs deep neural network architecture for harmful content detection, and one that uses a network immunization algorithm to block toxic nodes and stop the spread of harmful content. The efficacy of our solution is demonstrated by experiments conducted on two real-world datasets.
AIDec 2, 2024
GETAE: Graph information Enhanced deep neural NeTwork ensemble ArchitecturE for fake news detectionCiprian-Octavian Truică, Elena-Simona Apostol, Marius Marogel et al.
In today's digital age, fake news has become a major problem that has serious consequences, ranging from social unrest to political upheaval. To address this issue, new methods for detecting and mitigating fake news are required. In this work, we propose to incorporate contextual and network-aware features into the detection process. This involves analyzing not only the content of a news article but also the context in which it was shared and the network of users who shared it, i.e., the information diffusion. Thus, we propose GETAE, \underline{G}raph Information \underline{E}nhanced Deep Neural Ne\underline{t}work Ensemble \underline{A}rchitectur\underline{E} for Fake News Detection, a novel ensemble architecture that uses textual content together with the social interactions to improve fake news detection. GETAE contains two Branches: the Text Branch and the Propagation Branch. The Text Branch uses Word and Transformer Embeddings and a Deep Neural Network based on feed-forward and bidirectional Recurrent Neural Networks (\textsc{[Bi]RNN}) for learning novel contextual features and creating a novel Text Content Embedding. The Propagation Branch considers the information propagation within the graph network and proposes a Deep Learning architecture that employs Node Embeddings to create novel Propagation Embedding. GETAE Ensemble combines the two novel embeddings, i.e., Text Content Embedding and Propagation Embedding, to create a novel \textit{Propagation-Enhanced Content Embedding} which is afterward used for classification. The experimental results obtained on two real-world publicly available datasets, i.e., Twitter15 and Twitter16, prove that using this approach improves fake news detection and outperforms state-of-the-art models.
CLJan 19, 2024
Advancements in eHealth Data Analytics through Natural Language Processing and Deep LearningElena-Simona Apostol, Ciprian-Octavian Truică
The healthcare environment is commonly referred to as "information-rich" but also "knowledge poor". Healthcare systems collect huge amounts of data from various sources: lab reports, medical letters, logs of medical tools or programs, medical prescriptions, etc. These massive sets of data can provide great knowledge and information that can improve the medical services, and overall the healthcare domain, such as disease prediction by analyzing the patient's symptoms or disease prevention, by facilitating the discovery of behavioral factors for diseases. Unfortunately, only a relatively small volume of the textual eHealth data is processed and interpreted, an important factor being the difficulty in efficiently performing Big Data operations. In the medical field, detecting domain-specific multi-word terms is a crucial task as they can define an entire concept with a few words. A term can be defined as a linguistic structure or a concept, and it is composed of one or more words with a specific meaning to a domain. All the terms of a domain create its terminology. This chapter offers a critical study of the current, most performant solutions for analyzing unstructured (image and textual) eHealth data. This study also provides a comparison of the current Natural Language Processing and Deep Learning techniques in the eHealth context. Finally, we examine and discuss some of the current issues, and we define a set of research directions in this area.
CLMay 24, 2023
A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark EcosystemCiprian-Octavian Truică, Neculai-Ovidiu Istrate, Elena-Simona Apostol
Automatic Term Recognition is used to extract domain-specific terms that belong to a given domain. In order to be accurate, these corpus and language-dependent methods require large volumes of textual data that need to be processed to extract candidate terms that are afterward scored according to a given metric. To improve text preprocessing and candidate terms extraction and scoring, we propose a distributed Spark-based architecture to automatically extract domain-specific terms. The main contributions are as follows: (1) propose a novel distributed automatic domain-specific multi-word term recognition architecture built on top of the Spark ecosystem; (2) perform an in-depth analysis of our architecture in terms of accuracy and scalability; (3) design an easy-to-integrate Python implementation that enables the use of Big Data processing in fields such as Computational Linguistics and Natural Language Processing. We prove empirically the feasibility of our architecture by performing experiments on two real-world datasets.