LGJul 16, 2019Code
Selection Heuristics on Semantic Genetic Programming for Classification ProblemsClaudia N. Sánchez, Mario Graff
Individual's semantics have been used for guiding the learning process of Genetic Programming solving supervised learning problems. The semantics has been used to proposed novel genetic operators as well as different ways of performing parent selection. The latter is the focus of this contribution by proposing three heuristics for parent selection that replace the fitness function on the selection mechanism entirely. These heuristics complement previous work by being inspired in the characteristics of the addition, Naive Bayes, and Nearest Centroid functions and applying them only when the function is used to create an offspring. These heuristics use different similarity measures among the parents to decide which of them is more appropriate given a function. The similarity functions considered are the cosine similarity, Pearson's correlation, and agreement. We analyze these heuristics' performance against random selection, state-of-the-art selection schemes, and 18 classifiers, including auto-machine-learning techniques, on 30 classification problems with a variable number of samples, variables, and classes. The result indicated that the combination of parent selection based on agreement and random selection to replace an individual in the population produces statistically better results than the classical selection and state-of-the-art schemes, and it is competitive with state-of-the-art classifiers. Finally, the code is released as open-source software.
CLNov 29, 2018Code
EvoMSA: A Multilingual Evolutionary Approach for Sentiment AnalysisMario Graff, Sabino Miranda-Jiménez, Eric S. Tellez et al.
Sentiment analysis (SA) is a task related to understanding people's feelings in written text; the starting point would be to identify the polarity level (positive, neutral or negative) of a given text, moving on to identify emotions or whether a text is humorous or not. This task has been the subject of several research competitions in a number of languages, e.g., English, Spanish, and Arabic, among others. In this contribution, we propose an SA system, namely EvoMSA, that unifies our participating systems in various SA competitions, making it domain independent and multilingual by processing text using only language-independent techniques. EvoMSA is a classifier, based on Genetic Programming, that works by combining the output of different text classifiers and text models to produce the final prediction. We analyze EvoMSA on different SA competitions to provide a global overview of its performance, and as the results show, EvoMSA is competitive obtaining top rankings in several SA competitions. Furthermore, we performed an analysis of EvoMSA's components to measure their contribution to the performance; the idea is to facilitate a practitioner or newcomer to implement a competitive SA classifier. Finally, it is worth to mention that EvoMSA is available as open-source software.
LGMar 7, 2024
Analysis of Systems' Performance in Natural Language Processing CompetitionsSergio Nava-Muñoz, Mario Graff, Hugo Jair Escalante
Collaborative competitions have gained popularity in the scientific and technological fields. These competitions involve defining tasks, selecting evaluation scores, and devising result verification methods. In the standard scenario, participants receive a training set and are expected to provide a solution for a held-out dataset kept by organizers. An essential challenge for organizers arises when comparing algorithms' performance, assessing multiple participants, and ranking them. Statistical tools are often used for this purpose; however, traditional statistical methods often fail to capture decisive differences between systems' performance. This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The methodology is designed to be universally applicable; however, it is illustrated using eight natural language competitions as case studies involving classification and regression problems. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Furthermore, we introduce metrics that allow organizers to assess the difficulty of competitions. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.
CLOct 12, 2021
Regionalized models for Spanish language variations based on TwitterEric S. Tellez, Daniela Moctezuma, Sabino Miranda et al.
Spanish is one of the most spoken languages in the globe, but not necessarily Spanish is written and spoken in the same way in different countries. Understanding local language variations can help to improve model performances on regional tasks, both understanding local structures and also improving the message's content. For instance, think about a machine learning engineer who automatizes some language classification task on a particular region or a social scientist trying to understand a regional event with echoes on social media; both can take advantage of dialect-based language models to understand what is happening with more contextual information hence more precision. This manuscript presents and describes a set of regionalized resources for the Spanish language built on four-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities; as well as examples of using regional resources on message classification tasks.
CLJun 3, 2021
A Case Study of Spanish Text Transformations for Twitter Sentiment AnalysisEric S. Tellez, Sabino Miranda-Jiménez, Mario Graff et al.
Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness. Recently, it has received a lot of attention given the interest in opinion mining in micro-blogging platforms. These new forms of textual expressions present new challenges to analyze text given the use of slang, orthographic and grammatical errors, among others. Along with these challenges, a practical sentiment classifier should be able to handle efficiently large workloads. The aim of this research is to identify which text transformations (lemmatization, stemming, entity removal, among others), tokenizers (e.g., words $n$-grams), and tokens weighting schemes impact the most the accuracy of a classifier (Support Vector Machine) trained on two Spanish corpus. The methodology used is to exhaustively analyze all the combinations of the text transformations and their respective parameters to find out which characteristics the best performing classifiers have in common. Furthermore, among the different text transformations studied, we introduce a novel approach based on the combination of word based $n$-grams and character based $q$-grams. The results show that this novel combination of words and characters produces a classifier that outperforms the traditional word based combination by $11.17\%$ and $5.62\%$ on the INEGI and TASS'15 dataset, respectively.
CLSep 3, 2020
A Python Library for Exploratory Data Analysis on Twitter Data based on Tokens and Aggregated Origin-Destination InformationMario Graff, Daniela Moctezuma, Sabino Miranda-Jiménez et al.
Twitter is perhaps the social media more amenable for research. It requires only a few steps to obtain information, and there are plenty of libraries that can help in this regard. Nonetheless, knowing whether a particular event is expressed on Twitter is a challenging task that requires a considerable collection of tweets. This proposal aims to facilitate, to a researcher interested, the process of mining events on Twitter by opening a collection of processed information taken from Twitter since December 2015. The events could be related to natural disasters, health issues, and people's mobility, among other studies that can be pursued with the library proposed. Different applications are presented in this contribution to illustrate the library's capabilities: an exploratory analysis of the topics discovered in tweets, a study on similarity among dialects of the Spanish language, and a mobility report on different countries. In summary, the Python library presented is applied to different domains and retrieves a plethora of information in terms of frequencies by day of words and bi-grams of words for Arabic, English, Spanish, and Russian languages. As well as mobility information related to the number of travels among locations for more than 200 countries or territories.
LGJul 14, 2019
Improving classification performance by feature space transformations and model selectionJose Ortiz-Bejar, Eric S. Tellez, Mario Graff
Improving the performance of classifiers is the realm of feature mapping, prototype selection, and kernel function transformations; these techniques aim for reducing the complexity, and also, improving the accuracy of models. In particular, our objective is to combine them to transform data's shape into another more convenient distribution; such that some simple algorithms, such as Naïve Bayes or k-Nearest Neighbors, can produce competitive classifiers. In this paper, we introduce a family of classifiers based on feature mapping and kernel functions, orchestrated by a model selection scheme that excels in performance. We provide an extensive experimental comparison of our methods with sixteen popular classifiers on more than thirty benchmarks supporting our claims. In addition to their competitive performance, our statistical tests also found that our methods are different among them, supporting our claim of a compelling family of classifiers.
DSMay 29, 2017
A scalable solution to the nearest neighbor search problem through local-search methods on neighbor graphsEric S. Tellez, Guillermo Ruiz, Edgar Chavez et al.
Near neighbor search (NNS) is a powerful abstraction for data access; however, data indexing is troublesome even for approximate indexes. For intrinsically high-dimensional data, high-quality fast searches demand either indexes with impractically large memory usage or preprocessing time. In this paper, we introduce an algorithm to solve a nearest-neighbor query $q$ by minimizing a kernel function defined by the distance from $q$ to each object in the database. The minimization is performed using metaheuristics to solve the problem rapidly; even when some methods in the literature use this strategy behind the scenes, our approach is the first one using it explicitly. We also provide two approaches to select edges in the graph's construction stage that limit memory footprint and reduce the number of free parameters simultaneously. We carry out a thorough experimental comparison with state-of-the-art indexes through synthetic and real-world datasets; we found out that our contributions achieve competitive performances regarding speed, accuracy, and memory in almost any of our benchmarks.
CLApr 6, 2017
An Automated Text Categorization Framework based on Hyperparameter OptimizationEric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez et al.
A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackle using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task, using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, we propose a minimalistic and wide system able to tackle text classification tasks independent of domain and language, namely microTC. It is composed by some easy to implement text transformations, text representations, and a supervised learning algorithm. These pieces produce a competitive classifier even in the domain of informally written text. We provide a detailed description of microTC along with an extensive experimental comparison with relevant state-of-the-art methods. mircoTC was compared on 30 different datasets. Regarding accuracy, microTC obtained the best performance in 20 datasets while achieves competitive results in the remaining 10. The compared datasets include several problems like topic and polarity classification, spam detection, user profiling and authorship attribution. Furthermore, it is important to state that our approach allows the usage of the technology even without knowledge of machine learning and natural language processing.
CLDec 15, 2016
A Simple Approach to Multilingual Polarity Classification in TwitterEric S. Tellez, Sabino Miranda Jiménez, Mario Graff et al.
Recently, sentiment analysis has received a lot of attention due to the interest in mining opinions of social media users. Sentiment analysis consists in determining the polarity of a given text, i.e., its degree of positiveness or negativeness. Traditionally, Sentiment Analysis algorithms have been tailored to a specific language given the complexity of having a number of lexical variations and errors introduced by the people generating content. In this contribution, our aim is to provide a simple to implement and easy to use multilingual framework, that can serve as a baseline for sentiment analysis contests, and as starting point to build new sentiment analysis systems. We compare our approach in eight different languages, three of them have important international contests, namely, SemEval (English), TASS (Spanish), and SENTIPOLC (Italian). Within the competitions our approach reaches from medium to high positions in the rankings; whereas in the remaining languages our approach outperforms the reported results.
NEOct 2, 2014
Term-Weighting Learning via Genetic Programming for Text ClassificationHugo Jair Escalante, Mauricio A. García-Limón, Alicia Morales-Reyes et al.
This paper describes a novel approach to learning term-weighting schemes (TWSs) in the context of text classification. In text mining a TWS determines the way in which documents will be represented in a vector space model, before applying a classifier. Whereas acceptable performance has been obtained with standard TWSs (e.g., Boolean and term-frequency schemes), the definition of TWSs has been traditionally an art. Further, it is still a difficult task to determine what is the best TWS for a particular problem and it is not clear yet, whether better schemes, than those currently available, can be generated by combining known TWS. We propose in this article a genetic program that aims at learning effective TWSs that can improve the performance of current schemes in text classification. The genetic program learns how to combine a set of basic units to give rise to discriminative TWSs. We report an extensive experimental study comprising data sets from thematic and non-thematic text classification as well as from image classification. Our study shows the validity of the proposed method; in fact, we show that TWSs learned with the genetic program outperform traditional schemes and other TWSs proposed in recent works. Further, we show that TWSs learned from a specific domain can be effectively used for other tasks.