CLMay 1, 2024
WIBA: What Is Being Argued? A Comprehensive Approach to Argument MiningArman Irani, Ju Yeon Park, Kevin Esterling et al.
We propose WIBA, a novel framework and suite of methods that enable the comprehensive understanding of "What Is Being Argued" across contexts. Our approach develops a comprehensive framework that detects: (a) the existence, (b) the topic, and (c) the stance of an argument, correctly accounting for the logical dependence among the three tasks. Our algorithm leverages the fine-tuning and prompt-engineering of Large Language Models. We evaluate our approach and show that it performs well in all the three capabilities. First, we develop and release an Argument Detection model that can classify a piece of text as an argument with an F1 score between 79% and 86% on three different benchmark datasets. Second, we release a language model that can identify the topic being argued in a sentence, be it implicit or explicit, with an average similarity score of 71%, outperforming current naive methods by nearly 40%. Finally, we develop a method for Argument Stance Classification, and evaluate the capability of our approach, showing it achieves a classification F1 score between 71% and 78% across three diverse benchmark datasets. Our evaluation demonstrates that WIBA allows the comprehensive understanding of What Is Being Argued in large corpora across diverse contexts, which is of core interest to many applications in linguistics, communication, and social and computer science. To facilitate accessibility to the advancements outlined in this work, we release WIBA as a free open access platform (wiba.dev).
SEJul 11, 2021
Repo2Vec: A Comprehensive Embedding Approach for Determining Repository SimilarityMd Omar Faruk Rokon, Pei Yan, Risul Islam et al.
How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determiningrepository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by MLalgorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a)metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories. First, we show that our method outperforms previous methods in terms of precision (93%vs 78%), with nearly twice as many Strongly Similar repositories and 30% fewer False Positives. Second, we show how Repo2Vecprovides a solid basis for: (a) distinguishing between malware and benign repositories, and (b) identifying a meaningful hierarchical clustering. For example, we achieve 98% precision and 96%recall in distinguishing malware and benign repositories. Overall, our work is a fundamental building block for enabling many repository analysis functions such as repository categorization by target platform or intention, detecting code-reuse and clones, and identifying lineage and evolution.
LGNov 14, 2020
Mobility Map Inference from Thermal Modeling of a BuildingRisul Islam, Andrey Lokhov, Nathan Lemons et al.
We consider the problem of inferring the mobility map, which is the distribution of the building occupants at each timestamp, from the temperatures of the rooms. We also want to explore the effects of noise in the temperature measurement, room layout, etc. in the reconstruction of the movement of people within the building. Our proposed algorithm tackles down the aforementioned challenges leveraging a parameter learner, the modified Least Square Estimator. In the absence of a complete data set with mobility map, room and ambient temperatures, and HVAC data in the public domain, we simulate a physics-based thermal model of the rooms in a building and evaluate the performance of our inference algorithm on this simulated data. We find an upper bound of the noise standard deviation (<= 1F) in the input temperature data of our model. Within this bound, our algorithm can reconstruct the mobility map with a reasonable reconstruction error. Our work can be used in a wide range of applications, for example, ensuring the physical security of office buildings, elderly and infant monitoring, building resources management, emergency building evacuation, and vulnerability assessment of HVAC data. Our work brings together multiple research areas, Thermal Modeling and Parameter Estimation, towards achieving a common goal of inferring the distribution of people within a large office building.
IRNov 14, 2020
RecTen: A Recursive Hierarchical Low Rank Tensor Factorization Method to Discover Hierarchical Patterns in Multi-modal DataRisul Islam, Md Omar Faruk Rokon, Evangelos E. Papalexakis et al.
How can we expand the tensor decomposition to reveal a hierarchical structure of the multi-modal data in a self-adaptive way? Current tensor decomposition provides only a single layer of clusters. We argue that with the abundance of multimodal data and time-evolving networks nowadays, the ability to identify emerging hierarchies is important. To this effect, we propose RecTen, a recursive hierarchical soft clustering approach based on tensor decomposition. Our approach enables us to: (a) recursively decompose clusters identified in the previous step, and (b) identify the right conditions for terminating this process. In the absence of proper ground truth, we evaluate our approach with synthetic data and test its sensitivity to different parameters. We also apply RecTen on five real datasets which involve the activities of users in online discussion platforms, such as security forums. This analysis helps us reveal clusters of users with interesting behaviors, including but not limited to early detection of some real events like ransomware outbreaks, the emergence of a blackmarket of decryption tools, and romance scamming. To maximize the usefulness of our approach, we develop a tool which can help the data analysts and overall research community by identifying hierarchical structures. RecTen is an unsupervised approach which can be used to take the pulse of the large multi-modal data and let the data discover its own hidden structures by itself.
CRNov 14, 2020
TenFor: A Tensor-Based Tool to Extract Interesting Events from Security ForumsRisul Islam, Md Omar Faruk Rokon, Evangelos E. Papalexakis et al.
How can we get a security forum to "tell" us its activities and events of interest? We take a unique angle: we want to identify these activities without any a priori knowledge, which is a key difference compared to most of the previous problem formulations. Despite some recent efforts, mining security forums to extract useful information has received relatively little attention, while most of them are usually searching for specific information. We propose TenFor, an unsupervised tensor-based approach, to systematically identify important events in a three-dimensional space: (a) user, (b) thread, and (c) time. Our method consists of three high-level steps: (a) a tensor-based clustering across the three dimensions, (b) an extensive cluster profiling that uses both content and behavioral features, and (c) a deeper investigation, where we identify key users and threads within the events of interest. In addition, we implement our approach as a powerful and easy-to-use platform for practitioners. In our evaluation, we find that 83% of our clusters capture meaningful events and we find more meaningful clusters compared to previous approaches. Our approach and our platform constitute an important step towards detecting activities of interest from a forum in an unsupervised learning fashion in practice.
CRNov 14, 2020
HackerScope: The Dynamics of a Massive Hacker Online EcosystemRisul Islam, Md Omar Faruk Rokon, Ahmad Darki et al.
Authors of malicious software are not hiding as much as one would assume: they have a visible online footprint. Apart from online forums, this footprint appears in software development platforms, where authors create publicly-accessible malware repositories to share and collaborate. With the exception of a few recent efforts, the existence and the dynamics of this community has received surprisingly limited attention. The goal of our work is to analyze this ecosystem of hackers in order to: (a) understand their collaborative patterns, and (b) identify and profile its most influential authors. We develop HackerScope, a systematic approach for analyzing the dynamics of this hacker ecosystem. Leveraging our targeted data collection, we conduct an extensive study of 7389 authors of malware repositories on GitHub, which we combine with their activity on four security forums. From a modeling point of view, we study the ecosystem using three network representations: (a) the author-author network, (b) the author-repository network, and (c) cross-platform egonets. Our analysis leads to the following key observations: (a) the ecosystem is growing at an accelerating rate as the number of new malware authors per year triples every 2 years, (b) it is highly collaborative, more so than the rest of GitHub authors, and (c) it includes influential and professional hackers. We find 30 authors maintain an online "brand" across GitHub and our security forums. Our study is a significant step towards using public online information for understanding the malicious hacker community.
CRMay 28, 2020
SourceFinder: Finding Malware Source-Code from Publicly Available RepositoriesMd Omar Faruk Rokon, Risul Islam, Ahmad Darki et al.
Where can we find malware source code? This question is motivated by a real need: there is a dearth of malware source code, which impedes various types of security research. Our work is driven by the following insight: public archives, like GitHub, have a surprising number of malware repositories. Capitalizing on this opportunity, we propose, SourceFinder, a supervised-learning approach to identify repositories of malware source code efficiently. We evaluate and apply our approach using 97K repositories from GitHub. First, we show that our approach identifies malware repositories with 89% precision and 86% recall using a labeled dataset. Second, we use SourceFinder to identify 7504 malware source code repositories, which arguably constitutes the largest malware source code database. Finally, we study the fundamental properties and trends of the malware repositories and their authors. The number of such repositories appears to be growing by an order of magnitude every 4 years, and 18 malware authors seem to be "professionals" with well-established online reputation. We argue that our approach and our large repository of malware source code can be a catalyst for research studies, which are currently not possible.
CLJan 8, 2020
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security ForumsJoobin Gharibshah, Evangelos E. Papalexakis, Michalis Faloutsos
How can we extract useful information from a security forum? We focus on identifying threads of interest to a security professional: (a) alerts of worrisome events, such as attacks, (b) offering of malicious services and products, (c) hacking information to perform malicious acts, and (d) useful security-related experiences. The analysis of security forums is in its infancy despite several promising recent works. Novel approaches are needed to address the challenges in this domain: (a) the difficulty in specifying the "topics" of interest efficiently, and (b) the unstructured and informal nature of the text. We propose, REST, a systematic methodology to: (a) identify threads of interest based on a, possibly incomplete, bag of words, and (b) classify them into one of the four classes above. The key novelty of the work is a multi-step weighted embedding approach: we project words, threads and classes in appropriate embedding spaces and establish relevance and similarity there. We evaluate our method with real data from three security forums with a total of 164k posts and 21K threads. First, REST robustness to initial keyword selection can extend the user-provided keyword set and thus, it can recover from missing keywords. Second, REST categorizes the threads into the classes of interest with superior accuracy compared to five other methods: REST exhibits an accuracy between 63.3-76.9%. We see our approach as a first step for harnessing the wealth of information of online forums in a user-friendly way, since the user can loosely specify her keywords of interest.
IRApr 13, 2018
RIPEx: Extracting malicious IP addresses from security forums using cross-forum learningJoobin Gharibshah, Evangelos E. Papalexakis, Michalis Faloutsos
Is it possible to extract malicious IP addresses reported in security forums in an automatic way? This is the question at the heart of our work. We focus on security forums, where security professionals and hackers share knowledge and information, and often report misbehaving IP addresses. So far, there have only been a few efforts to extract information from such security forums. We propose RIPEx, a systematic approach to identify and label IP addresses in security forums by utilizing a cross-forum learning method. In more detail, the challenge is twofold: (a) identifying IP addresses from other numerical entities, such as software version numbers, and (b) classifying the IP address as benign or malicious. We propose an integrated solution that tackles both these problems. A novelty of our approach is that it does not require training data for each new forum. Our approach does knowledge transfer across forums: we use a classifier from our source forums to identify seed information for training a classifier on the target forum. We evaluate our method using data collected from five security forums with a total of 31K users and 542K posts. First, RIPEx can distinguish IP address from other numeric expressions with 95% precision and above 93% recall on average. Second, RIPEx identifies malicious IP addresses with an average precision of 88% and over 78% recall, using our cross-forum learning. Our work is a first step towards harnessing the wealth of useful information that can be found in security forums.