Ashish Sureka

SE
15papers
367citations
Novelty17%
AI Score20

15 Papers

SEApr 14, 2017Code
Using Source Code Metrics and Ensemble Methods for Fault Proneness Prediction

Lov Kumar, Santanu Rath, Ashish Sureka

Software fault prediction model are employed to optimize testing resource allocation by identifying fault-prone classes before testing phases. Several researchers' have validated the use of different classification techniques to develop predictive models for fault prediction. The performance of the statistical models are proven to be influenced by the training and testing dataset. Ensemble method learning algorithms have been widely used because it combines the capabilities of its constituent models towards a dataset to come up with a potentially higher performance as compared to individual models (improves generalizability). In the study presented in this paper, three different ensemble methods have been applied to develop a model for predicting fault proneness. The efficacy and usefulness of a fault prediction model also depends on the source code metrics which are considered as the input for the model. In this paper, we propose a framework to validate the source code metrics and select the right set of metrics with the objective to improve the performance of the fault prediction model. The fault prediction models are then validated using a cost evaluation framework. We conduct a series of experiments on 45 open source project dataset. Key conclusions from our experiments are: (1) Majority Voting Ensemble (MVE) methods outperformed other methods; (2) selected set of source code metrics using the suggested source code metrics using validation framework as the input achieves better results compared to all other metrics; (3) fault prediction method is effective for software projects with a percentage of faulty classes lower than the threshold value (low - 54.82%, medium - 41.04%, high - 28.10%)

SEDec 31, 2016Code
Parichayana: An Eclipse Plugin for Detecting Exception Handling Anti-Patterns and Code Smells in Java Programs

Ashish Sureka

Anti-patterns and code-smells are signs in the source code which are not defects (does not prevent the program from functioning and does not cause compile errors) and are rather indicators of deeper and bigger problems. Exception handling is a programming construct de- signed to handle the occurrence of anomalous or exceptional conditions (that changes the normal flow of program execution). In this paper, we present an Eclipse plug-in (called as Parichayana) for detecting exception handling anti-patterns and code smells in Java programs. Parichayana is capable of automatically detecting several commonly occurring excep- tion handling programming mistakes. We extend the Eclipse IDE and create new menu entries and associated action via the Parichayana plug- in (free and open-source hosted on GitHub). We compare and contrast Parichayana with several code smell detection tools and demonstrate that our tool provides unique capabilities in context to existing tools. We have created an update site and developers can use the Eclipse up- date manager to install Parichayana from our site. We used Parichyana on several large open-source Java based projects and detected presence of exception handling anti-patterns

SENov 22, 2015Code
Anvaya: An Algorithm and Case-Study on Improving the Goodness of Software Process Models generated by Mining Event-Log Data in Issue Tracking System

Prerna Juneja, Divya Kundra, Ashish Sureka

Issue Tracking Systems (ITS) such as Bugzilla can be viewed as Process Aware Information Systems (PAIS) generating event-logs during the life-cycle of a bug report. Process Mining consists of mining event logs generated from PAIS for process model discovery, conformance and enhancement. We apply process map discovery techniques to mine event trace data generated from ITS of open source Firefox browser project to generate and study process models. Bug life-cycle consists of diversity and variance. Therefore, the process models generated from the event-logs are spaghetti-like with large number of edges, inter-connections and nodes. Such models are complex to analyse and difficult to comprehend by a process analyst. We improve the Goodness (fitness and structural complexity) of the process models by splitting the event-log into homogeneous subsets by clustering structurally similar traces. We adapt the K-Medoid clustering algorithm with two different distance metrics: Longest Common Subsequence (LCS) and Dynamic Time Warping (DTW). We evaluate the goodness of the process models generated from the clusters using complexity and fitness metrics. We study back-forth \& self-loops, bug reopening, and bottleneck in the clusters obtained and show that clustering enables better analysis. We also propose an algorithm to automate the clustering process -the algorithm takes as input the event log and returns the best cluster set.

SEJun 4, 2015Code
Survey Results on Threats To External Validity, Generalizability Concerns, Data Sharing and University-Industry Collaboration in Mining Software Repository (MSR) Research

Ashish Sureka, Ambika Tripathi, Savita Dabral

Mining Software Repositories (MSR) is an applied and practise-oriented field aimed at solving real problems encountered by practitioners and bringing value to Industry. Replication of results and findings, generalizability and external validity, University-Industry collaboration, data sharing and creation dataset repositories are important issues in MSR research. Research consisting of bibliometric analysis of MSR paper shows lack of University-Industry collaboration, deficiency of studies on closed or propriety source dataset and lack of data as well as tool sharing by researchers. We conduct a survey of authors of past three years of MSR conference (2012, 2013 and 2014) to collect data on their views and suggestions to address the stated concerns. We asked 20 questions from more than 100 authors and received a response from 39 authors. Our results shows that about one-third of the respondents always make their dataset publicly available and about one-third believe that data sharing should be a mandatory condition for publication in MSR conferences. Our survey reveals that more than 50% authors used solely open-source software (OSS) dataset for their research. More than 50% of the respondents mentioned that difficulty in sharing Industrial dataset outside the company is one of the major impediments in University-Industry collaboration.

SEDec 21, 2017
A Comparative Study of Different Source Code Metrics and Machine Learning Algorithms for Predicting Change Proneness of Object Oriented Systems

Lov Kumar, Ashish Sureka

Change-prone classes or modules are defined as software components in the source code which are likely to change in the future. Change-proneness prediction is useful to the maintenance team as they can optimize and focus their testing resources on the modules which have a higher likelihood of change. Change-proneness prediction model can be built by using source code metrics as predictors or features within a machine learning classification framework. In this paper, twenty one source code metrics are computed to develop a statistical model for predicting change-proneness modules. Since the performance of the change-proneness model depends on the source code metrics, they are used as independent variables or predictors for the change-proneness model. Eleven different feature selection techniques (including the usage of all the $21$ proposed source code metrics described in the paper) are used to remove irrelevant features and select the best set of features. The effectiveness of the set of source code metrics are evaluated using eighteen different classiffication techniques and three ensemble techniques. Experimental results demonstrate that the model based on selected set of source code metrics after applying feature selection techniques achieves better results as compared to the model using all source code metrics as predictors. Our experimental results reveal that the predictive model developed using LSSVM-RBF yields better result as compared to other classification techniques

IRJan 18, 2017
Investigating the Application of Common-Sense Knowledge-Base for Identifying Term Obfuscation in Adversarial Communication

Swati Agarwal, Ashish Sureka

Word obfuscation or substitution means replacing one word with another word in a sentence to conceal the textual content or communication. Word obfuscation is used in adversarial communication by terrorist or criminals for conveying their messages without getting red-flagged by security and intelligence agencies intercepting or scanning messages (such as emails and telephone conversations). ConceptNet is a freely available semantic network represented as a directed graph consisting of nodes as concepts and edges as assertions of common sense about these concepts. We present a solution approach exploiting vast amount of semantic knowledge in ConceptNet for addressing the technically challenging problem of word substitution in adversarial communication. We frame the given problem as a textual reasoning and context inference task and utilize ConceptNet's natural-language-processing tool-kit for determining word substitution. We use ConceptNet to compute the conceptual similarity between any two given terms and define a Mean Average Conceptual Similarity (MACS) metric to identify out-of-context terms. The test-bed to evaluate our proposed approach consists of Enron email dataset (having over 600000 emails generated by 158 employees of Enron Corporation) and Brown corpus (totaling about a million words drawn from a wide variety of sources). We implement word substitution techniques used by previous researches to generate a test dataset. We conduct a series of experiments consisting of word substitution methods used in the past to evaluate our approach. Experimental results reveal that the proposed approach is effective.

IRJan 18, 2017
Characterizing Linguistic Attributes for Automatic Classification of Intent Based Racist/Radicalized Posts on Tumblr Micro-Blogging Website

Swati Agarwal, Ashish Sureka

Research shows that many like-minded people use popular microblogging websites for posting hateful speech against various religions and race. Automatic identification of racist and hate promoting posts is required for building social media intelligence and security informatics based solutions. However, just keyword spotting based techniques cannot be used to accurately identify the intent of a post. In this paper, we address the challenge of the presence of ambiguity in such posts by identifying the intent of author. We conduct our study on Tumblr microblogging website and develop a cascaded ensemble learning classifier for identifying the posts having racist or radicalized intent. We train our model by identifying various semantic, sentiment and linguistic features from free-form text. Our experimental results shows that the proposed approach is effective and the emotion tone, social tendencies, language cues and personality traits of a narrative are discriminatory features for identifying the racist intent behind a post.

SEOct 30, 2016
A Bibliometric Study of Asia Pacific Software Engineering Conference from 2010 to 2015

Lov Kumar, Saikrishna Sripada, Ashish Sureka

The Asia-Pacific Software Engineering Conference (APSEC) is a reputed and a long-running conference which has successfully completed more than two decades as of year 2015. We conduct a bibliometric and scientific publication mining based study to how the conference has evolved over the recent past six years (year 2010 to 2015). Our objective is to perform in-depth examination of the state of APSEC so that the APSEC community can identify strengths, areas of improvements and future directions for the conference. Our empirical analysis is based on various perspectives such as: paper submission acceptance rate trends, conference location, scholarly productivity and contributions from various countries, analysis of keynotes, workshops, conference organizers and sponsors, tutorials, identification of prolific authors, computation of citation impact of papers and contributing authors, internal and external collaboration, university and industry participation and collaboration, measurement of gender imbalance, topical analysis, yearly author churn and program committee characteristics.

SESep 20, 2016
Thirteen Years of Mining Software Repositories (MSR) Conference - What is the Bibliography Data Telling Us?

Lov Kumar, Ashish Sureka

The Mining Software Repositories (MSR) conference is a reputed, long-running and flagship conference in the area of Software Analytics which has successfully completed more than one decade as of year 2016. We conduct a bibliometric and scientific publication mining based study to study how the conference has evolved over the recent past 13 years (from 2004 to 2007 as a workshop and then from 2008 to 2016 as a conference). Our objective is to perform an examination of the state of MSR so that the MSR community can identify strengths, areas of improvements and future directions for the conference.

SEJul 5, 2015
Kernel Based Sequential Data Anomaly Detection in Business Process Event Logs

Ashish Sureka

Business Process Management Systems (BPMS) log events and traces of activities during the execution of a process. Anomalies are defined as deviation or departure from the normal or common order. Anomaly detection in business process logs has several applications such as fraud detection and understanding the causes of process errors. In this paper, we present a novel approach for anomaly detection in business process logs. We model the event logs as a sequential data and apply kernel based anomaly detection techniques to identify outliers and discordant observations. Our technique is unsupervised (does not require a pre-annotated training dataset), employs kNN (k-nearest neighbor) kernel based technique and normalized longest common subsequence (LCS) similarity measure. We conduct experiments on a recent, large and real-world incident management data of an enterprise and demonstrate that our approach is effective.

SEJul 4, 2015
Intention-Oriented Process Model Discovery from Incident Management Event Logs

Ashish Sureka

Intention-oriented process mining is based on the belief that the fundamental nature of processes is mostly intentional (unlike activity-oriented process) and aims at discovering strategy and intentional process models from event-logs recorded during the process enactment. In this paper, we present an application of intention-oriented process mining for the domain of incident management of an Information Technology Infrastructure Library (ITIL) process. We apply the Map Miner Method (MMM) on a large real-world dataset for discovering hidden and unobservable user behavior, strategies and intentions. We first discover user strategies from the given activity sequence data by applying Hidden Markov Model (HMM) based unsupervised learning technique. We then process the emission and transition matrices of the discovered HMM to generate a coarse-grained Map Process Model. We present the first application or study of the new and emerging field of Intention-oriented process mining on an incident management event-log dataset and discuss its applicability, effectiveness and challenges.

IRJan 2, 2014
Chaff from the Wheat : Characterization and Modeling of Deleted Questions on Stack Overflow

Denzil Correa, Ashish Sureka

Stack Overflow is the most popular CQA for programmers on the web with 2.05M users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed guidelines on how to post questions and an ebullient moderation community. Despite these precise communications and safeguards, questions posted on Stack Overflow can be extremely off topic or very poor in quality. Such questions can be deleted from Stack Overflow at the discretion of experienced community members and moderators. We present the first study of deleted questions on Stack Overflow. We divide our study into two parts (i) Characterization of deleted questions over approx. 5 years (2008-2013) of data, (ii) Prediction of deletion at the time of question creation. Our characterization study reveals multiple insights on question deletion phenomena. We observe a significant increase in the number of deleted questions over time. We find that it takes substantial time to vote a question to be deleted but once voted, the community takes swift action. We also see that question authors delete their questions to salvage reputation points. We notice some instances of accidental deletion of good quality questions but such questions are voted back to be undeleted quickly. We discover a pyramidal structure of question quality on Stack Overflow and find that deleted questions lie at the bottom (lowest quality) of the pyramid. We also build a predictive model to detect the deletion of question at the creation time. We experiment with 47 features based on User Profile, Community Generated, Question Content and Syntactic style and report an accuracy of 66%. Our feature analysis reveals that all four categories of features are important for the prediction task. Our findings reveal important suggestions for content quality maintenance on community based question answering websites.

CYSep 2, 2013
A Case-Study on Teaching Undergraduate-Level Software Engineering Course Using Inverted-Classroom, Large-Group, Real-Client and Studio-Based Instruction Model

Ashish Sureka, Monika Gupta, Dipto Sarkar et al.

We present a case-study on teaching an undergraduate level course on Software Engineering (second year and fifth semester of bachelors program in Computer Science) at a State University (New Delhi, India) using a novel teaching instruction model. Our approach has four main elements: inverted or flipped classroom, studio-based learning, real-client projects and deployment, large team and peer evaluation. We present our motivation and approach, challenges encountered, pedagogical benefits, findings (both positive and negative) and recommendations. Our motivation was to teach Software Engineering using an active learning (significantly increasing the engagement and collaboration with the Instructor and other students in the class), team-work, balance between theory and practice, imparting both technical and managerial skills encountered in real-world and problem-based learning (through an intensive semester-long project). We conduct a detailed survey (anonymous, optional and online) and present the results of student responses. Survey results reveal that for nearly every students (class size: 89) the instruction model was new, interesting and had a positive impact on the motivation in addition to meeting the learning outcome of the course.

SIJul 27, 2013
Fit or Unfit : Analysis and Prediction of 'Closed Questions' on Stack Overflow

Denzil Correa, Ashish Sureka

Stack Overflow is widely regarded as the most popular Community driven Question Answering (CQA) website for programmers. Questions posted on Stack Overflow which are not related to programming topics, are marked as 'closed' by experienced users and community moderators. A question can be 'closed' for five reasons - duplicate, off-topic, subjective, not a real question and too localized. In this work, we present the first study of 'closed' questions in Stack Overflow. We download 4 years of publicly available data which contains 3.4 Million questions. We first analyze and characterize the complete set of 0.1 Million 'closed' questions. Next, we use a machine learning framework and build a predictive model to identify a 'closed' question at the time of question creation. One of our key findings is that despite being marked as 'closed', subjective questions contain high information value and are very popular with the users. We observe an increasing trend in the percentage of closed questions over time and find that this increase is positively correlated to the number of newly registered users. In addition, we also see a decrease in community participation to mark a 'closed' question which has led to an increase in moderation job time. We also find that questions closed with the Duplicate and Off Topic labels are relatively more prone to reputation gaming. For the 'closed' question prediction task, we make use of multiple genres of feature sets based on - user profile, community process, textual style and question content. We use a state-of-art machine learning classifier based on an ensemble learning technique and achieve an overall accuracy of 73%. To the best of our knowledge, this is the first experimental study to analyze and predict 'closed' questions on Stack Overflow.

IRJan 21, 2013
Solutions to Detect and Analyze Online Radicalization : A Survey

Denzil Correa, Ashish Sureka

Online Radicalization (also called Cyber-Terrorism or Extremism or Cyber-Racism or Cyber- Hate) is widespread and has become a major and growing concern to the society, governments and law enforcement agencies around the world. Research shows that various platforms on the Internet (low barrier to publish content, allows anonymity, provides exposure to millions of users and a potential of a very quick and widespread diffusion of message) such as YouTube (a popular video sharing website), Twitter (an online micro-blogging service), Facebook (a popular social networking website), online discussion forums and blogosphere are being misused for malicious intent. Such platforms are being used to form hate groups, racist communities, spread extremist agenda, incite anger or violence, promote radicalization, recruit members and create virtual organi- zations and communities. Automatic detection of online radicalization is a technically challenging problem because of the vast amount of the data, unstructured and noisy user-generated content, dynamically changing content and adversary behavior. There are several solutions proposed in the literature aiming to combat and counter cyber-hate and cyber-extremism. In this survey, we review solutions to detect and analyze online radicalization. We review 40 papers published at 12 venues from June 2003 to November 2011. We present a novel classification scheme to classify these papers. We analyze these techniques, perform trend analysis, discuss limitations of existing techniques and find out research gaps.