Sanjay K. Sahay

CR
26papers
614citations
Novelty32%
AI Score22

26 Papers

CRJun 6, 2022
Deep Reinforcement Learning for Cybersecurity Threat Detection and Protection: A Review

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

The cybersecurity threat landscape has lately become overly complex. Threat actors leverage weaknesses in the network and endpoint security in a very coordinated manner to perpetuate sophisticated attacks that could bring down the entire network and many critical hosts in the network. Increasingly advanced deep and machine learning-based solutions have been used in threat detection and protection. The application of these techniques has been reviewed well in the scientific literature. Deep Reinforcement Learning has shown great promise in developing AI-based solutions for areas that had earlier required advanced human cognizance. Different techniques and algorithms under deep reinforcement learning have shown great promise in applications ranging from games to industrial processes, where it is claimed to augment systems with general AI capabilities. These algorithms have recently also been used in cybersecurity, especially in threat detection and endpoint protection, where these are showing state-of-the-art results. Unlike supervised machines and deep learning, deep reinforcement learning is used in more diverse ways and is empowering many innovative applications in the threat defense landscape. However, there does not exist any comprehensive review of these unique applications and accomplishments. Therefore, in this paper, we intend to fill this gap and provide a comprehensive review of the different applications of deep reinforcement learning in cybersecurity threat detection and protection.

CROct 13, 2021
Privacy-Preserving Mutual Authentication and Key Agreement Scheme for Multi-Server Healthcare System

Trupil Limbasiya, Sanjay K. Sahay, Bharath Sridharan

The usage of different technologies and smart devices helps people to get medical services remotely for multiple benefits. Thus, critical and sensitive data is exchanged between a user and a doctor. When health data is transmitted over a common channel, it becomes essential to preserve various privacy and security properties in the system. Further, the number of users for remote services is increasing day-by-day exponentially, and thus, it is not adequate to deal with all users using the one server due to the verification overhead, server failure, and scalability issues. Thus, researchers proposed various authentication protocols for multi-server architecture, but most of them are vulnerable to different security attacks and require high computational resources during the implementation. To Tackle privacy and security issues using less computational resources, we propose a privacy-preserving mutual authentication and key agreement protocol for a multi-server healthcare system. We discuss the proposed scheme's security analysis and performance results to understand its security strengths and the computational resource requirement, respectively. Further, we do the comparison of security and performance results with recent relevant authentication protocols.

CRSep 23, 2021
LSTM Hyper-Parameter Selection for Malware Detection: Interaction Effects and Hierarchical Selection Approach

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

Long-Short-Term-Memory (LSTM) networks have shown great promise in artificial intelligence (AI) based language modeling. Recently, LSTM networks have also become popular for designing AI-based Intrusion Detection Systems (IDS). However, its applicability in IDS is studied largely in the default settings as used in language models. Whereas security applications offer distinct conditions and hence warrant careful consideration while applying such recurrent networks. Therefore, we conducted one of the most exhaustive works on LSTM hyper-parameters for IDS and experimented with approx. 150 LSTM configurations to determine its hyper-parameters relative importance, interaction effects, and optimal selection approach for designing an IDS. We conducted multiple analyses of the results of these experiments and empirically controlled for the interaction effects of different hyper-parameters covariate levels. We found that for security applications, especially for designing an IDS, neither similar relative importance as applicable to language models is valid, nor is the standard linear method for hyper-parameter selection ideal. We ascertained that the interaction effect plays a crucial role in determining the relative importance of hyper-parameters. We also discovered that after controlling for the interaction effect, the correct relative importance for LSTMs for an IDS is batch-size, followed by dropout ratio and padding. The findings are significant because when LSTM was first used for language models, the focus had mostly been on increasing the number of layers to enhance performance.

CRSep 23, 2021
ADVERSARIALuscator: An Adversarial-DRL Based Obfuscator and Metamorphic Malware SwarmGenerator

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

Advanced metamorphic malware and ransomware, by using obfuscation, could alter their internal structure with every attack. If such malware could intrude even into any of the IoT networks, then even if the original malware instance gets detected, by that time it can still infect the entire network. It is challenging to obtain training data for such evasive malware. Therefore, in this paper, we present ADVERSARIALuscator, a novel system that uses specialized Adversarial-DRL to obfuscate malware at the opcode level and create multiple metamorphic instances of the same. To the best of our knowledge, ADVERSARIALuscator is the first-ever system that adopts the Markov Decision Process-based approach to convert and find a solution to the problem of creating individual obfuscations at the opcode level. This is important as the machine language level is the least at which functionality could be preserved so as to mimic an actual attack effectively. ADVERSARIALuscator is also the first-ever system to use efficient continuous action control capable of deep reinforcement learning agents like the Proximal Policy Optimization in the area of cyber security. Experimental results indicate that ADVERSARIALuscator could raise the metamorphic probability of a corpus of malware by >0.45. Additionally, more than 33% of metamorphic instances generated by ADVERSARIALuscator were able to evade the most potent IDS. If such malware could intrude even into any of the IoT networks, then even if the original malware instance gets detected, by that time it can still infect the entire network. Hence ADVERSARIALuscator could be used to generate data representative of a swarm of very potent and coordinated AI-based metamorphic malware attacks. The so generated data and simulations could be used to bolster the defenses of an IDS against an actual AI-based metamorphic attack from advanced malware and ransomware.

CRSep 12, 2021
DRo: A data-scarce mechanism to revolutionize the performance of Deep Learning based Security Systems

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

Supervised Deep Learning requires plenty of labeled data to converge, and hence perform optimally for task-specific learning. Therefore, we propose a novel mechanism named DRo (for Deep Routing) for data-scarce domains like security. The DRo approach builds upon some of the recent developments in Deep-Clustering. In particular, it exploits the self-augmented training mechanism using synthetically generated local perturbations. DRo not only allays the challenges with sparse-labeled data but also offers many unique advantages. We also developed a system named DRoID that uses the DRo mechanism for enhancing the performance of an existing Malware Detection System that uses (low information features like the) Android implicit Intent(s) as the only features. We conduct experiments on DRoID using a popular and standardized Android malware dataset and found that the DRo mechanism could successfully reduce the false-alarms generated by the downstream classifier by 67.9%, and also simultaneously boosts its accuracy by 11.3%. This is significant not only because the gains achieved are unparalleled but also because the features used were never considered rich enough to train a classifier on; and hence no decent performance could ever be reported by any malware classification system till-date using these features in isolation. Owing to the results achieved, the DRo mechanism claims a dominant position amongst all known systems that aims to enhance the classification performance of deep learning models with sparse-labeled data.

CRFeb 28, 2021
Identification of Significant Permissions for Efficient Android Malware Detection

Hemant Rathore, Sanjay K. Sahay, Ritvik Rajvanshi et al.

Since Google unveiled Android OS for smartphones, malware are thriving with 3Vs, i.e. volume, velocity, and variety. A recent report indicates that one out of every five business/industry mobile application leaks sensitive personal data. Traditional signature/heuristic-based malware detection systems are unable to cope up with current malware challenges and thus threaten the Android ecosystem. Therefore recently researchers have started exploring machine learning and deep learning based malware detection systems. In this paper, we performed a comprehensive feature analysis to identify the significant Android permissions and propose an efficient Android malware detection system using machine learning and deep neural network. We constructed a set of $16$ permissions ($8\%$ of the total set) derived from variance threshold, auto-encoders, and principal component analysis to build a malware detection engine that consumes less train and test time without significant compromise on the model accuracy. Our experimental results show that the Android malware detection model based on the random forest classifier is most balanced and achieves the highest area under curve score of $97.7\%$, which is better than the current state-of-art systems. We also observed that deep neural networks attain comparable accuracy to the baseline results but with a massive computational penalty.

CRFeb 28, 2021
Detection of Malicious Android Applications: Classical Machine Learning vs. Deep Neural Network Integrated with Clustering

Hemant Rathore, Sanjay K. Sahay, Shivin Thukral et al.

Today anti-malware community is facing challenges due to the ever-increasing sophistication and volume of malware attacks developed by adversaries. Traditional malware detection mechanisms are not able to cope-up with next-generation malware attacks. Therefore in this paper, we propose effective and efficient Android malware detection models based on machine learning and deep learning integrated with clustering. We performed a comprehensive study of different feature reduction, classification and clustering algorithms over various performance metrics to construct the Android malware detection models. Our experimental results show that malware detection models developed using Random Forest eclipsed deep neural network and other classifiers on the majority of performance metrics. The baseline Random Forest model without any feature reduction achieved the highest AUC of 99.4%. Also, the segregating of vector space using clustering integrated with Random Forest further boosted the AUC to 99.6% in one cluster and direct detection of Android malware in another cluster, thus reducing the curse of dimensionality. Additionally, we found that feature reduction in detection models does improve the model efficiency (training and testing time) many folds without much penalty on the effectiveness of the detection model.

CRFeb 1, 2021
DRLDO: A novel DRL based De-ObfuscationSystem for Defense against Metamorphic Malware

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

In this paper, we propose a novel mechanism to normalize metamorphic and obfuscated malware down at the opcode level and hence create an advanced metamorphic malware de-obfuscation and defense system. We name this system DRLDO, for Deep Reinforcement Learning based De-Obfuscator. With the inclusion of the DRLDO as a sub-component, an existing Intrusion Detection System could be augmented with defensive capabilities against 'zero-day' attacks from obfuscated and metamorphic variants of existing malware. This gains importance, not only because there exists no system to date that uses advanced DRL to intelligently and automatically normalize obfuscation down even to the opcode level, but also because the DRLDO system does not mandate any changes to the existing IDS. The DRLDO system does not even mandate the IDS' classifier to be retrained with any new dataset containing obfuscated samples. Hence DRLDO could be easily retrofitted into any existing IDS deployment. We designed, developed, and conducted experiments on the system to evaluate the same against multiple-simultaneous attacks from obfuscations generated from malware samples from a standardized dataset that contains multiple generations of malware. Experimental results prove that DRLDO was able to successfully make the otherwise un-detectable obfuscated variants of the malware detectable by an existing pre-trained malware classifier. The detection probability was raised well above the cut-off mark to 0.6 for the classifier to detect the obfuscated malware unambiguously. Further, the de-obfuscated variants generated by DRLDO achieved a very high correlation (of 0.99) with the base malware. This observation validates that the DRLDO system is actually learning to de-obfuscate and not exploiting a trivial trick.

CRJan 27, 2021
Robust Android Malware Detection System against Adversarial Attacks using Q-Learning

Hemant Rathore, Sanjay K. Sahay, Piyush Nikam et al.

The current state-of-the-art Android malware detection systems are based on machine learning and deep learning models. Despite having superior performance, these models are susceptible to adversarial attacks. Therefore in this paper, we developed eight Android malware detection models based on machine learning and deep neural network and investigated their robustness against adversarial attacks. For this purpose, we created new variants of malware using Reinforcement Learning, which will be misclassified as benign by the existing Android malware detection models. We propose two novel attack strategies, namely single policy attack and multiple policy attack using reinforcement learning for white-box and grey-box scenario respectively. Putting ourselves in the adversary's shoes, we designed adversarial attacks on the detection models with the goal of maximizing fooling rate, while making minimum modifications to the Android application and ensuring that the app's functionality and behavior do not change. We achieved an average fooling rate of 44.21% and 53.20% across all the eight detection models with a maximum of five modifications using a single policy attack and multiple policy attack, respectively. The highest fooling rate of 86.09% with five changes was attained against the decision tree-based model using the multiple policy approach. Finally, we propose an adversarial defense strategy that reduces the average fooling rate by threefold to 15.22% against a single policy attack, thereby increasing the robustness of the detection models i.e. the proposed model can effectively detect variants (metamorphic) of malware. The experimental analysis shows that our proposed Android malware detection system using reinforcement learning is more robust against adversarial attacks.

CRDec 26, 2020
Assessment of the Relative Importance of different hyper-parameters of LSTM for an IDS

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

Recurrent deep learning language models like the LSTM are often used to provide advanced cyber-defense for high-value assets. The underlying assumption for using LSTM networks for malware-detection is that the op-code sequence of malware could be treated as a (spoken) language representation. There are differences between any spoken-language (sequence of words/sentences) and the machine-language (sequence of op-codes). In this paper, we demonstrate that due to these inherent differences, an LSTM model with its default configuration as tuned for a spoken-language, may not work well to detect malware (using its op-code sequence) unless the network's essential hyper-parameters are tuned appropriately. In the process, we also determine the relative importance of all the different hyper-parameters of an LSTM network as applied to malware detection using their op-code sequence representations. We experimented with different configurations of LSTM networks, and altered hyper-parameters like the embedding-size, number of hidden layers, number of LSTM-units in a hidden layer, pruning/padding-length of the input-vector, activation-function, and batch-size. We discovered that owing to the enhanced complexity of the malware/machine-language, the performance of an LSTM network configured for an Intrusion Detection System, is very sensitive towards the number-of-hidden-layers, input sequence-length, and the choice of the activation-function. Also, for (spoken) language-modeling, the recurrent architectures by-far outperform their non-recurrent counterparts. Therefore, we also assess how sequential DL architectures like the LSTM compare against their non-sequential counterparts like the MLP-DNN for the purpose of malware-detection.

CROct 16, 2020
DOOM: A Novel Adversarial-DRL-Based Op-Code Level Metamorphic Malware Obfuscator for the Enhancement of IDS

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

We designed and developed DOOM (Adversarial-DRL based Opcode level Obfuscator to generate Metamorphic malware), a novel system that uses adversarial deep reinforcement learning to obfuscate malware at the op-code level for the enhancement of IDS. The ultimate goal of DOOM is not to give a potent weapon in the hands of cyber-attackers, but to create defensive-mechanisms against advanced zero-day attacks. Experimental results indicate that the obfuscated malware created by DOOM could effectively mimic multiple-simultaneous zero-day attacks. To the best of our knowledge, DOOM is the first system that could generate obfuscated malware detailed to individual op-code level. DOOM is also the first-ever system to use efficient continuous action control based deep reinforcement learning in the area of malware generation and defense. Experimental results indicate that over 67% of the metamorphic malware generated by DOOM could easily evade detection from even the most potent IDS. This achievement gains significance, as with this, even IDS augment with advanced routing sub-system can be easily evaded by the malware generated by DOOM.

CROct 16, 2020
DeepIntent: ImplicitIntent based Android IDS with E2E Deep Learning architecture

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

The Intent in Android plays an important role in inter-process and intra-process communications. The implicit Intent that an application could accept are declared in its manifest and are amongst the easiest feature to extract from an apk. Implicit Intents could even be extracted online and in real-time. So far neither the feasibility of developing an Intrusion Detection System solely on implicit Intent has been explored, nor are any benchmarks available of a malware classifier that is based on implicit Intent alone. We demonstrate that despite Intent is implicit and well declared, it can provide very intuitive insights to distinguish malicious from non-malicious applications. We conducted exhaustive experiments with over 40 different end-to-end Deep Learning configurations of Auto-Encoders and Multi-Layer-Perceptron to create a benchmark for a malware classifier that works exclusively on implicit Intent. Using the results from the experiments we create an intrusion detection system using only the implicit Intents and end-to-end Deep Learning architecture. We obtained an area-under-curve statistic of 0.81, and accuracy of 77.2% along with false-positive-rate of 0.11 on Drebin dataset.

CRApr 19, 2020
Secure and Energy-Efficient Key-Agreement Protocol for Multi-Server Architecture

Trupil Limbasiya, Sanjay K. Sahay

Authentication schemes are practised globally to verify the legitimacy of users and servers for the exchange of data in different facilities. Generally, the server verifies a user to provide resources for different purposes. But due to the large network system, the authentication process has become complex and therefore, time-to-time different authentication protocols have been proposed for the multi-server architecture. However, most of the protocols are vulnerable to various security attacks and their performance is not efficient. In this paper, we propose a secure and energy-efficient remote user authentication protocol for multi-server systems. The results show that the proposed protocol is comparatively ~44% more efficient and needs ~38% less communication cost. We also demonstrate that with only two-factor authentication, the proposed protocol is more secure from the earlier related authentication schemes.

CRDec 30, 2019
Secure Communication Protocol for Smart Transportation Based on Vehicular Cloud

Trupil Limbasiya, Debasis Das, Sanjay K. Sahay

The pioneering concept of connected vehicles has transformed the way of thinking for researchers and entrepreneurs by collecting relevant data from nearby objects. However, this data is useful for a specific vehicle only. Moreover, vehicles get a high amount of data (e.g., traffic, safety, and multimedia infotainment) on the road. Thus, vehicles expect adequate storage devices for this data, but it is infeasible to have a large memory in each vehicle. Hence, the vehicular cloud computing (VCC) framework came into the picture to provide a storage facility by connecting a road-side-unit (RSU) with the vehicular cloud (VC). In this, data should be saved in an encrypted form to preserve security, but there is a challenge to search for information over encrypted data. Next, we understand that many of vehicular communication schemes are inefficient for data transmissions due to its poor performance results and vulnerable to different fundamental security attacks. Accordingly, on-device performance is critical, but data damages and secure on-time connectivity are also significant challenges in a public environment. Therefore, we propose reliable data transmission protocols for cutting-edge architecture to search data from the storage, to resist against various security attacks, and provide better performance results. Thus, the proposed data transmission protocol is useful in diverse smart city applications (business, safety, and entertainment) for the benefits of society.

CRMay 30, 2019
An Efficient Detection of Malware by Naive Bayes Classifier Using GPGPU

Sanjay K. Sahay, Mayank Chaudhari

Due to continuous increase in the number of malware (according to AV-Test institute total ~8 x 10^8 malware are already known, and every day they register ~2.5 x 10^4 malware) and files in the computational devices, it is very important to design a system which not only effectively but can also efficiently detect the new or previously unseen malware to prevent/minimize the damages. Therefore, this paper presents a novel group-wise approach for the efficient detection of malware by parallelizing the classification using the power of GPGPU and shown that by using the Naive Bayes classifier the detection speed-up can be boosted up to 200x. The investigation also shows that the classification time increases significantly with the number of features.

CRMay 30, 2019
A Survey on the Detection of Android Malicious Apps

Sanjay K. Sahay, Ashu Sharma

Android-based smart devices are exponentially growing, and due to the ubiquity of the Internet, these devices are globally connected to the different devices/networks. Its popularity, attractive features, and mobility make malware creator to put a number of malicious apps in the market to disrupt and annoy the victims. Although to identify the malicious apps, time-to-time various techniques are proposed. However, it appears that malware developers are always ahead of the anti-malware group, and the proposed techniques by the anti-malware groups are not sufficient to counter the advanced malicious apps. Therefore, to understand the various techniques proposed/used for the identification of Android malicious apps, in this paper, we present a survey conducted by us on the work done by the researchers in this field.

CRApr 21, 2019
Android Malicious Application Classification Using Clustering

Hemant Rathore, Sanjay K. Sahay, Palash Chaturvedi et al.

Android malware have been growing at an exponential pace and becomes a serious threat to mobile users. It appears that most of the anti-malware still relies on the signature-based detection system which is generally slow and often not able to detect advanced obfuscated malware. Hence time-to-time various authors have proposed different machine learning solutions to identify sophisticated malware. However, it appears that detection accuracy can be improved by using the clustering method. Therefore in this paper, we propose a novel scalable and effective clustering method to improve the detection accuracy of the malicious android application and obtained a better overall accuracy (98.34%) by random forest classifier compared to regular method, i.e., taking the data altogether to detect the malware. However, as far as true positive and true negative are concerned, by clustering method, true positive is best obtained by decision tree (97.59%) and true negative by support vector machine (99.96%) which is the almost same result obtained by the random forest true positive (97.30%) and true negative (99.38%) respectively. The reason that overall accuracy of random forest is high because the true positive of support vector machine and true negative of the decision tree is significantly less than the random forest.

CRApr 4, 2019
Malware Detection using Machine Learning and Deep Learning

Hemant Rathore, Swati Agarwal, Sanjay K. Sahay et al.

Research shows that over the last decade, malware has been growing exponentially, causing substantial financial losses to various organizations. Different anti-malware companies have been proposing solutions to defend attacks from these malware. The velocity, volume, and the complexity of malware are posing new challenges to the anti-malware community. Current state-of-the-art research shows that recently, researchers and anti-virus organizations started applying machine learning and deep learning methods for malware analysis and detection. We have used opcode frequency as a feature vector and applied unsupervised learning in addition to supervised learning for malware classification. The focus of this tutorial is to present our work on detecting malware with 1) various machine learning algorithms and 2) deep learning models. Our results show that the Random Forest outperforms Deep Neural Network with opcode frequency as a feature. Also in feature reduction, Deep Auto-Encoders are overkill for the dataset, and elementary function like Variance Threshold perform better than others. In addition to the proposed methodologies, we will also discuss the additional issues and the unique challenges in the domain, open research problems, limitations, and future directions.

CRApr 3, 2019
Group-wise classification approach to improve Android malicious apps detection accuracy

Ashu Sharma, Sanjay K. Sahay

In the fast-growing smart devices, Android is the most popular OS, and due to its attractive features, mobility, ease of use, these devices hold sensitive information such as personal data, browsing history, shopping history, financial details, etc. Therefore, any security gap in these devices means that the information stored or accessing the smart devices are at high risk of being breached by the malware. These malware are continuously growing and are also used for military espionage, disrupting the industry, power grids, etc. To detect these malware, traditional signature matching techniques are widely used. However, such strategies are not capable to detect the advanced Android malicious apps because malware developer uses several obfuscation techniques. Hence, researchers are continuously addressing the security issues in the Android based smart devices. Therefore, in this paper using Drebin benchmark malware dataset we experimentally demonstrate how to improve the detection accuracy by analyzing the apps after grouping the collected data based on the permissions and achieved 97.15% overall average accuracy. Our results outperform the accuracy obtained without grouping data (79.27%, 2017), Arp, et al. (94%, 2014), Annamalai et al. (84.29%, 2016), Bahman Rashidi et al. (82%, 2017)) and Ali Feizollah, et al. (95.5%, 2017). The analysis also shows that among the groups, Microphone group detection accuracy is least while Calendar group apps are detected with the highest accuracy, and with the highest accuracy, and for the best performance, one shall take 80-100 features.

CRMar 7, 2019
Detection of Advanced Malware by Machine Learning Techniques

Sanjay Sharma, C. Rama Krishna, Sanjay K. Sahay

In today's digital world most of the anti-malware tools are signature based which is ineffective to detect advanced unknown malware viz. metamorphic malware. In this paper, we study the frequency of opcode occurrence to detect unknown malware by using machine learning technique. For the purpose, we have used kaggle Microsoft malware classification challenge dataset. The top 20 features obtained from fisher score, information gain, gain ratio, chi-square and symmetric uncertainty feature selection methods are compared. We also studied multiple classifier available in WEKA GUI based machine learning tool and found that five of them (Random Forest, LMT, NBT, J48 Graft and REPTree) detect malware with almost 100% accuracy.

CRSep 16, 2018
Comparison of Deep Learning and the Classical Machine Learning Algorithm for the Malware Detection

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

Recently, Deep Learning has been showing promising results in various Artificial Intelligence applications like image recognition, natural language processing, language modeling, neural machine translation, etc. Although, in general, it is computationally more expensive as compared to classical machine learning techniques, their results are found to be more effective in some cases. Therefore, in this paper, we investigated and compared one of the Deep Learning Architecture called Deep Neural Network (DNN) with the classical Random Forest (RF) machine learning algorithm for the malware classification. We studied the performance of the classical RF and DNN with 2, 4 & 7 layers architectures with the four different feature sets, and found that irrespective of the features inputs, the classical RF accuracy outperforms the DNN.

CRSep 16, 2018
An investigation of a deep learning based malware detection system

Mohit Sewak, Sanjay K. Sahay, Hemant Rathore

We investigate a Deep Learning based system for malware detection. In the investigation, we experiment with different combination of Deep Learning architectures including Auto-Encoders, and Deep Neural Networks with varying layers over Malicia malware dataset on which earlier studies have obtained an accuracy of (98%) with an acceptable False Positive Rates (1.07%). But these results were done using extensive man-made custom domain features and investing corresponding feature engineering and design efforts. In our proposed approach, besides improving the previous best results (99.21% accuracy and a False Positive Rate of 0.19%) indicates that Deep Learning based systems could deliver an effective defense against malware. Since it is good in automatically extracting higher conceptual features from the data, Deep Learning based systems could provide an effective, general and scalable mechanism for detection of existing and unknown malware.

CRFeb 23, 2018
An investigation of the classifiers to detect android malicious apps

Ashu Sharma, Sanjay K. Sahay

Android devices are growing exponentially and are connected through the internet accessing billion of online websites. The popularity of these devices encourages malware developer to penetrate the market with malicious apps to annoy and disrupt the victim. Although, for the detection of malicious apps different approaches are discussed. However, proposed approaches are not suffice to detect the advanced malware to limit/prevent the damages. In this, very few approaches are based on opcode occurrence to classify the malicious apps. Therefore, this paper investigates the five classifiers using opcodes occurrence as the prominent features for the detection of malicious apps. For the analysis, we use WEKA tool and found that FT detection accuracy (79.27%) is best among the investigated classifiers. However, true positives rate i.e. malware detection rate is highest (99.91%) by RF and fluctuate least with the different number of prominent features compared to other studied classifiers. The analysis shows that overall accuracy is majorly affected by the false positives of the classifier.

CRJun 22, 2016
Improving the detection accuracy of unknown malware by partitioning the executables in groups

Ashu Sharma, Sanjay K. Sahay, Abhishek Kumar

Detection of unknown malware with high accuracy is always a challenging task. Therefore, in this paper, we study the classification of unknown malware by two methods. In the first/regular method, similar to other authors [17][16][20] approaches we select the features by taking all dataset in one group and in the second method, we select the features by partitioning the dataset in the range of file 5 KB size. We find that the second method to detect the malware with ~8.7% more accurate than the first/regular method.

CRJun 22, 2016
Grouping the executables to detect malware with high accuracy

Sanjay K. Sahay, Ashu Sharma

The metamorphic malware variants with the same malicious behavior (family), can obfuscate themselves to look different from each other. This variation in structure leads to a huge signature database for traditional signature matching techniques to detect them. In order to effective and efficient detection of malware in large amounts of executables, we need to partition these files into groups which can identify their respective families. In addition, the grouping criteria should be chosen such a way that, it can also be applied to unknown files encounter on computers for classification. This paper discusses the study of malware and benign executables in groups to detect unknown malware with high accuracy. We studied sizes of malware generated by three popular second generation malware (metamorphic malware) creator kits viz. G2, PS-MPC and NGVCK, and observed that the size variation in any two generated malware from same kit is not much. Hence, we grouped the executables on the basis of malware sizes by using Optimal k-Means Clustering algorithm and used these obtained groups to select promising features for training (Random forest, J48, LMT, FT and NBT) classifiers to detect variants of malware or unknown malware. We find that detection of malware on the basis of their respected file sizes gives accuracy up to 99.11% from the classifiers.

CRJun 22, 2016
An effective approach for classification of advanced malware with high accuracy

Ashu Sharma, Sanjay K. Sahay

Combating malware is very important for software/systems security, but to prevent the software/systems from the advanced malware, viz. metamorphic malware is a challenging task, as it changes the structure/code after each infection. Therefore in this paper, we present a novel approach to detect the advanced malware with high accuracy by analyzing the occurrence of opcodes (features) by grouping the executables. These groups are made on the basis of our earlier studies [1] that the difference between the sizes of any two malware generated by popular advanced malware kits viz. PS-MPC, G2 and NGVCK are within 5 KB. On the basis of obtained promising features, we studied the performance of thirteen classifiers using N-fold cross-validation available in machine learning tool WEKA. Among these thirteen classifiers we studied in-depth top five classifiers (Random forest, LMT, NBT, J48 and FT) and obtain more than 96.28% accuracy for the detection of unknown malware, which is better than the maximum detection accuracy (95.9%) reported by Santos et al (2013). In these top five classifiers, our approach obtained a detection accuracy of 97.95% by the Random forest.