Site Loader

Abstract- In the
ever developing world of computers, internet and cyber activities, cyberattacks
and malware stands as a serious and ever growing threat to security of the
cyberspace, causing the detection of these attacks and malware of great
concern. A lot of research efforts has been made to create an intelligent cyberattack
and malware detection by using machine language and data mining methods.
Although large number of result have been seen with these techniques but also a
large number of them constitute a shallow learning framework which to a great
extent does not satisfy cyberattacks and malware problems.

In this paper, i
propose a deep learning dependent technique to implement an effective, good and
flexible Network Intrusion Detection System (NIDS), using a Self-Taught
Learning (STL) which is a deep learning based method on NSL-KDD, a benchmark
data for network intrusion and cyber-attacks.

  Key words: cyber-attacks,
STL, malware, network security, NIDS, sparse auto-encoder, deep learning,
NSL-KDD

 

   1. Introduction

     Cyber-attacks
involve the use of malware which are malicious software aimed at infiltrating
the integrity, secrecy and overall functionality of a system 8 these include
viruses, Trojan, worms, back ware, spyware etc. With computers, internet and
cyberspace being essential in our everyday life, malware therefore stand as a
serious security threat. Malware not just stand as an emotional but also a
financial treat. According to a recent report from Kaspersky Lab, up to one
billion dollars was stolen in roughly two years from financial institutions
worldwide, due to malware attacks 7. Therefore, the recognition of malware is
of significant worry to both the counter malware industry and researchers. To
protect legitimate users from the attacks, the majority of anti-malware
software products (e.g., Comodo, Symantec, Kaspersky) use the signature-based
method of detection 10, 9. Signature is a short string of bytes, which is
unique for each known malware so that its future examples can be correctly
classified with a small error rate 5. However, this technique can be easily
evaded by malware attackers through the techniques such as encryption,
polymorphism and obfuscation 15, 2. Furthermore, malicious files are being
disseminated at a rate of thousands per day 6, making it difficult for this
signature-based method to be effective. In order to combat the malware attacks,
intelligent malware detection techniques need to be investigated. The need for
a Network Intrusion Detection System(NIDS) cannot be underestimated as they are
important tools for network system users and administrators to detect various
security breaches in and around their network. NIDS monitors, analyzes and
further raises
alarm for network traffic entering and or exiting from the network devices of
an organization. Based on the methods of
intrusion detection, NIDSs are categorized into two classes:

i) signature(misuse) based NIDS (SNIDS): will monitor network traffic
packets on the network and matches them against a database of signatures or rules of known malicious
threats.

 ii) anomaly detection
based NIDS (ADNIDS): will monitor
network traffic and compare it against an established normal traffic. Any
deviation from normal traffic alerts the administrator or user, indicating anomalous behavior. The
rate of false positives is high as not all anomalies are intrusions. These IDSs
require system administrators to identify real attacks versus false positives
since incoming traffic packets and trained pattern might have several
deviations 3.

SNIDS is most
effective in the detection of known attack and it shows great detection
accuracy exhibiting less false alarm rate where on the other hand its
performance suffers when it is needed to detect unknown or new attacks. For
ADNIDS it is well suited for the detection of unknown and new attacks although
ADNIDS produces high rate of false positive, its theoretical capability in
identification of novel attacks has caused it to be widely accepted in the
research community.

In order to curb the siege of cyber-attack,
intelligent intrusion detection techniques need to be researched making many
researchers conduct malware detection by applying machine learning and data
mining technologies over the years which include Decision tree, Artificial
Neutral network(ANN), Naïve-Bayesian(NB), Support Vector Machine(SVM), Radom
Forest(RF), Self-Organized Map(SOM) etc. But most of these methods are based on
learning architectures which are shallow 11 12 13. Though these methods
had success which were isolated in cyber-attack and malware detection. Shallow
learning architecture still do not satisfy cyber-attack and malware detection
problem.

Base on this limitation a new frontier in data mining
and machine learning called deep learning architecture is beginning to gain
prominence in academic and industrial research for different application. Deep
learning architecture overcomes the difficulty of learning through layer wise
pre- training i.e. multiple layers pre-training of feature detections starting
from the lowest level to the highest to create the final classification model.
14.

  In this paper, a deep learning
architecture using the self-taught
learning(STL), based on sparse auto-encoder and soft-max regression using NSL-KDD intrusion dataset, to develop an NIDS for malware
detection is studied.

This paper is grouped in 5 sections with section 2 being the
review of related works, section 3 presenting an overview of the self-taught
learning(STL) and NSL-KDD intrusion
dataset. Performance, results and comparative, analysis in Section 4 and finally
section 5 concludes the work.

 

     2 Related Works

     In the cyber-attack and malware industry,
signature based method is widely used 910.However most cyber attackers and
malware creators can easily bypass this signature based method by using
techniques which include polymorphism ,encryption, obfuscation15.

Previous work seen in
literature use Artificial Neural Network(ANN) with improved resilient back
propagation for the design of an NIDS 16 where the training dataset used was
70% for training, validation and testing 15%. As a result, the use of unlabeled
data for testing resulted in a reduction of performance. Also a more recent
work used J48 decision tree classifier where only the training dataset of a
10-fold cross validation was used in testing 17. In this work only a reduced
feature set of 22 is proposed instead of the full set of 41 features. Another
related work used various popular supervise tree-based model, performance was 

Post Author: admin