Umuzi Tech Department > Projects > Data-science-specifics > Data Modelling Challenges > Network Intrusion

Network Intrusion

Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” normal connections.

Background

The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided.

Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records.

A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.

This data was adjusted from the original and was taken from the 1999 KDD Cup. The data set contains 43 features per record, with 41 of the features referring to the traffic input itself and the last two are labels (whether it is a normal or attack) and Score (the severity of the traffic input itself).

Instructions

Create a classifier to predict good vs bad connections. Go through all the relevant steps of CRISP Data Mining to decide on the best model to use and to build the classifier.

The compressed data file can be found here.

Outcome

Attacks fall into four main categories:

DOS: denial-of-service, e.g. syn flood
R2L: unauthorized access from a remote machine, e.g. guessing password
U2R: unauthorized access to local superuser (root) privileges, e.g., various - “buffer overflow” attacks
probing: surveillance and other probing, e.g., port scanning

Attack types:

back
buffer_overflow
ftp_write
guess_passwd
imap
ipsweep
land
loadmodule
multihop
neptune
nmap
normal
perl
phf
pod
portsweep
rootkit
satan
smurf
spy
teardrop
warezclient
warezmaster

Features:

Table 1: Basic features of individual TCP connections.

feature name	description	type
duration	length (number of seconds) of the connection	continuous
protocol_type	type of the protocol, e.g. tcp, udp, etc.	discrete
service	network service on the destination, e.g., http, telnet, etc.	discrete
src_bytes	number of data bytes from source to destination	continuous
dst_bytes	number of data bytes from destination to source	continuous
flag	normal or error status of the connection	discrete
land	1 if connection is from/to the same host/port; 0 otherwise	discrete
wrong_fragment	number of ‘wrong” fragments	continuous
urgent	number of urgent packets	continuous

Table 2: Content features within a connection suggested by domain knowledge.

feature name	description	type
hot	number of ‘hot’ indicators	continuous
num_failed_logins	number of failed login attempts	continuous
logged_in	1 if successfully logged in; 0 otherwise	discrete
num_compromised	number of ‘compromised’ conditions	continuous
root_shell	1 if root shell is obtained; 0 otherwise	discrete
su_attempted	1 if ‘su root’ command attempted; 0 otherwise	discrete
num_root	number of ‘root’ accesses	continuous
num_file_creations	number of file creation operations	continuous
num_shells	number of shell prompts	continuous
num_access_files	number of operations on access control files	continuous
num_outbound_cmds	number of outbound commands in an ftp session	continuous
is_hot_login	1 if the login belongs to the ‘hot’ list; 0 otherwise	discrete
is_guest_login	1 if the login is a ‘guest’ login; 0 otherwise	discrete

Table 3: Traffic features computed using a two-second time window.

feature name	description	type
count	number of connections to the same host as the current connection in the past two seconds	continuous
Note: The following features refer to these same-host connections.
serror_rate	% of connections that have SYN errors	continuous
rerror_rate	% of connections that have REJ errors	continuous
same_srv_rate	% of connections to the same service	continuous
diff_srv_rate	% of connections to different services	continuous
srv_count	number of connections to the same service as the current connection in the past two seconds	continuous
Note: The following features refer to these same-service connections.
srv_serror_rate	% of connections that have SYN errors	continuous
srv_rerror_rate	% of connections that have REJ errors	continuous
srv_diff_host_rate	% of connections to different hosts	continuous