Real time intrusion detection is merely impossible due to the huge amount of data flowing on the Internet. Feature selection can reduce the computation and model complexity. Research on feature selection started in early 60s. Feature selection is a technique of selecting a subset of relevant features by removing most irrelevant and redundant features from the data for building robust learning models.

Process of Feature Selection

Feature selection processes involve four basic steps in a typical feature selection method shown in Figure 2. They are generation procedure to generate the next candidate subset; an evaluation function to evaluate the subset under examination; a stopping criterion to decide when to stop; and a validation procedure to check whether the subset is valid. Figure 2 demonstrates the feature selection process to determine and validate a best feature subset.

Fig1A Study of Feature Selection Methods-3
Figure 1 : Feature selection process with validation.


Blum and Langley divide the feature selection methods into three categories named filter, wrapper and hybrid (embedded) method. These methods are currently used in intrusion detection. The filter method selects features subsets based on the general characteristics of the data. Filter method is independent of classification algorithms. Filter algorithm [18] uses external learning algorithm to evaluate the performance of selected features. The wrapper method “Wrap around” the learning algorithm. It uses one predetermined classifier to evaluate features or feature subsets. Wrapper algorithm uses a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Many feature subsets are evaluated based on classification performance and best one is selected This method is more computationally expensive than the filter method. The hybrid method combines wrapper and filter approach to achieve best possible performance with a particular learning algorithm. More efficient search strategies and evaluation criteria are needed for feature selection with large dimensionality in hybrid algorithm to achieve similar time complexity of filter algorithms. These methods are discussed in detail in Section 5 and summarized in section 6.


In this section, we thoroughly discusses the different feature selection methods used in intrusion detection based on filter, wrapper and hybrid method, number of feature selected, feature number (according to Table 1), its performance on KDD Cup’99 dataset, strength, limitation and future work reported in the literature.

Filter Method

A feature selection algorithm, FSMDB based on DB index criterion is proposed in (Zhang et al., 2004). Criterion function is constructed according to the characters of DB index criterion. 24 features {features no. : 6, 5, 1, 34, 33, 36, 32, 8, 27, 29, 28, 30, 26, 38, 39, 35, 13, 24, 23, 11, 3, 10, 12 and 4} are selected and tested using two classifiers BP network and SVM. Classification accuracy of FSMDB algorithm by classifiers BP network and SVM are 0.1017 and 0.056 respectively. This method can be used for supervised or unsupervised classification problems but has computational complexity in unsupervised learning mode. Future Work: To find a better approach to reduce high computational complexity in unsupervised learning mode.

Two neural network methods: (1) neural network principal component analysis (NNPCA) and (2) nonlinear component analysis (NLCA) are presented in [22] (Kuchimanchi et al., 2004). The number of significant features extracted from methods PCA, NNPCA and NLCA are 19, 19 and 12. The first 19 selected features based on the results of Scree test and critical eignvalues test are {feature no. : 5, 6, 1, 22, 21, 31, 30, 3, 4, 2, 16, 10, 13, 34, 32, 27, 24, 37, 23 and 36}. The performance of the Non-linear classifier (NC) and the CART decision tree classifier (DC) are tested on four datasets (Table 3). DC has relatively high detection accuracies and low false positive rates. Future Work: This work can be extended on quantitative measures to find optimal combinations of classifiers and feature extractors for IDS.

t3A Study of Feature Selection Methods-4
RICGA (ReliefF Immune Clonal Genetic Algorithm), a combined feature subset selection method based on the ReliefF algorithm, Immune Clonal selection algorithm and GA is proposed in [23] (Zhu et al., 2005). BP networks is used as classifier.. RICGA has higher classification accuracy (86.47%) for small size feature subsets (8) than ReliefF-GA. Features are not mentioned in the paper.

t4A Study of Feature Selection Methods-5
This paper (Zainal et al., 2006) investigated the effectiveness of Rough Set (RS) theory in identifying important features and used as a classifier. The 6 significant features obtained are {feature no.: 41, 32, 24, 4, 5 and 3}. Classification results obtained by Rough Set are compared with Multivariate Adaptive Regression Splines (MARS), Support Vector Decision Function (SVDF) and Linear Genetic Programming (LGP). Classification accuracy of RS is ranked second for normal category and performed almost same to MARS and SVDF for attack category. Future Work: This work can be extended in terms of accuracy by focusing on fusion of classifiers after a set of optimum feature subset is obtained.

Wong and Lai (2006) combined Discriminant Analysis (DA) and Support Vector Machine (SVM) to detect intrusion for anomaly-based network IDS. Nine features (feature no. : 12, 23, 32, 2, 24, 36, 31, 29 and 39) are extracted by Discriminant Analysis and evaluated by SVM. The TN (%), FP(%), FN(%) and TP(%) of the proposed method are 99.58%, 0.42%, 9.93% and 90.07% respectively. Future Work: Multiple Discriminant Analysis (MDA) can be applied to find the optimal feature set for each type of attack.

Li et al. (2006) proposed a lightweight intrusion detection model. Information Gain and Chi-Square approach are used to extract important features and Classic Maximum Entropy (ME) model is used to learn and detect intrusions. The top 12 important features selected by both methods are {feature no.: 3, 5, 6, 10, 13, 23, 24, 27, 28, 37, 40 and 41}. Experimental results are shown in Table 4. Future Work: This model can be applied in realistic environment to verify its real-time performance and effectiveness.

5A Study of Feature Selection Methods-6
Tamilarasan et al. (2006) performed different feature selection and ranking methods on the KDD Cup’99 dataset. Chi-Square analysis, logistic regression, normal distribution and beta distribution experiments are performed for feature selection. The 25 most significant features ranked by Chi-square test are {feature no.: 35, 27, 41, 28, 40, 30, 34, 3, 33, 12, 37, 24, 29, 2, 13, 8, 36, 10, 26, 39, 22, 25, 5, 1, 38}. Experiments are performed for normal, probe, DoS, U2R, and R2L using resilient back propagation neural network. The overall accuracy of the classification is 97.04% with a FPR of 2.76% and FNR of 0.20%.

Fadaeieslam et al. (2007) proposed a feature selection method based on Decision Dependent Correlation (DDC). Mutual information of each feature and decision is calculated and top 20 important features {feature no.: 3, 5, 40, 24, 2, 10, 41, 36, 8, 13, 27, 28, 22, 11, 14, 17, 18, 7, 9 and 15} are selected and evaluated by SVM classifier. The classified result is 93.46% and it outperforms Principal Component Analysis PCA.

Shina Sheen and R Rajesh (2008) considered different methods: Chi square, Information Gain and ReliefF for feature selection. Top 20 features {feature no.: 2, 3, 4, 5, 12, 22, 23, 24, 27, 28, 30, 31, 32, 33, 34, 35, 37, 38, 40 and 41} are selected and evaluated using decision tree (C4.5). The Classification accuracy of Chi Square, Info Gain and ReliefF are 95.8506%, 95.8506% and 95.6432% respectively.

In (Kiziloren and German, 2009), Principal Component Analysis (PCA) is used for feature selection to increase quality of extracted feature vectors and Self Organizing Network (SOM) as a classifier to detect network anomalies. The highest success rate 98.83% of the system is obtained when number of feature vector size equals to 10. Features are not mentioned in the paper. The average success rate of the system without using PCA is 97.76%. PCA provides faster classification operation which is important for a real-time system.

Suebsing and Hiransakolwong (2009) proposed a combination of Euclidean Distance and Cosine Similarity to select robust features subsets with smaller size. Euclidean Distance is used to select the features to detect the known attacks and Cosine Similarity is used to select the features to detect the unknown attacks to build a model. The known detection method extracts 30 important features {feature no. : 1, 2, 12, 25, 26, 27, 28, 30, 31, 35, 37, 38, 39, 40, 41, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19 and 22}. The unknown detection method extracts 24 important features {feature no. : 1, 2, 12, 25, 26, 27, 28, 30, 31, 35, 37, 38, 39, 40, 41, 3, 4, 23, 24, 29, 32, 33, 34 and 36}. 15 features {feature no.: 1, 2, 12, 25, 26, 27, 28, 30, 31, 35, 37, 38, 39, 40 and 41} are selected by both methods. The C5.0 method is used as a classifier. The experimental results are shown in Table 5.

6A Study of Feature Selection Methods-7
A new approach named Quantitative Intrusion Intensity Assessment (QIIA) is proposed in the paper [32] (Lee et al., 2009). QIIA evaluates the proximity of each instance of audit data using proximity metrics based on Random Forests (RF). QIIA uses Random Forests (RF) to select important features by using the numerical feature importance of RF. Two approaches QIIQ1 and QIIA2 are proposed to determine the threshold parameters value. The top 5 important features selected are {feature no.: 23, 32, 10, 6 and 3}. Only DoS attacks are used since other attack types have very small number of instances. The experimental results show that the detection rates (DR) of QIIA1 and QIIA2 are 97.94 and 99.37 respectively.

An entropy-based traffic profiling scheme for detecting security attacks is presented in (Lee and He, 2009). Only denial-of-service (DoS) attack is focused in this paper. The top six features ranked by the accuracy are {feature no.: 5, 6, 31, 32, 36 and 37}. The true positive rate (TPR) of this scheme is 91%.  (Xiao et al., 2009) presented a two-step feature selection algorithm. It eliminates two kinds of features: irrelevant features in first step and redundant features in second step. 21 features {feature no.: 1, 3, 4, 5, 6, 8, 11, 12, 13, 23, 25, 26, 27, 28, 29, 30, 32, 33, 34, 36 and 39} are selected and evaluated using C4.5 algorithm and Support Vector Machine (SVM). The Detection Rate (%), False Alarm Rate (%) and Processing Time of selected features (All features) are 86.3 (87.0), 1.89 (1.85) and 15.163 sec (21.891 sec) respectively.

A novel approach for selecting features and comparing the performance of various BN classifiers is proposed in (Khor et al., 2009). Two feature selection algorithms Correlation-based Feature Selection Subset Evaluator (CFSE) and Consistency Subset Evaluator (CSE) and domain experts are utilised to form the proposed feature set.. This feature set contains 7 features as {feature no.: 3, 6, 12, 23, 32*, 14* and 40*}. Bayesian Network (BN) is employed as a classifier. The classification accuracy (%) of the BN for Normal, DoS, Probe, R2L and U2R types are (99.8, 99.9, 89.4, 91.5 and 69.2%). *: Features that were selected based on domain knowledge.

Bahrololum et al. (2009) used three machine learning methods : Decision Tree(DT), Flexible Neural Tree (FNT) and Particle Swarm Optimization (PSO) for feature selection. The five important features {feature no.: 10, 17, 14, 13 and 11} are selected depending on the contribution of the variables for the construction of the decision tree. The experimental results are shown in (Table 6).

An automatic feature selection method based on filter method is proposed by Nguyen et al. (2010). The globally optimal subset of relevant features is found by the Correlation Feature Selection (CFS) and evaluated by C4.5 and BayesNet. The selected features for Normal&DoS are 3 {5, 6 and 12}; for Normal&Probe are 6 {5, 6, 12, 29, 37 and 41}; for Normal&U2R is 1 {14}; for Normal&R2L are 2 {10 and 22}. Average classification accuracies of C4.5 and BayesNet are 99.41% and 98.82% respectively.

Chen et al. (2010) proposed a novel inconsistency-based feature selection method. Data consistency is applied to find the optimal features and evaluated by decision tree method (C4.5). The proposed method is compared with CFS (Table 7).

7A Study of Feature Selection Methods-8
A novel unsupervised statistical varGDLF, a variational framework for the GD mixture model with localized feature selection (GDLF) approach is proposed in (Fan et al., 2011) for detecting network based attacks. Eleven features {feature no.: 1, 5, 12, 15, 18, 21, 22, 29, 33, 38 and 41} are selected. The performance of varGDLF approach is compared with other four variational mixture models and it outperforms with the highest accuracy rate (85.2%), the lowest FP rate (7.3%) and the most accurately detected number of components (4.95). Accuracy rate for Normal, DOS, R2L, U2R and Probing is 99.5, 96.5, 75.4, 69.6 and 85.1%, respectively. FP rate is 11.5, 0.8, 1.4, 11.5 and 11.3%, respectively.

An improved information gain (IIG) algorithm is proposed in [40] (Xian et al., 2011) based on feature redundancy.. Twenty two features are selected after applying Information Gain (IG) algorithm and then 12 {feature no.:2, 3, 5, 6, 8, 10, 12, 23, 25, 36, 37 and 38} features are selected after applying IIG. Naive Bayes (NB) is used to carry out the experiment on the three feature set as the original feature set (41 features), feature subset 1 (22 features) and feature subset 2(12 features). The Processing times (s) of the three feature subsets are 8.34, 4.16 and 2.08; the Detection Rates (DR) (%) are 96.187, 96.407 and 96.801; the False Positive Rates (FPR) (%) are 5.22, 2.58 and 1.02 respectively.