Comparison of Gain Ratio and Chi-Square Feature Selection Methods in Improving SVM Performance on IDS

An intrusion detection system (IDS) is a security technology designed to identify and monitor suspicious activity in a computer network or system and detect potential attacks or security breaches. The importance of accuracy in IDS must be addressed, given that the response to any alert or activity generated by the system must be precise and measurable. However, achieving high accuracy in IDS requires a process that takes work. The complex network environment and the diversity of attacks led to significant challenges in developing IDS. The application of algorithms and optimization techniques needs to be considered to improve the accuracy of IDS. Support vector machine (SVM) is one data mining method with a high accuracy level in classifying network data packet patterns. A feature selection stage is needed for an optimal classification process, which can also be applied to SVM. Feature selection is an essential step in the data preprocessing phase; optimization of data input can improve the performance of the SVM algorithm, so this study compares the performance between feature selection algorithms, namely Information Gain Ratio and Chi-Square, and then classifies IDS data using the SVM algorithm. This outcome implies the importance of selecting the right features to develop an effective IDS.


Introduction
An intrusion detection system (IDS) is a security technology designed to identify and monitor suspicious activity in a computer network or system and detect potential attacks or security breaches.In network activity, IDS will identify suspicious activity and recognize attacks on the network.When the IDS discovers this attack or activity, the IDS will send reports and notifications to the network administrator.[1] The increase in online threats and attacks shows that developing an Intrusion Detection System is imperative to protect networks and computer systems [2].IDS is an effective tool for monitoring networks, especially to detect malicious attacks [3].An IDS can detect network anomaly behavior such as Denial of Service (DoS), Probe, SSH Brute Force (SBF), Brute Force Web (BFW), SQL Injection (SQLI), and other types of attacks.[4] The importance of accuracy in IDS must be addressed, given that the response to any alert or activity generated by the system must be precise and measurable.The high accuracy ensures that limited resources are not wasted on unnecessary investigations or excessive responses to false alerts.In addition, high accuracy also contributes to a better understanding of potentially emerging attack patterns, which ultimately helps take better precautions and design more resilient defense systems.However, achieving high accuracy in IDS requires a process that takes work.The complex and evolving network environment and the ever-changing diversity of attacks led to significant challenges in developing detection algorithms distinguishing between normal and suspicious activity.The development of data and networks makes data to be processed bigger and included in the Big Data category.
The application of algorithms and optimization techniques needs to be considered to improve the accuracy of IDS.Researchers in recent years using various publicly accessible data sets such as KDDCUP, NSL KDD, Darpa, and other public datasets tried various machine learning-based intrusion detection methods by applying various algorithms, such as the application of PSO [5], SMOTE, and Random Forest [6], including their use in IoT [7].One algorithm that is also frequently used is the application of Support Vector Machines (SVM), which can learn patterns from training data and apply them to new data to detect suspicious activity.Support vector machine (SVM) is one data mining method with a high accuracy level in classifying data.The previous Comparison of SVM and ANN Classifiers for the COVID-19 Prediction study proves that SVM results have slightly better accuracy than ANN [8].Judging from the high accuracy, SVM research compared to KNN research shows that the SVM algorithm exhibits higher accuracy when tested with normalization, outperforming the KNN algorithm in normalized and non-normalized conditions.However, the KNN algorithm consistently demonstrates lower accuracy, achieving an SVM accuracy of 84.61% and a KNN accuracy of 64.83% [9].In another study, SVM achieved high metrics of accuracy, acquisition, precision, and F1 score where this study used intrusion detection domain dataset with 93.75% accuracy on the UNSW-NB15 dataset, 98.92% accurate curation on the CICIDS2017 dataset [10].
At the classification stage, a feature selection stage is needed.Feature selection is an essential step in the data preprocessing phase.This phase entails choosing a subset of pertinent features from a broader set of available features.Some examples of research that apply this technique include the application of IG-R to improve IDS performance [11], research that uses a combination of PSO and CFS for selecting features [12], and the Farmland Fertility Algorithm [13].The importance of feature selection can be seen from previous studies that prove that optimization algorithms can help increase SVM accuracy by up to 36.2% compared to the SVM process without using feature selection optimization [14].One feature selection method is the Gain Ratio model, which can improve classification models' accuracy [15].Other studies indicate that employing the gain ratio method can enhance the efficacy of the Support Vector Machine (SVM) algorithm.This improvement is observed when utilizing features ranging from 100% to 5%, with optimal precision achieved at 50% of the features.However, the highest accuracy and recall are attained when utilizing only 5% of the features.Another research related to feature selection methods is to be able to rank feature sets on microarray datasets.The five feature selection methods include Chi-square, Relief, Gain Ratio, Information Gain, and Symmetrical Uncertainty.Four classification methods were also carried out for the classification stage, which was also carried out ten times, including cross-validation in each classification method.The results of this study, the feature selection method that excels a lot in several Microarray datasets is the Superior Gain Ratio method in the breast, Colon, and Ovarian datasets, with each value of the recognition rate 84.69, 82.25, and 87.91 [16].The gain ratio is a widely used feature selection method that is useful in improving the accuracy of classification models, assisting in selecting relevant features and reducing complexity.
Another feature selection method that has advantages and is widely used is the Chi-Square (CHI) method.This feature selection method is robust in using statistics developed to measure the relationship between two categorical variables in contingency tables.Features with a strong relationship with the target may be considered for inclusion in the model, while less informative features may be omitted.Related research using the Chi-Square (CHI) method for feature selection [17] states that the Chi-Square method can help optimize the threshold for NeighShrink, which is a denoising algorithm method to reduce additional white Gaussian noise, where the experimental results show that the proposed algorithm is simple and efficient, and provides noise reduction, and can maintain good edges and detail.Another research related to applying Chi-Square combined in Arabic text classification to improve classification performance.This combination significantly enhanced the performance of the Arabic text classification model with a dataset of 5,070 data and Arabic documents classified into six independent classes.The best fmeasure obtained for this model is 90.50% when the number of features is as much as 900 [18].The chi-square method is helpful for feature selection in machine learning and data analysis.This technique helps identify the most informative features by evaluating their relationship with the target variable.
Based on previous research showed that optimization of data input can improve the performance of classification algorithms and saw the ability of SVM in the data classification process, including network data packet patterns, as evidenced by several studies using KDDCUP, NSL KDD, Darpa, and other public datasets, this study compared the performance between feature selection algorithms, namely Information Gain Ratio and Chi-Square.We use these two feature selection algorithms because the data types of all the datasets used have various forms, such as categorical, continuous, and numerical data.Data that the selection has processed feature is also classified using the SVM algorithm.This study uses NSL KDD, UNSW, and CSE CIC datasets IDS2018, and the results of this research comparison will be used as the basis for further research in the classification optimization stage.The details about datasets are shown in Table 1

Research Methods
This study used a process of comparing the classification results between feature selection using gain ratio and chi-square.The process starts with acquiring the IDS dataset using the NSL-KDD, UNSW Datasets, and CSE CIC 2018.This research uses normal and abnormal classes, representing normal by 0 and abnormal by 1.
For NSL-KDD data, the data that has been owned is then carried out in the data transformation process by coding the value 'normal' as 0 and other values as 1 in the attack attribute.In the UNSW dataset, the labels (classes) are 0 and 1, where class 0 represents the normal form of network traffic, and 1 represents the attack.On the CSE CIC dataset, we carried out a transformation from the Label (class), namely Benign, to 0, which means normal, and two types of brute force attacks, namely FTP-BruteForce and SSH-BruteForce, to Next, this attribute will be used as a classifier for binary models to identify any attack.The subsequent phase involves implementing feature selection on the existing dataset using the gain ratio and chi-square methods.The feature selection results from these two techniques will cause not all attributes from the original dataset to be used.From the feature selection results, we used a dataset using selected features for training, and the classification process was then tested using the SVM algorithm.The following process compares the final results presented from the confusion matrix process.

Data Mining
Data mining is the process of extracting knowledge sourced from large or complex datasets.There are two common techniques in data mining: descriptive and predictive methods [19].In descriptive approaches, algorithms identify patterns that describe data by examining the relationships among data labels or attributes.Clustering, association rule mining, and sequential pattern discovery represent three model learning methods characterized by their descriptive nature in data mining.Predictive methods use the value of several features to predict a particular value or trait in the future based on data held in the past or present.This technique is also known as the supervised learning method, and some common algorithms include classification, regression, and anomaly detection.[19] 2.2.SVM Support vector machines (SVM) are a subset of supervised machine learning techniques that construct a binary classification framework for addressing intricate, highly non-linear challenges [20].Commonly applied in regression and classification challenges, Support Vector Machine (SVM) was conceived by Vapnik as part of the machine learning toolkit.SVM identifies the most effective separator to distinguish between two distinct classes.Additionally, SVM offers a cohesive framework enabling the classification of diverse data through the selected kernel.This is considered one of the advantages of SVM.[21]

Gain Ratio
The gain ratio is one of the metrics used in classifying data or selecting features in machine learning and data analysis.This metric measures how well a feature or attribute performs at class separation in a dataset.The gain ratio is often used in the context of decision tree algorithms.Gain Ratio is an increase in Information Gain that attempts to optimize normalized values for a feature in the context of classification.The gain ratio was chosen because of its ability to produce higher accuracy than other filter techniques.[22].To calculate the Gain Ratio, it is necessary to calculate the Information Gain first, with the process of calculating the gain ratio as follows: Gain(A) is the calculation of Information Gain, and SplitInfo(A) is the split of the entropy calculation.(3)

Chi-Square
A chi-square statistical test tests the difference between theoretical (assumed) and observed distributions.This test is generally used in quantitative research, especially qualitative research, which uses categorized data.
Based on the equation above, each feature is given a value for each class, and then the maximum final value is obtained by combining all these values.[18]

Evaluation Methods
The evaluation method in this study uses the confusion matrix method for precision, Recall, and Accuracy measurements.The confusion matrix compares the model or algorithm's classification results with the actual classification results.The matrix is described in the following table:

Result and Discussion
The datasets to be used in the classification process in this study are NSL KDD, UNSW, and CSE CIC IDS2018.This dataset contains several network traffic log data labeled as normal or intrusive.We transform the attributes that will become class labels to be classified into Normal and Not Normal.
Subsequently, the preprocessed dataset undergoes the application of the Gain Ratio and Chi-Square feature selection methods.This method ranks features based on their relevance to the target variable (normal or normal nor).Furthermore, the data that has gone through the preprocessing process will proceed to the training phase.In the training process, we used the NSL-KDD Training dataset and UNSW, which provided a special dataset for training.Meanwhile, the training process was carried out for the CSE CIC dataset using 70% of the data as a training source.The last stage is classification with SVM, which evaluates its performance using performance metrics: accuracy, precision, and recall.The process is carried out for all three datasets, and the results are compared.
Before selecting features, the IDS dataset that has been owned will be transformed into a transformation process, especially for attributes that will become labels (classes).The data owned by NSL-KDD has attributes such as Protocol_Type, Src_Bytes, Attack, and Level from a total of

Figure 2. NSL KDD Features
At the feature selection stage, we compared the selection results of two feature selection methods, namely Gain Ratio and Chi-Square.Several attributes are obtained from the feature selection results of the Gain Ratio method, such as Service, Flag, Logged_In, Count, Dst_Host_Srv_Count, and Dst_Host_Count.From the feature selection results using the gain ratio, 19 attributes, including labels, will be used in the next stage.There are 24 features removed from the use of the gain ratio method.In the Chi-Square selection process, 36 features were used, and only seven features were ignored by this method.The attributes obtained from the Chi-Square method feature extraction results include Duration, Protocol_Type, Service, Flag, Land, and 31 other features.
In the next stage, we implement the feature selection results from the NSL KDD dataset with SVM using the dot kernel.The following are the results of SVM performance with the selection of Gain Ratio features: These results obtained accuracy in data testing using a feature selection model, and the gain ratio was 75.36%.Furthermore, for SVM performance results using Chi-Square feature selection are as follows: We used the UNSW and CSE CIC datasets IDS2018 in another experiment while utilizing the previously mentioned feature extraction methods.For the UNSW dataset, which includes 49 features and labels, the Gain Ratio feature extraction stage resulted in 12 optimal parts, whereas the Chi-Square method identified 42 features for use.The detailed UNSW dataset features are shown in Figure 3.     Analyzing the accuracy outcomes shown in Figure 5, it becomes evident that choosing the Chi-Square feature selection yields advantages over the Gain Ratio in experiments conducted on two datasets, NSL KDD and UNSW.Conversely, the choice of Gain Ratio features proves superior in tests involving the CSE CIC dataset.When compared with the difference in accuracy levels in the three datasets, results were obtained for the superiority of Chi-Square on the NSL KDD dataset with a percentage of 18%.Here is a table of accuracy differences for the three datasets used: The experimental results shown in Figure 6 show that both feature selection methods improve SVM performance compared to using the entire feature.The experimental findings from NSL KDD, UNSW, and CSE CIC datasets indicate that SVM accuracy improves when employing Chi-Square feature selection compared to Gain Ratio feature selection.

Conclusion
This study has compared the selection methods of Gain Ratio and Chi-Square features in the context of IDS to improve SVM performance.Results show that both methods can improve SVM performance in detecting intrusions.From the comparison results, the Gain Ratio value is smaller than the Chi-Square for testing two datasets, so the Chi-Square method is more recommended in feature selection because it provides slightly better results in terms of accuracy.Based on the accuracy results, the two selection features can work optimally if the features they have are not too few, or in other words, they have enough features to be the basis for the classification process.This can be seen from the accuracy results when Gain Ratio and SVM are used for the CSE CIC dataset, which produces an accuracy of 89.05%.The dataset features obtained by the Gain Ratio are more than those provided by Chi-Square and are sufficient for classification.The same thing can be seen when Chi-Square and SVM produce higher accuracy than Gain Ratio and SVM for the NSL KDD and UNSW datasets.This outcome implies the importance of selecting the right features to develop an effective IDS.0.00% The Attack attribute (attribute 42) has several values, including normal, neptune, nmap, spy, etc.The transformation process occurs in this attribute, i.e., encoding the value 'normal' as 0 and the other as 1.

Figure 3 .Figure 4 .
Figure 3. UNSW Dataset FeaturesRegarding the CSE CIC dataset IDS2018, which encompasses 80 elements, including labels, the Gain Ratio feature extraction stage produced the best 14 features.In contrast, the Chi-Square feature extraction method identified the best six features, as shown in Figure4.

Table 3 .
Gain ratio and SVM Performance

Table 4 .
Chi-square and SVM Performance

Table 8 .
SVM+Chi Square Dataset CSE CIC These results showed that SVM performance using test data with a feature selection model from Chi-Square resulted in an accuracy of 79.50% for the UNSW dataset and 74.35% for the CSE CIC dataset.Figure5shows a comparison of the accuracy results of the selection of Gain Ratio and Chi-Square features for all datasets used:

Table 9 .
Accuracy Difference