The Optimization of the ARP Poisoning Attack Detection Model Using a Similar Approach Based on NetFlow Analysis

Information security and threats are a concern in the cyber era. Attacks can be malicious activities. One of them is known as ARP poisoning attack activity, which attacks by falsifying a computer's identity through illegal access to retrieve confidential information in a target computer. Besides, it has also caused service deadlocks in the network. Previous studies have been introduced for the ARP Attack Detection model using rule-based and mining-based. Still, they cannot show optimal detection performance and obtain high false positive results. This paper proposed a detection model for ARP poisoning attacks using a similarity measurement approach adopting cosine similarity. The goal is to obtain measurements of host activities similar to ARP poisoning attacks. The experiment results showed that the model got an accuracy of 97.25%, recall of 96.43%, and precision of 81% with a similarity threshold value of 0.488. Comparison results with previous studies showed higher detection accuracy than previous studies and some classification methods. It shows that the model can improve intrusion detection performance and facilitate network administrators to analyze ARP poisoning attacks in computer networks.


Introduction
Nowadays, system security in computer networks needs to be handled appropriately.It also needs to avoid and detect malicious activities that can cause worse damage to computer networks and credential data.This malicious activity is often referred to as intrusion [1].In dealing with intrusion in computer networks, an Intrusion detection model known as IDS [2] needs to be implemented in computer networks to strengthen communication services between interconnected computers.
Generally, IDS systems are built with two approaches, namely misuse detection and anomaly detection [3].These approaches can be implemented based on rule-based formation, such as the SNORT application [4]- [7].However, this IDS model cannot detect new types of attack variants in the network, and the detection accuracy depends on the accuracy of the rule base formation.One type of attack with a high detection error in intrusion detection systems is the Address Resolution Protocol (ARP) Poisoning attack.ARP is a communications protocol that maps the addressing of each computing device in the computer network as a Media Access Control (MAC) address.Research on ARP attack detection has been introduced in previous research, such as detection models by applying the concept of mining-based analysis [1]- [3], [8], [9], anomaly behavior analysis [10]- [13] and used supporting applications [4], [14], [15].In [16], an ARP poisoning detection model is introduced by separating ARP service access and using a listing technique in the form of static MAC address addressing.The experimental results show the model can detect suspected ARP Poisoning attack activity.Still, the model is very dependent on resources in the ARP control flow model.Selvarajan et al. [5] proposed an ARP poisoning detection model involving communication analysis on manipulated ICMP echo requests.The detection process involves rule-based analysis by checking communications on the ICMP protocol on ping messages that lead to ARP spoofing activity.The test results show that the model can detect behavior using a statistical approach based on the behavioral characteristics of suspected ARP poisoning perpetrators who communicate using certain protocol services.However, this research requires an attack scenario using a specific protocol.This type of ARP poisoning attack generally involves several communication protocols.[13] proposed an ARP attack detection model by distinguishing detection and prevention mechanisms.ARP attacks are divided into two activity processes: ARP spoofing and ARP poisoning attacks.The activity detection mechanism creates a static list of communications in the network, namely a list of IP address mapping and MAC address mapping tables.Then, the legal access in the permission list gets communication registration on the network.This model produces the detection and prevention of ARP poisoning attacks but requires a validation process on the network, which can hinder the duration of activity on the network.Besides, the model introduced uses statistical rules implemented for several hosts declared to be registered.In some cases, new hosts that have not yet been registered require a legal communication process.

LONTAR KOMPUTER
The ARP poisoning model often uses analysis of network traffic data obtained from recordings [15] or uses public datasets that have been processed in the form of netflows [8], [10].Besides, the Netflows are commonly used to detect attacks or malicious activity anomalies [17], including ARP Poisoning attacks.ARP poisoning is often misused to cause computer network deadlocks by illegally forging computer addresses to obtain confidential and credential information [8].Thus, the proper ARP poisoning attack detection technique is required to accurately detect the attack in a computer network.This paper proposed a new approach for ARP poisoning attack detection using a similarity analysis based on NetFlow analysis.This research is a development in [8], where the detection model used a classification approach in previous studies.The results of ARP poisoning attack detection using classification have a high detection accuracy but a high detection error value.Therefore, in the proposed research, the detection model aims to improve detection accuracy by suppressing detection errors based on NetFlow analysis.The novelty of this research is that it involves analyzing the dynamic similarity threshold value.In addition, this research built a knowledge base based on the characteristics of attack patterns.The similarity measurement results in the detection model are expected to show the closeness between the activity patterns of suspected attackers and the characteristics of the ARP poisoning attack knowledge base.This paper is constructed into several sections.The process stages of the proposed model are introduced in Section II.Section III presents the results of the experiment and the research discussion.Finally, the conclusions of the research are drawn in section IV.

Research Methods
Research on ARP poisoning attack detection has been conducted by previous researchers.Some of them applied the concept of mining-based analysis [1]- [3], [8], [9], anomaly behavior analysis [10]- [13] and used supporting applications [4], [14], [15].ARP poisoning activities often used analysis on network traffic data obtained from recording results [15] or used public datasets that had been processed in the form of network traffic flows (netflows) [8], [10].The proposed ARP Poisoning detection model is shown in Figure 1.

Dataset
This research used network traffic data obtained from the computer network recording process.The recorded data was in the form of .pcapfiles and processed into .csvfiles to form a dataset by adopting the techniques in research [8] and standardized based on the IDMEF standard [18].Netflow data had a recording duration of 1 hour with 2819 traffic records, 279 ARP poisoning attack records, and 2540 normal activity traffic records.The number of attackers was six out of 418 hosts.ARP poisoning attacks involved malware to do ARP broadcasts, ARP flooding packets, and MAC flooding.Normal activities were conducted by hosts with browsing, e-mail sending, DNS access, and FTP access.

Data Preprocessing
At this phase, the data in .csvformat are processed in the preprocessing process: data cleansing, normalization, and feature selection.The data cleansing stage was conducted to clean some data records that do not have values and are double data.Double data can appear due to errors in the recording process.In addition, the data cleansing process was carried out to fill in the "0" value to the feature value with a null value, aiming to standardize the value of each attribute in the data record.After the cleansing process, the normalized data were normalized using a value range of 0 to one, adopting the approach in research [19].

Data Splitting
Data splitting is dividing data into two types: training data and testing data.This stage was often used for the data learning process of a machine learning-based detection model [20]- [23] with a composition of 70% and 30% as testing data.In the testing data, the attack identities label of each record is removed and used to test the learning model used.

Classification
The classification stage is the ARP attack activity detection stage from the learning process.The detection results would record the attack activity and be stored in the knowledge base.The classification stage used five types of classification models, namely -NN, Logistic Regression, Naïve Bayes, Random Forest, and Decision Tree.Evaluation of the five models with the best detection accuracy, precision, and recall values stored would be labeled as ARP poisoning pattern NetFlow.

ARP Poisoning Pattern NetFlow
ARP poisoning pattern flows are the knowledge base of ARP attack activities.Each data record would be labeled as an activity and sorted based on the time of the ARP attack activity.Thus, in the knowledge base of sequential activities among ARP attack activities.

Dynamic Threshold Analysis
Dynamic threshold analysis is a stage to determine the relationship between ARP attack activities.The intended relationship was the similarity between the two attacker activities.If two different attackers are represented as nodes A and B, there may be a difference in similarity between A to B and B to A [24].To determine whether two attacking objects are similar and have a substantial similarity value, it is necessary to analyze the similarity threshold value [25].In this paper, the similarity measurement (  ) between two ARP attack patterns adopted the cosine similarity approach.The determined threshold value was dynamic based on the characteristics of the data using the (1): where   Is the lowest similarity value that occurs from all the attacker activity similarity measurements,   is the highest value of the measured similarity value.The threshold value obtained will be updated if new attack characteristics are in the knowledge base update, namely in the ARP poisoning pattern NetFlow.

Feature Extraction
The Feature Extraction stage is the feature extraction of each data attribute or primary feature in NetFlow traffic.The number of primary attributes used in the previous process was ten features: source port, source IP address, destination port, destination IP address, length, UDP port, TCP port, source MAC address, protocol, and destination MAC address.If traffic in a computer network (  ) consists of host activities in which there are traffic records (  ) defined as   = { _1 ,  _2 , … ,  _ }, containing feature tuples () in each record, namely source IP address (  ), destination IP address (   ), protocol(   ), TCP port (   ), UDP port(   ), length( ℎ ), source port(  ), source MAC address(  ) and destination MAC address (   ), thus denoted   ∈   , where   = {(  ,   ,   ,   ,  ℎ ,   ,   ,   ,   , )}.
Each feature tuple (  ) value was re-extracted to get six new features denoted as  by calculating the activity type of each host's interaction.The type of host activities can be in the form of spreading activity and show their pattern for communication behavior in the network.The type or variant of the feature is calculated by grouping each feature based on (  ).Thus, each feature is defined as 1 = (  ),  2 = (  ),  3 = ( ℎ ),  4 = (  ),  5 = (  ),  6 = (  ).
The feature extraction results produced a feature pattern denoted as  with a value of  = { 1 ,  2 ,  3 ,  4 ,  5 ,  6 , }.The feature pattern () obtained from the classification results will be stored in the ARP poisoning attack activity knowledge base denoted by   .

Similarity Measurement
This stage measured the similarity between ARP attack traffic in the knowledge base as training data and network traffic as testing data.The similarity measurement adopted the cosine similarity shown in (2).
where the inner product is symbolized by the sign ". " with the calculation: (3) and || the calculation result of vector  : and || the calculation result of vector  : If the traffic testing data is denoted as _, then the feature tuple extraction on the testing data becomes  − _, with the features ( − _) formed into a feature pattern denoted as  _ .Thus, the similarity measurement to determine how substantial the similarity is with the cosine similarity between the ARP poisoning attack feature pattern in the knowledge base (  ) and the traffic in the testing data ( _ ) become (6).
The similarity measurement results  (  , _ ) Which has a value above   will state  _ as ARP poisoning attack activity.Thus, it is expressed in (7): 2.9.Evaluation In this stage, the evaluation uses F-measure by measuring the precision, recall, and detection accuracy.To calculate the True Positive, True Negative, False Negative, and False Positive values are traced.The true Negative (TN) value is the number of regular activities detected as normal activities.False Positive (FP) is a normal activity but detected as an ARP poisoning attack activity.Meanwhile, True Positive (TP) is an ARP poisoning attack activity that is correctly detected as an ARP poisoning attack activity.False Negative (FN) is the opposite of True Positive, so ARP poisoning attack activity is detected as normal activity.

Result and Discussion
This research used the ARP poisoning attack dataset used in the study [8].Network traffic data was taken through network traffic recording with the assistance of the Wireshark application [26] and produced files in the form of .pcap.The recording was carried out for 1 hour.This research used a computer with Intel Core i5-9300H processor specifications, 16 GB RAM, and 500 GB SSD storage capacity during the experiment and traffic data collection on the network.

Experiment
The recording data produced by the application was the .pcapextension.This data was processed into a comma-separated value (.csv) file.It aimed to convert traffic data from unstructured to structured tabular data by separating each column with a "; " separator.The conversion was done using the command line-based Tshark application, as shown in Figure 2.After that, preprocessing was carried out, namely data cleansing, data normalization, and feature selection.In the data cleansing stage, 153 traffic records were deleted because they were redundant data.Redundant data could occur during the recording to conversion process using two applications, Wireshark and Tshark.In addition to removing redundant data, the null value in each column was filled to 0. There were 201 record data that had columns with null values and converted to a 0 value.The data cleansing process resulted in a traffic reduction of 5.43%.The details of the cleansing results are shown in Table 1.The results of data cleansing were followed by converting the values in each category data column into numerical values.This change technique used an encoder, also done in research [27].Furthermore, all values in each column were normalized to a value scale between 0 and 1.
Each normalized column was expressed as a traffic data feature.The basic features obtained with the Tshark application in Figure 2 of 7 features.In this research, six features were selected, and the selection was conducted manually.One feature that was not used was the time feature.The time feature was not used because it did not match the characteristics of ARP poisoning attacks, where attack activities occurred randomly and not continuously.ARP attacks tend to be characterized by spreading and have high-intensity occurrences.Features used were source IP (   ), destination IP (   ), TCP port (   ), UDP port(   ), length(  ℎ ), source port(  ), protocol(  ), source MAC address(  ) and destination MAC address (  ).
From the preprocessing results, the data was divided into two types of data, namely training data as the knowledge base of ARP attacks (  ) after the classification stage and testing data as denoted by ( _ ).Data were divided randomly with data divided composition shown in Table 2.In this research, the classification was performed using five classification methods, namely -NN, Logistic Regression, Naïve Bayes, Random Forest, and Decision Tree.The classification results are shown in Figure 3.The method with the best detection accuracy will be used to build the ARP poisoning attack knowledge base.The results of the five classification methods are shown in Table 3.The classification results showed that the decision tree method obtained the highest detection accuracy by producing 196 traffic records as ARP poisoning attack patterns.Furthermore, the traffic records were extracted into , where the variants of each feature were grouped based on (   ) and produced a total of  = (  ) with  = 6 namely   ,   ,  ℎ ,   ,   ,   .The feature extraction results formed feature patterns  = { 1 ,  2 ,  3 ,  4 ,  5 ,  6 , } with the value of each  is  1 = (  ),  2 = (  ),  3 = ( ℎ ),  4 = (  ),  5 = (  ),  6 = (  ).Examples of the knowledge base (  ) of ARP poisoning attack patterns is shown in Table 4.In the similarity measurement, the testing data used was 30% of the dataset divided at the data splitting stage.Examples of testing data that have been formed into feature patterns  _ are shown in Table 5.
The results of similarity measurement between the ARP poisoning attack feature pattern in the knowledge base (  ) and the traffic in the testing data ( _ ) result in several similarity values.Each data record in  _ will be measured against 196 data records in   .The value taken as the similarity result was the average value of the total similarity with the   .Data record.The results of similarity measurement are shown in Table 6.The similarity measurement results successfully detect ARP Poisoning traffic records with a detection accuracy of 98.88%, precision of 92.13%, and recall of 97.62%.It showed that the proposed model performs optimally to detect ARP poisoning attacks.

Analysis and Discussion
This research proposed a new ARP Poisoning attack detection model with a similarity measurement approach and dynamic threshold value analysis.In the preprocessing stage, there was a traffic reduction of 5.43%, and only normal activity occurred.Some normal activities had the potential to be performed repeatedly by the user, thus causing recording as redundant data by the Wireshark application and Tshark application conversion.
In analyzing threshold value, in addition to using (1), this research also conducted a heuristic analysis to obtain the optimal   value analysis.The result of the similarity threshold value search found that the best   value was 0.488, with the highest accuracy, precision, and recall values.Thus, it was determined that  =0.488 was the optimal threshold value used to detect ARP Poisoning attacks on traffic records in the testing data.The threshold value can dynamically change if there is a change in the results of the knowledge base formation that depends on the classification method.The higher the accuracy value of the classification method, the more optimal the knowledge base formation in the form of ARP attack feature patterns.The optimization of knowledge base formation shows a more significant amount of feature pattern data.The results of the threshold value search are shown in Figure 4.In this paper, the proposed model could detect ARP Poisoning attacks with a detection accuracy value of 97.25%, precision of 81%, and recall of 96.43%.These results had higher values than research [8] and detection results in several classification methods.The comparison results are shown in Table 8.The comparison results with previous research; the proposed model had higher detection accuracy than earlier research, which was 98.88%.The detection accuracy that could be obtained in this research was 0.1% higher than in the previous study.However, the precision and recall values had lower values.The precision value was 6.87% lower than the highest precision value produced by the Decision Tree method.At the same time, the recall value was 1.92% lower than the highest recall value produced by the Linear Regression classification method.The precision and recall values were obtained lower because the composition of the number of ARP attack activities was only 10.5% of the total traffic data in the testing data and imbalanced compared to the number of normal activity data records.Besides, this paper has the novelty of a detection process that involves the analysis of dynamic similarity threshold values that have never been used in previous research.

Conclusion
This paper proposed a detection model to detect ARP poisoning attacks using a similarity analysis approach adopting cosine similarity.The proposed model aimed to obtain substantial similarity between host activities in the network and ARP Poisoning attack activities in the knowledge base formed from the machine learning-based classification model.The proposed model had a novelty in analysis that involved a dynamic threshold value to determine the host activity pattern as an ARP Poisoning attack.In this research, the model successfully detected ARP Poisoning attacks with a detection accuracy value of 97.25%, precision of 81%, and recall of 96.43% with a similarity threshold value of 0.488.Detection accuracy showed higher results than Atmojo et al. [8] and some machine learning-based classification methods.The detection accuracy that could be obtained in this research was 0.1% higher than the highest accuracy value in previous research obtained by the -NN classification method with a value of  = 2.However, the recall and precision had lower values.The precision value was 6.87% lower than the highest value the Decision Tree method produced.At the same time, the recall value was 1.92% lower than the highest recall value produced by the Linear Regression classification method.These two lower precision and recall values were caused by the data composition formed during network traffic recording.However, the composition of the number of traffic records recorded with the Wireshark application was a characteristic that corresponded to the actual occurrence of ARP Poisoning attacks on the network.In this paper, the proposed model could be used to develop intrusion detection models and make it easier for network administrators to analyze ARP poisoning attacks in computer networks.
The model needs to be developed in future research by optimizing features and feature selection.It aims to improve the model evaluation of the precision and recall measurements without degrading the current detection accuracy results.The precision and recall values need to be increased in further research by handling the imbalance data issues to reduce the positive error rate.In addition, it can develop a time of occurrence analysis to obtain attack analysis that can be causal attacks.Thus, the ARP Poisoning attack detection model could be performed optimally.

Figure 1 .
Figure 1.The proposed model

Figure 3 .
Figure 3. Evaluation results of 5 classification methods

Figure 4 .
Figure 4. Optimal value tracing results from

Table 6 .
Example of Cosine Matrix Map ResultsFrom the similarity measurement results, the lowest value of  (  , _ ) was 0.033, and the highest value was 0.862.Thus, the value of   obtained was 0.448 as the threshold value for determining ARP poisoning traffic.In this research, the value of   = 0.448 successfully obtained six hosts of ARP Poisoning attack perpetrators based on the source IP (  ) grouping with the identification accuracy of the number of ARP poisoning attack traffic 82 and normal activity of 709 records.Identification details of ARP poisoning attacks based on similarity measurements are shown in Table7.