Analysis of Cyberbullying Level using Support Vector Machine Method

Internet users in Indonesia is increasing in every year. The increase caused by several factors, such as the increasingly even distribution of internet infrastructure in Indonesia. The internet has a positive impact such as facilitating communication between individuals, while the negative impact of the internet is intimidation to someone or known as cyberbullying. Cyberbullying has a huge impact on mental health person, causing victim to be angry, depressed, and anxious. This research aims to measure the level of cyberbullying in Indonesia on Twitter using TF-IDF and Support Vector Machine. Classification in this study is classified into two classes, namely cyberbullying and non-cyberbullying. Twitter data used in this study were 3,344,782 tweets that resulted in a cyberbullying classification level of 34.59% and a non-cyberbullying classification level of 65.41%. The best accuracy value obtained is 85%.


Introduction
The number of internet users in Indonesia is increasing every year, until the second quarter of 2020 it reaches 73,7% of the total population. Several factors that affect the increase in internet users such as an increasingly equitable internet infrastructure, online learning, and working from home [1]. The internet has a huge impact on daily activities. The positive impact of the internet users is that the facilitates communication between individuals who are in different places, besides that it makes easier for business people to market their products because the internet can be reached by many people. The negative impact of the internet for its users is a decrease in one's desire to socialize with the surrounding environment, the spread of hoax news, and acts of humiliation against someone through comments on social media or what is known as cyberbullying [2].
Cyberbullying is an act of humiliation against an individual in cyberspace [3]. Cyberbullying has an impact on the victim mentality which results in trauma so that the person is often angry, depressed, anxious, afraid, and embarrassed [2]. This can make the victim of JURNAL ILMIAH MERPATI VOL. 10 cyberbullying avoid the surrounding environment, take revenge on the perpetrator, and become the perpetrator of the cyberbullying [4]. The development of information technology produces large amounts of data every day or what is known as Big Data [5]. Big data is a massive data set that has a large and complex structure so that a method is needed to process data [6]. The biggest source of Big Data is on social media like Fac ebook and Twitter. Research related to Big Data analysis can be done using several methods such as Decision Tree, Naïve Bayes, and Support Vector Machine.
Research related to the classification of cyberbullying comments on Instagram social media uploaded by artists. The method used in this study is K-Nearest Neighbor using 1000 comments data which is divided into 500 bullying data and 500 non bullying data. The results of this study obtained the highest accuracy value of 77% with a comparison of 90% of training data and 10% of test data [7].
Other research is related to the design of an English-language cyberbullying comment detection system. The method used in this study is Naïve Bayes with the classification results in the form of bully and non-bully comments obtained from the highest probability value. This study resulted in an accuracy value of 80% [8].
Research related to the comparison of the Support Vector Machine, K -Nearest Neighbor, and Naïve Bayes classification methods on the level of bully behavior on the Whatsapp application shows the Support Vector Machine method gets the best accuracy value of 81.58% [9].
This study aims to measure the level of cyberbullying in Indonesia by analyzing tweet data from social media Twitter. The classification in this study is divided into cyberbullying and non-cyberbullying. This method uses in this research are TF-IDF and Support Vector Machine.

Research Method
The research methodology used in analyzing the level of cyberbullying in Indonesia using the Support Vector Machine method is shown in Figure 1.  Figure 1 is the furrow of this research which begins from collecting tweet data from Twitter using the Twitter API which is used as training data and testing data. Collecting tweet data using an engine made with Python, then the tweet data is stored in the MongoDB.
The next stage is data pre-processing, which begins with retrieving tweet data from the MongoDB database, then clean data, lower case, replace slang words, remove stop words, and stemming. Tweet data that has gone through the pre-processing process is then stored in the MongoDB.
The next stage is to retrieve 2000 tweet data to be manually labeled which is used as training data which is stored in a file with the CSV extension. The next stage is to determine the classification model using training data that already contains the classification label which is tested using TF-IDF and Support Vector Machine. The classification model is used in the cyberbullying analysis stage. The next stage of cyberbullying analysis starts from taking tweet data from the MongoDB and then tested using TF-IDF and Support Vector Machine to determine the classification class of tweet data based on training data that has been given a label. The results of the data analysis are cyberbullying and non-cyberbullying classes, then stored in MongoDB. The last stage is visualization which begins from retrieving tweet data in the MongoDB that has been analyzed, then visualized using Tableau by determining the required filters and measures.

3.
Literature Study 3.1. Cyberbullying Cyberbullying is an act of humiliation or threat to someone on social media. Cyberbullying can take the form of repeatedly sending text messages to someone, shaming someone on social media, and insulting someone using a fake account. Several factors cause perpetrators to take cyberbullying actions such as revenge against victims, in addition to getting their own pleasure [10]. Cyberbullying has a huge impact on the mental health of teenagers because at that age teenagers have unstable emotions so they are prone to mood swings when they receive the influence of an unhealthy environment. The resulting impact causes the victim to experience anxiety, anger, fear, avoid social environment, and depression [2].

Twitter
Twitter is a social networking service in the form of micro-blogging which was originally created as a short message service to facilitate communication in small groups [11]. Twitter is one of the most popular social media in Indonesia, reaching 78 million users from the total population in Indonesia [12].

Python
Python is a script-based programming language that can be used to develop software. Python has several advantages, such as facilitate for data scientists to analyze data to obtain calculations and visualize data more efficiently [13]. Python provides libraries that can be used to simplify the data analysis process such as Pandas and Scikit Learn [14].

TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is a process to calculate the weight of each word that has been extracted. This method is a combination of Term Frequency calculation which is word weighting based on the number of occurrences of words in a document and Inverse Document Frequency which is a calculation based on the number of words that appear throughout the document to determine how common the word is [15]. The TF value will be high if the frequency of words in the document is high and the frequency of documents containing these words is low in the document set [16]. The TF value can be calculated using equation (1).
w is the word that appears in the document. d is the number of documents in the dataset. fw,d is the word frequency in the entire document. The IDF value can be calculated using equation (2).
is a document in a dataset. is the word that appears in the document. is a collection of all documents. ( ) is the word frequency in the entire document set. The TF-IDF value can be calculated using equation (3). (3)

Support Vector Machine
Support Vector Machine is an algorithm capable of analyzing data and identifying patterns [3]. SVM includes a supervised learning method that performs learning on data that has labels to determine the pattern used for the classification process on test data [17]. SVM performs the classification process by determining the best hyperplane that separates two or more classes by maximizing the margin between classes [18]. Margin is the distance between the hyperplane and the closest data from each class. SVM has several kernels that aim to classify data that cannot be classified linearly. Kernal contained in SVM such as Linear, Polynomial, and RBF [19]. There is a decision function as in equation (4).
Based on the decision function, it is assumed that the two classes are separated by a hyperplane, so that the equations and inequalities are obtained which are used to determine the hyperplane as a function for classification.
After getting the equation from the hyperplane, the equation is entered into the decision function sign(f(X)) as in equation (8).
Equation (8) is the pattern used to determine the classification of the testing data. If the result of the decision function is 1, then it produces a positive classification. If the result of the decision function is -1, then it results in a negative classification.

Confusion Matrix
Confusion matrix is a measurement method to determine the quality of the classification model. A dataset that has two classes, the first class is positive and the second class is negative. The TP value is obtained from the number of positive and predicted positive data. The TN value is obtained from the number of negative data and is predicted to be negative. The FP value is obtained from the number of negative data, but is predicted to be positive. The FN value is obtained from the number of positive data, but is predicted to be negative [20]. The results of the calculation of the confusion matrix are accuracy, precision, recall, and f1score. The accuracy value is a calculation process to determine how accurate the correct prediction results are from the total data. The accuracy value can be calculated using equation (9).
The precision value is a calculation process to determine the ratio of t he prediction of a true positive class to the number of data that is predicted to be positive. The precision value can be calculated using equation (10).
The recall value is a calculation process to determine the ratio of the predictions of the true positive class to the number of data that are actually positive. The recall value can be calculated using equation (11). The f1score value is a calculation process to calculate the average comparison of precision and recall. The f1score value can be calculated using equation (12). (12)

Data Collection
Data was collected from Twitter social media using the Twitter API starting from June 2021 -April 2022. The keywords used were "goblok", "tolol", "bego", and "brengsek" which were obtained from an interview with a Clinical Psychologist, Devy Hestiwana, S. Psi., M. Psi., Psychologist.

Data Preprocessing
Data preprocessing is a series of data processing steps that aim to clean up the text on tweet data in order to simplify the cyberbullying analysis process. Preprocessing data consists of cleaning data, lowering case, replace s lang words, remove stop words, and stemming.

Training Data
Training data is data that already has a classification label that is used as learning material from a classification method to determine the appropriate model for analyzing cyberbullying. The classification label on the training data is given manually to the tweet data that has been cleaned. The classification labels are cyberbullying and non-cyberbullying.  Table 4 is an example of training data that has been given a classification label. The training data used in this study is 2000 data which is divided into 1000 data with cyberbullying classification class and 1000 data with non-cyberbullying classification class.

Classification Model Test
Classification model testing is carried out on training data which aims to determine the best classification model to be used in the cyberbullying analysis process using the TF-IDF and Support Vector Machine methods. There are 4 scenarios in the test to get accuracy, precision, recall, and f1score values. The following is a training data test scenario.  Table 5 is a test scenario on training data. The test was carried out 4 times and got the best results on the model that uses a comparison of 90% training data and 10% testing data with an accuracy value of 85%, precision value 90%, recall value 80%, and an f1score value of 85%. Compared to previous studies, implementation of Support Vector Machine method in this study was able to produce a higher accuracy value.

Processing Data
Processing data is the stage of data analysis using the TF-IDF method and the Support Vector Machine. The application of this method uses the Scikit Learn library. The kernel used in the Support Vector Machine method is Linear Kernel. Figure 2 is the result of the cyberbullying analysis. Cyberbullying analysis was carried out on tweet data using a classification model that had been tested and training data used as a source for the classification method in determining the classification class.

Data Visualization
Data visualization is the stage to visualize the analyzed data in the form of line graphs and maps. Visualization aims to obtain information clearly and efficiently from the results of the analysis that has been carried out. The tweet data used in this visualization is data from June 2021 to April 2022.

Level of Cyberbullying in Indonesia
The difference between this study with previous research is the level of cyberbullying in each region in Indonesia. There are 34 provinces in Indonesia which is shown in Table 6.  Table 6 describes data from each province in Indonesia. The location data from Twitter user that is detected only according to the name of the province in Indonesia so that not all data whose location can be detected. The level of cyberbullying is highest in Maluku at 59,34% and the level of non-cyberbullying is highest in Yogyakarta at 75,96%. Based on the amount of analyzed data, Jakarta obtains the highest level of cyberbullying and noncyberbullying.

Conclusion
This study uses tweet data as much as 3,344,782 tweet data obtained from June 2021 to April 2022 regarding tweets containing words related to cyberbullying in Indonesia. The results of the cyberbullying analysis showed 34.59% tweets classified as cyberbullying and 65.41% tweets classified as non-cyberbullying. The method implemented in this research is TF-IDF and Support Vector Machine which produces the best accuracy value of 85% on training data testing of 2,000 data. A significant increase in tweets in the cyberbullying classification occurred on September 5, 2021, which was 149.8% and a significant increase in tweets in the non-cyberbullying classification occurred on November 11, 2021, which was 188.7% .
Suggestions for research development are to use data from social media such as Instagram and Youtube, as well as optimize the preprocessing stage so that the results of implementing the classification method are better.