The Influence Of Applying Stopword Removal And Smote On Indonesian Sentiment Classification

Information, like public opinions or responses, can be obtained through Twitter tweets. These opinions can expressed as a sentiment. Sentiments can be positive, neutral, or negative. Sentiment analysis (opinion mining) on a text can performed through text classification. This research aims to determine the influence of implementing Stopword Removal and SMOTE on the sentiment classification model for Indonesian tweets. The algorithms used in this research are Logistic Regression and Random Forest. Based on the evaluation, the best classification model in this research was achieved by implementing the Random Forest algorithm along with SMOTE, with an f1-score value of 75.03%. Meanwhile, implementing the Random Forest algorithm and Stopword Removal achieved the worst classification model, with an f1-score value of 68.09%. Implementing Stopword Removal in both algorithms has a negative impact in the form of a decrease in the resulting f1-score. Meanwhile, the performance of SMOTE provides a positive impact in the form of an increase in the resulting f1-score. This happened since Stopword Removal could reduce information and alter the meaning of processed tweets, causing the tweet to lose its sentiment.


Introduction
In producing information, a collection of factual and actual data must be managed.The dissemination of information can be considered highly rapid, owing to the abundance of digital channels/platforms accessible for expressing ideas and opinions.Social media stands as the most widely used digital platform, thus making disseminating information through social media more efficient and swifter.Twitter is a popular social media platform in Indonesia.Twitter provides a space for its users to interact/discuss through short text messages as a means of its utilization.Thus, the data generated is known as tweets, which, when processed, will result in information.One is public sentiment/opinion, which can serve as a reference for reciprocal societal responses.Sentiments can classified into three categories: positive, neutral, and negative.Sentiment data is typically gathered manually, including methods such as distributing questionnaires.However, in practice, it can be pretty time-consuming and labor-intensive, so tweet data retrieval is more efficient to use.Sentiment classification can be achieved by applying classification methods to text, known as sentiment analysis or opinion mining.
Generally, there are four stages of sentiment analysis: text preprocessing, data vectorization, modeling, and evaluation.Data Cleaning and Modeling stages play a crucial role in sentiment analysis, as both generate the dataset and classification model.The dataset that has been developed is used as training data to obtain a classification model.An accuracy evaluation is performed on this classification model.The text preprocessing stage is conducted on tweet data to handle noise or disturbances, such as using non-standard words, abbreviations, and slang.Stopword Removal is one of the methods commonly applied at the Text Preprocessing stage [1].Stopword Removal involves eliminating words that are pretty common and frequently appear but do not significantly impact the meaning of a text or sentence.The implementation of Stopword Removal is expected to yield a better dataset.Several parties have conducted research related to the performance of Stopword Removal.This includes an experiment involving the application of Stopword Removal and Stemming using the LSTM (Long Short Term-Memory) algorithm, where the highest accuracy value was obtained when Stopword Removal and Stemming were not applied, with an accuracy score and f1-score of 0.82 [1].
The modeling stage involves generating a classification model from the input dataset and utilizing a classification algorithm.The parameters of the algorithms are optimized by applying Hyperparameter Tuning.Grid Search is an algorithm that can select a combination of hyperparameters to achieve the highest accuracy value.The parameter values that are obtained are implemented in the classification model.Classification modeling is performed using the Logistic Regression and Random Forest algorithms.This is because, in their implementation, both algorithms can produce good accuracy values and process large amounts of data [2].Research on implementing Grid Search as a Hyperparameter compared KNN and Logistic Regression Algorithms for classifying emotions in Indonesian tweets with the performance of TF-IDF and Grid Search [3].The highest accuracy result value was achieved with the Logistic Regression algorithm and the implementation of TF-IDF and Grid Search, yielding an accuracy and f1-score of 65% and 66%, respectively.In General, the performance of methods and algorithms applied for sentiment classification tends to be suboptimal when the utilized dataset is imbalanced [4].Imbalanced data was handled with SMOTE (Synthetic Minority Oversampling Technique).SMOTE is a technique used to address imbalanced data issues by generating synthetic new data from the minority type in the dataset, thus achieving a balance between classes [5].With the presence of SMOTE, the dataset will not be biased towards the majority class.Therefore, it is expected to optimize the performance of classification methods and algorithms.Research related to the implementation of SMOTE was the analysis of sentiment in Tokopedia's Twitter tweets using the Naïve Bayes and Random Forest algorithms; it was found that the performance of SMOTE could increase the accuracy values for the Naïve Bayes and Random Forest algorithms by 3.4% and 1.55% respectively.The highest accuracy value achieved by the Random Forest algorithm with the implementation of SMOTE is 86.89% [6].
Based on the presented description, this research will analyze the impact of applying Stopword Removal and SMOTE on the resulting f1-score values.This research employs two classification algorithms, Logistic Regression and Random Forest, for machine classification modeling.Then, the f1-score values produced by both algorithms will be compared to determine which algorithm's application is more optimal for performing sentiment analysis on Indonesian language tweets.Figure 1 shows the steps of methods that were carried out in this research.There are eight consecutive method stages to be performed.The process starts with data gathering, text preprocessing, vectorization, balancing dataset, splitting dataset, modeling, validation, and evaluation.

Data Gathering
In data gathering, the data collected consists of Indonesian text data taken from the research conducted by Ridi Ferdiana, Fahim Jatmiko, Desi Dwi Purwanti, Artmita Sekar Tri Ayu, and Wiliam Fajar Dicka in 2019, entitled 'Indonesian Dataset for Sentiment Analysis' [7].The data was collected using the Twitter Streaming API over four months, starting from September to December 2018, using Indonesian standard conjunction words as keywords such as adalah, yaitu, juga, and seperti.This dataset consists of 10,820 labeled sentences categorized into three sentiment classes: 3,228 positive sentences, 3,556 neutral sentences, and 4,036 negatives.The Indonesian Sentiment Analysis Dataset is an Excel dataset stored in (.csv) format.The sentence samples in the dataset can be seen in Table 1.

Text Pre-processing
Text preprocessing refers to steps or techniques used to clean, organize, and transform raw text into a more easily processed form by natural language processing (NLP) models or other computational systems.The goal is to enhance the quality of text data and facilitate further analysis or processing.This involves lowercasing, tokenization, text cleaning, stopword removal, stemming or lemmatization, normalization, and vectorization.The ultimate aim is to simplify and reduce the complexity of the text, making it easier for models to extract relevant patterns or information.
The text preprocessing steps in this research include the following.

a. Cleaning
In the cleaning process, the tweets in the Indonesian Dataset for Sentiment Analysis are cleaned by removing punctuation or delimiters, numbers, symbols, and usernames [1].

b. Case Folding
In the case folding process, the characters of each word in the data are standardized by converting all letters in each word to lowercase [8].

c. Normalization
In the normalization process, changes and language normalization are applied to words, where non-standard words, abbreviations, and words in colloquial and slang language are transformed into words that adhere to the proper rules of writing in the Indonesian language, as per the guidelines of the 'Kamus Besar Bahasa Indonesia' (KBBI) [9].

d. Tokenizing
In tokenizing, the sentences are split into words or tokens using white spaces or spaces [10].

e. Stopword Removal
In the stopword removal process, words that often commonly occur but are insignificant and irrelevant are removed, such as conjunctions and possessive and personal pronouns [1].

f. Stemming
In the stemming process, the words are transformed into base forms by removing prefixes and suffixes [11].

g. Rejoin
In the rejoin process, the words or tokens resulting from the stemming process are recombined into a complete sentence [11] p

Frequency Distribution
Frequency Distribution is carried out to determine the number of occurrences or frequency of a particular word.This research performs the frequency counting process using the FreqDist function in the NLTK library.Frequency distribution for text preprocessing data and text preprocessing + stopword removal data can be seen in Figure 2.

Figure 2. Frequency Distribution
The number of vocabulary and tokens in text preprocessing data and text preprocessing + stopword removal data can be found in Table 3.  [12].This weighting is performed because computers only understand and process data numerically.In this research, the TF-IDF vectorization process is performed using the TfidfVectorizer() and fit_transform() functions in the Scikit-Learn library.The results of weighting/vectorization for the dataset Text Preprocessing and Text Preprocessing + Stopword Removal can be seen in Figure 3.It consists of three parts: document index, word index, TF-IDF score, and Vocabulary Content.

Balancing Dataset
The Indonesian Dataset for Sentiment Analysis is balanced using SMOTE.SMOTE or Synthetic Minority Oversampling Technique is a technique introduced by Nithes V Chawla to address imbalanced datasets [5].By augmenting the minority class data through synthetic data generated from replicating the minority class instances, SMOTE balances the distribution of minority and majority class data in the dataset.The synthetic data or new samples are obtained by finding the k-nearest neighbors of each data point in the minority class and then creating replicas of those data points [13].The result of dataset balancing can be seen in Figure 4.

Splitting Dataset
The dataset is divided into two parts: training data and testing data.Training data is used to train the system to recognize the desired patterns.Testing data is used to evaluate the trained system's performance.In this research, the dataset is divided into 90% training data and 10% testing data; the splitting of training and testing data in the dataset can be seen in Figure 5. Tuning Hyperparameters is a process to optimize the performance of machine learning by selecting the best and optimal hyperparameters.Then, the chosen hyperparameters will be implemented in the machine learning classification algorithm modeling.In this research, the method for Hyperparameter Tuning is Grid Search.Grid Search is an algorithm that is applied to select the best variations of parameters by working through the process of combining all the input parameters.In its implementation, Grid Search typically involves defining a dictionary to store all the hyperparameters that need to be combined or searched for first.Then, this algorithm will perform model calculations based on all the stored hyperparameters.After that, the bestperforming hyperparameter combination for the machine modeling will be obtained based on the resulting f1-score values [14].The hyperparameters that will undergo Hyperparameter Tuning are the C values in Logistic Regression and the values for Estimators, Max_depth, Max_features, and Criterion in Random Forest.

Modelling
Machine Modeling designed in this research employs two classification algorithm approaches: Logistic Regression and Random Forest.Sixteen classification machine models are constructed, consisting of eight Logistic Regression models and eight Random Forest models.

Logistic Regression
Logistic Regression is a data analysis technique in statistics designed to determine the relationship between a dependent variable and one or more independent variables.This technique is also known as a regression model.In applying logistic Regression, the dependent variable used is categorical (nominal or ordinal), while the independent variable is categorical or continuous [15].

Random Forest
Random Forest is an Ensemble Classifier algorithm, which, in its implementation, combines several methods by combining multiple Decision Trees.Its functioning involves combining and performing majority voting on the outcomes of each Decision Tree, ultimately resulting in the final classification class/decision [16].The Decision Trees constructed by Random Forest are formed through random data sampling and considering all the features.Decision Trees consist of root, internal, and leaf nodes created by considering information gained to determine the root node and rules [17].

Testing Scenario
The testing scenarios constructed in this research consist of four model scenarios for each algorithm.The details of the scenarios can seen in Table 4. LG / RF

Evaluation
In this research, the evaluation is conducted on the training data using 10-fold cross-validation (CV) and on the testing data using Confusion Matrix.10-Fold Cross Validation divides the dataset into ten parts (folds), where one part (fold) later becomes training data (validation fold), and the remaining nine parts (fold) become test data (train fold).Measurements are repeated iteratively until each part (fold) out of the ten parts (folds) has been used as the training data (validation fold).Then, the average accuracy value of the 10-fold cross-validation (CV) conducted on the training dataset is a benchmark for the validation results.Confusion Matrix is used to evaluate the performance of a classification model.Within the confusion matrix, there is information related to the actual classification and predictions made by the classification model, allowing for the calculation of accuracy, precision, and recall values as benchmarks for the performance produced by the classification model [18].

Result and Discussion
This section contains the results and discussion of the conducted research.Details and specific results using methods and algorithms can be presented as descriptions, charts, or figures.

Tuning Hyperparameter
The results of Tuning Hyperparameter using Grid Search on the Logistic Regression and Random Forest algorithms can observed in Table 5.It includes detailed values for the parameter C in the Logistic Regression algorithm and the parameters Estimators, Max-depth, Max-features, and Criterion in the Random Forest algorithm.

Evaluation
The evaluation in this research refers to the results of 10-fold cross-validation and the confusion matrix.It is based on the achievement of the highest f1-score value, the comparison of each scenario to the default scenario, and the influence of applying each method in the scenario for each algorithm.Then, the results from both algorithms are compared to determine the best performance achieved between them.

Logistic Regression
The evaluation results of Logistic Regression are shown in Table 6; it is observed that overall, each scenario experiences an increase in the f1-score evaluation value from testing data to training data, except for the fourth scenario.Therefore, the fourth scenario is experiencing overfitting.The scenario that achieved the highest f1-score value in Logistic Regression is the one with SMOTE implementation (Scenario 3), which is 72.70%.Meanwhile, the scenario with the lowest f1-score is the one with Stopword Removal implementation (Scenario 2), which is 69.23% for the f1-score.Based on the comparison of each scenario to the default scenario, it is found that the scenario that experienced the highest increase in f1-score value is the one with SMOTE implementation (Scenario 3), which is +0.80%.Meanwhile, the scenario that experienced the highest decrease in the f1-score value is the one with Stopword Removal implementation, which is -2.67% (Scenario 2).Implementing Stopword Removal on the Logistic Regression algorithm decreases the resulting f1-score.The highest decrease in f1-score based on the application of Stopword Removal is with the combination of Stopword Removal and SMOTE (Scenario 4), which is -3.21%, while the lowest decrease in f1-score is with the Stopword Removal combination (Scenario 2), which is -2.67%.The implementation of SMOTE on the Logistic Regression algorithm increases the resulting f1-score.The highest increase in f1-score based on the application of SMOTE is with the SMOTE combination (Scenario 3), which is +0.80%, while the lowest increase in f1-score is with the combination of SMOTE and Stopword Removal (Scenario 4), which is +0.26%.The best scenario obtained for the Logistic Regression algorithm is with the implementation of SMOTE (Scenario 3), which results in a f1-score of 72.70% and a +0.80% increase in the f1-score.On the other hand, the worst scenario is with the implementation of Stopword Removal (Scenario 2), which yields an f1-score of 69.23% and a -2.67% decrease in the f1-score.

Random Forest
The evaluation results of Random Forest are shown in Table 7; it is observed that all scenarios experience an increase in the f1-score evaluation value from testing data to training data, indicating that there are no scenarios experiencing overfitting.The scenario that achieved the highest f1-score value in Random Forest is the one with SMOTE implementation (Scenario 3), which is 75.03%.Meanwhile, the scenario with the lowest f1-score is the one with Stopword Removal implementation (Scenario 2), which is 68.73%.Based on comparing each scenario to the default scenario in the Random Forest algorithm, the scenario that experienced the highest increase in f1-score value is the one with SMOTE implementation (Scenario 3), which is +6.11%.Meanwhile, the highest decrease in the f1-score value is the one with Stopword Removal implementation (Scenario 2), which is -0.19%.Implementing Stopword Removal on the Random Forest algorithm decreases the resulting f1-score.The highest decrease in f1-score based on the application of Stopword Removal is with the combination of Stopword Removal and SMOTE (Scenario 4), which is -2.03%, while the lowest decrease in f1-score is with the Stopword Removal combination (Scenario 2), which is -0.19%.The implementation of SMOTE on the Random Forest algorithm increases the resulting f1-score.The highest increase in f1-score based on the application of SMOTE is with the SMOTE combination (Scenario 3), which is +6.11%, while the lowest increase in f1-score is with the combination of SMOTE and Stopword Removal (Scenario 4), which is +4.27%.The best scenario obtained for the Random Forest algorithm is with the implementation of SMOTE (Scenario 3), which results in a f1-score of 75.03% and a +6.11% increase in the f1-score.On the other hand, the worst scenario is with the implementation of Stopword Removal (Scenario 2), which yields a f1-score of 68.73% and a -0.19% decrease in the f1-score.

Figure 1
Figure 1.Research Methods

Table 3 .
Detail Vocabulary & Token In Vectorization, word occurrence vectors in documents are created using TF-IDF weighting.TF-IDF (Term Frequency -Inverse Document Frequency) is a method of weighting and vectorizing each word (Term) in text data into numerical values by combining two modeling concepts, Term Frequency (TF) and Inverse Document Frequency (IDF).TF (Term Frequency) represents the LONTAR KOMPUTER VOL.14, NO.

Table 6 .
Evaluation of Logistic Regression

Table 7 .
Evaluation of Random Forest

. 14, NO. 3 DECEMBER 2023 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2023.v14.i03.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021
Based on the research, the conclusion is that the best sentiment analysis classification model for Indonesian tweets in this research is achieved using the Random Forest algorithm with SMOTE applied.It resulted in an f1-score of 75.03%, showing an improvement of +6.11% in the f1-score value.Meanwhile, the worst is achieved using the Random Forest algorithm with Stopword Removal applied.It resulted in an f1-score of 68.73%, showing a decrease of -0.19% in the f1score value.Secondly, the implementation of Stopword Removal on Logistic Regression and Random Forest algorithms can lead to a reduction in the f1-score values.This is because Stopword Removal can potentially reduce information and alter the meaning of the processed tweets, causing them to lose their sentiment.Furthermore, implementing the NLTK stoplist used for Stopword Removal in this research is more optimally effective for document classification than sentiment classification, so implementing a more suitable stoplist for sentiment classification can be an option.The highest decrease in the f1-score is observed in the Logistic Regression algorithm by applying Stopword Removal and SMOTE.In contrast, the lowest reduction in the f1score is kept in the Random Forest algorithm with Stopword Removal.Using Stopword Removal for both algorithms is preferable, resulting in the lowest decrease in f1-score.Thirdly, implementing SMOTE on Logistic Regression and Random Forest algorithms generally increases the f1-score values.The dataset used in each scenario that applies SMOTE to both algorithms has a balanced class distribution, preventing tendencies or biases in sentiment classification towards the majority class.The Random Forest algorithm obtained the highest increase in the f1score with the implementation of SMOTE, amounting to +6.11%.Meanwhile, the lowest increase in f1-score is observed in the Logistic Regression algorithm with the combined implementation of SMOTE and Stopword Removal, amounting to +0.26%.