Balancing Datasets for Classifying Comments on the Kampus Merdeka Program Using Synonym Replacement
Abstract
The classification of comments in the Merdeka Campus program is an essential step in analyzing user sentiment towards the various features and services offered by the program. However, in the dataset processed in this study, problems are encountered, namely the imbalance of the amount of data in each class. The Imbalanced Ratio in this dataset is relatively high by 5:1. This generally leads to a classification model that prioritizes the majority class and results in low performance in the minority class. Therefore, a data augmentation approach is used in this study with the Synonym Replacement method to produce data variations in minority classes, thereby reducing the imbalance and improving classification performance. This method utilizes the technique of replacing synonyms in sentences in comments to enrich the dataset and increase the representational features. The study's results showed an increase in the F-Measure value from 0.6672 to 0.7875. Evaluation using ROC shows a maximum value of 0.96. In contrast, the class that did not get augmentation tended to have low ROC values between 0.81 to 0.88.