Text Classification System Using Text Mining with XGBoost Method

81% dan recall sebesar 77%


Introduction
The availability of massive data nowadays can be used for analysis; thus, it can obtain important/useful knowledge in various domains.Text mining can be used for text analysis using computational methods so that knowledge extraction can be carried out on large text data, including processing related to unstructured text data written in natural language.
Text mining is a process of extracting information, where users interact with documents using analytical tools in the form of data mining components, including clustering components.Text mining adopts various techniques from other fields, such as data mining, information retrieval, machine learning, statistics and mathematics, linguistics, natural language processing (NLP), and visualization.Activities related to research for text mining include text extraction and storage, preprocessing, statistical data collection, indexing, and content analysis [1].analysis, document summarization, and entity-relation modeling.Text mining is used to process unstructured data, in contrast to data mining which is used to process structured data [2].Classification is the searching process or a set of models or functions that describe and differentiate data classes.The goal of classification is that the model can be used to predict the class of an object whose class is unknown [3].Classification is often referred to as the supervised method because it uses previously labeled data as examples of correct data.Text classification is a classification process on textual data that has been carried out in several studies using various methods.Terms often used in text data include documents, words, phrases, corpus, and lexicon.Documents are sequences of words and punctuation marks that follow the grammatical rules of the language.Sentences, paragraphs, sections, chapters, books, web pages, emails, and more are some instances of the documents in this context.A term is usually a word but can also be a word or phrase pair.The corpus is a collection of documents, while the lexicon is a set of all the unique words in the corpus.Popular methods used for classification in the text include Nearest Neighbor (NN), Naïve Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), and Neural Networks method [4].
Text classification utilizing embedded representation of words and word sense has been studied and produced a stable classification process, especially in classifications with complex semantics [5].Local features of phrases and global sentence semantics have been used for text classification using the AC-BiLSTM (attention-based bidirectional long short-term memory with convolution layer) method [6].In addition, word embedding has also been used for text classification using various classification methods [7].However, the method commonly used is TFIDF.TFIDF is a method that integrates term frequency and inverse document frequency.Term frequency calculated by the i-th term is the frequency of the t-th term appearing in the d-th document.Inverse Document Frequency helps reduce the influence of common words in the corpus [8].
Text mining in this research analyzed text data through a text classification system using classification techniques.The classification method used was the XGBoost (eXtreme Gradient Boosting) method, one of the tree-based machine learning methods that utilize treeboosting techniques [9].This study employed XGBoost because it enabled resource optimization through cache access patterns, data compression, and sharding.The data used in this research were Indonesian language article data.TFIDF (Term Frequency Inverse Document Frequency) for feature extraction, ANOVA (Analysis of Variance) for feature selection, and PCA (Principal Component Analysis) for feature dimension reduction were other techniques applied in this work.The XGBoost method has been studied in several studies related to classification, including building a milk source classification model (dairy farming) [10], diabetes prediction [11], traffic accidents prediction [12], gourami supply estimation [13], and landslide hazard mapping [14].The utilization of the XGBoost method for text mining has been carried out in several studies, including the hybrid model development for Ukrainian language sentiment analysis [15], integrated technology analysis of patent data [16], classification of injury rates based on accident narrative data [17], and classification of proactive personality in social media users [18].

2.
Research Method / Proposed Method System development was carried out in two main stages: the training and the testing stage.System overview can be seen in Figure 1.Furthermore, documents that have gone through preprocessing were subjected to TFIDF weighting or the term weighting process.Feature selection was used in feature extraction results to reduce feature size by selecting more relevant features.The feature selection method used was ANOVA (Analysis of Variance).Feature dimension reduction was applied to the feature from feature selection results to reduce feature dimensions.The method used for feature dimension reduction was PCA (Principal Component Analysis).The training was carried out using the XGBoost Method, which was applied to the TFIDF-weighted features of the training data document.The testing was carried out on test data (features) using the model produced during training as explanation of the stages of research that illustrates the logical sequence to get research output in line with expectations.
The text data used in this research were Indonesian articles obtained from the news site www.cnnindonesia.com.The data used consisted of five article topics: Economy, Sports, Entertainment, Lifestyle, and Technology.Each article topic consisted of 20 articles, so the total amount of data used was 100 article data.

Classification
Classification is a technique in Data Mining that is used to extract patterns/knowledge from text in Text Mining.The purpose/function of classification is knowledge extraction or model building to predict previously unknown class/category data [3].The model is formed through the training phase, while the accuracy or performance of the model is obtained through the testing phase.Classification is included in the supervised learning category because the data used is labeled data or with the class column.

XGBoost (eXtreme Gradient Boosting) Method
XGBoost is a tree-based machine learning method which utilizes tree boosting techniques [9].The utilization of cache access patterns, data compression, as well as sharding in XGBoost are the main components that can support more optimal use of resources.The tree boosting technique implemented in XGBoost is scalable, which is effective for preventing overfitting [19].

TF-IDF
TFIDF is a method/technique of word weighting using term frequency ( d t, tf ), and inverse document frequency ( t idf ).The term frequency is obtained by calculating the frequency of a term/word (t) contained in document (d), and the inverse document frequency ( obtained by the logarithmic ratio between the number of documents in the corpus (N) and the number of documents that have the term t ( ).The inverse document frequency is useful for reducing the influence of common words on the corpus [8] , obtained by calculating using Equation (1), while the TFIDF weight is obtained using Equation (2).
The TFIDF weighting function is to obtain values that can be used to represent documents of the training data.

Feature Selection and Dimentionality Reduction
High-dimensional datasets with large feature sizes can cause various obstacles in the learning process in machine learning.These obstacles include increasing the dimensions of the search space and data preparation for the learning process, as well as increasing computational complexity [20].Feature selection is a process of selecting attributes that are considered relevant in the machine learning process, including text mining.Reducing feature size is useful for saving training time and reducing model complexity, and can even support model performance.One of the feature selection methods is Anova (Analysis of Variance) method which utilizes variations between the average features and data attributes from various classes/groups.Dimensionality reduction is used to reduce feature size (dimensions).The method used in this study is PCA (Principal Component Analysis) which is obtained based on the principal data components with the highest value variations.
Preprocessing Preprocessing was conducted for training and testing data.The preprocessing stages consisted of case folding, tokenization, filtering, and stemming.An example of one of the initial article data before the preprocessing stage is shown in Figure 2. Case folding was applied to change text data to lowercase.The resulting data from the case folding process is shown in Figure 3, where words that previously consisted of uppercase letters have changed to lowercase letters.
Figure 3 The results of the case folding process The results of the tokenization process are shown in Figure 4, where the formed data has been separated into units of words/tokens.The data, which originally consisted of paragraphs, was formed into smaller parts, namely sentences.From sentences, the data was formed into even smaller parts in the form of tokens/words.
Figure 4 The results of the tokenization process The resulting data from the filtering process is shown in Figure 5.There was an omission of non-alphabetical characters, such as numbers and punctuation marks, so the data only consisted of the letters a to z.

Figure 5 The omission of non-alphabetical characters
As shown in Figure 6, the omission of meaningless words was also conducted through a filtering process.The omitted words included the words dengan, seluruh, mulai, dari, yang, akan, nanti, dan seperti.Figure 7 shows the results of the stemming process, where the data changes to form basic words.Words that experience changes in examples included pembentukan to bentuk, pengasuhan to asuh, mengendalikan to kendali, kehidupan to hidup, and bersikap to sikap.
Figure 7 The results of the stemming process

Feature Extraction
Feature extraction in this research was carried out using the TFIDF method.Feature extraction aimed to obtain features from the data.Based on word frequency in the data as well as word frequency across the entire dataset, the TFIDF approach generated data features.The distribution of words across the data (common words) was represented using IDF.The frequency of words with less common word influence was represented by TFIDF.   Figure 9 shows the ten highest weighted features for Entertainment Topic. Figure 10 shows the ten highest weighted features for Lifestyle Topic. Figure 11 shows the ten highest weighted features for Economic Topic. Figure 12 shows the ten highest weighted features for Technology Topic. Figure 13 shows the ten highest weighted features for the Sport Topic.

Feature Selection
Feature selection in this research aimed to select features based on their relevance to the topic.It was carried out using the ANOVA method, which utilized variations between the average feature/attribute data from various classes/groups.The ten features resulting from the feature selection process can be seen in Figure 14.The size of the data features after the feature selection process was 215.
Figure 143 The ten highest weighted features from the feature selection process

4.4.
Feature Reduction Dimensional reduction aimed to reduce the data feature dimension, which used the PCA method in this research.PCA was obtained based on the principal data components with the highest value variation.The change in feature size resulted from a dimension reduction to 3 features.Visualization of the feature distribution resulting from feature selection is shown in Figure 15.Entertainment and Sports Topic features have separate feature sections from other features in one direction.Meanwhile, Economic Topic features have a section separated from other features in two directions.Lifestyle Topic features with a slightly separate section from other features, while Technology Topic features whose data were still combined with other features.The ensuing model combination of parameters includes max_depth: 2; min_child_weight: 1; gamma: 0.0; subsample: 0.6; colsample_bytree: 0,1; learning_level: 0.2; and n_estimator: 100.

Conclusion
The conducted research aims to classify text data by employing the XGBoost classification method.Text mining was applied to classify text data through several stages, namely as follows.There was preprocessing, feature extraction using the TFIDF method, feature selection using the ANOVA method, feature dimension reduction using the PCA method, and classification using the XGBoost method.A text classification system was developed to classify text of articles.The highest accuracy obtained from the test is 77%, with a precision of 81% and a recall of 77%.Misclassification occurred on the Topic of Economics, Sports, and the most errors on the Topic of Technology.

Figure 1
Figure 1 System overviewThe training stage aimed to build a model, and the testing stage aimed to test system performance through a model formed using the classification method, namely XGBoost, during the training stage.Articles classification based on Figure2began with document preprocessing.Preprocessing aimed to prepare text into data that could be processed at the next stage (training).The process carried out at the training stage includes (1) Case folding or changing all the letters in the article to lowercase; (2) Tokenization or the stage where a collection of characters in a text that has gone through a case folding process was broken down into units of words (tokens) with the first step, namely dividing the document into smaller parts, namely paragraphs and then sentences; (3) Filtering or the stage of taking important/related words by removing non-alphabetic characters and stopwords or meaningless words/related to determining the topic, as well as removing characters other than letters such as numbers, punctuation, whitespace or blank characters, while filtering in research was based on Sastrawi Algorithm and; (4) Stemming was the stage of mapping and decomposing the form of a word into its basic word form.This process also used Sastrawi Algorithm.Furthermore, documents that have gone through preprocessing were subjected to TFIDF weighting or the term weighting process.Feature selection was used in feature extraction results to reduce feature size by selecting more relevant features.The feature selection method used was ANOVA (Analysis of Variance).Feature dimension reduction was applied to the feature from feature selection results to reduce feature dimensions.The method used for feature dimension reduction was PCA (Principal Component Analysis).The training was carried out using the XGBoost Method, which was applied to the TFIDF-weighted features of the training data document.The testing was carried out on test data (features) using the model produced during training as explanation of the stages of research that illustrates the logical sequence to get research output in line with expectations.The text data used in this research were Indonesian articles obtained from the news site www.cnnindonesia.com.The data used consisted of five article topics: Economy, Sports, Entertainment, Lifestyle, and Technology.Each article topic consisted of 20 articles, so the total amount of data used was 100 article data.

Figure 2
Figure 2 Example of raw text data

Figure 6
Figure 6 the omission of meaningless words (stopword)

Figure 8
Figure 8 The highest IDF scores for ten features/word The resulting features for each data were 2048 features/attributes.Ten words with the highest IDF scores can be seen in Figure 8.The highest TFIDF scores for ten features/words on each topic are shown in Figures 9 to 13. .

Figure 9 Figure 10 Figure 11 Figure 1 Figure 2
Figure 9 The ten highest weighted features for Entertainment Topic

Figure 154
Figure 154 The feature distribution resulting from dimensional reduction 4.5.The Model Testing ResultThe test was carried out on a 30% ratio data test or 30 article data.Based on multiple experiments with various parameter combinations, the optimum model was found.The best

Figure 165
Figure 165 The Confusion matrix of test result Text mining tasks include text categorization, text clustering, concept/entity extraction, sentiment Text Classification System Using Text Mining with XGBoost Method (Ni Kadek Dwi Rusjayanthi)