Stemming Algorithm for Indonesian Digital News Text Processing
Abstract
Stemming is the process of finding the basic word of a word in the text. The stemming algorithm built by Nazief-Adriani is the best stemming algorithm for Indonesian, and has been refined by Asian Jelita. However, references related to the Nazief-Adriani stemming algorithm are still difficult to find given that the algorithm is an internal publication. Therefore, in this study, will be built stemming algorithm for Indonesian news digital text based on the stemming algorithm Nazief-Adriani and Jelita Asian. The evaluation in this study was done before and after the addition of rules and more complete basic word dictionary. Both evaluations were performed by calculating Precision, Recall and F-Measure values between automatic and manual stemming results. Preliminary tests of the stemming algorithm Nazief-Adriani and Jelita Asian found some new basic words, abbreviations, entities and foreign terms that appear common in the news text but have not stored in the basic word dictionary. Furthermore, there are some unrecognized affixed words in defined rules. The addition of basic words, abbreviations, entities and foreign terms to the basic word dictionary, along with the addition of rules can improve the performance of the stemming algorithm built on this study. Thus, the completeness basic word dictionary and the accuracy of rules play a very important role in the success of an automatic stemming algorithm.