CLUSTERING BERITA MENGGUNAKAN ALGORITMA TF-IDF DAN K-MEANS DENGAN MEMANFAATKAN SUMBER DATA CRAWLING PADA SITUS DETIK.COM

  • I Made Arta Purniawan Teknologi Informasi Universitas Udayana
  • Gusti Made Arya Sasmita
  • I Putu Agus Eka Pratama

Abstract

News clustering aims to identify each news group that is formed from the implementation of the K-Means method which is based on the word weighting process using the TF-IDF (Term Frequency Inverse Document Frequency) Algorithm. The clustering process uses news crawled from the detik.com site for a period of one year (2018), totaling 124,509 news stories and stored in the form of a CSV (Comma Seperated Value) file. Before carrying out the clustering process, the previous dataset must go through a text-processing stage in the form of: case folding, tokenizing, stopword removal, and stemming. The TF-IDF and K-Means methods are used for the clustering process. The TF-IDF method assigns weights to each keyword in each category to find the similarity of keywords to the available categories, then continues with the K-Means Method for the grouping process based on similar characteristics / similarities between documents. In the process, there are two implementations of the K-Means method, each using 16 centroids and 12 centroids. This is because in the first process, there are groups / clusters that cannot be identified because they contain common words, so a second implementation is needed. Based on the results of testing on 124,509 news stories, there are 27 news groups that have been successfully identified with adequate application capabilities in processing large data.

References

[1] Husni, Yudha Dwi Putra Negara, M. Syarief (2015). Clusterisasi Dokumen Web (Berita) Bahasa Indonesia Menggunakan Algoritma K-Means. Jurnal SimanteC, vol. 4, no. 3, 159 - 160
[2] Wayan M. Wijaya (2019). TEKNOLOGI BIG DATA Sistem Canggih dibalik Google Yahoo! Facebook IBM.
[3] Edy Susanto, Viny Christanti Mawardi, Manatap Dolok Lauro (2021). Aplikasi Clustering Berita Dengan Metode K Means Dan Peringkas Berita Dengan Metode Maximum Marginal Relevance. Jurnal Ilmu Komputer dan Sistem Informasi, 62-63
[4] Ni Komang Widyasanti, I Ketut Gede Darma Putra, Ni Kadek Dwi Rusjayanthi (2018). Seleksi Fitur Bobot Kata dengan Metode TFIDF untuk Ringkasan Bahasa Indonesia. Merpati, vol. 6, no. 2, 121-122
[5] Muhammad Sholeh hudin, M Ali Fauzi, Sigit Adinugroho (2018). Implementasi Metode Text Mining dan K-Means Clustering untuk Pengelompokan Dokumen Skripsi (Studi Kasus: Universitas Brawijaya). Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer, vol. 2, no. 11, 5519-5520
Published
2022-01-25
How to Cite
PURNIAWAN, I Made Arta; SASMITA, Gusti Made Arya; PRATAMA, I Putu Agus Eka. CLUSTERING BERITA MENGGUNAKAN ALGORITMA TF-IDF DAN K-MEANS DENGAN MEMANFAATKAN SUMBER DATA CRAWLING PADA SITUS DETIK.COM. JITTER : Jurnal Ilmiah Teknologi dan Komputer, [S.l.], v. 3, n. 1, p. 821-830, jan. 2022. ISSN 2747-1233. Available at: <https://ojs.unud.ac.id/index.php/jitter/article/view/82983>. Date accessed: 19 apr. 2024. doi: https://doi.org/10.24843/JTRTI.2022.v03.i01.p18.

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.