Spam Comments Detection on Instagram Using Machine Learning and Deep Learning Methods
Abstract
The more popular a public figure on Instagram (IG), the number of followers also increase. When a public figure posts something, there are many comments from other users. In fact, from all the comments, not all of them are relevant to the post, such as advertising, links, or clickbait comments. The type of comments that are irrelevant to the post is usually called spam comments. Spam comments will interfere with information flow and may lead to misleading information. This research compares machine learning (ML) and deep learning (DL) classification methods based on our collected Indonesian IG spam comment dataset. This research was conducted in the following steps: dataset preparation, pre-processing, simple normalization, features generation using TF-IDF and word embedding, application of ML and DL classification methods, performance evaluation, and comparison. The authors compare accuracy, F-1, precision, and recall from ML and DL results. This research shows that ML and DL methods do not significantly differ. The Linear SVM, Extreme Tree (ET), Regression, and Stochastics Gradient Descent algorithms can reach the accuracy of 0.93. At the same time, the DL method has the highest accuracy of 0.94 using the SimpleTransformer BERT architecture. The difference between ML and DL methods is not significantly different.
Downloads
References
[2] S. Aiyar and N. P. Shetty, “N-Gram Assisted Youtube Spam Comment Detection,” Procedia Computer Science., vol. 132, pp. 174–182, 2018, doi: 10.1016/j.procs.2018.05.181.
[3] A. R. Chrismanto, A. K. Sari, and Y. Suyanto, “CRITICAL EVALUATION ON SPAM CONTENT DETECTION IN SOCIAL MEDIA,” Journal of Theoretical and Applied Information Technology (JATIT), vol. 100, no. 8, pp. 2642–2667, 2022, [Online]. Available: http://www.jatit.org/volumes/Vol100No8/29Vol100No8.pdf
[4] A. Chrismanto and Y. Lukito, “Klasifikasi Komentar Spam Pada Instagram Berbahasa Indonesia Menggunakan K-NN,” in Seminar Nasional Teknologi Informasi Kesehatan (SNATIK), 2017, pp. 298–306.
[5] F. Prabowo and A. Purwarianti, “Instagram online shop’s comment classification using statistical approach,” in Proceedings - 2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering, ICITISEE 2017, 2018, pp. 282–287. doi: 10.1109/ICITISEE.2017.8285512.
[6] A. Chrismanto and Y. Lukito, “Deteksi Komentar Spam Bahasa Indonesia Pada Instagram Menggunakan Naive Bayes,” Jurnal Ultima, vol. 9, no. 1, pp. 50–58, 2017, doi: 10.31937/ti.v9i1.564.
[7] W. Zhang and H.-M. Sun, “Instagram Spam Detection,” in 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), Jan. 2017, pp. 227–228. doi: 10.1109/PRDC.2017.43.
[8] B. Priyoko and A. Yaqin, “Implementation of naive bayes algorithm for spam comments classification on Instagram,” in 2019 International Conference on Information and Communications Technology, ICOIACT 2019, 2019, pp. 508–513. doi: 10.1109/ICOIACT46704.2019.8938575.
[9] N. A. Haqimi, N. Rokhman, and S. Priyanta, “Detection Of Spam Comments On Instagram Using Complementary Naïve Bayes,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems, vol. 13, no. 3, p. 263, Jul. 2019, doi: 10.22146/ijccs.47046.
[10] A. Chrismanto and Y. Lukito, “Identifikasi Komentar Spam Pada Instagram,” Lontar Komputer: Jurnal Ilmiah Teknologi Informasi, vol. 8, no. 3, p. 219, 2017, doi: 10.24843/lkjiti.2017.v08.i03.p08.
[11] A. Chrismanto, Y. Lukito, and A. Susilo, “Implementasi Distance Weighted K-Nearest Neighbor Untuk Klasifikasi Spam dan Non-Spam Pada Komentar Instagram,” Jurnal Edukasi dan Penelitan Informatika, vol. 6, no. 2, p. 236, 2020, doi: 10.26418/jp.v6i2.39996.
[12] A. Chrismanto, W. Raharjo, and Y. Lukito, “Design and Development of REST-Based Instagram Spam Detector for Indonesian Language,” Proceedings - 2018 International Seminar on Application for Technology of Information and Communication: Creative Technology for Human Life, iSemantic 2018, iSemantic 2018, pp. 345–350, Sep. 2018, doi: 10.1109/ISEMANTIC.2018.8549725.
[13] A. R. Chrismanto, W. Sudiarto, and Y. Lukito, “Integration of REST-Based Web Service and Browser Extension for Instagram Spam Detection,” International Journal of Advanced Computer Science and Applications, vol. 9, no. 12, 2018, doi: 10.14569/IJACSA.2018.091253.
[14] C. Zhang, C. Liu, X. Zhang, and G. Almpanidis, “An up-to-date comparison of state-of-the-art classification algorithms,” Expert Systems with Applications., vol. 82, pp. 128–150, 2017, doi: 10.1016/j.eswa.2017.04.003.
[15] M. P. Nugraha, A. Nurhadiyatna, and D. M. S. Arsa, “Offline Signature Identification Using Deep Learning and Euclidean Distance,” Lontar Komputer : Jurnal Ilmiah Teknologi Informasi, vol. 12, no. 2, pp. 102–111, Aug. 2021, doi: 10.24843/LKJITI.2021.V12.I02.P04.
[16] I. P. A. E. D. Udayana, M. Sudarma, and N. W. S. Ariyani, “Detecting Excessive Daytime Sleepiness With CNN And Commercial Grade EEG,” Lontar Komputer: Jurnal Ilmiah Teknologi Informasi, vol. 12, no. 3, pp. 186–195, Nov. 2021, doi: 10.24843/LKJITI.2021.V12.I03.P06.
[17] P. K. Roy, J. P. Singh, and S. Banerjee, “Deep learning to filter SMS Spam,” Future Generation Computer Systems, vol. 102, pp. 524–533, 2020, doi: 10.1016/j.future.2019.09.001.
[18] S. Dutta, T. Saha, S. Banerjee, and S. K. Naskar, “Text normalization in code-mixed social media text,” 2015 IEEE 2nd International Conference on Recent Trends in Information Systems, ReTIS 2015 - Proceedings, no. c, pp. 378–382, 2015, doi: 10.1109/ReTIS.2015.7232908.
[19] A. Chandra and S. K. Khatri, “Spam SMS Filtering using Recurrent Neural Network and Long Short Term Memory,” 2019 4th International Conference on Information Systems and Computer Networks, ISCON 2019, ISCON 2019, pp. 118–122, 2019, doi: 10.1109/ISCON47742.2019.9036269.
[20] T. Wu, S. Wen, Y. Xiang, and W. Zhou, “Twitter spam detection: Survey of new approaches and comparative study,” Computers & Security, vol. 76, pp. 265–284, Jul. 2018, doi: 10.1016/j.cose.2017.11.013.
[21] A. A. Septiandri and O. Wibisono, “Detecting spam comments on Indonesia’s Instagram posts,” Journal of Physics: Conference Series, vol. 801, no. 012069, pp. 1–7, 2017, doi: 10.1088/1742-6596/755/1/011001.
[22] R. Wongso, F. A. Luwinda, B. C. Trisnajaya, O. Rusli, and Rudy, “News Article Text Classification in Indonesian Language,” Procedia Computer Science, vol. 116, pp. 137–143, 2017, doi: 10.1016/j.procs.2017.10.039.
[23] F. Z. Ruskanda, “Study on the Effect of Preprocessing Methods for Spam Email Detection,” Indonesian Journal on Computing (Indo-JC), vol. 4, no. 1, p. 109, 2019, doi: 10.21108/indojc.2019.4.1.284.
[24] W. Etaiwi and G. Naymat, “The Impact of applying Different Preprocessing Steps on Review Spam Detection,” Procedia Computer Science, vol. 113, pp. 273–279, 2017, doi: 10.1016/j.procs.2017.08.368.
[25] C. Mus, “10+ Akun Instagram Dengan Followers Terbanyak Di Indonesia,” musdeoranje.net, 2015. http://www.musdeoranje.net/2016/08/akun-instagram-dengan-followers-terbanyak-di-indonesia.html (accessed Oct. 13, 2021).
[26] D. Mekala and J. Shang, “Contextualized Weak Supervision for Text Classification,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 323–333. doi: 10.18653/v1/2020.acl-main.30.
[27] K. Hammar, S. Jaradat, N. Dokoohaki, and M. Matskin, “Deep Text Mining of Instagram Data without Strong Supervision,” Deep Text Mining of Instagram Data without Strong Supervision, pp. 158–165, 2019, doi: 10.1109/WI.2018.00-94.
[28] H. Zhang, “The Optimality of Naive Bayes,” in Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, 2004, pp. 562–567. [Online]. Available: http://www.aaai.org/Library/FLAIRS/2004/flairs04-097.php
[29] Scikit-Learn, “1.4. Support Vector Machines — scikit-learn 0.23.2 documentation,” Scikit-Learn Documentation, 2021. https://scikit-learn.org/stable/modules/svm.html (accessed Nov. 19, 2020).
[30] Suyanto;, Data mining untuk klasifikasi dan klasterisasi data, 1st ed. Bandung: Informatika, 2017. Accessed: Nov. 19, 2020. [Online]. Available: //catalogue.ubharajaya.ac.id/slims/index.php?p=show_detail&id=39879
[31] J. Han, M. Kamber, and J. Pei, Data Mining : Concepts and Techniques, 3rd ed. Morgan Kaufmann, 2011. Accessed: Nov. 19, 2020. [Online]. Available: https://www.amazon.com/Data-Mining-Concepts-Techniques-Management/dp/0123814790
[32] P. Soucy and G. W. Mineau, “A simple KNN algorithm for text categorization,” Proceedings - IEEE International Conference on Data Mining, ICDM, ICDM, pp. 647–648, 2001, doi: 10.1109/icdm.2001.989592.
[33] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997, doi: 10.1006/jcss.1997.1504.
[34] N. Bhandari, “A Gentle Introduction to XGBoost for Applied Machine Learning,” Medium, 2018. https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/ (accessed Dec. 16, 2020).
[35] J. Brownlee, “ExtraTreesClassifier. How does ExtraTreesClassifier reduce… | by Naman Bhandari | Medium,” Machine Learning Mastery, 2016. https://medium.com/@namanbhandari/extratreesclassifier-8e7fc0502c7 (accessed Dec. 16, 2020).
[36] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach Learn, vol. 63, pp. 3–42, 2006, doi: 10.1007/s10994-006-6226-1.
[37] R. N. Waykole and A. D. Thakare, “A Review of Feature Extraction Methods for Text Classification,” International Journal of Advance Engineering and Research Development, vol. 5, no. 04, pp. 351–354, 2018.
[38] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, “Learning word vectors for 157 languages,” LREC 2018 - 11th International Conference on Language Resources and Evaluation, pp. 3483–3487, 2019.
[39] P. Liu, X. Qiu, and H. Xuanjing, “Recurrent neural network for text classification with multi-task learning,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2016-Janua, pp. 2873–2879, 2016.
[40] Y. Lukito and A. Chrismanto, “Recurrent neural networks model for WiFi-based indoor positioning system,” in 2017 International Conference on Smart Cities, Automation & Intelligent Computing Systems (ICON-SONICS), Nov. 2017, vol. 2018-Janua, pp. 121–125. doi: 10.1109/ICON-SONICS.2017.8267833.
[41] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computing, vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/neco.1997.9.8.1735.
[42] A. W. Ramadhan, D. Adytia, D. Saepudin, S. Husrin, and A. Adiwijaya, “Forecasting of Sea Level Time Series using RNN and LSTM Case Study in Sunda Strait,” Lontar Komputer : Jurnal Ilmiah Teknologi Informasi, vol. 12, no. 3, p. 130, 2021, doi: 10.24843/lkjiti.2021.v12.i03.p01.
[43] K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1724–1734, 2014, doi: 10.3115/v1/d14-1179.
[44] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transaction Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997, doi: 10.1109/78.650093.
[45] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[46] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15, 2015.
The Authors submitting a manuscript do so on the understanding that if accepted for publication, the copyright of the article shall be assigned to Jurnal Lontar Komputer as the publisher of the journal. Copyright encompasses exclusive rights to reproduce and deliver the article in all forms and media, as well as translations. The reproduction of any part of this journal (printed or online) will be allowed only with written permission from Jurnal Lontar Komputer. The Editorial Board of Jurnal Lontar Komputer makes every effort to ensure that no wrong or misleading data, opinions, or statements be published in the journal.
This work is licensed under a Creative Commons Attribution 4.0 International License.