Dimensionality Reduction using PCA and K-Means Clustering for Breast Cancer Prediction
Abstract
Breast cancer is the most important cause of death among women. A prediction of breast cancer in early stage provides a greater possibility of its cure. It needs a breast cancer prediction tool that can classify a breast tumor whether it was a harmful malignant tumor or un-harmful benign tumor. In this paper, two algorithms of machine learning, namely Support Vector Machine and Extreme Gradient Boosting technique will be compared for classification purpose. Prior to the classification, the number of data attribute will be reduced from the raw data by extracting features using Principal Component Analysis. A clustering method, namely K-Means is also used for dimensionality reduction besides the Principal Component Analysis. This paper will present a comparison among four models based on two dimensionality reduction methods combined with two classifiers which applied on Wisconsin Breast Cancer Dataset. The comparison will be measured by using accuracy, sensitivity and specificity metrics evaluated from the confusion matrices. The experimental results have indicated that the K-Means method, which is not usually used for dimensionality reduction can perform well compared to the popular Principal Component Analysis.
Downloads
References
[2] R. Jain and A. Abraham, “A Comparative Study of Fuzzy Classification Methods on Breast Cancer Data” Australasian Physics & Engineering Sciences in Medicine, Vol. 27, no. 4, p. 213-218, 2004.
[3] E. D. Ubeyli, “Implementing Automated Diagnostic Systems for Breast Cancer Detection” Expert System with Applications, Vol. 33, no. 4, p. 1054-1062, 2007.
[4] I. Muhic, “Fuzzy Analysis of Breast Cancer Disease Using Fuzzy C- Means and Pattern Recognition” Southeast European Journal of Soft Computing, vol. 2, no. 1, p. 50-55, 2013.
[5] C. P. Utomo, A. Kardiana and R. Yuliwulandari, “Breast Cancer Diagnosis Using Artificial Neural Networks with Extreme Learning Techniques” International Journal Advanced Research in Artificial Intelligence, vol. 3, no. 7, p. 10-14, 2014.
[6] A. Handayani, A. Jamal and A. A. Septiandri, “Evaluasi Tiga Jenis Algoritme Berbasis Pembelajaran Mesin untuk Klasifikasi Jenis Tumor Payudara” Jurnal Nasional Teknik Elektro Teknologi Informasi vol. 4, no. 4, p. 394-403, 2017.
[7] A. Fallahi and S. Jafari, “An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network” International Journal of Advanced Science and Technology, vol. 34, p. 65-70, 2011.
[8] A. Aloraini, "Different Machine Learning Algorithms for Breast Cancer Diagnosis," International Journal of Artificial Intelligence & Applications (IJAIA), vol. 3, no.6, p. 21-30, 2012.
[9] K. Sivakami and Nadar Saraswathi, "Mining Big Data: Breast Cancer Prediction using DT - SVM Hybrid Model," International Journal of Scientific Engineering and Applied Science (IJSEAS), vol. 1, no. 5, p.418-429, 2015.
[10] K. Menaka and S. Karpagavalli , "Breast Cancer Classification using Support Vector Machine and Genetic Programming," International Journal of Innovative Research in Computer and Communication Engineering, vol.1, no. 7, p. 1410-1417, 2013.
[11] M. U. Ali, S. Ahmed, J. Ferzund, A. Mehmood and A. Rehman, “Using PCA and Factor Analysis for Dimensionality Reduction of Bioinformatics Data” International Journal of Advanced Computer Science and Applications, vol. 8, no. 5, p. 415-426, 2017.
[12] M. M. Al-Anezi, M. J. Mohammed and D. S. Hammadi, “Artificial Immunity and Feature Reduction for Effective Breast Cancer Diagnosis and Prognosis” International Journal of Computer Science Issue, vol. 10, no. 3, p. 136-142, 2013.
[13] R. R. Janghel, R. Tiwari, R. Kala and A. Shukla, “Breast cancer data prediction by dimensionality reduction using PCA and adaptive neuro evolution” International Journal of Information Systems and Social Change, vol. 3, no. 1, p. 1-9, 2012.
[14] K. Gupta and R. R. Janghel, “Dimensionality Reduction-Based Breast Cancer Classification using Machine Learning” Computational Intelligence: Theories, Application and Future Directions (Advances in Intelligent System and Computing ), vol. 1, editors N. K. Verma and A. K. Ghosh, Springer Nature Singapore Pte Ltd., p. 133-146, 2019.
[15] T. Yuan, W. Deng, J. Hu, Z. An, and Y. Tang, “Unsupervised Adaptive Hashing based on Feature Clustering” Neurocomputing, vol. 323, p. 373-282, 2019.
[16] T. Chen and C. Guestrin, “XGBoost: a Scalable Tree Boosting System” in KDD'16 Proceedings of the 22nd ACM SIGKDD, International Conference on Knowledge Discovery and Data Mining, California, 2017, p. 785-794.
[17] D. Napoleon and S. Pavalakodi, “A New Method for Dimensionality Reduction using K-Means Clustering Algorithm for High Dimensional Data Sets”, International Journal of Computer Applications, vol. 13, no. 7, p. 41-46, 2011.
[18] D. Rusjayanthi, “Identifikasi Biometrika Telapak Tangan Menggunakan Metode Pola Busur Terlokalisasi, Block Standar Deviasi, dan K-Means Clustering” Lontar Komputer, vol. 4, no. 2, p. 265-276, 2013.
[19] M. Khan, “KMeans Clustering for Classification” Towards Data Science, 7 Aug. 2017 [online], Available: https://towardsdatascience.com/kmeans-clustering-for-classification-74b992405d0a [Access 10 Oct. 2018]
[20] Arif Habib, Meshiel Alalyani, I Hussain Musa and M. S. Almutheibi, “Brief review on Sensitivity, Specificity and Predictivities” IOSR Journal of Dental and Medical Sciences (IOSR-JDMS), vol. 14, no. 4, p.64-68, 2015.
The Authors submitting a manuscript do so on the understanding that if accepted for publication, the copyright of the article shall be assigned to Jurnal Lontar Komputer as the publisher of the journal. Copyright encompasses exclusive rights to reproduce and deliver the article in all forms and media, as well as translations. The reproduction of any part of this journal (printed or online) will be allowed only with written permission from Jurnal Lontar Komputer. The Editorial Board of Jurnal Lontar Komputer makes every effort to ensure that no wrong or misleading data, opinions, or statements be published in the journal.
This work is licensed under a Creative Commons Attribution 4.0 International License.