Customer Segmentation Based on RFM Model Using K-Means, K-Medoids, and DBSCAN Methods

A problem that appears in marketing activities is how to identify potential customers. Marketing activities could identify their best customer through customer segmentation by applying the concept of Data Mining and Customer Relationship Management (CRM). This paper presents the Data Mining process by combining the RFM model with K-Means, K-Medoids, and DBSCAN algorithms. This paper analyzes 334,641 transaction data and converts them to 1661 Recency, Frequency, and Monetary (RFM) data lines to identify potential customers. The K-Means, K-Medoids, and DBSCAN algorithms are very sensitive for initializing the cluster center because it is done randomly. Clustering is done by using two to six clusters. The trial process in the K-Means and K-Medoids Method is done using random centroid values and at DBSCAN is done using random Epsilon and Min Points, so that a cluster group is obtained that produces potential customers. Cluster validation completes using the Davies-Bouldin Index and Silhouette Index methods. The result showed that K-Means had the best level of validity than K-Medoids and DBSCAN, where the Davies-Bouldin Index yield was 0,33009058, and the Silhouette Index yield was 0,912671056. The best number of clusters produced using the Davies Bouldin Index and Silhouette Index are 2 clusters, where each K-Means, K-Medoids, and DBSCAN algorithms provide the Dormant and Golden customer classes.


Introduction
The main goal of the company is to strengthen the relationship between one customer with another customer to get a significant profit in the market competition. This showing that companies must develop skills in identifying customers and providing customer requirements [1]. Distribution companies need to produce management that can identify the best customers and tasks with increasing the company's understanding of customer needs so that company loyalty can be maintained [2]. Customer Relationship Management (CRM) can support the customer segmentation process by implementing appropriate marketing strategies so that companies can identify the quality and behavior of customers. Customer segmentation is the process of dividing customers into groups based on past data with the demands, characteristics, and the same functioning [3]. Customer segmentation analysis of company transaction data is done to find profitable customers. The first thing to do is to change company data to Present Value, Frequency, and Monetary (RFM). RFM is a method used to analyze customer behavior, such as how recently a customer buys (Current), how often a customer buys (Frequency), and how much money a customer spends in conducting transactions (Monetary). The RFM Model attribute explained by linguistic variables. For example, the linguistic variable from Recency is defined using the terms 'old' and 'very new,' the Frequency attribute is explained using the terms 'rarely' and 'often,' and the Monetary attribute explained using the terms 'low' and 'high' values [4]. K-Means, K-Medoids, and DBSCAN are algorithms with RFM models used in this study. These three methods are often used to segment customers because they are easy to understand. Also, three methods are applied in customer segmentation research to determine the diversity of customer classes and to get the best customer class so that companies can use it. K-Means algorithm is sensitive to outliers because of objects with tremendous values. It can substantially distort data distribution, to take the average amount of an object in a cluster as a reference point, a medoid can be used, which is the object in a cluster that is most centralized [5]. The basic strategy of the K-Medoids grouping algorithm is to find k clusters in n objects by first arbitrarily finding representatives of objects (medoid) for each cluster [6]. The DBSCAN method uses the minimum input point parameters (minpts) and epsilon (eps). The process of determining parameter values is trial and error, which means that the determination of parameter values must be tested several times to obtain several clusters [7]. This research explains the transaction data of companies employed in food and beverage distribution. Data transactions generate segmentation of potential customers using the K-Means, K-Medoids, and DBSCAN methods. The results of customer segmentation obtained will be used by the company to find out potential customers in the company so that the company can provide the best service to all customers based on the needs of each customer.

Research Method
Customer segmentation is done by inputting annual transaction data from January 2013 to December 2018, consisting of 334,641 rows of data.  Figure 1 is a general description of the system for customer segmentation, where the data used are sales transaction data of PT. Cimory from January 2013 to December 2018. This paper analyzes 334,641 transaction data and converts them to 1661 Recency, Frequency, and Monetary (RFM) data lines to identify potential customers. The data selection process based on the characteristics of the RFM model. Namely, the creation attribute, the value of the difference between the date of the last transaction and the date of the segmentation process, the frequency attribute is the number of transactions made by customers, and the monetary attribute is the total transactions made by customers. The data transformation process is a transaction data process that has gone through the data selection stage to converted into the RFM model. The data that is transformed will be normalized to produce values with a range that is not too far away so that the results are more optimal. The clustering model design is performed on the RapidMiner Application using the K-Means, K-Medoids, and DBSCAN methods. In this paper, three methods work to form the optimal consumer class for use in distribution companies. The group validation process is done using the Davies-Bouldin Index and the Silhouette Index Method. Then the data modeling process is based on the results obtained from the data modeling process. The results of clustering will group data based on five customer labels, namely Superstar, Golden, Every Day, Occasional and Dormant.

Normalization of Data
Normalization is part of data transformation that used to convert data into values that are easily understood. Normalization is used to improve the accuracy of numerical calculation processes with data scales in the range of 0 to 1 [8]. This study uses the min-max normalization technique, with the following equation.
(1) X is the actual data, minA is the lowest actual data, maxA is the highest actual data, new_maxA is the highest data scale that is 1, and new_minA is the lowest data scale, where the lowest data scale is 0 [9].

Clustering
This paper uses the K-Means, K-Medoids, and DBSCAN algorithms to group data. The use of the K-Means algorithm is very sensitive to initialize the cluster center because it is done randomly [10]. The K-Means algorithm uses the average value as the center of the cluster. The following are the steps for the K-Means algorithm.
a. Choose the k value as the center of the initial cluster at random. b. Each data divided into k clusters and cluster centers obtained using Euclidean Distance.
(2) c. Each cluster center is recalculated based on the average value in the cluster obtained. d. Repeat steps two and three if there are changes to the cluster group. The process will stop if there are no changes to the cluster.
The K-Medoids algorithm applies objects as representatives (medoid) for each cluster. The application of the K-Medoids algorithm takes longer than K-Means because it takes about 2 minutes on Rapidminer, while the K-Means method only takes about 1 second [11]. The steps to complete the K-Medoids algorithm are as follows. a. Initialize the center of the cluster with the number of clusters (k). b. Each data or object is allocated to the nearest cluster using Euclidian Distance. c. Randomly select objects in each cluster as new medoid candidates. d. Calculate the distance of each object contained in each cluster with the new medoid candidate. e. Calculate the total deviation (S) by calculating the total new distance value -the total old distance. If S <0 is obtained, exchange the object with the data cluster to create a new set of k objects as a medoid. f. Repeat steps three into five until there are no changes to the medoid so that clusters and cluster members are obtained. DBSCAN is a grouping method that builds clusters based on density, clusters that are not included in the object are considered noise. The practice of DBSCAN requires a very long time because the use of this method is done by searching epsilon and min points randomly to get a particular cluster [12]. The steps to complete the DBSCAN algorithm are as follows.
a. Initializing min parameters, eps parameters. b. Specify the starting point or p randomly. c. Repeat steps 3 -5 until all points have been processed. d. Calculate eps or all distance points whose density can be reached up to p. e. If the point that fits eps is more than a small point, then the point p is the core point, and the group is formed. f. If p is a border point and there is no point whose density can be reached p, then the process continues to another point.

Data Modelling
Clusters are formed through the process of data modeling. Data modeling complete by comparing the average of each cluster with a range of RFM values so that the class of each cluster can be found. Each variable R, F, and M has three linguistic variables and domain values [13]. Linguistic variables and domain values for each mean are shown in Table 1. Each class in the RFM model has a client label that states the characteristics of each customer class [14]. Class descriptions for each cluster can be seen in Table 2.

Results and Discussions
Clustering was tested with the K-Means, K-Medoids, and DBSCAN method to form 2 clusters until 6 clusters. Below are some of the results of the experiments. Figure 2 shows the results of clustering using the K-Means with the parameter value k = 2. The results of the segmentation of the formation of 2 clusters using K-Means are shown in Table 3. The results of the formation of 2 clusters produce two customer classes, namely Dormant A and Dormant C.   Table 4. The results of the formation of 2 clusters produce two customer classes, namely Dormant A and Dormant C.   Table 5. The results of the formation of 2 clusters produce two customer classes, namely Dormant B and Golden B.   Table 6. The results of the formation of 4 clusters produce four customer classes, namely Dormant A, Dormant B, Dormant C, and Golden B.   Table 7. The results of the formation of 4 clusters produce four customer classes, namely Dormant A, Dormant B, Dormant C, and Golden A.   Table 8. The results of the formation of 4 clusters produce four customer classes, namely Dormant A, Dormant A, Golden B, and Golden E. In the Davies Bouldin validity index, the optimum number of clusters is the number of clusters that have the smallest Davies Bouldin index value [15], while in the Silhouette validity index the optimum amount of clusters is the number of clusters that have the largest Silhouette index value [16]. Figures 8,9,     Based on the results in figures 8,9 and 10, the K-Means method has the smallest DBI value and the largest Silhouette value, and it can be concluded that the K-Means method can produce better clusters compared to other methods. Based on testing a number of different clusters that were tested using the Davies Bouldin Index and Silhouette Index, the best number of clusters is 2 clusters, where the similarity of the three methods is seen based on customer characteristics.

Conclusions
Based on this research, the application of the K-Means and K-Medoids methods in the 2 cluster experiment, did not produce the best customer class, but only created the Dormant customer class, the application of the DBSCAN method in the 2 cluster experiment produced the Golden customer class, in other words, the implementation of the DBSCAN method in 2 cluster experiments are better than the K-Means and K-Medoids methods. Whereas in experiment 4, the three cluster methods produced a Golden customer class. This proves that the more tests are carried out, the resulting customer class will be more varied, so that the possibility of the emergence of the best customer class, namely Superstar and Golden, is greater. The results showed that K-Means had the best level of validity than K-Medoids and DBSCAN, where the Davies-Bouldin Index yield was 0.33009058, and the Silhouette Index yield was 0.912671056. Based on testing a number of different clusters that were tested using the Davies Bouldin Index and Silhouette Index, the best number of clusters is 2 clusters.