Implementation Of ETL E-Commerce For Customer Clustering Using RFM And K-Means Clustering

E-commerce is the activity of selling and buying goods through an online system or online. Customer loyalty is important factor in running a Company. To maintain the loyalty, the company can provide several different treatments to its customers so that they can maintain good relations with customers and can increase product purchase revenue. In this study, we are using ETL method with the K-Means clustering algorithm and RFM (Recency, Frequency, Monetary) feature to segment E-Commerce customer. The proposed method obtains the silhouette score is 0.470 in 4 optimum clusters. Thus, we believe that RFM with K-Means Clustering can be used for segmenting the Customer to increase the Customer loyalty.

The previous research that applied the K-Means algorithm and RFM method for customer segmentation was the research conducted by Rahma Wati Br Sembiring Berahmana, Fahd Agodzo Mohammed, and Kankamol Chairuang. Bouldin is 0.33009058 and the result of the Silhouette Index is 0.912671056 with the best number of clusters being 2, namely the Dormant cluster and the Golden cluster [15]. So this study wants to know customer segmentation using the K-Means clustering method by considering the feature construction into RFM on the E-Commerce Olist dataset.
Based on what has been described above, the formulation of the problem in this study is: (1) How to make feature construction from the E-Commerce Olist dataset feature into an RFM feature. (2) How is the performance of the K-Means algorithm in classifying existing customers in the E-Commerce Olist data.
The purpose of this research are: (1) Get the results of feature construction from the E Commerce Olist dataset feature in the form of the RFM feature. (2) Knowing the performance of the K-Means algorithm in classifying existing customers in the E-Commerce Olist data.

2
Research Method / Proposed Method Here, we are using Extract, Transform and Load Method in executing RFM Analysis. First, we extract the data from data source which consists of 9 tables. Then, we transform the data by merging the tables and perform data understanding. Then, we load the data to Cross-Industry Standard Process for Data Mining (CRISP DM) methods to perform data mining, such as: business understanding, data understanding, data preparation, modeling, evaluation and or deployment. The detailed procedure can be seen as follows:

Data Source
The data used in this research topic is secondary data taken from the Kaggle website (https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce) about performance data in an E-Commerce company in Brazil created by Olist . This data has information on one hundred thousand customer orders from 2016 to 2018. In this study there are 9 types of table data, namely payment type data, product data, purchase review data, purchase data, purchase item data, seller data, location data, product category name, and customer data.

Data Analysis Technique
At this stage the author writes down several steps used in analyzing the data held in order to get a solution to the research problem that the author is doing. Data analysis is carried out as follows:

Data Merger
At this stage, the author selects the required data and collects data by merging data from several tables into 1 dataset table with an inner join. The following is the process of joining data:

Data Understanding
In the data understanding process, the authors carry out the Exploratory Data Analysis (EDA) process by conducting descriptive statistics, correlation analysis, and visualization which aims to ensure that the existing data has no problematic data and find patterns and insights that can be used to develop strategies in modeling later.

Data Selection
At the data selection stage, the author selects attributes or features that are in accordance with the method proposed by the author, namely the RFM method. The following data features were selected by the authors:

Data Preprocessing
The data that has been selected and then carried out a preprocessing process includes:

Data Cleaning
Perform data selection through checking for missing values, checking for duplicate values, checking data types, and removing irrelevant features to be used in the model.

Feature Engineering
The feature engineering process is the process of creating new features or adding features. Indicates the data on the number of customer purchases or the number of purchase orders made by the customer.

Monetary
Indicates total customer expenditure data to purchase products.

Standardization
The standardization process is the process of changing the feature values. The standardization results have a mean value of 0 and a standard deviation of 1.

Proposed Method 2.3.1 Model Training & Evaluation Model
At this stage, the examiner uses data on the time of purchase in 2016 and 2017 as training data, while for the test data uses data on the time of purchase in 2018. Then the training data will be carried out in the training process using the K-Means clustering algorithm by conducting several trials of the number of clusters. To get the best number of clusters, you can use the evaluation of the Elbow Method and Silhouette Score models. After getting the best number of clusters, a training process will be carried out using a predetermined number of clusters and storing the model for use in the testing phase.

Testing Stage
At this stage, data on the time of purchase in 2018 was used and made predictions using the previously stored model to get the cluster label.

Data Merger
Merge data with the inner join method, according to the data needed to become one dataset. Obtained 115878 records which consist of 22 columns which include 9 features with numeric data type and 13 features with object / string data type. The following is an example of how to combine data and the results of data merging using the following coding:  Year of product purchase made by customer

Data Understanding
Data understanding is a process to find out information from a data that is owned. To find out the data information, several visualizations were made to gain insight as follows:

Average monthly active users and number of new customers
In this stage, visualization will be made to find out information about the average monthly active users and the number of new customers per year.  From this analysis, it is seen that monthly customer activity and also the number of new customers have increased.

Number of Customers Making Repeat Orders
At this stage the researcher wants to know data about the number of customers who make repeat orders every year.

Total product category revenue per year
At this stage the researcher wants to find out data information about what product category names have the highest amount of income each year  From the visualization above, it can be seen that the product category that provides the highest amount of revenue every year is always changing. Viewed from the side of the company's overall income has also increased every year.

Number of product category cancellations per year
At this stage the researcher wants to find out information about the name of the product category that has the highest number of cancellations of product purchases per year.  From the visualization above, each year there are different product categories that experience purchase cancellations. However, there is an interesting thing that in 2018 the health beauty product category was the name of the product category that provided the most revenue as well as the name of the product category that experienced the highest number of product purchase cancellations in 2018.

Favorite payment type per year
At this stage the researcher wants to know the types of payment types that are often used by customers to buy products at E-Commerce companies per year.  From the visualization above, it is found that the payment method that is more attractive to customers in the process of purchasing products at an E-Commerce company is a credit card.

Data Selection
After visualizing the data to be able to understand the content of information in the data held, then selecting the appropriate features to create RFM features. The following is an example of the results of the feature selection used:

Data Preprocessing
Data preprocessing is carried out before the modeling process so that the data makes the data more structured, here are some processes for conducting data preprocessing:

Data Cleaning
Data cleaning is the process of cleaning data from data that is incorrect, incomplete, etc.

Feature Engineering
Feature Engineering is the process of making existing features into RFM features. For making RFM, you can use the following coding:

Standardization
Standardization is the process of changing feature values which have a mean value of 0 and a standard deviation of 1.

Modelling Process
At this stage the researchers tried to make a model using the K-Means clustering algorithm with the number of clusters from 2 to 9. Then evaluated using a silhouette score to get the optimal number of clusters. Here it is found that the number of clusters is 4 clusters with a value of 0.470. Figure 9 Silhouette Score K-means Clustering

Conclusion
Based on the results of research conducted by the author on the Implementation of ETL E-Commerce For Customer Clustering Using RFM And K-Means Clustering, are as follows: 1. The result of feature construction from the E-Commerce Olist dataset feature in the form of the RFM feature has been successfully executed with the findings obtained in the form of the value on the frequency feature at most is 1, so it can be interpreted that most of the customers only make a product purchase once a year. 2. The performance of the K-Means clustering algorithm proves that this algorithm can be applied to segment customers. This is evidenced by the results of the silhouette score evaluation with a value of 0.470 with a cluster value of 4 clusters.