Medical Costs Estimation Using Linear Regression Method

Medical costs are a significant issue in the health sector. High healthcare cost lead to the need to anticipate financial risks for individuals and insurance providers. Therefore , medical cost data analysis is necessary to estimate future medical expenses. This research implements data mining techniques using Simple and Multiple Linear Regression methods to estimate medical costs. The dataset used consists of insurance claim data obtained from Kaggle, which includes attributes such as age, gender, body mass index, number of children, smoking habits, region


Introduction
Medical costs are a significant problem in the health sector.High medical costs lead to the need to anticipate financial risks, both for individuals or insurance providers.This can be a heavy burden for patients and their families, especially if they do not have insurance or the insurance does not cover all of the medical expenses.Knowing the possibility of future healthrelated expenses can help patients to choose the appropriate insurance and premium plans [1].On the other hand, for hospitals and insurance companies, high medical costs can also be a heavy burden in planning budgets and determine the right insurance premiums.[2].Therefore, it JURNAL ILMIAH MERPATI VOL.11, NO. 3 DECEMBER 2023 p-ISSN: 2252-3006 e-ISSN: 2685-2411 is necessary to analyze health-related insurance cost data to estimate future medical costs more accurately.
To answer this problem, this study used insurance claim data available through external sources such as Kaggle.This dataset includes attributes such as age, gender, body mass index (BMI), number of children, smoking habits, region, and medical charges.These attributes provide essential information needed to analyze and predict medical costs.By using data mining techniques, it is possible to analyze the data that has been collected to identify patterns or information that is useful in predicting the medical cost in hospitals.
Several previous studies have applied data mining techniques such as Linear Regression [3], Generalized Linear Models (GLM) [4], Random Forest [5], Support Vector Machine [6] and Neural Network (NN) [7] algorithms to make an estimates.The linear regression method can be used to estimate in various fields, such as estimating software development costs [8], health benefit costs for outpatient treatment of employees [9], motorcycle sales [10], and estimating the availability of landfills [11].In estimating software development costs, it shows that the linear regression model yielded a R2 value of 0.9476 and has a smaller MMRE value compared to other methods in previous research [8].The Linear Regression Algorithm also shows good performance in research on social media ads data insights to estimate ad duration.The Linear Regression Algorithm yields a RMSE (Root Mean Square Error) evaluation value of 1,690, which is better than Neural Networks, Support Vector Machines, Generalized Models and Deep Learning.[12].It appears that you are debating a research that compares different approaches and performance measures in the field of medical cost estimation.XGBoost, Decision Tree, Random Forest Regressor, Multiple Linear Regression, Support Vector Regression, Ridge Regressor, Stochastic Gradient Boosting, and k-Nearest Neighbors were among the techniques assessed in the study.Based on the presented data, Linear Regression yielded an accuracy of 74% and a Root Mean Square Error (RMSE) score of 0.479808.[13].
This study estimates medical costs using a linear regression method.Linear regression is a statistical technique used to model the linear relationship between the dependent variable and the related independent variables [14].By applying linear regression, models are expected to be developed that can describe the mathematical relationship between variables such as age, gender, body mass index (BMI), number of children, smoking habits, and region as well as predicting medical expenses with a high degree of accuracy.This approach is based on the assumption that the relationship between the dependent and independent variables is linear, and by using linear regression techniques, the best regression line that represents the relationship pattern can be found.Therefore, this article aims to apply the data mining method using linear regression in estimating medical costs.

Research Method
Research on estimating medical costs using linear regression method consists several stages start with performing data preprocessing, then dividing the dataset into train data and test data, creating the linear regression model, and performing evaluation to know the performance of the model.The stages carried out in this study are described in the flow shown in Figure 1.

Data Preparation
The data set used in this study is a health insurance claims dataset obtained from the Kaggle with a total of 1338 data with 7 attributes that are interconnected with one another.

Preprocessing Data
The dataset that has been obtained will be pre-processed before being analyzed with the model.In this stage, pre-processing data is performed by converting categorical data into numerical data so that it can make correlations of all the attributes in the dataset.

Data Correlation
In this process, data analysis is carried out before moving to the modeling phase.This stage aims to understand the distribution of existing data in the dataset and see the correlation between existing attributes.After preprocessing to convert categorical data into numeric data, then from this data we will be able to see correlations between columns before proceeding to modeling.

Splitting Data
In this study, the dataset will be divided into 80% for training data and 20% for testing data.The value of the dependent variable is determined by applying a certain algorithm to the training set.Then, the algorithm that has been trained using the training data will be used to predict the value of the dependent variable in the testing data using the existing independent variables.The prediction results of the dependent variable will be compared with the original values from the testing data, and which has the minimum error is selected as the estimation results.

Linear Regression Models 2.2.1 Simple Linear Regressions
Simple linear regression is a simple straight line model that connects an independent variable (X) with the dependent variable (Y) [15].Therefore, this model can be explained using the mathematical equation of straight lines as in the equation 1. (1)

Referring to the equation 1,
or intercept is the point on the y-axis that is passed by the line, whereas or the slope is the change in the number on the y-axis for every change of 1 unit on the x-axis.To take into account variability or uncertainty that cannot be described by an independent variables, error term (ε) is added, so that the linear regression equation can be written as follows on equation 2. (2) In the estimation model of medical expenses using simple linear regression, variable Y represents medical expenses (charges) and X is age (age).The intercept ( ) value is an estimated of charges when the patient's age is 0, the slope ( ) shows the average change in the charges for each additional 1 year of age.Using the given age value (X), it is possible to calculate the estimated cost of treatment based on the linear relationship between age and medical costs.

Multiple Linear Regressions
Multiple Linear Regressions is a regression method that uses more than one independent variable [16].Multiple linear regression equations are used in combination in the following equation 3. and understand the relative contribution of each independent variable to medical costs.The estimation results obtained are then presented in a scatter plot, which provides a visual picture of the relationship between charges and the independent variables.

Evaluation Metrics
The estimation results in this study evaluated using R-Squared (R2), Mean Absolute Error (MAE), and Mean Squared Error (MSE) to measure the success of the simple and multiple linear regression models in estimating medical costs.R2 is used to measure the extent to which variations in the dependent variable (medical costs) can be explained by the independent variables in the model.Meanwhile, the Mean Absolute Error (MAE) and Mean Squared Error (MSE) describe the difference between the predicted value and the actual value of medical expenses.These three methods can be explained by the following formula in equations 4, 5 and 6. ∑ The equations 4, 5 and 6 are the equations used to evaluate the estimation results of the linear regression model.
is the actual value of the dependent variable used in the calculation process at time i, is the predicted value of the dependent variable given by the linear regression model, ̅ is the average value of the dependent variable in the dataset, while n is the number of observations in the dataset.

Literature Study 3.1 Health Insurance Costs
The Affordable Care Act (ACA) of 2010, which altered the markets and plans for health insurance that impact the great majority of American individuals and families, has made health insurance a topic of discussion.To improve access to care, millions of Americans-many of whom were previously uninsured-are anticipated to sign up for health insurance programs.Benefits for those who already own insurance will alter (Patient Protection & Affordable Care Act, 2010) [17].
The imbalance of life and challenges to health problems cause individuals to need to prepare early to ensure they get the opportunity to get the best health services.There are still some people who are still not aware that health and medical insurance is a policy that covers private medical care costs which may be expensive such as hospitalization and surgery costs.Through medical health insurance, treatment costs and other medical expenses due to disability and chronic illness will be transferred to the insurance company.

Data Mining
Extensive datasets are analyzed using data mining to find patterns and correlations that may be used to solve problems.It's a branch of science where knowledge found in data is explored.Through this method, businesses may improve their decision-making and forecast future trends.Finding patterns, anomalies, and correlations using data mining enables the creation of useful insights.Data collection, which is the first step in the process, involves gathering and organizing relevant information.Data quality is ensured by data preparation, which includes procedures like data pre-processing and cleaning.Before examining the complete dataset, some algorithms are run over sample data.After that, the data is evaluated and processed to produce analytical models that will guide future company choices.Tasks like estimate, prediction, classification, grouping, and association are all included in data mining.

Estimation
Estimation is the process of optimizing the actual state of nature.In general, estimation is related to model building, namely finding the parameters that best describe the multivariate distribution of historical data.In estimation, the goal is to estimate population parameter values by working with samples [19].Ideally in estimation one can consider all possible samples according to a given sampling strategy and obtain a probability distribution.The probability distribution is nothing but a mapping between a list of all possible outcomes and their probabilities.

Linear Regressions
Linear regression is a data analysis technique that predicts unknown data values.Linear regression mathematically models unknown variables or dependent variables and known variables or independent variables as linear equations [20].For example, suppose one has data about last year's expenses and income.A linear regression technique analyzes this data and determines that expenses account for half of income.Next they calculate the unknown future expenditure by dividing the known future income by two

Result and Discussion
This section discusses the trials and results of implementing the linear regression method in estimating medical costs.Implementation in estimation using two linear regression models namely Simple Linear Regression and Multiple Linear Regression is as follows.

Data Preparation
As previously explained, the data used in this study is an insurance claims dataset consisting of 1338 data with 7 attributes.The following is a description of the dataset.The preprocessing stage is carried out to convert categorical data into numeric data on several attributes such as sex, smoker, and region.The results of preprocessing data can be seen in Table 2.
Next, a correlation analysis is performed to understand the relationship between the attributes in the dataset, after which all the columns are converted into numeric data.By identifying the correlation between variables, it can be determined which attributes have a significant effect on medical expenses and become the main predictor factor in the linear regression model.Figure 2 is a visualization of data correlation using HeatMap.This correlation can be observed from the color in each column with the other columns.If it gets lighter, the columns show a stronger relationship with each other, whereas if it gets darker, then the relationship between the columns has a weaker one.From the correlation analysis, it can be seen that the "age" attribute has the strongest correlation with the "charges" attribute.

Simple Linear Regression
The results of the analysis using the Simple Linear Regression method with the independent variable age (age) and the dependent variable medical costs (charges) are represented in the form of a scatter plot graphic to visualize the relationship between the two variables.Figure 3 shows a plot of the results from Simple Linear Regression.  3. Based on the evaluation results, it can be seen that the Simple Linear Regression model has a low R-squared (R2), this indicates the model's ability to explain variations in the data is low.In addition, this model produces high MSE and MAE, indicating that the level of deviation between the prediction and the actual value is also high.Thus, this model may not be sufficiently good at predicting medical costs based on age alone.

Multiple Linear Regression
The estimation of medical expenses then uses the Multiple Linear Regression method to explain the relationship between several independent variables and the dependent variable, namely "charges".The results of the analysis are represented in the form of a scatter plot graph with a red diagonal line which shows the same value between the actual value and the estimated predicted result.Figure 4 displays a plot of the results of Multiple Linear Regression.

Figure 4. Multiple Linear Regression Plots
Figure 4 is a plot of results using multiple linear regression which explains the relationship of independent variable with the dependent variable, namely charges.The red diagonal line in the plot shows the same value between the actual value and the estimate generated by the model.If the model estimation is very accurate, the points on the scatter plot will be very close to the diagonal line.It can be seen that most of the data points tend to follow a diagonal line pattern, this shows that the Multiple Linear Regression model tends to provide fairly good predictions.However, there are several data points that are spread far enough from the diagonal line, which indicates an error in the prediction for certain cases.This is also shown by the results of the model evaluation in From the evaluation table above, it can be seen that the Multiple Linear Regression model has a high R-squared (R2), indicating the model's ability to explain variations in good data.In addition, this model produces lower MSE and MAE than the previous model.Thus, this model is better at predicting medical costs by combining several independent variables, such as age, gender, body mass index, number of children, smoking habits, and region.
Based on the evaluation that has been done, it can be concluded that Multiple Linear Regression provides better performance compared to Simple Linear Regression with higher Rsquared (R2) values, and lower MSE and MAE.However, the MSE and MAE values in this model are still quite high, which indicates that there are still significant limitations and deviation levels in some predictions.Therefore, for estimating medical costs based on the dataset used, the Multiple Linear Regression model is more recommended than Simple Linear Regression.

Conclusion
Data mining is the process of extracting useful knowledge from large and complex data sets.In this study, data mining is used to estimate medical costs.Medical cost are a significant issue in the health sector, so an accurate estimate can assist in budget planning and choosing the right insurance plan.This study uses the Linear Regression approach to estimate medical costs.The health insurance claims dataset was obtained from the Kaggle website and used for training and testing using the Simple Linear Regression and Multiple Linear Regression methods.The stages of this analysis include data preprocessing, data separation, regression and evaluation.The research results show that Multiple Linear Regression is recommended for estimating the dataset used, by obtaining an R2 value of 79.99% compared to Simple Linear Regression which can only explain the relationship of one attribute and obtaining an R2 value of 7.86% with high MSE and MAE.The application of linear regression in the analysis of insurance claims data can provide significant benefits for patients, hospitals and insurance companies.Overall, this study shows that data mining using the linear regression method is an effective approach to estimating medical costs.

Figure 3 .
Figure 3. Simple Linear Regression PlotsFigure3shows the results of test data and train data from Simple Linear Regression using a scatter plot.From this figure it can be seen that there is a significant relationship between data age and medical cost data, as can be seen from the distribution of the points on

Table 3 .
Evaluation Results of the Simple Linear Regression Model

Table 4 .
Evaluation Results of the Multiple Linear Regression Model