Implementation of Equal-Width Interval Discretization in Naive Bayes Method for Increasing Accuracy of Students ' Majors Prediction

The Selection of majors for students is a positive step that is done to focus students in accordance with their potential, it is considered important because with the majors, students are expected to develop academic ability according to the field of interest. In previous research, Naive Bayes method has been tested to classify the student’s department based on the criteria that support the case study on Private Madrasah Aliyah PAB 6 Helvetia students and the accuracy of the test from 100 student data is 90%. in this study, the researcher developed a previously used method by applying an equal-width interval discretization that would transform numerical or continuous criteria into a categorical criteria with a predetermined k value, different k values would be tested to find the best accuracy value. from the 120-student data that have been tested, it is proved that the result of the classification of the application of equal-width interval discretization on the Naive Bayes method with the value of k = 8 is better and increased the accuracy value 91.7% to 93.3%.


INTRODUCTION
The role of education is very important in supporting the development of technology that almost has penetrated into all areas.It also affects the determination of majors for high school / equivalent students, where the determination of the student's department is a process to focus students in a particular area of the interested field, this is done so that each student can learn more in the subjects that are in accordance with the concentration which has been specified for the student.The problem is the ongoing system of private school Madrasah Aliyah PAB 2 Helvetia Medan, the place where researchers conduct research is not entirely effective because students are given a questionnaire to determine which majors they are interested in regardless of other criteria that may have a stake in determining eligibility students in terms of choosing majors.Through the process of determining the majors for students is an important step in preparing students to concentrate on the field that students are interested in when it should continue to the next education level.In the previous research, researchers also have done the process of mining to dig information about the determination of student majors using Naive Bayes method, the results of the research were tested 100 student data based on several criteria include the average score of natural science subjects, the average value of science social, classroom teacher recommendation and the questionnaire value filled by the students concerned.from the 100 data tested using the Naive Bayes method, it is obtained the accuracy value of determining student majors by 90% with an error of 10% [1] .The Naive Bayes method was chosen because it was widely implemented in various fields of science, as in the Xingxing Zhou research (2016), the Naive Bayes method was used to classify images to improve the accuracy of brain diagnosis using NMR imagery, where 94.5% sensitivity classification was obtained, 91.70% and the overall accuracy of 92.60 [2].Naive Bayes is one of the top ten (10) data mining algorithms for simplicity and efficiency, as evidenced by the performance of Naive Bayes in classifying text [3], [4].In addition, Naive Bayes is widely recognized as a simple and effective probabilistic classification method [5]- [7], and its performance is proportional to or higher than the decision tree [8] and artificial neural networks [9].However, researchers wanted to expand their previous research by applying Unsupervised Discretization [10] to improve the performance of the Naive Bayes method so that the percentage of predicted accuracy results could increase compared to the previous one.Where Unsupervised Discretization techniques in transforming numerical criteria / attributes are excellent [11].

Naïve Bayes
Naive Bayes is a model-based classification method and offers competitive classification performance compared with other data-driven classification methods [12]- [15], such as neural network, support vector machine (SVM), logistic regression, and k-nearest neighbors.The naive Bayes applies the Bayes' theorem with the "naive" assumption that any pair of features is independent for a given class.The classification decision is made based upon the maximum-aposteriori (MAP) rule.Usually, three distribution models, including Bernoulli model, multinomial model and Poisson model, have commonly been incorporated into the Bayesian framework and have resulted in classifiers of Bernoulli naive Bayes (BNB), multinomial naive Bayes (MNB) and Poisson naive Bayes (PNB), respectively [4].The formula of Bayes's theorem is [16]: Where variable X represents Data with unknown class, H represents The data hypothesis is a specific class, P (H|X) represents The probability of hypothesis H is based on condition X (posterior probability), P (H) represents Hypothesis probability H (prior probability), while P (X|H) represents The probability of X is based on the conditions in hypothesis H and P (X) represents Probability X.Therefore, the method of Naive Bayes above is adjusted as follows: Where Variable C represents the class, while the F1 ... Fn represents the characteristics of the user for the classification process.Therefore, the above formula can also be written simply as follows:

Unsupervised Discretization
Discretization is the process of converting a continuous attribute value into a limited number of intervals and associated with each interval with a discrete numerical value.Discretization process is carried out before the learning process [17].Among the methods of Unsupervised Discretization, there are several simple methods.(Equal-width Interval Discretization and equalfrequency Interval Discretization) and more sophisticated, based on clustering analysis, such as k-means discretization.The Continuous range is divided into subranges by user-specified width or Frequency [18].But in this study, researchers used Equal-width interval Discretization technique, which is the simplest discretization method that divides the observed range of values in each feature / attribute.The process involves sorting the observed values of the continuous feature / attribute and finding the minimum (Vmin) and maximum (Vmax) values.The interval can be calculated by dividing the observed range of values for the variables into k of the same size using the following formula [18].
LONTAR KOMPUTER VOL. 9, NO.Then the limits can be constructed for i = 1 ... k-1 using the above equation.This type of discretization does not depend on multi-relational data structures.However, this discretization method is sensitive to outliers that can drastically reduce the range.The limitations of this method are given by the uneven distribution of data points: some intervals may contain more data points than others.

Research Stages
In the Naïve Bayes method, the constant (categorical) String data is distinguished from continuous numerical data, this difference will be seen when determining the probability value of each criterion whether it is a criterion with a string data value or a criterion with a numeric data value.The stages of applying the method of Naive Bayes in this study can be seen in Figure 1 below.

Data Collection
The data that will be used as training data is the academic data of the students as respondents, where the sample of student data is taken as much as 120 data, they consist of The students' academic data such as the score of Mathematics, Physics, Chemistry, Biology, Economics, Geography, History and Sociology ,the questionnaire that is filled by students and recommendation from the homeroom.

Data Cleaning
In the process of data cleaning, the data that eventually used in this research is the exact value of subjects, non-exact subjects, a recommendation from the homeroom, and questionnaires filled by students.

Determining the Criteria
The criteria that used based on data that has been collected is as in table 1  There are four (4) criteria used in this research, namely the average score of exact subjects, the average value of non-exact subjects, recommendation and lift.Two (2) of them are numerical / continuous criteria and two (2) categorical criteria.To improve the accuracy of the Naive Bayes method, discretization is performed using unsupervised discretization techniques on numerical / continuous criteria, the goal is to transform numerical/continuous criteria into categorical criteria using formulas 4 and 5.The following table 2 discriminates numerical criteria / continuous.In table 2 above, you can see the results of the discretization process using the Unsupervised Discretization technique.Where the criteria / attributes of The average values of exact and nonexact subjects with numerical or continuous type are transformed into categorical criteria with 8 categories.The first category is the average value of exact sciences that are below 71.9125, the second category is the average value of exact subjects which are between 71.9125-73.825,the third category is the average value of exact subjects which are between 73.825-75.7375, the fourth category is the average value of exact subjects that are between 75.7375-77.65,the fifth category is the average value of exact subjects that are between 77.65-79.5625, the sixth category is the average value of exact subjects that are between 79.5625-81.475,the seventh category is the average value of exact subjects which are between 81.475-83.3875,and the eighth category is the average value of exact sciences that are above 83.3875.Furthermore, the results of the discretization of the criteria for the average value of non-exact subjects are also divided into 8 categories, where the first category is the average value of nonexact subjects under 71,875, the second category is the average value of non-exact subjectsacts that are between 71,875-73,75, the third category is the average value of non-exact subjects that are between 73.75-75.625, the fourth category is the average value of non-exact subjects that are between 75.625-77.5, the fifth category is the average value of non-exact subjects that are between 77.5-79.375, the sixth category is the average value of non-exact subjects that are between 79.375-81.25, the seventh category is the average value of non-exact subjects between 81. 25

The Probability of Each Criterion
Several criteria have been set as a reference in classifying students' majors using Unsupervised Discretization techniques on the Naive Bayes method.The next step, determining the probability value of each criterion, for example, the probability value of the average scores of the exact scores of subjects to be shown is the probability value with the value k = 8.
Here the value of probability criteria of the average value of the exact sciences can be seen in table 3.
Table 3.The Probability of The average score of exact subjects with k=8 from table 3 above, there were 60 students placed in the science studies major and 60 students were placed in the social studies major .Based on these data, there were 4 students with the average value of exact subjects below 71.9125 placed in the science studies major and the probability value of 0.067, 3 student with an average value of exact subjects between 71.9125-73.825placed in the science studies major and the probability value of 0.05 , 12 students with the average value of exact subjects between 73.825-75.7375are placed in the science studies major and the probability value is 0.2, 1 student with an average value of exact subjects between 75.7375-77.65 is placed in the science studies major and the probability value is 0.017, 2 students with the average value of exact subjects between 77.65-79.5625are placed in the science studies major and the probability value is 0.033, 13 students with the the average value of exact subjects between 79.5625-81.475are placed in the science studies major and the probability value is 0.217, 8 students with the average value of exact subjects between 81,475-83.3875 is placed in the science studies major and the probability value is 0.133, 17 students with the average value of exact subjects above 83.3875are placed in the science studies major and the probability value is 0.283.Meanwhile, there were 17 students with the average value of exact subjects below 71.9125 placed at the social studies major and the probability value was 0.283, 8 students with the average value of exact subjects between 71.9125-73.825were placed in the social studies major and the probability value was 0.133, 12 students with the average value of exact subjects between 73.825-75.7375were placed in the social studies major and the probability value was 0.2, 3 students with the average value of exact subjects between 75.7375-77.65were placed in the social studies major and the probability value was 0.05, 2 students the average value of exact subjects between 77.65-79.5625are placed in the social studies major and the probability value is 0.033, 9 students with the average value of exact subjects between 79.5625-81.475are placed in the social studies major and the probability value is 0.15, 6 students with the average value of exact subjects is between 81,475-8 3.3875 is placed at the social studies major and the probability value is 0.1, 3 students with an average value of exact subjects above 83.3875are placed at the social studies major and the probability value is 0.05.
The probability value of the average score of non-exact subjects with a value of k = 8, be shown in table 4 4 above, there were 60 students placed in the science studies major and 60 students were placed in the social studies major.Based on these data, there were 18 students with the average value of non-exact subjects below 71.9125 placed in the science studies major and the probability value of 0.3, 10 student with an average value of non-exact subjects between 71.9125-73.825placed in the science studies major and the probability value of 0.167, 9 students with the average value of non-exact subjects between 73.825-75.7375are placed in the science studies major and the probability value is 0.15, 2 student with an average value of non-exact subjects between 75.7375-77.65 is placed in the science studies major and the probability value is 0.033, there is no student with the average value of non-exact subjects between 77.65-79.5625are placed in the science studies major and the probability value is 0, 10 students with the the average value of non-exact subjects between 79.5625-81.475are placed in the science studies major and the probability value is 0.167, 8 students with the average value of non-exact subjects between 81,475-83.3875 is placed in the science studies major and the probability value is 0.133, 3 students with the average value of non-exact subjects above 83.3875are placed in the science studies major and the probability value is 0.05.Meanwhile, there were 3 students with the average value of non-exact subjects below 71.9125 placed at the social studies major and the probability value was 0.05, 6 students with the average value of non-exact subjects between 71.9125-73.825were placed in the social studies major and the probability value was 0.1 , 15 students with the average value of nonexact subjects between 73.825-75.7375were placed in the social studies major and the probability value was 0.25, 1 students with the average value of non-exact subjects between 75.7375-77.65were placed in the social studies major and the probability value was 0.033, 4 students the average value of non-exact subjects between 77.65-79.5625are placed in the social studies major and the probability value is 0.067, 11 students with the average value of non-exact subjects between 79.5625-81.475are placed in the social studies major and the probability value is 0.183, 10 students with the average value of non-exact subjects is between 81,475-8 3.3875 is placed at the social studies major and the probability value is 0.167, 10 students with an average value of non-exact subjects above 83.3875are placed at the social studies major and the probability value is 0.167.
The probability value for the recommendation criteria can be seen in table 5.The number of students used was 120 students who had been recommended by the previous homeroom teacher, there were 60 students were placed in the science studies major and 60 students were placed in the social studies major.Based on these data there were 59 students who were recommended to enter the science studies major and placed in the science studies major, while there was 1 student who was recommended to enter the social studies major but was placed in the science studies major.Furthermore, there were 9 students who were recommended to enter the science studies major but were placed at the social studies major while there were 51 students who were recommended to enter the social studies major and placed at the social studies major.Thus, the probability of students who are recommended to enter the science studies major and be placed in the science studies major is 0.967 while the probability of students who are recommended to enter the social studies major but is placed at the science studies major is 0.033.While the probability of students who were recommended to enter the science studies major but placed in the social studies major was 0.15.then, the probability of students being recommended to enter the social studies major and placed in the social studies major was 0.85.The probability value for the questionnaire criteria can be seen in table 6.
The probability value for the Questionnaire criteria can be seen in table 6.Table 6.The Probability of the Questionnaire criteria with k=8 The number of students used was 120 students who had been given questionnaires, it was recorded as many as 60 students were placed in the science studies majors and 60 more students were placed in the social studies major.Based on these data there were 50 students who chose the science studies major and were placed in the science studies majors, while there were 10 students who chose the social studies major but were placed in the science studies major.Then there were 9 students who chose the science studies major but were placed in the social studies majors while there were 51 students who chose the social studies major and were placed in the social studies major.Thus the probability of students who choose the science studies major can be calculated and placed at the science studies major of 0.833, the probability of students who choose the social studies major but placed in the science studies majors is 0.167.Whereas, the probability of students who choose the science studies major but placed at the social studies major 0.15 while the probability of students who choose the social studies major and placed at the social studies major is 0.85.

Result and Discussion
To see the consistency of the use of equal-width interval discretization in the Naive Bayes method, it was tested for some data, The following test of the implementation of unsupervised discretization on The Naive Bayes method by using sample 60 data can be seen in table 7. From the test result using 90 sample data, the application of equal-width interval discretization technique on Naive Bayes method with k = 4 value succeeded in classifying the data with 90% accuracy, while for k = 6, the accuracy level was 92.5%, then the value k = 8, the accuracy of 93.3% and k = 10, the accuracy of 9.25%.testing is also done with 120 data, the test result can be seen in table 9 below.

Table 9. Testing Results with 120 data
The test result using 120 sample data, the application of equal-width interval discretization technique on Naive Bayes method with value k = 4 succeeded in classifying data with 90% accuracy, while for k = 6, the accuracy level was 92.2%, then for the value k = 8, the accuracy of 93.3% and k = 10, the accuracy of 88.9%.
The graph of the test results with some previous data can be seen in Figure 2 below:   2 above can be seen the results of testing the application of equal-width interval discretization on the Naive Bayes method in predicting the suitability of students' majors.In the test with 60 sample data, the accuracy value of k = 10 was the best result with 58 successfully classified data correctly.Furthermore, in the test with 90 sample data, the best classification result is owned by the value of k = 8 with 84 data successfully classified correctly, and the last test with 120 sample data, got the best result at value k = 8 where there are 112 data successfully classified with correct.

Conclusion
The conclusion that can be summarized in this study is the application of Unsupervised Discretization on the Naive Bayes method has quite an impact on the test results, where the criteria used for this test are: data on the average value of exact courses, data on the average value of non-exact courses, recommendation data and student questionnaire data.And the application of Unsupervised Discretization especially equal-width discretization to Naive Bayes method in predicting the suitability of the student majors increased from the result of accuracy in the previous study by 90% to 93.3%.

Acknowledgments
Researchers would like to thank the Ministry of Research and Technology Higher Education Republic of Indonesia (KEMENRISTEKDIKTI) which has helped this research morally and financially.

Figure 2 .
Figure 2. The test results of Unsupervised Discretization Implementation on the Naive Bayes method

Table 2 .
-83.125, and the eighth category are the average values of non-exact subjects above 83.125.The results of Discretization with k=8

Table 5 .
The Probability of the recommendation criteria with k=8

Table 7 .
Testing Results with 60 dataFrom the test results using 60 sample data, the application of equal-width interval discretization technique on the Naive Bayes method with the value of k = 4 successfully classify the data with

NO. 2, AUGUST 2018 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2018.v09.i02.p05 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 111 the
accuracy of 91.7%, while for the value k = 6, obtained a level of accuracy of 91.7%, then for value k = 8, the obtained accuracy of 93.3% and for the value k = 10, the accuracy rate obtained is 0.967%.meanwhile,testing is also done with 90 data, the test result can be seen in table8below.

Table 8 .
Testing Results with 90 data