The Classification of Acute Respiratory Infection (ARI) Bacteria Based on K-Nearest Neighbor

Acute Respiratory Infection (ARI) is an infectious disease. One of the performance indicators of infectious disease control and handling programs is disease discovery. However, the problem that often occurs is the limited number of medical analysts, the number of patients, and the experience of medical analysts in identifying bacterial processes so that the examination is relatively longer. Based on these problems, an automatic and accurate classification system of bacteria that causes Acute Respiratory Infection (ARI) was created. The research process is preprocessing images (color conversion and contrast stretching), segmentation, feature extraction, and KNN classification. The parameters used are bacterial count, area, perimeter, and shape factor. The best training data and test data comparison is 90%: 10% of 480 data. The KNN classification method is very good for classifying bacteria. The highest level of accuracy is 91.67%, precision is 92.4%, and recall is 91.7% with three variations of K values, namely K = 3, K = 5, and K = 7.


Introduction
Acute Respiratory Infections (ARI) are included in the list of the top ten infectious diseases whose incidence of infectious diseases (disease prevalence) and morality (a measure of the number of deaths in a population) are quite high in the world [1]. ARI is divided into two, namely upper respiratory tract infections (URTIs) and lower respiratory tract infections (LRTIs). The upper respiratory tract consists of the ears, nose, and throat, while the lower respiratory tract consists of the trachea, bronchi, bronchioles, and lungs [2]. Some examples of ARI diseases caused by bacteria are pneumonia, tuberculosis (TB), diphtheria, and pharyngitis [3].
Pneumonia is an infectious disease caused by an infection that causes the lungs to become inflamed. The causative pathogens (bacteria) are Streptococcus pneumoniae, Staphylococcus aureus, Haemophilus influenza, Mycoplasma pneumonia, Chlamydophila pneumonia, and Legionella pneumophila [4]. Tuberculosis (TB) is one of the serious health problems in Indonesia. TB is an infection caused by Mycobacterium tuberculosis in the lower respiratory tract. Diphtheria is an acute infectious disease caused by Corynebacterium diphtheriae which attacks the upper respiratory tract [2]. From year to year in East Java, the number of diphtheria sufferers is reported to continue to increase until, in 2019, there were 358 cases [5]. In addition, Neisseria gonorrhoeae is a bacterial pathogen that causes pharyngitis [4], which usually occurs in sexually transmitted diseases (STD) without symptoms (asymptomatic) [3].
Achievement of performance indicators of infectious disease control and handling programs, namely discovery, treatment, and success of treatment [5]. Generally, the discovery process is carried out by examining specimens or sputum from the patient, which is then carried out by a microscopic examination process. However, the problems that often occur are the limited number of medical analysts, a large number of patients, differences in perceptions and experiences of medical analysts in identifying bacteria in sputum/throat sputum samples, and the time required for the examination process is relatively longer. Based on the description of the problem above, the researchers created an automatic and accurate bacterial classification system for the early detection of acute respiratory infections (ARI).
Several references are used as references by researchers regarding the identification of bacteria that cause pneumonia and tuberculosis. In 2016, a Streptococcus pneumoniae detection system was created from digital microscope images with an accuracy rate of 80% [6]. Then the bacterial segmentation was developed using the Channel Area Thresholding (CAT) segmentation method so that the system was able to identify bacilli with an accuracy of 97.58% on the sputum image dataset [8]. Meanwhile, the identification of Mycobacterium tuberculosis bacteria has also been carried out using image segmentation and the K-Means clustering method in 2015 [7]. The following research compares two classification methods: backpropagation and K-Nearest Neighbor (KNN), to obtain an accuracy rate of 93.22% for backpropagation and 94.92% for KNN [9].
Based on the references above, the researcher uses the K-Nearest Neighbor (KNN) method. The KNN method is a general and straightforward classification method used, but this research is an early stage of research on ARI bacterial classification, so we focus on selecting the right features to classify ARI bacteria. There is a difference with previous research, namely the type of bacteria studied. In this research, researchers added Staphylococcus aureus and Streptococcus pneumoniae as bacteria for pneumonia disease, Corynebacterium diphtheriae as bacteria for diphtheria disease, and Neisseria gonorrhoeae as pathogens for pharyngitis disease.

Research Methods
This study uses the personal data of the researcher, namely the bacterial image dataset from throat sputum. Several stages were carried out in this research, namely bacterial image, image preprocessing, image segmentation, feature extraction, and bacterial classification using the KNN method, as shown in Figure 1.

Bacteria Images
Generally, the size of bacteria is 0.4 to 2 m, consisting of three general forms, namely cocci, bacilli, and spirochetes [4]. The three forms have other specific forms such as Staphylococcus aureus, which is included in the cocci in a cluster group, Streptococcus pneumoniae is included in the cocci in chains group, Corynebacterium diphteriae is included in the clubshaped and pleomorphic rods group, and Neisseria gonorrhoeae is included in the diplococci group [10].

Neisseria gonorrhoeae Pharyngitis
Mycobacterium tuberculosis TB Table 1 shows that the research data consisted of 5 classes, namely Staphylococcus aureus and Streptococcus pneumoniae as pneumonia disease bacteria, Corynebacterium diphtheriae as diphtheria disease bacteria, Neisseria gonorrhoeae as asymptomatic pharyngitis bacteria, and Mycobacterium tuberculosis as tuberculosis (TB) bacteria.

Preprocessing Images
The data normalization process is carried out at this research stage, such as uniformity of image size and uniformity of color space used before the image segmentation process. Initially, the size of the bacterial image varied from 1920x1080 pixels, but the size was very large, and it was necessary to cut the image to 151x151 pixels, as shown in Figure 3. The result of the cropping process is part of the normalization of data that represents the shape of the ARI bacteria. In addition, the cropping process aims to reduce the computational load [12]. The cropped image is an RGB color space image where the color space consists of 3 color components, namely red components, green components, and blue components. RGB color space has a large size, so it isn't easy to segment, so it needs to be converted to another color space [13], for example, HSV color space. The HSV color space is a color space that also consists of 3 color components, namely the Hue color component, the Saturation color component, and the Value color component. The process of converting color from RGB color space to HSV color space with the formula equation [14] : Next is the process of adding contrast (contrast stretching). Its function is to even out the distribution of light and dark intensity over the entire intensity scale so that the image has a high contrast value.

Segmentation
At this stage, the aim is to separate the research object from the background. This stage uses a threshold process where we have to find the threshold value with formula equation [15] : To find the threshold value (T), we have to look at the histogram of the grayscale image to find out the gray-level value of the research object and the background. In addition to using the thresholding technique, the segmentation process is also carried out using the chain-code technique. This method uses a labeling system for each binary object. It then calculates the proximity of the pixel values based on the direction of 4 or 8 surrounding neighbors, as shown in Figure 4.

Feature Extraction
At this stage, the aim is to find characteristic values that can distinguish the first class from other classes. Feature extraction used in this research is morphological or shape features such as Bacterial count, area, perimeter, and form factor. Determination of the area and perimeter using a chain code, where area (A) represents the area of the bacteria, the perimeter or circumference (P) represents the edge length, and the shape factor (S) represents the shape of the bacteria. The three parameters are expressed by the equation formula [16] :

K-Nearest Neighbor Classification
K-Nearest Neighbor (KNN) classification is one of the classification methods with supervised learning methods. In supervised learning, the classification target is known. The KNN method uses the closest distance to the object to classify data, so that the method is often known as lazy learning. The basic principle of KNN is to find the value of K where the value of K is the closest amount of data that will determine the classification results and to calculate the closest distance using Euclidean distance (ED) calculations with the equation formula [16]- [18]: Where Xir is the testing data and Xij is the training data The total number of data is 481 images, consisting of 94 images of Corynebacterium diphteriae bacteria, 91 images of Mycobacterium tuberculosis, and 95 images of Neisseria gonorrhoeae 92 images of Staphylococcus aureus, and 108 images of Streptococcus pneumoniae bacteria. In this research, the classification process is to find the highest level of accuracy from the KNN method in comparing training data and testing data. The comparison of the data carried out is 50% : 50%, 60% : 40%, 70% : 30%, 80% : 20% and 90% : 10%.

Result and Discussion
In the research of bacterial images, which were originally in the RGB color space, they were converted into HSV color spaces using the equations (1), (2), and (3) so that the HSV color space channel that best represented the shape of the bacteria was shown in Figure 5. 96 image of the saturation component. To clarify the shape of the bacteria, the following process is contrast stretching which causes the image to have a high contrast value so that it also affects the histogram of the image. In addition, there is a change in the image before and after the contrast stretching process, as shown in Figure 6.   Figure 6 shows a difference between the Hue image histogram before and after the contrast stretching process. The range of gray values of the HSV image is 0 -1. This is certainly different from the range of gray values of the grayscale image, which is 0 -255. In the image before contrast stretching, there are two peaks in the histogram, namely 0.58 and 0.78, while after doing the contrast stretching, there are two peaks in the histogram contrast stretching, the distribution of light and dark intensity throughout the intensity scale so that the image histogram looks bigger than before. In addition to changes in the histogram, Figure 6 also shows changes in the hue image before and after contrast stretching. The contrast stretching process helps the process of segmentation because the image of (a) the value of gray level is similar between the object and the background, while the image (b) occur significant color difference between the object and the background that will ease the process of segmentation using a threshold. After the contrast stretching process, the segmentation process is carried out based on the threshold value with the equation (4). Because this study used hue images and saturation images, the threshold value used has a range of 0.4 to 0.7. It depends on the results of the contrast stretching the image, whether it is dark or light. The thresholding process results are a binary image, an image with two values, namely 0 (black) and 1 (white), as shown in Figure 8.  Figure 7 shows that the threshold image can represent most forms of bacteria, but there are some bacteria such as Neisseria gonorrhoeae and Mycobacterium tuberculosis that need to be resegmented. This is because there is still noise in the segmentation image based on the threshold value. Noise is meant objects that are not parts of the bacterial body, such as paint residues and other objects like Polymorphonuclear (PMN) cells. PMN itself is one of the white blood cells that will appear if there is an infection in the body. In the image of Neisseria gonorrhoeae bacteria, the shape of polymorphonuclear cells (PMN) is also segmented, so it is necessary to do segmentation based on area. To perform the segmentation, the process is continued by labeling the object and finding the area value using a chain code with the proximity of 8 neighboring pixels. This process is known as the Channel Area Thresholding (CAT) segmentation technique [19].  Table 2.  Table 2 shows that the area, perimeter, and shape factor of the bacteria with the largest value is Staphylococcus aureus, while the smallest is Corynebacterium diphtheriae. The highest number of bacteria is Neisseria gonorrhoeae, as many as 29 bacteria in one image. In comparison, the least number of bacteria is Mycobacterium tuberculosis, as many as one bacteria in one image.
These features will be the input of the K-Nearest Neighbor (KNN) classification method. The basic principle of KNN is to find the value of K where the value of K is the closest amount of data that will determine the classification results and calculate the closest distance using the Euclidean distance calculation using the equation (8). The learning process of the KNN method is supervised learning, where the target is known beforehand. When testing the test data (unknown class label), the KNN algorithm looks for the training data closest to the test data. The test data is classified according to the class from the training data with the closest Euclidean distance. This study uses 480 data which is divided into training data and testing data with the provisions of 50% : 50%, 60% : 40%, 70% : 30%, 80% : 20% and 90% : 10% with variations in K values, the results accuracy, precision, recall are shown in Table 3.  Table 3 contains the comparison of training data and test data used with variations in the value of K to produce the best level of accuracy, precision, and recall. In comparing data 50%: 50%, the best accuracy rate is 87.5%, with a K = 1. Comparison of data 60%: 40%, the best accuracy rate is 88.54% with a K = 5. Comparison of data 70%: 30 % the best accuracy rate is 90.28% with a Value of K = 5. Comparison of data 80%: 20% the best accuracy rate is 90.63% with a Value of K = 9. This is different in the comparison of training data and test data 90%: 10 %, the best level of accuracy is 91.67%, precision is 92.4%, and recall is 91.7% with three variations in the value of K, namely K = 3, K = 5 and K = 7. To find out the results of the KNN classification, a confusion matrix was made, as shown in Table 4. Table 4. Confusion Matrix with a data ratio of 90%: 10% at the value of K = 7 Output Target  a  b  c  d  e  10 Table 4 shows that as many as 10 data were correctly classified as Corynebacterium diphtheriae, while for Mycobacterium tuberculosis, 8 data were correctly classified, and 1 data was misclassified as Corynebacterium diphtheriae. 4 data were correctly classified as Neisseria gonorrhoeae and 1 data was misclassified as Streptococcus pneumonia. The Staphylococcus aureus was classified correctly as many as 12 data, and 1 data was misclassified into Mycobacterium tuberculosis. Streptococcus pneumoniae were classified correctly as many as 10 data, and 1 data was misclassified into Corynebacterium diphtheriae. These results can occur due to the closeness of the values between the KNN input parameters (number of bacteria, area, perimeter, and shape factor) for each bacterium, as shown in Table 2. An example of the average perimeter feature is shown in Table 5 below.  Table 5 shows a closeness of the average value of perimeter features between Staphylococcus aureus, Streptococcus pneumoniae, and Neisseria gonorrhoeae bacteria, namely 1774, 1088, and 1021. Of course, this proximity affects the classification results using the KNN method, causing misclassification between bacteria so that a confusion matrix is created and is shown in Table 4.
Suppose we compare with previous research where the accuracy of KNN is 94.92% while the accuracy of KNN in this research is 91.67%. This difference occurs because the previous research only classified one bacterium, namely Mycobacterium tuberculosis. Still, in this research, we added four other bacteria, namely Staphylococcus aureus, Streptococcus pneumoniae, Corynebacterium diphtheriae, and Neisseria gonorrhoeae.

Conclusion
This research is one of the computer vision studies that aims to classify acute respiratory tract infection (ARI) bacteria using the K-Nearest Neighbor (KNN) method. The parameters used in this study are shape parameters, namely Bacterial count, area, perimeter, and form factor. The data used are 480 data with the best comparison of training data and test data, namely 90%: 10%. The KNN classification method can classify these bacteria with the highest level of accuracy, namely 91.67%, precision 92.4%, and recall 91.7% with 3 variations in the value of K, namely K = 3 K = 5 and K = 7. In this study, it is necessary to add other features and compare them with other classification methods to get the best classification method to classify bacteria that cause acute respiratory infections (ARI).