Spatial Based Deep Learning Autonomous Wheel Robot Using CNN

The development of technology is growing rapidly; one of the most popular among the scientist is robotics technology. Recently, the robot was created to resemble the function of the human brain. Robots can make decisions without being helped by humans, known as AI (Artificial Intelligent). Now, this technology is being developed so that it can be used in wheeled vehicles, where these vehicles can run without any obstacles. Furthermore, of research, Nvidia introduced an autonomous vehicle named Nvidia Dave-2, which became popular. It showed an accuracy rate of 90%. The CNN (Convolutional Neural Network) method is used in the track recognition process with input in the form of a trajectory that has been taken from several angles. The data is trained using Jupiter's notebook, and then the training results can be used to automate the movement of the robot on the track where the data has been retrieved. The results obtained are then used by the robot to determine the path it will take. Many images that are taken as data, precise the results will be, but the time to train the image data will also be longer. From the data that has been obtained, the highest train loss on the first epoch is 1.829455, and the highest test loss on the third epoch is 30.90127. This indicates better steering control, which means better stability.

packet loss is smaller. In addition, the GPU owned by Nvidia also supports data image processing, so it can be processed faster.
The data that has been input and processed by the machine passes through two or more layers [4]. When more layers are used, the accuracy rate will also be increase [5]. This layer is a substitute for humans to make decisions independently without human assistance. One method of Deep learning is the Convolutional Neural Network or commonly abbreviated as CNN [6]. The CNN works by scan each section in the data to be used as a node. Each number in nodes is the result of matrix calculation. The robot can follow the track avoiding obstacles and doing work more efficiently and optimally.
Research related to autonomous driving in Artificial Intelligence laboratories has been done before [7], by simulating it using a program called the Carla simulator. It's an open-source one for autonomous car driving. Aims to continue the development to the next stage, this research focuses on making a prototype of an autonomous car that has three wheels; two regular wheels and one Omni wheel. The body is made of abs filaments that are printed using a 3D printer, the camera module as a place to pick up objects on the front. Thus the robot is expected to be able to operate on the ground, reading the area of the path and obstacles obtained through camera capture.

Research Methods
The method used in this study is the Resnet model of convolutional neural network (CNN) as follows :  The design of the research system developed is illustrated in Figure 1. The camera retrieves digital imagery data passed to Nvidia Jetson Nanobot for processing, data from the received digital imagery will be processed with Nvidia Jetson Nanobot using Convolutional Neural Network method, this method is used to detect image data and train it, but this process is done separately in Personal Computer using Jupiter lab.
After Nvidia Jetson nano does training on the data that has been taken, the data will be reused as a reference to control the motor drivers who drive dc motors as actuators.

Deep learning
In the deep learning method, it is necessary to address significant problems in statistical machine learning [8]. The selection of a feature space that fits the representation learning approach becomes a problem in machine learning because the input space can be mapped to intermediate features. Deep neural networks have some difficulties [9], especially with high dimensional input spaces, e.g., images. This problem then encourages researchers to adopt a deep architecture, consisting of several layers with non-linear processing to solve the problem. Although there is already evidence of a successful case of a shallow network [10] [11], the researchers found that curse dimensionality becomes a problem in the case of multiple functions. Also, it was found that increasing the number of layers in the neural network can reduce the impact of backpropagation on the first layer. The descent of the gradient then tends to stop within the local minima or plateaus. However, this problem was solved in 2006 [12] [13] through the introduction of layerwise unattended pre-training. In 2011, the Graphics Processing Unit (GPU) speed increased significantly, which made it possible to train Convolutional Neural Networks based architectures. Alexnet is won international competitions in 2011 and 2012. Then, in the following order of the following year, with the advancement of CPU and GPU, Deep learning and more for data-hungry deep learning techniques. The training and validation of motor sensor control models for urban driving in the real world were beyond the reach of most of the research groups [14]. Therefore, simulation testing is an alternative that can be done.

Neural Networks
The result of cross multiplication Feedforward neural networks or Multilayer Perceptrons (MLPs) [15] are the base of the Deep Learning Model. The main objective of the feed-forward network is to define the mapping of input , − ( ; ) categories and to estimate the value of the parameter θ, which is the result of the best function estimate [16] [17].

Figure 2. Example of MLPs With Hidden Layer
A feed-forward neural network has a structure consisting of many different functions. For example, Figure 2 consists of three different layer functions (1), (2), (1) (x))))For this case, (1) referred to as the input layer, (2) is the second layer, or the hidden layer, and then (3) is the output layer referred to in Figure 2. The overall length of the chain is the depth of the model. From here, this process is called Deep Learning [18].
This can provide a feed-forward network as a transformation of a linear function into a nonlinear function of , or it can be expressed as ( ), where is a non-linear transformation. So it can be said that has a feature that describes or provides a new representation of . There are three general approaches [18]used to select mapping. That is: a. Very generic based approach. b. Manually engineered . c. Parametrization of with a representation of ( ; ). The last third option uses the feed-forward network as an application to study deterministic mapping, stochastic mapping, functions with feedback, and probability distributions on a single vector [18]. Most of the Neural Network models are designed using this principle.

Convolutional Neural Network
CNN, introduced by LeCun, is mainly used to process data with a grid-like topology. It is simply neural networks that use convolutions instead of general multiplication. Usually, a convolutional network is composed of three-phase. The first phase, the convolutional layer, carries out convolution to produce a series of linear activations. In the second phase, the convoluted features . .

Input Layers
Hidden Layers Output Layers  With the general availability of data and escalating computing power, deep learning approaches as convolutional neural networks (CNN), as evident, outperform traditional approaches. [22] CNN consists of multiple layers, each of which has an Application Program Interface (API) or commonly called a simple application program interface. In Figure 3, CNN, with the initial input of a threedimensional block, will be transformed into a three-dimensional output with several differentiation functions that have or do not have parameters. CNN forms its neurons into three dimensions (length, width, and height) in one layer. The proposed system performance was evaluated based on mean square error (MSE) [23] [24].
In CNN, there are two main processes, namely Feature Learning and Classification

Feature Learning
Feature Learning is the layers contained in Feature Learning, which is useful for translating input into features based on the characteristics of the input, which are in the form of numbers in vectors [25]. This feature extraction layer consists of a Convolutional Layer and a Pooling Layer. a. Convolutional Layer will calculate the output of neurons connected to the local area in the input [26]. b. The Rectified Linear Unit (ReLU) will abolish off the lost gradient by adjusting the element activation function as f (x) = max f, 0 (0, x) [27] element activation will be performed when on the verge of 0. Advantages and disadvantages of using ReLU can expedite the Stochastic gradient compared with Sigmoid / tanh function, RelU is linear Not using exponential operations such as sigmoid / tanh, by creating an activation matrix when the threshold is 0. ReLU training is carried out it becomes fragile and dies, a large gradient that flows through ReLU causes weight updates, neurons are no longer active on the data point. If this happens, the gradient that flows through the unit will forever be zero from that point. c. The pooling layer is a layer that reduces the dimensions of the feature map or better known as the step for down sampling [28], that speeds up computation. Fewer parameters need to be updated, and overfitting is overcome. Pooling that is commonly used is Max Pooling and Average Pooling. Max Pooling to determine the maximum value of each filter shift, while Average Pooling will determine the average value.

Classification
Classification This layer is useful for classifying each neuron that features extracted previously.
Consists of: a. Flatten is Reshape feature map into a vector, and then it can be used as input for the fully-connected layer [29]. b. Fully-connected The FC layer calculates the class score. Like a normal Neural Network and as the name suggests, every neuron in this layer will be connected to every number in the volume. classes and will help to determine the target class for the input given. The advantage of using Softmax is that the probability of output ranges from zero to one, and the number of all probabilities will be equal to one. The softmax function used for the multiclassification model will return the probability that each class and the target class will have a high probability [30].
In the convolution layer, the convolutional algorithm converts the image into a vector without losing spatial information, which MLPs cannot do. Mathematically, the discrete convolution operation between two functions f and g, denoted by the operator * , can be defined as: For a 2-dimensional image as input, the formula can be written as follows Since convolution is commutative, convolution can also be written as follows, From these equations (1 and 2), I is a two-dimensional input, while K is a two-dimensional convolutional kernel.

Figure 4. 2D convolution between 3 x 4 input and 2 x 2 kernel
The principle of 2D convolution is to shift the convolutional kernel on the input. At each index position shown in Figure 3, element-wise multiplications are computed, they are summed. Then the result value is as follows: Refer to Figure 4, the kernel slides by the number of strides. This helps the user in downsampling the image. There is also a parameter called padding, which we can set up to control the size of the output.

Result and Discussion
This chapter will contain about testing the system on a device that is designed following the design to find out whether the tool is running as planned. Testing is carried out to compare the results of the theoretical design with the experimental results. From the test results.

Result of Autonomous Wheel Robot
The robots developed in this project are autonomous wheel robots that have three wheels, for more details can be seen in the following image.

Figure 5
Autonomous wheel robot Figure 5 is a robot developed using three wheels, one of its wheels uses an Omni wheel that can move 360 degrees, and the other two wheels using conventional wheels connected to the DC motor as an actuator of the developed robot, in addition to being the robot's data receiver uses a camera on the front to capture the data received, for the brain to move the robot using NVidia Jetson Nanobot, to process data that has been stored using Convolutional Neural Network method (CNN) so that the robot can move smoothly along the track.

Result of Camera Module Capture pi v2
The camera used is the pi v2 camera module with the resolution used to capture images is 256 x 256 pixels, which serve as an image capture of objects with detailed and bright results. This camera device is a track detector or track that will be processed on the NVIDIA Jetson Nanobot.  Figure 6 Is a camera device that is connected to the Jetson Nanobot, which is placed on the front as a trajectory detector in the digital image camera testing device that is captured by this device, shown in Figure 6.

Data training
In this test, using CNN Resnet 34 method as described in the previous chapter in this process that determines how high the level of accuracy will be obtained in this research here is the training data process. . (a) is the cut that is used, but this cut is only the right turning part, Figure 9. (b) is the cut used, but this cut is only a straight section but takes half the angle of the whole track, Figure  9. (c) is the cut used but this cut is only the left turning part, Figure 9. (d) is a cut of a straight track taken from the end of the track to the last point of the track showing the result of the track that has been marked using the Jupiter Lab Note Book. The green pointer is the place where the track is marked as a point for Nvidia to make a decision.

Model Test Results
Loss Function is a function in optimization problems to minimize the shortage or loss itself. The loss is damage or failure when training data, while the Number of Epoch is the number of a group of data repeatedly.  Figure 10. is a graph that shows the results of training data where training is carried out 70 times using the Resnet_34 Model, where the hidden layers used are 34 hidden layers, there is a comparison between training loss and test loss where if you want accurate accuracy results, the results of the test loss must be valuable equal to or higher than the training loss chart. It shows that for epoch less than ≤ 5, loss function obtained quite high with the peak point is around epoch = 3 and down monotone close to zero after passing epoch = 5.

Conclusion
In this experiment, the Convolutional Neural Network deep learning method was used with the Resnet 34 models in the trajectory recognition process. To move smoothly, must take a picture of the trajectory from several angles not only take from one angle because the robot does not always move according to its path there must be a time when the robot moves out of the trajectory, and by the time it happens, the robot already has the data to make its own decision, the data image that has been stored next will be trained to get test loss and train loss values. From the data obtained, the highest train loss in the first epoch was 1.829455, and the highest test loss in the third epoch is 30.90127. The result obtained is then used by the robot to determine the path it will take. By adding the data, we want to train, we can reduce the level of loss that will be obtained, but the more data we train then, the longer it will take to train the data.