Balinese Script Recognition Using Tesseract Mobile Framework
Abstract
One of the main factors causing the decline in the use of Balinese Script is that Balinese people are less interested in reading Balinese Script because of their reluctance to learn Balinese Script, which is relatively complicated in the recognition process. The development of computer technology has now been used to help by performing character recognition or known as Optical Character Recognition (OCR). Developing the OCR application for Balinese Script is an effort to help preserve, from the technology side, as a means of education related to Balinese Script. In this study, that development was conducted by using a Tesseract OCR engine that consists of several stages, i.e., the first one is to prepare the dataset, the second one is to generate the dataset using the Web Scraping method, the third one is to train the OCR engine using the generated dataset, and finally, the fourth one is to implement the generated language model into a mobile-based application. The study results prove that the dataset generation process using the Web Scraping method can be a better choice when faced with a training dataset that requires a large dataset compared to several previous studies of non-Latin character recognition. In those studies, the jTessBox tools were used, which took time because they had to select per character for a dataset. The best result of the language model is a combination of character, word, sentence, and paragraph datasets (hierarchical combination of character, word, sentence, and paragraph datasets) with a coincidence rate of 66.67%. The more diverse and structured hierarchical datasets used, the higher the coincidence rate.
Downloads
References
[2] Bali Governor, Bali Governor Regulation No. 80 on Protection and Usage of Balinese Language, Script, and Literature. Indonesia, 2018.
[3] A. Qaroush, A. Awad, M. Modallal, and M. Ziq, "Segmentation-based, omnifont printed Arabic character recognition without font identification," Journal of King Saud University - Computer and Information Sciences, Volume 34, Issue 6, Part A, 2020, doi: 10.1016/j.jksuci.2020.10.001.
[4] T. W. Ramdhani, I. Budi, and B. Purwandari, "Optical Character Recognition Engines Performance Comparison in Information Extraction," International Journal of Advanced Computer Science and Applications, vol. 12, no. 8, pp. 120–127, 2021, doi: 10.14569/IJACSA.2021.0120814.
[5] G. Abdul Robby, A. Tandra, I. Susanto, J. Harefa, and A. Chowanda, "Implementation of Optical Character Recognition Using Tesseract With the Javanese Script Target in Android Application," Procedia Computer Science, vol. 157, pp. 499–505, 2019, doi: 10.1016/j.procs.2019.09.006.
[6] H. Hassani and S. Idress, "Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR," Applied Sciences, p. 20, Oct. 2021, doi: 10.3390/app11209752.
[7] R. Smith, "An Overview of the Tesseract OCR Engine," in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), 2007, pp. 629–633, doi: 0.1109/ICDAR.2007.4376991.
[8] G. Indrawan, I. K. Paramarta, K. Agustini, and Sariyasa, "Latin-to-Balinese Script Transliteration Method on Mobile Application: A comparison," The Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), vol. 10, no. 3, pp. 1331–1342, 2018.
[9] S. Chaudhari, R. Aparna, V. G. Tekkur, G. L. Pavan, and S. R. Karki, "Ingredient/Recipe Algorithm using Web Mining and Web Scraping for Smart Chef," Proceedings CONECCT 2020 - 6th IEEE International Conference on Electronics, Computing and Communication Technologies, no. 3, pp. 22–25, 2020, doi: 10.1109/CONECCT50063.2020.9198450.
[10] W. Uriawan, A. Wahana, D. Wulandari, W. Darmalaksana, and R. Anwar, "Pearson Correlation Method and Web Scraping For Analysis of Islamic Content on Instagram Videos," Proceedings - 2020 6th International Conference on Wireless and Telematics (ICWT) 2020, 2020, doi: 10.1109/ICWT50448.2020.9243626.
[11] G. Adomavicius and A. Tuzhilin, "Web Scraping: State of the art," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734–749, 2019.
[12] Tesseract OCR, "Tesseract User Manual," Github, 2018. https://tesseract-ocr.github.io/tessdoc/ (accessed Jul. 08, 2022).
[13] S. Idrees and H. Hassani, "Exploiting Script Similarities to Compensate For The Large Amount of Data In Training Tesseract LSTM: Towards Kurdish OCR," Applied Sciences, vol. 11, no. 20, 2021, doi: 10.3390/app11209752.
[14] P. Kumar, P. Sihag, P. Chaturvedi, K. V. Uday, and V. Dutt, "BS-LSTM: An Ensemble Recurrent Approach to Forecasting Soil Movements in the Real World," Front. Earth Sci., 23 August 2021 Sec. Environmental Informatics and Remote Sensing, vol. 9, no. August, pp. 1–23, 2021, doi: 10.3389/feart.2021.696792.
[15] C. Clausner, A. Antonacopoulos, and S. Pletschacher, "Efficient and effective OCR engine training," International Journal on Document Analysis and Recognition (IJDAR), vol. 23, no. 1, pp. 73–88, 2020, doi: 10.1007/s10032-019-00347-8.
[16] V. K. Kaliappan, S. Yu, R. Soundararajan, S. Jeon, D. Min, and E. Choi, "High-Secured Data Communication for Cloud Enabled Secure Docker Image Sharing Technique Using Blockchain-Based Homomorphic Encryption," Energies, vol. 15, no. 15, 2022, doi: 10.3390/en15155544.
[17] N. H. Khan and A. Adnan, "Urdu optical character recognition systems: Present contributions and future directions," IEEE Access, vol. 6, pp. 46019–46046, 2018, doi: 10.1109/ACCESS.2018.2865532.
[18] K. O. Mohammed Aarif and S. Poruran, "OCR-Nets: Variants of Pre-trained CNN for Urdu Handwritten Character Recognition via Transfer Learning," Procedia Computer Science, vol. 171, no. 2019, pp. 2294–2301, 2020, doi: 10.1016/j.procs.2020.04.248.
[19] B. Wang, Y. W. Ma, and H. T. Hu, "Hybrid model for Chinese character recognition based on Tesseract-OCR," International Journal of Internet Protocol Technology, vol. 13, no. 2, pp. 102–108, 2020, doi: 10.1504/IJIPT.2020.106316.
[20] R. Bassam et al., "Autonomous Assistance System for Visually Impaired using Tesseract OCR & gTTS Autonomous Assistance System for Visually Impaired using Tesseract OCR & gTTS," Journal of Physics: Conference Series, Volume 2327, 4th International Conference on Intelligent Circuits and Systems, doi: 10.1088/1742-6596/2327/1/012065.
[21] D. Sporici, E. Cus, and C. Boiangiu, "Using Convolution-Based Preprocessing," SS symmetry, 2020.
[22] Google, "Flutter architectural overview." https://docs.flutter.dev/resources/architectural-overview (accessed February 06, 2022).
[23] Google, "Dart overview." https://dart.dev/overview (accessed Feb. 06, 2022).
[24] N. Chigali, S. R. Bobba, K. Suvarna Vani, and S. Rajeswari, "OCR assisted translator," 7th International Conference on Smart Structures and Systems (ICSSS), July 2020, doi: 10.1109/ICSSS49621.2020.9202034.
This work is licensed under a Creative Commons Attribution 4.0 International License.
The Authors submitting a manuscript do so on the understanding that if accepted for publication, the copyright of the article shall be assigned to Jurnal Lontar Komputer as the publisher of the journal. Copyright encompasses exclusive rights to reproduce and deliver the article in all forms and media, as well as translations. The reproduction of any part of this journal (printed or online) will be allowed only with written permission from Jurnal Lontar Komputer. The Editorial Board of Jurnal Lontar Komputer makes every effort to ensure that no wrong or misleading data, opinions, or statements be published in the journal.
This work is licensed under a Creative Commons Attribution 4.0 International License.