Annotation Error Detection and Correction for Indonesian POS Tagging Corpus
Abstract
Linguistic Corpus is the primary material for training and evaluating machine learning models, especially for POS Tagging. However, the human-annotated corpus is not free from annotation errors. Annotation errors have a negative impact on model performance. Therefore, we propose annotation error detection and correction. We detect annotation errors in the Indonesian POS Tagging corpus using the n-gram variation method. Then, we correct the corpus using an expert-voting approach. Annotation error detection successfully collected 6,536 annotation error candidates. Each candidate has two possibilities: (i) an ambiguous word or (ii) an incorrect annotation. Annotation error correction validated and corrected the candidates using the majority-voting method in an expert group. Annotation error correction successfully identified and corrected 503 words from 1918 sentences. Then, we compared the performance of the POS Tagging model with the corpus before and after correction. The results showed a significant improvement in the F1-score value (+9.69%) compared to the uncorrected corpus.
Downloads
References
[2] S. N. A. N. Ariffin and S. Tiun, “Improved POS Tagging Model for Malay Twitter Data based on Machine Learning Algorithm,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 7, pp. 229–234, 2022, doi: 10.14569/IJACSA.2022.0130730.
[3] J. C. Klie, B. Webber, and I. Gurevych, “Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future,” Computational Linguistics, vol. 49, no. 1, pp. 157–198, 2023, doi: 10.1162/coli_a_00464.
[4] N. J. Dobbins, T. Mullen, Ö. Uzuner, and M. Yetisgen, “The Leaf Clinical Trials Corpus: a new resource for query generation from clinical trial eligibility criteria,” Sci Data, vol. 9, no. 490, 2022, doi: 10.1038/s41597-022-01521-0.
[5] C. G. Northcutt, L. Jiang, and I. L. Chuang, “Confident learning: Estimating uncertainty in dataset labels,” Journal of Artificial Intelligence Research, vol. 70, pp. 1373–1411, 2021, doi: 10.1613/JAIR.1.12125.
[6] S. A. A. Shah, M. Ali Masood, and A. Yasin, “Dark Web: E-Commerce Information Extraction Based on Name Entity Recognition Using Bidirectional-LSTM,” IEEE Access, vol. 10, pp. 99633–99645, 2022, doi: 10.1109/ACCESS.2022.3206539.
[7] Y. Fu, N. Lin, X. Lin, and S. Jiang, “Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition,” Journal of Intelligent and Fuzzy Systems, vol. 41, no. 1, pp. 563–574, 2021, doi: 10.3233/JIFS-202286.
[8] I. M. S. Putra, D. Siahaan, and A. Saikhu, “SNLI Indo: A recognizing textual entailment dataset in Indonesian derived from the Stanford Natural Language Inference dataset,” Data in Brief, vol. 52, p. 109998, 2024, doi: 10.1016/j.dib.2023.109998.
[9] E. S. Lim et al., “ICON: a linguistically-motivated large-scale benchmark Indonesian constituency treebank,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 8, pp. 1–34, Aug. 2023, doi: 10.1145/3609798.
[10] S. Fu, N. Lin, G. Zhu, and S. Jiang, “Towards Indonesian Part-of-Speech tagging: Corpus and models,” 2018 International Conference on Asian Language Processing (IALP), vol. 1, pp. 303–307, 2018.
[11] M. Alfian, U. L. Yuhana, and D. Siahaan, “Indonesian Part-of-Speech tagger: A comparative study,” in 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), IEEE, Oct. 2023, pp. 1–6. doi: 10.1109/ICAICTA59291.2023.10390353.
[12] H. Song, M. Kim, D. Park, Y. Shin, and J. G. Lee, “Learning From Noisy Labels With Deep Neural Networks: A Survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 11, pp. 8135–8153, 2023, doi: 10.1109/TNNLS.2022.3152527.
[13] P. Květoň and K. Oliva, “(Semi-)automatic detection of errors in PoS-tagged corpora,” in COLING ’02: Proceedings of the 19th international conference on Computational linguistics, 2002, pp. 1–7. doi: 10.3115/1072228.1072249.
[14] M. Dickinson, “From detecting errors to automatically correcting them,” in EACL 2006 - 11th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2006.
[15] S. Angle, P. Mishra, and D. M. Sharma, “Automated error correction and validation for POS tagging of hindi,” in Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, PACLIC 2018, 2018.
[16] Y. Yanfi, R. Setiawan, H. Soeparno, and W. Budiharto, “SPECIL: spell error corpus for the Indonesian language,” IEEE Access, vol. 11, pp. 93227–93237, 2023, doi: 10.1109/ACCESS.2023.3307712.
[17] M. Chen, “Trust, understanding, and machine translation: the task of translation and the responsibility of the translator,” AI & Soc, vol. 39, pp. 2307–2319, 2023, doi: 10.1007/s00146-023-01681-6.
[18] Z. Chen, L. Jiang, and C. Li, “Label augmented and weighted majority voting for crowdsourcing,” Inf Sci (N Y), vol. 606, pp. 397–409, 2022, doi: 10.1016/j.ins.2022.05.066.
[19] S. Warjri, P. Pakray, S. A. Lyngdoh, and A. K. Maji, “Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora,” International Journal of Speech Technology, vol. 24, no. 4, pp. 853–864, 2021, doi: 10.1007/s10772-021-09860-w.
[20] A. Rahmawati, H. Setiawan, and F. Meliasanti, “Analisis Kalimat Tunggal dan Majemuk Pada Rubrik Pendidikan di kompas.com Serta Rekomendasinya Sebagai Bahan Ajar di SMP,” Jurnal Educatio, vol. 7, no. 4, pp. 1602–1606, 2021.
[21] F. Yani, “The Comparison Between English Conjunction and Indonesian Conjunctiona,” Cendikia : Media Jurnal Ilmiah Pendidikan, vol. 11, no. 2, pp. 71–81, 2021, doi: 10.35335/cendikia.v11i2.1667.
[22] B. A. Smitha and R. K. N. Praveen, “ORDSAENet: Outlier Resilient Semantic Featured Deep Driven Sentiment Analysis Model for Education Domain,” Journal of Machine and Computing, vol. 3, no. 4, pp. 408–430, 2023, doi: 10.53759/7669/jmc202303034.
[23] T. Karadeniz, H. H. Maraş, G. Tokdemir, and H. Ergezer, “Two Majority Voting Classifiers Applied to Heart Disease Prediction,” Applied Sciences (Switzerland), vol. 13, no. 6, p. 3767, 2023, doi: 10.3390/app13063767.
[24] A. Pradhan and A. Yajnik, “Parts-of-Speech tagging of Nepali texts with Bidirectional LSTM, Conditional Random Fields and HMM,” Multimedia Tools and Applications, vol. 83, pp. 9893–9909, Jun. 2023, doi: 10.1007/s11042-023-15679-1.
[25] A. Turchin, S. Masharsky, and M. Zitnik, “Comparison of BERT implementations for natural language processing of narrative medical documents,” Informatics in Medicine Unlocked, vol. 36, p. 101139, 2023, doi: 10.1016/j.imu.2022.101139.

This work is licensed under a Creative Commons Attribution 4.0 International License.
The Authors submitting a manuscript do so on the understanding that if accepted for publication, the copyright of the article shall be assigned to Jurnal Lontar Komputer as the publisher of the journal. Copyright encompasses exclusive rights to reproduce and deliver the article in all forms and media, as well as translations. The reproduction of any part of this journal (printed or online) will be allowed only with written permission from Jurnal Lontar Komputer. The Editorial Board of Jurnal Lontar Komputer makes every effort to ensure that no wrong or misleading data, opinions, or statements be published in the journal.
This work is licensed under a Creative Commons Attribution 4.0 International License.
















