Annotation Error Detection and Correction for Indonesian POS Tagging Corpus

  • Muhammad Alfian ITS
  • Umi Laili Yuhana Institut Teknologi Sepuluh Nopember
  • Daniel Siahaan Institut Teknologi Sepuluh Nopember
  • Harum Munazharoh Universitas Airlangga

Abstract

Linguistic Corpus is the primary material for training and evaluating machine learning models, especially for POS Tagging. However, the human-annotated corpus is not free from annotation errors. Annotation errors have a negative impact on model performance. Therefore, we propose annotation error detection and correction. We detect annotation errors in the Indonesian POS Tagging corpus using the n-gram variation method. Then, we correct the corpus using an expert-voting approach. Annotation error detection successfully collected 6,536 annotation error candidates. Each candidate has two possibilities: (i) an ambiguous word or (ii) an incorrect annotation. Annotation error correction validated and corrected the candidates using the majority-voting method in an expert group. Annotation error correction successfully identified and corrected 503 words from 1918 sentences. Then, we compared the performance of the POS Tagging model with the corpus before and after correction. The results showed a significant improvement in the F1-score value (+9.69%) compared to the uncorrected corpus. 

Downloads

Download data is not yet available.

Author Biographies

Umi Laili Yuhana, Institut Teknologi Sepuluh Nopember

Departement of Informatics, Institut Teknologi Sepuluh Nopember

Daniel Siahaan, Institut Teknologi Sepuluh Nopember

Department of Informatics, Institut Teknologi Sepuluh Nopember

Harum Munazharoh, Universitas Airlangga

Departement of Indonesian Language and Literature, Universitas Airlangga

References

[1] D. Kim et al., “A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining,” IEEE Access, vol. 7, pp. 73729–73740, 2019, doi: 10.1109/ACCESS.2019.2920708.
[2] S. N. A. N. Ariffin and S. Tiun, “Improved POS Tagging Model for Malay Twitter Data based on Machine Learning Algorithm,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 7, pp. 229–234, 2022, doi: 10.14569/IJACSA.2022.0130730.
[3] J. C. Klie, B. Webber, and I. Gurevych, “Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future,” Computational Linguistics, vol. 49, no. 1, pp. 157–198, 2023, doi: 10.1162/coli_a_00464.
[4] N. J. Dobbins, T. Mullen, Ö. Uzuner, and M. Yetisgen, “The Leaf Clinical Trials Corpus: a new resource for query generation from clinical trial eligibility criteria,” Sci Data, vol. 9, no. 490, 2022, doi: 10.1038/s41597-022-01521-0.
[5] C. G. Northcutt, L. Jiang, and I. L. Chuang, “Confident learning: Estimating uncertainty in dataset labels,” Journal of Artificial Intelligence Research, vol. 70, pp. 1373–1411, 2021, doi: 10.1613/JAIR.1.12125.
[6] S. A. A. Shah, M. Ali Masood, and A. Yasin, “Dark Web: E-Commerce Information Extraction Based on Name Entity Recognition Using Bidirectional-LSTM,” IEEE Access, vol. 10, pp. 99633–99645, 2022, doi: 10.1109/ACCESS.2022.3206539.
[7] Y. Fu, N. Lin, X. Lin, and S. Jiang, “Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition,” Journal of Intelligent and Fuzzy Systems, vol. 41, no. 1, pp. 563–574, 2021, doi: 10.3233/JIFS-202286.
[8] I. M. S. Putra, D. Siahaan, and A. Saikhu, “SNLI Indo: A recognizing textual entailment dataset in Indonesian derived from the Stanford Natural Language Inference dataset,” Data in Brief, vol. 52, p. 109998, 2024, doi: 10.1016/j.dib.2023.109998.
[9] E. S. Lim et al., “ICON: a linguistically-motivated large-scale benchmark Indonesian constituency treebank,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 8, pp. 1–34, Aug. 2023, doi: 10.1145/3609798.
[10] S. Fu, N. Lin, G. Zhu, and S. Jiang, “Towards Indonesian Part-of-Speech tagging: Corpus and models,” 2018 International Conference on Asian Language Processing (IALP), vol. 1, pp. 303–307, 2018.
[11] M. Alfian, U. L. Yuhana, and D. Siahaan, “Indonesian Part-of-Speech tagger: A comparative study,” in 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), IEEE, Oct. 2023, pp. 1–6. doi: 10.1109/ICAICTA59291.2023.10390353.
[12] H. Song, M. Kim, D. Park, Y. Shin, and J. G. Lee, “Learning From Noisy Labels With Deep Neural Networks: A Survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 11, pp. 8135–8153, 2023, doi: 10.1109/TNNLS.2022.3152527.
[13] P. Květoň and K. Oliva, “(Semi-)automatic detection of errors in PoS-tagged corpora,” in COLING ’02: Proceedings of the 19th international conference on Computational linguistics, 2002, pp. 1–7. doi: 10.3115/1072228.1072249.
[14] M. Dickinson, “From detecting errors to automatically correcting them,” in EACL 2006 - 11th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2006.
[15] S. Angle, P. Mishra, and D. M. Sharma, “Automated error correction and validation for POS tagging of hindi,” in Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, PACLIC 2018, 2018.
[16] Y. Yanfi, R. Setiawan, H. Soeparno, and W. Budiharto, “SPECIL: spell error corpus for the Indonesian language,” IEEE Access, vol. 11, pp. 93227–93237, 2023, doi: 10.1109/ACCESS.2023.3307712.
[17] M. Chen, “Trust, understanding, and machine translation: the task of translation and the responsibility of the translator,” AI & Soc, vol. 39, pp. 2307–2319, 2023, doi: 10.1007/s00146-023-01681-6.
[18] Z. Chen, L. Jiang, and C. Li, “Label augmented and weighted majority voting for crowdsourcing,” Inf Sci (N Y), vol. 606, pp. 397–409, 2022, doi: 10.1016/j.ins.2022.05.066.
[19] S. Warjri, P. Pakray, S. A. Lyngdoh, and A. K. Maji, “Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora,” International Journal of Speech Technology, vol. 24, no. 4, pp. 853–864, 2021, doi: 10.1007/s10772-021-09860-w.
[20] A. Rahmawati, H. Setiawan, and F. Meliasanti, “Analisis Kalimat Tunggal dan Majemuk Pada Rubrik Pendidikan di kompas.com Serta Rekomendasinya Sebagai Bahan Ajar di SMP,” Jurnal Educatio, vol. 7, no. 4, pp. 1602–1606, 2021.
[21] F. Yani, “The Comparison Between English Conjunction and Indonesian Conjunctiona,” Cendikia : Media Jurnal Ilmiah Pendidikan, vol. 11, no. 2, pp. 71–81, 2021, doi: 10.35335/cendikia.v11i2.1667.
[22] B. A. Smitha and R. K. N. Praveen, “ORDSAENet: Outlier Resilient Semantic Featured Deep Driven Sentiment Analysis Model for Education Domain,” Journal of Machine and Computing, vol. 3, no. 4, pp. 408–430, 2023, doi: 10.53759/7669/jmc202303034.
[23] T. Karadeniz, H. H. Maraş, G. Tokdemir, and H. Ergezer, “Two Majority Voting Classifiers Applied to Heart Disease Prediction,” Applied Sciences (Switzerland), vol. 13, no. 6, p. 3767, 2023, doi: 10.3390/app13063767.
[24] A. Pradhan and A. Yajnik, “Parts-of-Speech tagging of Nepali texts with Bidirectional LSTM, Conditional Random Fields and HMM,” Multimedia Tools and Applications, vol. 83, pp. 9893–9909, Jun. 2023, doi: 10.1007/s11042-023-15679-1.
[25] A. Turchin, S. Masharsky, and M. Zitnik, “Comparison of BERT implementations for natural language processing of narrative medical documents,” Informatics in Medicine Unlocked, vol. 36, p. 101139, 2023, doi: 10.1016/j.imu.2022.101139.
Published
2025-06-05
How to Cite
ALFIAN, Muhammad et al. Annotation Error Detection and Correction for Indonesian POS Tagging Corpus. Lontar Komputer : Jurnal Ilmiah Teknologi Informasi, [S.l.], v. 16, n. 1, p. 41-52, june 2025. ISSN 2541-5832. Available at: <https://ojs.unud.ac.id/index.php/lontar/article/view/122682>. Date accessed: 03 nov. 2025. doi: https://doi.org/10.24843/LKJITI.2025.v16.i01.p04.