Implementasi Ekstraksi Fitur VGG-16 dan Pemodelan LSTM untuk Pembangkitan Caption Gambar Otomatis
Abstract
Image captioning, the task of automatically generating descriptive captions for images, has gained significant attention due to its potential applications in various domains. This paper addresses the challenges associated with integrating computer vision and natural language processing techniques to develop an effective image caption generator. The proposed solution leverages the VGG-16 model for feature extraction from images and an LSTM (Long Short-Term Memory) model for caption generation. The Flickr8k dataset, containing approximately 8000 images with five different captions per image, is utilized for training and evaluation. The methodology encompasses several steps, including data preprocessing, feature extraction, model training, and evaluation. Data preprocessing involves cleaning captions by removing punctuations, single characters, and numerical values, while incorporating start and end sequences. Image features are extracted using the pre-trained VGG-16 model, and similar images are clustered to ensure accurate feature extraction. Subsequently, the captions and corresponding image features are merged and tokenized for model training. The LSTM model is designed with input layers for image features and captions, as well as an output layer for caption generation. Extensive hyperparameter tuning is conducted to optimize the model's performance, involving variations in the number of nodes and layers. The generated captions are evaluated using BLEU scores, where a score closer to 1 indicates higher similarity between predicted and actual captions. The proposed system demonstrates promising results in generating meaningful captions for images, with potential applications in assisting visually impaired individuals, medical image analysis, and advertising industry automation.
References
[2] A. R. GRIGOREV, Tensorflow Deep Learning Projects: 10 Real-World Projects on Computer Vision, Machine Translation, Chatbots, and Reinforcement Learning;10 Real-World. PACKT Publishing, 2018.
[3] K. Anitha Kumari, C. Mouneeshwari, R. B. Udhaya, and R. Jasmitha, “Automated image captioning for Flickr8k dataset,” Proceedings of International Conference on Artificial Intelligence, Smart Grid and Smart City Applications, pp. 679–687, 2020. doi:10.1007/978-3-030-24051-6_62.
[4] adityajn105, “Flickr 8K dataset,” Kaggle, 27-Apr-2020. [Online]. Available: https://www.kaggle.com/datasets/adityajn105/flickr8k. [Accessed: 01-May-2024].
[5] B. Jawade, D. D. Mohan, N. M. Ali, S. Setlur, and V. Govindaraju, “NAPReg: Nouns as proxies regularization for semantically aware cross-modal embeddings,” 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan. 2023. doi:10.1109/wacv56688.2023.00119.
[6] T. V. Sneha and Dr. S. J. Rani, LSTM-VGG-16: A Novel and Modular Model for Image Captioning Using Deep Learning Approachesge captioning for Flickr8k dataset, vol. 12, no. 11, pp. 131–141, 2021.
[7] C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image captioning with deep bidirectional lstms,” Proceedings of the 24th ACM international conference on Multimedia, Oct. 2016. doi:10.1145/2964284.2964299.
[8] GeeksforGeeks, “NLP - Bleu score for Evaluating Neural Machine Translation - Python,” GeeksforGeeks, 08-Mar-2024. [Online]. Available: https://www.geeksforgeeks.org/nlp-bleu-score-for-evaluating-neural-machine-translation-python/. [Accessed: 01-May-2024].