Hyperparameter Analysis for Image Captioning
This is an incremental analysis for researchers in computer vision and NLP, focusing on optimizing image captioning models.
The paper tackled hyperparameter sensitivity in image captioning by analyzing CNN+LSTM and CNN+Transformer architectures on the Flickr8k dataset, finding that fine-tuning the CNN encoder outperformed baselines and other experiments.
In this paper, we perform a thorough sensitivity analysis on state-of-the-art image captioning approaches using two different architectures: CNN+LSTM and CNN+Transformer. Experiments were carried out using the Flickr8k dataset. The biggest takeaway from the experiments is that fine-tuning the CNN encoder outperforms the baseline and all other experiments carried out for both architectures.