广西科学

引用本文：

本文已被：浏览 26次下载 0次
融合检索与生成的轻量化图像描述方法研究
王子怡, 于千城, 李卫军, 刘雪洋, 丁建平, 刘世侠
0 字体:加大+\|默认\|缩小-
(北方民族大学)

摘要:

现有图像描述方法存在提取图像信息不完整、缺少文本提示信息、模型复杂和训练成本高等问题。为此本文结合检索和生成方法的优点，提出了一种融合检索和生成的轻量化图像描述方法。该方法将向量检索与视觉语言预训练模型相结合，ViLT(Vision-and-Language Transformer)作为编码器将图像与检索到的描述进行联合编码，输出图像和文本的融合特征。解码器使用加入交叉注意力机制的OPT(Open Pre-trained Transformer Language Models）生成图像描述。在MSCOCO和Flickr30k数据集上，BLEU4指标分别达到36.7%和28.6%、ROUGE-L指标分别达到57.1%和50.3%。实验结果表明，该方法在保持低参数量的前提下能有效的提升图像描述的质量并且加入检索方法后能够提高模型的鲁棒性，使生成的描述更加贴合图像内容。

关键词: 图像描述向量检索预训练模型轻量化 ViLT OPT

DOI：

投稿时间：2024-02-27修订日期：2024-03-31

基金项目:国家自然科学基金(62066038, 61962001)；中央高校科研(2021JCYJ12); 宁夏自然科学基金(2021AAC03215);北方民族大学研究生创新项目(YCX23163)

Lightweight Image Caption Approach through Integrated Generation and Retrieval Methods

Wang Ziyi, Yu Qiancheng, Li Weijun, Liu Xueyang, Ding Jianping, Liu Shixia

(North Minzu University)

Abstract:

Current image captioning methods face issues like incomplete image information extraction, lack of textual context, complex models, and high training costs. To address these issues, a lightweight image caption method combining the advantages of retrieval and generation methods is proposed. This method uses ViLT(Vision-and-Language Transformer) as the encoder to jointly encode images with retrieved captions, outputting integrated features of images and text. The decoder employs OPT (Open Pre-trained Transformer Language Models) with added cross-attention mechanism to generate image captions. On the MSCOCO and Flickr30k datasets, BLEU4 metrics reached 36.7% and 28.6% respectively, while ROUGE-L metrics reached 57.1% and 50.3% respectively. Experimental results demonstrate that this method effectively enhances the quality of image captions while maintaining a low parameter count, and the integration of retrieval methods improves the robustness of the model, resulting in captions that better align with the image content.

Key words: image caption vector retrieval pre-trained models lightweight ViLT OPT

用微信扫一扫