作者:陶锐,任洪娥,曹海燕
Authors:TAO Rui,REN Honge,CAO Haiyan摘要:图像描述是指为图像自动生成与其内容相符的语言描述 .桥接计算机视觉和自然语言处理两个领域 的预训练模型构建图像描述模型时 ,跨模态语义一致性是共享子空间嵌入的核心问题 。本文将图像拆分成若干片作为视觉语义单元与语言特征进行自由的跨模态关联 ,突破了有限视觉特征分类的限制 ;联合运用掩码学习和图 文特征匹配两个损失函数 ,挑选高难度负样本训练跨模态跳接网络提取一致性全局语义 ,提高了子空间邻域内高相似度图文特征点匹配的准确度 。在 MS COCO 和 Flickr30k 两个数据集上的实验结果表明 ,与同样采用 CLIP + GPT 生成图像描述的模型及其他主流模型相比 ,性能均有提升 ,证明了所提出模型的有效性。
Abstract:Image captioning is a method for automatically generating language descriptions for images. Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captioning models. In this paper, we introduce a new method that breaks through the limitation of visual feature classification by dividing images into patches as visual semantic units for open-vocabulary cross-modal association with language features. It combines the two loss functions of masked language modeling and image-text matching, selects highly difficult negative samples to train the cross-modal hop network to extract consistent global semantics, improving the accuracy of distinguishing highly similar image and text feature points within the neighborhood of the subspace. Experimental results on two datasets, MS COCO and Flickr30k, show that the performance of the model is improved compared to models that also use CLIP + GPT to generate image descriptions and other mainstream methods, demonstrating the effectiveness of the proposed method.
PDF全文下载地址:
可免费Download/下载PDF全文
删除或更新信息,请邮件至freekaoyan#163.com(#换成@)