Unsupervised Image Captioning Based on Instance Segmentation for Person Re-Identification
Keywords: Image Captioning, Person Re-Identification, Unsupervised Learning, Deep Learning
Abstract. The invention of vision-language models such as CLIP (Contrastive Language-Image Pre-training) has had a positive impact on various image classification tasks, including the task of person re-identification (Re-ID), which aims to detect a person of interest using multiple non-overlapping cameras. This allows to consider the person re-identification as a multi-task problem involving both visual and textual descriptors. However, CLIP-based models mainly suffer from coarse-grained alignment issues and use a supervised learning strategy, which is undesirable for real-time person Re-ID task. The multi-modal problem statement based on preliminary instance segmentation on a person helps to achieve fine-grained alignment of visual-text descriptors. We propose an unsupervised image captioning method based on the CutLER detector, where visual features are extracted only from the object of interest without considering background data. The experiments were conducted using human images selected from the MSCOCO dataset. More than 8,000 images were processed. Experimental results with CutLER pre-segmentation showed an improvement in the caption generation accuracy as measured by BLEU1-BLEU4, METEOR, ROUGE, CIDEr, and SPICE metrics.