Unsupervised Image Captioning Based on Instance Segmentation for Person Re-Identification

Favorskaya, Margarita N.; Savkov, Maxim V.

doi:https://doi.org/10.5194/isprs-archives-XLVIII-2-W9-2025-87-2025

Articles | Volume XLVIII-2/W9-2025

https://doi.org/10.5194/isprs-archives-XLVIII-2-W9-2025-87-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-archives-XLVIII-2-W9-2025-87-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XLVIII-2/W9-2025

04 Sep 2025

| 04 Sep 2025

Unsupervised Image Captioning Based on Instance Segmentation for Person Re-Identification

Margarita N. Favorskaya and Maxim V. Savkov

Keywords: Image Captioning, Person Re-Identification, Unsupervised Learning, Deep Learning

Abstract. The invention of vision-language models such as CLIP (Contrastive Language-Image Pre-training) has had a positive impact on various image classification tasks, including the task of person re-identification (Re-ID), which aims to detect a person of interest using multiple non-overlapping cameras. This allows to consider the person re-identification as a multi-task problem involving both visual and textual descriptors. However, CLIP-based models mainly suffer from coarse-grained alignment issues and use a supervised learning strategy, which is undesirable for real-time person Re-ID task. The multi-modal problem statement based on preliminary instance segmentation on a person helps to achieve fine-grained alignment of visual-text descriptors. We propose an unsupervised image captioning method based on the CutLER detector, where visual features are extracted only from the object of interest without considering background data. The experiments were conducted using human images selected from the MSCOCO dataset. More than 8,000 images were processed. Experimental results with CutLER pre-segmentation showed an improvement in the caption generation accuracy as measured by BLEU1-BLEU4, METEOR, ROUGE, CIDEr, and SPICE metrics.

Unsupervised Image Captioning Based on Instance Segmentation for Person Re-Identification

Useful Links

Useful External Links

Our Contact