Generative AI-Based Application for Producing Tourism Video Blogs with Proximity and Direction to Points of Interest

Eguchi, Hayate; Sasaki, Iori; Lu, Min; Utsumi, Tomihiro; Sato, Ryo; Arikawa, Masatoshi

doi:https://doi.org/10.5194/isprs-archives-XLVIII-4-W16-2025-33-2025

Articles | Volume XLVIII-4/W16-2025

https://doi.org/10.5194/isprs-archives-XLVIII-4-W16-2025-33-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-archives-XLVIII-4-W16-2025-33-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XLVIII-4/W16-2025

19 Sep 2025

| 19 Sep 2025

Generative AI-Based Application for Producing Tourism Video Blogs with Proximity and Direction to Points of Interest

Hayate Eguchi, Iori Sasaki, Min Lu, Tomihiro Utsumi, Ryo Sato, and Masatoshi Arikawa

Keywords: AI-Enabled Urban Tourism, Context-Aware Captioning, Tourism Video Blogs, Geofencing, Generative AI

Abstract. Taking and sharing videos of tourist attractions has become a common activity among tourists. When accompanied by captions and audio, these videos serve as an effective way of conveying impressions and information about the places visited through social media. Such content not only enriches the post-travel experience of individuals but also contributes to promoting local tourism and stimulating inbound demand. However, producing highly informative video content poses challenges in that it requires editing skills and reliability of information. This study establishes a method for automatically overlaying captions onto videos by (1) estimating appropriate time periods during which points of interest (POIs) are captured within the camera’s field of view, and (2) generating explanatory comments with a suitable word count for the corresponding durations. This method was implemented in the author’s video blog application to enable users to easily share the appeal of a region. In a field experiment simulating tourist movement and POI filming within a predefined guide area, the average error between the time a POI appeared in the video and the calculated caption display duration was approximately 1.8 seconds, with a maximum error of 4.0 seconds. This level of accuracy is considered sufficient for viewers to associate each caption with the corresponding POI as it appears in the video. Furthermore, the text length of the generated captions was also reasonable for the display duration, and their content was confirmed to be factually accurate through qualitative evaluation. Future improvements should incorporate the users’ personal experiences into the caption generation.

Generative AI-Based Application for Producing Tourism Video Blogs with Proximity and Direction to Points of Interest

Useful Links

Useful External Links

Our Contact