Unleashing the Reasoning Capabilities of Vision Language Models for Effective Image-based Roadside Tree 3D Measurement
Keywords: Vision Language Model, Deep Learning, Street View Images, Roadside Tree Measurement
Abstract. The Diameter at Breast Height (DBH) and Tree Height (TH) are key morphological parameters of roadside tree, their accurate measurement is conducive to quantifying various ecological benefits of trees. Compared with traditional field survey or laser scanning methods, using low-cost street view images as an alternative data source is a promising measurement method. However, existing methods rely on preset reference systems or manual interpretation, which results in poor generalization and low efficiency. Recently, Visual Language Models (VLMs) have shown potential in mimicking human visual reasoning, but their direct application fails to address 3D measurement tasks. To tackle this, we propose a VLM-based Tree 3D Measurement Network, named VLM-TMN. Our key idea is to adjust VLMs to focus on the semantic and geometric information of trees to achieve effective measurement. Specifically, it contains several designs: 1) A Depth Projector Module integrating explicit depth supervision and implicit depth encoding to enhance geometric understanding. 2) A Magnifying Glass Strategy that amplifies visual perception by dynamically focusing on critical tree regions. Built upon LLaVA-7B, our method reduces DBH measurement errors from 24.39 cm to 7.08 cm RMSE, achieving a 7.57% improvement over standard supervised fine-tuning approaches and significantly outperforming existing methods (7.08 cm < 15 cm). The results demonstrate how VLM-TMN can be effectively repurposed for urban ecological parameter quantification, providing a cost-effective solution for sustainable city planning.
