Multi-Object Tracking in UAV Videos: A YOLOv11 Fusion Method for Detection and Segmentation Optimization
Keywords: Multi-Object Tracking, YOLOv11, Fusion, Segmentation, Object Detection
Abstract. The rapid evolution of deep learning has significantly advanced multi-object tracking (MOT) in UAV-based remote sensing applications. However, accurately detecting and tracking objects of varying sizes in complex UAV-captured environments remains a challenge. This research introduces a novel fusion-based approach that leverages YOLOv11, a state-of-the-art object detection framework, to enhance MOT performance on the VisDrone UAV dataset. The proposed method integrates two YOLOv11 configurations: detection mode, paired with the Bot-SORT tracker, optimized for large objects to ensure high precision and localization accuracy, and segmentation mode, combined with the Byte-Track tracker, designed to effectively detect and track smaller, less prominent objects. By fusing the outputs of these configurations, the approach ensures comprehensive object coverage across different size ranges, thereby improving both detection and tracking accuracy while enhancing segmentation performance. This method addresses critical limitations in existing models, such as low recall for small objects and imprecise localization for larger ones, which are particularly challenging in UAV datasets due to varying altitudes, occlusions, and dynamic backgrounds. The fusion strategy employs Intersection over Union (IoU)-based matching, weighted bounding box fusion, and confidence thresholding to enhance tracking reliability and accuracy. Experimental evaluations on the VisDrone dataset, using motion tracking metrics and the F1 score for detection and segmentation, demonstrate significant performance improvements across multiple UAV videos. The results show that the fused approach outperforms individual configurations while maintaining consistent object identity tracking over time. This research contributes to UAV-based remote sensing by providing a scalable and efficient MOT framework, making it particularly valuable for applications such as surveillance, traffic monitoring, and disaster response, where precise object localization and tracking are crucial.