Towards Estimation of 3D Poses and Shapes of Animals from Oblique Drone Imagery

Wildlife research in both terrestrial and aquatic ecosystems now deploys drone technology for tasks such as monitoring, census counts and habitat analysis. Unlike camera traps, drones offer real-time flexibility for adaptable flight paths and camera views, thus making them ideal for capturing multi-view data on wildlife like zebras or lions. With recent advancements in animals’ 3D shape & pose estimation, there is an increasing interest in bringing 3D analysis from ground to sky by means of drones. The paper reports some activities of the EU-funded WildDrone project and performs, for the first time, 3D analyses of animals exploiting oblique drone imagery. Using parametric model fitting, we estimate 3D shape and pose of animals from frames of a monocular RGB video. With the goal of appending metric information to parametric animal models using photogrammetric evidence, we propose a pipeline where we perform a point cloud reconstruction of the scene to scale and localize the animal within the 3D scene. Challenges, planned next steps and future directions are also reported.


INTRODUCTION
Drones, also known as Unmanned Aerial Vehicles (UAV) or Systems (UAS), have become an indispensable asset for wildlife conservation research (Wirsing et al., 2022;Tuia et al., 2022;Duffy et al., 2020).Their application is now widespread in ecological studies as they enable upscaled, replicable (Schroeder et al., 2020) and non-invasive acquisition of high-quality data (Fust and Loos, 2020;Jiménez López and Mulero-Pázmány, 2019).They are also more easily available, safer, and more costeffective than traditional ground-based or aerial data collection methods (Duffy et al., 2020).Drones can help conduct flexible studies with more intricately designed flight paths as compared to remote sensing, thus making room for adding more complexity and nuances in surveys with varying altitudes (Fust and Loos, 2020).There is also the potential to access difficult or inaccessible terrains, and drones can bring to such locations a variety of high-resolution payloads, ranging from visual to environmental sensors, depending on the objectives of the study (Krishnan et al., 2023;Mou et al., 2023;Corcoran et al., 2021).The use of photogrammetry and computer vision on droneacquired data can boost quantitative products that can be derived from collected data while reducing the processing time and human effort.For instance, Koger et al. (2023) demonstrate how high-altitude top-view RGB videos and photogrammetry could support habitat reconstruction and herd movement estimation, linking group action to the ecological context.Drones are also employed to create approximations of visual field of each animal within its habitat by learning pose and orientation when observed overhead (Schad and Fischer, 2023;Walter and Couzin, 2021).Most drone-based animal studies, particularly the ones involving some level of automation, work with nadir data whereas oblique imagery remains mostly unexplored (Chabot and Bird, 2015).This is due to the added ground distortion and viewing complexity of oblique videography, e.g. for morphometric analyses.Oblique views, however, have potential to be utilized for gaining visual insights that nadir imagery cannot be used for, especially when studying terrestrial animal characteristics, appearances, or behavior (Shero et al., 2021).Oblique views can capture data on animals that are occluded by surrounding vegetation (Tuia et al., 2022) and contain additional information about shape, coat patterns, gait, and activity such as grazing (Figure 1).A relevant dataset of oblique videos on animals is the KABR dataset (Kholiavchenko et al., 2024): it comprises highresolution footages of Grevy's zebras, plains zebras and giraffes.Authors have also presented a computer vision-based pipeline for simultaneously focal sampling several terrestrial animals in the scene at the same time, using deep learning for behavior recognition.Focal sampling (Altmann, 1974) refers to the observation of one specific individual animal for a set time duration, done with the purpose of gaining behavioral insights.Monitoring health, movement, behavior, and responses of the animal not only sheds light on the individual but helps form a more refined understanding of the collective behavior (Koger et al., 2023).

Paper aims
In this paper, we propose a methodology (Figure 2) to perform a scaled 3D pose and shape extrapolation on individual zebras building upon 3D knowledge of the surveyed scene and the skinned parametric model SMAL (Zuffi et al., 2017).Our work acts as a first step towards drone-based research on capturing scale in addition to 3D pose and shape of animals using oblique views and photogrammetric knowledge of the scene.Since the parametric model can in principle characterize several species, this will be extendable to oblique drone data of different types of animals.The contributions of our paper include: • to perform animals' 3D pose and shape estimation based on oblique drone imagery; • to combine photogrammetrically extracted camera poses and flights logs to geolocate and scale the 3D animal models; • to bring together parametric fitting and photogrammetry for comprehensive scene recovery.

RELATED WORK
2.1 Animal 3D pose and shape estimation 3D shape of animals can provide valuable knowledge on their health, reproductive status, and age (Postma et al., 2015).The 3D postural information allows for several types of kinematic analyses (Tuia et al., 2022).However, it remains an underexplored problem when compared to the analogous challenge of 3D human shape and pose estimation (Xu et al., 2024).This is true because of various reasons such as inter-species shape and appearance diversity, and shortage of datasets.3D reconstruction through neural representations is progressing rapidly and can be generalized to several species.Some instances of research in this direction include 3D Fauna (Li et al., 2024), LASSIE (Li et al., 2024), MagicPony (Wu et al., 2023), 3D Style Birds (Wang et al., 2023), BANMo (Yang et al., 2022), TAVA (Li et al., 2022), DOVE (Wu et al., 2022) and LASR (Yang et al., 2021).However, the accuracy with which they represent animals, particularly for ecological use-cases, is not yet comparable to statistical fitting methods (Rüegg et al., 2023).Since large-scale 3D digitization operations of animals are not practically achievable.3D parameterized quadruped models, such as SMAL (Zuffi et al., 2017), were proposed by learning a low-dimensional shape space using scans of toys.The manual process of labelling pose keypoints and segmentation mask required for SMAL fitting was supplemented by Biggs et al. (2018) with a deep learning based front-end for 2D joint predictions.SMALR (Zuffi et al., 2018) improves SMAL fitting through incorporation of multi-view imagery to refine the shape and extract texture information.SMALST (Zuffi et al., 2019) uses an end-to-end network to regress SMAL parameters to extract shape, pose and texture from Grevy's zebras' images in the wild.Li and Lee (2021) add a layer of refinement to the SMAL fitting by integrating per-vertex deformation prediction using graph convolutional networks.Generalization over diverse species is the main challenge for parametric methods.SMAL-like models can be used with high realism for animals belonging to the Felidae, Canidae, Equidae, Bovidae and Hippopotamidae family with a highly specific shape prior; the problem with generalization becomes more pronounced with animals such as elephants and giraffes.Their particularity and precision have, however, been taken advantage of for studies not focusing on generalization but instead where breed priori is exploited, such as SMBLD in Biggs et al. (2018).This is an extension of SMAL for representing dogs with better shape accuracy using augmented shaped parameters.These augmented shape parameters are picked up in BARC (Rüegg et al., 2022), a method incorporating breed-awareness, which has been followed by BITE (Rüegg et al., 2023) that additionally utilizes ground contact information for modelling more realism in dog shape and posture.Kanazawa et al. (2018) deform spherical meshes to birds for extracting shape, texture and pose.The deformation is done, however, without parameterizing the posture information.Therefore Badger et al. (2020) introduced a low-dimensional bird shape space, further developed by Wang et al. (2021).The most recent advancement has been the Animal3D Dataset (Xu et al., 2024), which is a dataset consisting of over 3000 images of 40 mammal species with annotations of not just 2D pose key points, but also the accompanying pose and shape parameters of highquality SMAL fitting.Note that none of these methods have been utilized on drone-based imagery, which introduces a new domain of modelling and posture challenges.
From an application standpoint, the use of 3D statistical models can be seen where Stennett et al. (2022) combined deep learning, 3D shape analysis with parametric modelling process SMALST and metric learning for re-identification of individual Grevy's zebra on camera-trap data.The authors highlight how 3D model fitting can improve re-identification results as compared to widely used 2D bounding box methods, but even though they lay foundation for an animal identification system which could be applicable to open population settings, there is still no solution to the two-side problem.The two-side problem refers to how one lateral view of an animal has no identifiable correspondence to the other lateral view and one zebra can be assigned with two separate ids when viewed from different sides, an issue prevalent with camera-trap data.Flight paths designed to acquire a combination of oblique and nadir views, combined with real-time tracking, can be the solution to the two-side problem.Our application of parametric 3D pose analysis to drone data lays the perfect groundwork for testing such re-identification pipelines with drones.

Drones, Photogrammetry and Computer Vision in Wildlife Conservation
The contribution of drones to wildlife conservation efforts has multiplied with the assimilation of machine learning, computer vision and photogrammetry in both real-time analysis and postprocessing.This can be seen across a variety of applications such as census surveys and animal counts (Rahman et al., 2023;Burke et al., 2019;Kellenberger et al., 2019), social and individual behavioral analysis (Jagielski et al., 2022;Hartman et al., 2020;Torney et al., 2018), morphometric analysis (Torney et al., 2018), sample collection (Álvarez-González et al., 2023;Aucone et al., 2023), individual re-identification (Andrew et al., 2019), environment analysis (Koger et al., 2023), large-scale ground truthing of remote-sensing data (Wirsing et al., 2022) and security related applications such as poacher detection (Bhatia et al., 2024;Anbalagan et al., 2023;Doull et al., 2021).This combination has helped produce valuable and layered datasets comprising of quantified metrics on movement and social interactions at highresolution (Koger et al., 2023;Haalck et al., 2023;Torney et al., 2018).Drones have helped capture the 3D structure of surrounding habitats using photogrammetry, for analyzing patterns in animal grouping and decision making and linking it to spatial knowledge (Maeda and Yamamoto, 2023;Koger et al., 2023).Data fusion of visible and thermal spectrum has been utilized for studying and distinguishing individuals from their environment in drone data (Krishnan et al., 2023).
Most of these instances study ways of disentangling several strands of information, for instance gauging animal movement from drone videos or reconstructing the environmental context around the monitored animal that is continuously engaging with its environment and changing positions.For videos acquired in the wild, photogrammetry can provide metric information of the scene but cannot recover 3D shape and pose of the animals due to their continuous movement and change in posture.Through statistical fitting, 3D shapes and pose of the moving animals in the scene can be recovered but without scale information.Our methodology thus brings together these two 3D analysis methods -photogrammetry and parametric modelling -for a more comprehensive form of terrestrial animals analyses.
Figure 2: The proposed methodology which includes 3D environment reconstruction as well as 3D shape and pose estimation of animals seen in oblique drone imagery.

Data
The used data are standard videos collected for manual surveillance at the Ol Pejeta Conservancy, located in Laikipia County in Kenya, using a DJI Mavic 3E (Figure 3).The videos (3840 x 2160 pixels resolution) feature a herd of Plains Zebras (Equus quagga) and were captured from relative altitude ranging from 15 to 35 meters above ground.Generally, the flight starts with a circular trajectory around the herd and subsequently proceeds to follow the herd.The zebras can be distinctly observed in their environment and engage in a variety of activities such as grazing and self-grooming.Working with these videos poses a different set of challenges when compared to working with traditional nadir data acquired for wildlife studies, as they feature oblique viewpoints, varying distance from the herd, changing altitude from the ground, zooming effects, continuous movement of animals, etc. 1 https://github.com/3DOM-FBK/deep-image-matching

Photogrammetric scene reconstruction
For our experiments, we selected a video that showcases all the above-mentioned challenges.Given the continuous movement of both herd and drone, we extracted frames at 12 fps.This allowed us to create a set of images suitable for photogrammetric purposes, while also to correctly determine poses and shape of the zebras.During the drone flight, frequent camera zooms are required to better examine the herd or single animals.Therefore, using the DJI flight logs and frame metadata, the frames were automatically split into bins of similar focal lengths to assist the photogrammetric processing.For the image orientation, the SuperPoint (DeTone et al., 2018) feature extractor and the LightGlue (Lindenberger et al., 2023) feature matcher, both available in the Deep Image Matching toolbox (Morelli et al., 2024) 1 , were used.Camera trajectories and sparse 3D reconstruction of the scenes were then recovered in COLMAP (Figure 4a-b).After individual scene recovery (for each bin), the scenes were co-registered in Agisoft Metashape exploiting the GNSS information stored in the flight logs (Figure 4c).This process allowed to create a scaled 3D result of the scene.

3D pose and shape estimation of zebras
SMAL (Zuffi et al., 2017) is a statistical 3D shape space descriptor mesh model represented as M(β, θ, γ), where β is the shape, θ is the pose and γ is the translation.These parameters collectively describe the modulation in shape (or postural representation), thus making them suitable for integration in graphic-based pipelines.Shape β is a descriptor for the coefficients of the low-dimensional shape space of the animal that are learned through Principal Component Analysis.The joints are denoted via a kinematic tree analogous to the skeletal structure, the root of which undergoes the translation γ.Pose θ is described through joints rotation.Starting from the SMAL mesh model, the fitting based SMALR method (Zuffi et al., 2018) is then applied for high accuracy shape retrieval.Silhouette masks and 2D pose keypoints are extracted from the UAV frames as inputs for the parametric fitting: • for extracting zebra silhouettes, a MaskRCNN (He et al., 2018) architecture is trained using PyTorch's Detectron2.Once masks are extracted, the selection of frames for SMALR fitting is performed by calculating the ratio between mask and image size and considering frames with a mask detection accuracy above 80%.For the former, frames with ratio above the median value are considered.• for 2D pose estimation, the MMPose 2 toolkit is used, applying transfer learning from a pre-trained HRNet (Wang et al., 2020) backbone.Pose estimation with HRNet calculates the maximum likelihood of body keypoints through a 'top-down approach' i.e. body detection followed by joints estimation.In its first stage, the HRNet architecture processes the input image through parallelized network architecture, each branch corresponding to a different resolution scale with the goal to preserve both granular local details and global semantic contextual information.The multi-scale features are aggregated in the pose estimation head of the network through a fusion mechanism -each representation takes input from its immediate neighboring scales and from other parallel branches iteratively.This aggregation method ultimately generates keypoint heatmaps predicting the joint presence likelihood from the multi-spatial representations (Figure 5).Finally, using the Equidae or Horse family specific shape prior, we input 2D key points and silhouette-based image evidence of the zebra's shape in the frame as pose and shape target for iterative 3D model fitting optimization process inspired by SMALR fitting method (Figure 6).The 3D model is aligned via error minimization on an objective function which represents both pose key points errors and the silhouette errors.

Metric parametric model
To maintain the accuracy of the scaling, we take advantage of the extracted 2D and 3D keypoints of the hooves of the focal zebra.We chose hooves as animal reference because they have ground contact, they show good image replication trend when using SMALR and can help gauge orientation.We locate the 2D hoof keypoints within the image using the already obtained 2D pose estimates and label them as markers across a set of five frames.where maxext represents the maximum extension in either front or back to calculate distance from the most extended front hoof to the most extended back hoof either left or right.This distance is chosen because the SMALR model fitting, on rare occasions, can show distortions when differentiating between the left and right limb in both the front and back of the animal model, but, irrespective of the selection of limbs, it mimics extension along the head-to-tail length very accurately.Our goal here is to preserve this length and therefore go with the maximally extended limb in both front and back.We calculate the mean of this distance dist2DXmean across a set of five frames to minimize the error.We then perform rigid scaling on the model mesh using the Python library Trimesh3 .With the 3D pose estimates from the model hoof keypoints, labelled as HRF (right front hoof: yellow in Figure 7), HLF (left front hoof: green in Figure 7), HRB (right back hoof: red in Figure 7), and HLB (left back hoof: blue in Figure 7), a rigid mesh scaling is performed through the following steps: • key point distances computation between Pi and the most extended back hoof, which in Figure 7 is HRB.Pi is the x-axis component of the furthest extended front hoof, in this case is HLF, along HRB; • scaling factor computation as dist2DXmean / | distance (Pi, maxext3D (HRB, HLB)) | • rigid transformation application and mesh scaling with the computed factor.In case artefacts are created, a mesh regularization is applied.After this scaling process, we calculated the distance between nose keypoint and tail-start keypoint using the corresponding 3D pose coordinate mesh vertices.In the sample case shown in Figure 7, the zebra length is 232 cm.In the literature (Kingdon, 1988), the average head-body length of a Plains Zebras can range anywhere between 217-244 cm. Figure 8 shows examples of the fitting results for 3D shape recovery.

CONCLUSIONS
In this paper, we perform parametric model fitting on zebras using monocular videos captured by drones in-the-wild.The results suggest that combining photogrammetric processing and parametric model fitting to oblique monocular drone footage is an effective technique for quantifying the posture and shape of zebras observed in the wild.At the moment, we do not have ground truth to verify the error in this metric estimation, but such ground truthing will be pursued in further research activities in the framework of WildDrone project.We plan to use toy animals moved around while drones are surveying the area and acquiring multi-view images.A standard issue, as seen with SMALR fitting, is estimation of shape on dorsal viewpoints and missed gaze direction and these problems were observed with drone footage as well.To address this, we plan to collect aerial multi-view data with simultaneous multi-drone flights to understand the scope of improving fitting to odd poses and dorsal view with multiple perspectives.This was an important reason for choosing SMALR as the parametric model fitting algorithm in this pipeline.Since the model can represent several other species of interest, we will be looking at collecting data for different species as well.Aerial high-resolution oblique viewpoints open doors for 3D insights in wildlife conservation and introduce opportunities to increase the granularity and depth in pipelines such as pose estimation.The performance of most current methods worsens when they are used outside the domain they were developed for (Jiang et al., 2022) therefore it is highly crucial that drone data start to be considered as an essential domain for these 3D methods to be experimented on and adapted to wildlife monitoring and conservation.
Figure 8: Fitting results and 3D shape recovery on zebras from multiple drone frames: automatically extracted joints and 3D shape without/with superimposed original image.

Figure 1 :
Figure 1: (a) Nadir vs oblique views for animal's surveying.(b) Nadir imagery sees animals only from the top.(c) Oblique views offer better scope for studying individual animal characteristics.

Figure 3 :
Figure 3: Ol Pejeta Conservancy area as seen in Google Maps (left).UAV views from the WildDrone data acquisition in July 2023 (right).

Figure 4 :
Figure 4: Recovered camera poses and sparse 3D scene from bins of 194 (a) and 227 (b) frames, respectively.Fusion of the separate processing results exploiting GNSS information extracted from the flight logs (c).

Figure 6 :
Figure 6: Visualization of SMALR fitting results in Blender.
These points are labelled as Hrf (image right front hoof), Hlf (image left front hoof), Hrb (image right back hoof), and Hlb (image left back hoof) in x,y image coordinates.The next step involves initiating a ground plane that denotes the maximum length covered within the x range: dist2DX = | distance (maxext (Hrf,, Hlf), maxext (Hrb, Hlb))

Figure 7 :
Figure 7: Scaling results visualized in Blender.Visualization of the rigid scaling factor deduction using the properties of similar triangles in the plane of △HLF Pi HRB.