Transformer-LSTM Improve Maize-Yield Estimation in Smallholder Fields of Malawi
Keywords: Remote Sensing, Machine Learning, Yield Estimation, GeoAI, Photogrammetry, LOFO
Abstract. Accurate in-season yield estimation is essential for Malawi’s food-security planning, yet conventional crop-cut surveys cover fewer than 1% of the nation’s approximately 1.8 million sub-hectare maize plots. In this study, we exploit Sentinel-2 time-series imagery to benchmark five modelling paradigms: spectral-index linear regression, XGBoost, CNN-LSTM, a frozen Vision Transformer (ViT) and a ViT-LSTM hybrid. We apply these across eight rain-fed maize fields (0.2–0.9 ha) in Zomba District, Malawi, under a strict nested leave-one-field-out cross-validation design. Our results show that the recurrent architectures significantly outperform the tabular baselines (p ≤ 0.02, exact paired-permutation test). The ViT-LSTM hybrid achieved the lowest error (RMSE = 0.022 tha−1; MAE = 0.019 tha−1), representing an approximate 80% improvement over the best CNN-LSTM comparator, with statistical significance (p = 0.031). Inference speed remains practical at ≈ 35 ms per 32×32-pixel patch, or about ∼ 3 has−1 on a low-end Quadro P1000 GPU, enabling national-scale yield mosaics within a week. These findings align with emerging evidence that transformer–recurrent hybrid architectures represent the current state-of-the-art for crop-yield prediction (see e.g., ViT-based studies) and highlight the enduring trade-off between accuracy and throughput in operational contexts. Moreover, our open-source pipeline, the first validated on data-scarce, intercropped smallholder plots in sub-Saharan Africa, provides a reproducible blueprint for operational yield monitoring across similar agro-ecologies. The experiment scripts are available at https://github.com/jahnical/yield-pred-models-comp
