The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Share
Publications Copernicus
Download
Citation
Share
Articles | Volume XLVIII-2/W9-2025
https://doi.org/10.5194/isprs-archives-XLVIII-2-W9-2025-241-2025
https://doi.org/10.5194/isprs-archives-XLVIII-2-W9-2025-241-2025
04 Sep 2025
 | 04 Sep 2025

G-MAE: Gesture-aware Masked Autoencoder for Human-Machine Interaction

Elena Ryumina, Dmitry Ryumin, and Denis Ivanko

Keywords: Masked Autoencoder, Multi-Scale Transformer, Multi-Head Self-Attention, Gesture Recognition, Human-Machine Interaction

Abstract. Gesture recognition remains a critical challenge in human-computer interaction due to issues such as lighting variations, background noise, and limited annotated datasets, particularly for underrepresented sign languages. To address these limitations, we propose G-MAE (Gesture-aware Masked Autoencoder), a self-supervised framework leveraging a Gesture-aware Multi-Scale Transformer (GMST) backbone that integrates multi-scale dilated convolutions (MSDC), multi-head self-attention (MHSA), and a multi-scale contextual feedforward network (MSC-FFN) to capture both local and long-range spatiotemporal dependencies. Pre-trained on the Slovo corpus with 50–70% masking and fine-tuned on TheRusLan, G-MAE achieves 94.48% accuracy, with ablation studies confirming the contributions of each component. Removing MSDC, MSC-FFN, or MHSA reduces accuracy to 92.67%, 91.95%, and 90.54%, respectively. The optimal masking ratio (50–70%) balances information retention and learning efficiency, demonstrating robust performance even with limited labeled data, thus advancing gesture recognition in resource-constrained scenarios.

Share