G-MAE: Gesture-aware Masked Autoencoder for Human-Machine Interaction
Keywords: Masked Autoencoder, Multi-Scale Transformer, Multi-Head Self-Attention, Gesture Recognition, Human-Machine Interaction
Abstract. Gesture recognition remains a critical challenge in human-computer interaction due to issues such as lighting variations, background noise, and limited annotated datasets, particularly for underrepresented sign languages. To address these limitations, we propose G-MAE (Gesture-aware Masked Autoencoder), a self-supervised framework leveraging a Gesture-aware Multi-Scale Transformer (GMST) backbone that integrates multi-scale dilated convolutions (MSDC), multi-head self-attention (MHSA), and a multi-scale contextual feedforward network (MSC-FFN) to capture both local and long-range spatiotemporal dependencies. Pre-trained on the Slovo corpus with 50–70% masking and fine-tuned on TheRusLan, G-MAE achieves 94.48% accuracy, with ablation studies confirming the contributions of each component. Removing MSDC, MSC-FFN, or MHSA reduces accuracy to 92.67%, 91.95%, and 90.54%, respectively. The optimal masking ratio (50–70%) balances information retention and learning efficiency, demonstrating robust performance even with limited labeled data, thus advancing gesture recognition in resource-constrained scenarios.