AIT Asian Institute of Technology

1 AIT Asian Institute of Technology

> > >

Enhancing Sign language recognition with video swin transformer and keypoint-based frame selection
Author	Nont Arayarungsarit
Call Number	AIT Thesis no.CS-25-02
Subject(s)	Sign language--Data processing Computer vision Deep learning (Machine learning)
Note	A thesis submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science
Publisher	Asian Institute of Technology
Abstract	This paper proposes the hybrid approach to sign language recognition that combines transformer based modeling with keypoint-guided frame selection. Invideo-based sign language datasets, noise and redundancy are prevalent due to uninformative or repetitive frames, which can de grade model performance and increase computational cost. To address this issue, we propose DotRand, a keypoint-driven frame selection algorithm that judiciously selects the most informative frames based on motion dynamics,reducing irrelevant visual content while preserving essential gesture sequences.Our hybrid model integrates DotRand with the Video Swin Transformer, a hierarchical vi sion transformer designed for spatiotemporal feature learning. We evaluate this approach on two datasets: the Thai Sign Language (TSL) dataset and the widely used Word-Level American Sign Language 100 (WLASL100) benchmark. Both datasets are annotated at the word level, meaning each video corresponds to a single signed word. The TSL dataset, being low-resource and limited in size, presents specific challenges for deep learning. To overcome these limitations, Controlled rotation and Gaussian noise were employed as augmentation techniques to strengthen the model’s performance under real-world conditions.For comparison, we implement a CNN-BiLSTM baseline, a commonly used architecture for video-based sequence modeling. Our proposed model achieves 94.84% Top-1 accuracy and 97.65% Top-5 accuracy on the TSL dataset. On the WLASL100 dataset, our model yields 70.54% Top-1 accuracy and 88.76% Top-5 accuracy, significantly outperforming the CNN BiLSTM baseline, which achieves only 46.90% Top-1 and 77.91% Top-5 accuracy.The inclusion of both a high-resource (WLASL100) and a low-resource (TSL) dataset en ables a comprehensive evaluation of the model’s benchmark generalization capability. These findings highlight the effectiveness of combining keypoint-informed frame selection, data augmentation, and transformer-based architectures for robust and efficient sign language recognition in noisy and resource-constrained settings.
Year	2025
Type	Thesis
School	School of Engineering and Technology
Department	Department of Information and Communications Technologies (DICT)
Academic Program/FoS	Computer Science (CS)
Chairperson(s)	Chantri Polprasert;
Examination Committee(s)	Chaklam Silpasuwanchai;Mongkol Ekpanyapong;
Scholarship Donor(s)	AIT Scholarship;
Degree	Thesis (M. Eng.) - Asian Institute of Technology, 2025