1 AIT Asian Institute of Technology

Enhancing Sign language recognition with video swin transformer and keypoint-based frame selection

AuthorNont Arayarungsarit
Call NumberAIT Thesis no.CS-25-02
Subject(s)Sign language--Data processing
Computer vision
Deep learning (Machine learning)

NoteA thesis submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science
PublisherAsian Institute of Technology
AbstractThis paper proposes the hybrid approach to sign language recognition that combines transformer based modeling with keypoint-guided frame selection. Invideo-based sign language datasets, noise and redundancy are prevalent due to uninformative or repetitive frames, which can de grade model performance and increase computational cost. To address this issue, we propose DotRand, a keypoint-driven frame selection algorithm that judiciously selects the most informative frames based on motion dynamics,reducing irrelevant visual content while preserving essential gesture sequences.Our hybrid model integrates DotRand with the Video Swin Transformer, a hierarchical vi sion transformer designed for spatiotemporal feature learning. We evaluate this approach on two datasets: the Thai Sign Language (TSL) dataset and the widely used Word-Level American Sign Language 100 (WLASL100) benchmark. Both datasets are annotated at the word level, meaning each video corresponds to a single signed word. The TSL dataset, being low-resource and limited in size, presents specific challenges for deep learning. To overcome these limitations, Controlled rotation and Gaussian noise were employed as augmentation techniques to strengthen the model’s performance under real-world conditions.For comparison, we implement a CNN-BiLSTM baseline, a commonly used architecture for video-based sequence modeling. Our proposed model achieves 94.84% Top-1 accuracy and 97.65% Top-5 accuracy on the TSL dataset. On the WLASL100 dataset, our model yields 70.54% Top-1 accuracy and 88.76% Top-5 accuracy, significantly outperforming the CNN BiLSTM baseline, which achieves only 46.90% Top-1 and 77.91% Top-5 accuracy.The inclusion of both a high-resource (WLASL100) and a low-resource (TSL) dataset en ables a comprehensive evaluation of the model’s benchmark generalization capability. These findings highlight the effectiveness of combining keypoint-informed frame selection, data augmentation, and transformer-based architectures for robust and efficient sign language recognition in noisy and resource-constrained settings.
Year2025
TypeThesis
SchoolSchool of Engineering and Technology
DepartmentDepartment of Information and Communications Technologies (DICT)
Academic Program/FoSComputer Science (CS)
Chairperson(s)Chantri Polprasert;
Examination Committee(s)Chaklam Silpasuwanchai;Mongkol Ekpanyapong;
Scholarship Donor(s)AIT Scholarship;
DegreeThesis (M. Eng.) - Asian Institute of Technology, 2025


Usage Metrics
View Detail0
Read PDF0
Download PDF0