AIT Asian Institute of Technology

1 AIT Asian Institute of Technology

> > >

Part of speech masking effect on vision-language representation learning
Author	Pasit Tiwawongrut
Call Number	AIT Thesis no.DSAI-25-07
Subject(s)	Artificial intelligence Natural language processing (Computer science)
Note	A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Data Science and Artificial Intelligence
Publisher	Asian Institute of Technology
Abstract	Vision language (VL) models have shown promising performance across multiple tasks in both zero-shot and fine-tuning setups. Most studies use masked language modeling as a pre-training task by apply random masking to image caption tokens. However, random token masking is not an optimal strategy for training VL mod els, and effective masking strategies in VL remain underexplored. In this work, we investigate the effects of part of speech (POS) masking, as each POS category contributes differently to sentence meaning. By pre-training models with different POS masking strategies, we evaluate each model on image-text retrieval, image text matching, and visual question answering tasks. Our findings contribute to a deeper understanding of how POS masking influences model performance, providing insights that can lead to more effective pre-training strategies for future VL models.Our experiments show that the choice of masked tokens matters. For retrieval tasks, masking simpler tokens like determiners leads to higher accuracy than masking nouns, suggesting that freeing the model from predicting harder words can improve overall alignment. For VALSE, selective POS masking consistently performs better than random masking, The VQA show that content-word masking helps most with fine-grained understanding. Even categories that perform less well in retrieval still add value in VQA, showing that different POS support dif ferent aspects of cross-modal learning. We also confirm that models trained with MLM consistently outperform those trained without it, especially downstream task.
Year	2025
Type	Thesis
School	School of Engineering and Technology
Department	Department of Information and Communications Technologies (DICT)
Academic Program/FoS	Data Science and Artificial Intelligence (DSAI)
Chairperson(s)	Chaklam Silpasuwanchai
Examination Committee(s)	Chantri Polprasert;Attaphongse Taparugssanagorn
Scholarship Donor(s)	AIT Scholarship
Degree	Thesis (M. Sc.) - Asian Institute of Technology, 2025