1
Part of speech masking effect on vision-language representation learning | |
| Author | Pasit Tiwawongrut |
| Call Number | AIT Thesis no.DSAI-25-07 |
| Subject(s) | Artificial intelligence Natural language processing (Computer science) |
| Note | A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Data Science and Artificial Intelligence |
| Publisher | Asian Institute of Technology |
| Abstract | Vision language (VL) models have shown promising performance across multiple tasks in both zero-shot and fine-tuning setups. Most studies use masked language modeling as a pre-training task by apply random masking to image caption tokens. However, random token masking is not an optimal strategy for training VL mod els, and effective masking strategies in VL remain underexplored. In this work, we investigate the effects of part of speech (POS) masking, as each POS category contributes differently to sentence meaning. By pre-training models with different POS masking strategies, we evaluate each model on image-text retrieval, image text matching, and visual question answering tasks. Our findings contribute to a deeper understanding of how POS masking influences model performance, providing insights that can lead to more effective pre-training strategies for future VL models.Our experiments show that the choice of masked tokens matters. For retrieval tasks, masking simpler tokens like determiners leads to higher accuracy than masking nouns, suggesting that freeing the model from predicting harder words can improve overall alignment. For VALSE, selective POS masking consistently performs better than random masking, The VQA show that content-word masking helps most with fine-grained understanding. Even categories that perform less well in retrieval still add value in VQA, showing that different POS support dif ferent aspects of cross-modal learning. We also confirm that models trained with MLM consistently outperform those trained without it, especially downstream task. |
| Year | 2025 |
| Type | Thesis |
| School | School of Engineering and Technology |
| Department | Department of Information and Communications Technologies (DICT) |
| Academic Program/FoS | Data Science and Artificial Intelligence (DSAI) |
| Chairperson(s) | Chaklam Silpasuwanchai |
| Examination Committee(s) | Chantri Polprasert;Attaphongse Taparugssanagorn |
| Scholarship Donor(s) | AIT Scholarship |
| Degree | Thesis (M. Sc.) - Asian Institute of Technology, 2025 |