1
Multi-medical document summarization using XX masking approach | |
Author | Koirala, Ayush |
Call Number | AIT Thesis no.DSAI-23-12 |
Subject(s) | Natural language processing (Computer science) Information retrieval |
Note | A thesis submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Data Science and Artificial Intelligence |
Publisher | Asian Institute of Technology |
Abstract | Multi-document summarization is essential for capturing key information from vast medical literature. Navigating the medical domain faces significant challenges due to the vastness and complexity of medical literature datasets. The specific meanings of medical keywords and their clinical importance further amplify this difficulty. How ever, existing summarization methods employs token probability marginalization, en counter critical challenges. This technique, by averaging token probabilities to deter mine their relevance, may fail to fully capture the intricate details of medical texts, po tentially leading to inaccuracies or misrepresentations, particularly of less common med ical terms. Addressing these limitations, we propose an masking approach specifically designed to effectively select candidate sentences from the masking background infor mation. This approach commences with the selection of candidate documents utilizing Dense Passage Retrieval (DPR) and then, to select candidate sentences based on back ground. We experimented comprehensive analysis of five distinct masking techniques, applied at three varying masking ratios, and assesses their effectiveness across four different BART model sizes. Significantly, our experiments demonstrated that employing a TF-IDF (Term Frequency-Inverse Document Frequency) based background masking strategy at a 15% masking ratio, particularly when fine-tuned on the BART-LARGE CNN model, yielded the highest Rouge scores. This performance surpasses previous benchmarks established on the MS2 dataset, thereby underscoring the efficacy of our proposed approach in enhancing the quality of multi-document summarization in the medical domain. |
Year | 2023 |
Type | Thesis |
School | School of Engineering and Technology |
Department | Department of Information and Communications Technologies (DICT) |
Academic Program/FoS | Data Science and Artificial Intelligence (DSAI) |
Chairperson(s) | Chaklam Silpasuwanchai |
Examination Committee(s) | Chantri Polprasert;Attaphongse Taparugssanagorn |
Scholarship Donor(s) | AIT Fellowship |
Degree | Thesis (M. Eng.) - Asian Institute of Technology, 2023 |