1 AIT Asian Institute of Technology

Automatic bitext alignment for Southeast Asian languages

AuthorLwin Moe
Call NumberAIT Thesis no.CS-09-07
Subject(s)Thai language

NoteA thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, School of Engineering and Technology
PublisherAsian Institute of Technology
Series StatementThesis ; no. CS-09-07
AbstractBitext alignment is the task of aligning words, phrases or sentences in one language with the equivalent translation in another. Aligned bitexts help lay the groundwork for statistical machine translation, are useful for language teaching, provide data for cross-language information retrieval, and have a variety of other applications. This thesis investigates the problem of bitext alignment for English and Southeast Asian languages. Although bitext alignment in general has been well studied, most algorithms, implementations, and even performance metrics depend on the assumption that both texts have been regularly divided into words and sentences. Bitext alignment of Southeast Asian languages has not benefited from previous work because they are not normally divided this way. There is no completely reliable machine method for dividing such texts into words and sentences. We will use Thai as our example and test language because experimental data are readily available. However, our goal is to develop insights into the best methods of automatically aligning "low resource" Southeast Asian languages like Burmese, Khmer, and Lao. This thesis will explore dictionary-based alignment methods to improve basic length-based method. We will begin by introducing existing European and Asian bitext corpora, and then discuss current approaches to bitext alignment problems. First, we discuss the basic length-based approach that we use as our baseline method. We then look at the use of lexical features and semantic analysis; for example, using dictionary-based similarity and WordNet relatedness measures, to enhance the baseline methods. Finally, we test different approaches to adapting a Southeast Asian language, Thai, to work with these methods. Before aligning with dictionary-based methods, we pre-segment the Thai input using vari¬ous techniques and prepare the English and Thai input using stemming, stopword removal or normalization of derived forms in English. This thesis will make the following contributions: 1.It will establish the baseline performance of the naIve basic method. 2.It will introduce metrics for evaluating the performance of bitext alignment, taking both sentence boundary detection and alignment of individual Thai segments into account. 3.It will test and measure different approaches to Southeast Asian word segmentation in the input text preparation before determining similarity between Thai sentence segments and English sentences. 4.It will compare the effectiveness of English-to-English comparison (that is, translate the Thai segments to English first) versus Thai-to- Thai comparison (that is, translate the English sentences to Thai first). 5.It will test and measure the effects of using different types of dictionaries for translation and alignment. 6.It will test and measure the effects of stopword removal, stemming, simplification of derived forms on dictionary-based realignment. 7.It will test WordNet relatedness analysis to realign the output of length-based method. 8.It will provide data that will be useful for ongoing research into such problems as detection and correction of misordered or missing alignment pairs. 9.It will make Southeast Asian language-specific recommendations on performance measurement, segmentation algorithm, segmentation dictionary, translation type, translation dictionary and different approaches to improve the segment and sentence similarity test.
Year2009
Corresponding Series Added EntryAsian Institute of Technology. Thesis ; no. CS-09-07
TypeThesis
SchoolSchool of Engineering and Technology (SET)
DepartmentDepartment of Information and Communications Technologies (DICT)
Academic Program/FoSComputer Science (CS)
Chairperson(s)Janecek, Paul
Examination Committee(s)Dailey, Matthew;Cooper, Doug
Scholarship Donor(s)Asian Institute of Technology Fellowship
DegreeThesis (M.Sc.) - Asian Institute of Technology, 2009


Usage Metrics
View Detail0
Read PDF0
Download PDF0