1 AIT Asian Institute of Technology

Compression of large Vietnamese text documents

AuthorDo Duc Hanh
Call NumberAIT Thesis no. CS-95-05
Subject(s)Data compression (Computer science)

NoteA thesis submitted in the partial fulfillment of the requirement for the degree of Master of Engineering
PublisherAsian Institute of Technology
AbstractDigital libraries require efficient methods of storing vast amounts of information in such a way that provides fast search and retrieval. But there is a conflict. Decompression increases access time and the need for an index enlarges stored space. This study was involved in efficient compression methods of large Vietnamese text documents to create databases for digital libraries. The characteristics of Vietnamese text were analyzed. The zero-order word-based method coupled with the canonical Huffman coding was used to compress Vietnamese text documents. Then an in-place merging algorithm was used to create inverted files. Finally, the coding methods of integers were used to reduce a space requirement of temporary and inverted files. By the proposed approach, the documents can be decoded fast and full-text queries are supported on compressed documents. The size of compressed database (including indexing to every word) is about 40% of the original text size.
Year1995
TypeThesis
SchoolSchool of Engineering and Technology (SET)
DepartmentDepartment of Information and Communications Technologies (DICT)
Academic Program/FoSComputer Science (CS)
Chairperson(s)Huynh, Ngoc Phien;
Examination Committee(s)Phan, Minh Dung;Batanov, Dentcho N.;
Scholarship Donor(s)The Swedish International Development Authority Agency (SIDA);
DegreeThesis (M.Eng.) - Asian Institute of Technology, 1995


Usage Metrics
View Detail0
Read PDF0
Download PDF0