1
Compression of large Vietnamese text documents | |
Author | Do Duc Hanh |
Call Number | AIT Thesis no. CS-95-05 |
Subject(s) | Data compression (Computer science) |
Note | A thesis submitted in the partial fulfillment of the requirement for the degree of Master of Engineering |
Publisher | Asian Institute of Technology |
Abstract | Digital libraries require efficient methods of storing vast amounts of information in such a way that provides fast search and retrieval. But there is a conflict. Decompression increases access time and the need for an index enlarges stored space. This study was involved in efficient compression methods of large Vietnamese text documents to create databases for digital libraries. The characteristics of Vietnamese text were analyzed. The zero-order word-based method coupled with the canonical Huffman coding was used to compress Vietnamese text documents. Then an in-place merging algorithm was used to create inverted files. Finally, the coding methods of integers were used to reduce a space requirement of temporary and inverted files. By the proposed approach, the documents can be decoded fast and full-text queries are supported on compressed documents. The size of compressed database (including indexing to every word) is about 40% of the original text size. |
Year | 1995 |
Type | Thesis |
School | School of Engineering and Technology (SET) |
Department | Department of Information and Communications Technologies (DICT) |
Academic Program/FoS | Computer Science (CS) |
Chairperson(s) | Huynh, Ngoc Phien; |
Examination Committee(s) | Phan, Minh Dung;Batanov, Dentcho N.; |
Scholarship Donor(s) | The Swedish International Development Authority Agency (SIDA); |
Degree | Thesis (M.Eng.) - Asian Institute of Technology, 1995 |