1 AIT Asian Institute of Technology

Multilingual information extraction from semi-structured documents

AuthorWatsamon Pongsupan
Call NumberAIT Thesis no.CS-22-03
Subject(s)Optical character recognition
Computational linguistics
Artificial intelligence
NoteA thesis submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science
PublisherAsian Institute of Technology
AbstractDiscovering knowledge generally requires analyzing structured data; however, most data is only available in unstructured form. Transforming unstructured data into structured data is always a tedious task. Converting a digital image to text requires an optical character recognition engine (OCR). Although Latin-based OCR achieves nearly 100% accuracy in ideal scenarios, Thai OCR is considered a complicated problem because Thai letters are composed in various ways. Moreover, Thai symbols could be composed in at most four stacked levels. There are only a few OCR engines that are able to work on multi-language images efficiently. In this thesis, I aim to construct an OCR model using deep learning methods. However, I am also interested in Google OCR and Tesseract OCR. Thus, I experiment with three OCR models in the proposed system. Several tech niques are used to extract and organize information, including regular expressions and rule-based methods integrated with a custom dictionary. The results of my experiments are illustrated and analyzed. Finally, the results are displayed via a mobile application to illustrate the results. The YOLO model achieves an average precision over all classes of 96%, but the resulting OCR system’s performance, only achieves 63% accuracy, on non-numeric datasets. On the other hand, Tesseract OCR achieves 70% accuracy, and Google OCR gives the best accuracy 86%.
Year2022
TypeThesis
SchoolSchool of Engineering and Technology
DepartmentDepartment of Information and Communications Technologies (DICT)
Academic Program/FoSComputer Science (CS)
Chairperson(s)Dailey, Matthew N.
Examination Committee(s)Mongkol Ekpanyapong;Chaklam Silpasuwanchai
Scholarship Donor(s)His Majesty the King’s Scholarship (Thailand)
DegreeThesis (M. Eng.) - Asian Institute of Technology, 2022


Usage Metrics
View Detail0
Read PDF0
Download PDF0