1
Multilingual information extraction from semi-structured documents | |
Author | Watsamon Pongsupan |
Call Number | AIT Thesis no.CS-22-03 |
Subject(s) | Optical character recognition Computational linguistics Artificial intelligence |
Note | A thesis submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science |
Publisher | Asian Institute of Technology |
Abstract | Discovering knowledge generally requires analyzing structured data; however, most data is only available in unstructured form. Transforming unstructured data into structured data is always a tedious task. Converting a digital image to text requires an optical character recognition engine (OCR). Although Latin-based OCR achieves nearly 100% accuracy in ideal scenarios, Thai OCR is considered a complicated problem because Thai letters are composed in various ways. Moreover, Thai symbols could be composed in at most four stacked levels. There are only a few OCR engines that are able to work on multi-language images efficiently. In this thesis, I aim to construct an OCR model using deep learning methods. However, I am also interested in Google OCR and Tesseract OCR. Thus, I experiment with three OCR models in the proposed system. Several tech niques are used to extract and organize information, including regular expressions and rule-based methods integrated with a custom dictionary. The results of my experiments are illustrated and analyzed. Finally, the results are displayed via a mobile application to illustrate the results. The YOLO model achieves an average precision over all classes of 96%, but the resulting OCR system’s performance, only achieves 63% accuracy, on non-numeric datasets. On the other hand, Tesseract OCR achieves 70% accuracy, and Google OCR gives the best accuracy 86%. |
Year | 2022 |
Type | Thesis |
School | School of Engineering and Technology |
Department | Department of Information and Communications Technologies (DICT) |
Academic Program/FoS | Computer Science (CS) |
Chairperson(s) | Dailey, Matthew N. |
Examination Committee(s) | Mongkol Ekpanyapong;Chaklam Silpasuwanchai |
Scholarship Donor(s) | His Majesty the King’s Scholarship (Thailand) |
Degree | Thesis (M. Eng.) - Asian Institute of Technology, 2022 |