1
Multilingual information extraction from semi-structured documents | |
| Author | Watsamon Pongsupan |
| Call Number | AIT Thesis no.CS-22-03 |
| Subject(s) | Optical character recognition Computational linguistics Artificial intelligence |
| Note | A thesis submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science |
| Publisher | Asian Institute of Technology |
| Abstract | Discovering knowledge generally requires analyzing structured data; however, most data is only available in unstructured form. Transforming unstructured data into structured data is always a tedious task. Converting a digital image to text requires an optical character recognition engine (OCR). Although Latin-based OCR achieves nearly 100% accuracy in ideal scenarios, Thai OCR is considered a complicated problem because Thai letters are composed in various ways. Moreover, Thai symbols could be composed in at most four stacked levels. There are only a few OCR engines that are able to work on multi-language images efficiently. In this thesis, I aim to construct an OCR model using deep learning methods. However, I am also interested in Google OCR and Tesseract OCR. Thus, I experiment with three OCR models in the proposed system. Several tech niques are used to extract and organize information, including regular expressions and rule-based methods integrated with a custom dictionary. The results of my experiments are illustrated and analyzed. Finally, the results are displayed via a mobile application to illustrate the results. The YOLO model achieves an average precision over all classes of 96%, but the resulting OCR system’s performance, only achieves 63% accuracy, on non-numeric datasets. On the other hand, Tesseract OCR achieves 70% accuracy, and Google OCR gives the best accuracy 86%. |
| Year | 2022 |
| Type | Thesis |
| School | School of Engineering and Technology |
| Department | Department of Information and Communications Technologies (DICT) |
| Academic Program/FoS | Computer Science (CS) |
| Chairperson(s) | Dailey, Matthew N. |
| Examination Committee(s) | Mongkol Ekpanyapong;Chaklam Silpasuwanchai |
| Scholarship Donor(s) | His Majesty the King’s Scholarship (Thailand) |
| Degree | Thesis (M. Eng.) - Asian Institute of Technology, 2022 |