Text Detection

Fast Korean Text Detection in Traffic Guide Signs

- 은현준, 김종희, 김진수 학생이 2017 삼성화재 머신러닝 챌린지에서 최우수상 수상 (2017.11.9)




In this paper, we propose a method to detect and recognize Korean text in traffic guide signs. As shown in Fig 2., this task requires two steps including detection of Korean characters in traffic guide signs and recognition of them. We first detect Korean characters in traffic guide signs by using a region proposal network (RPN) proposed in [5]. We newly construct a RPN by introducing residual blocks [6] and the Inception architecture [7] for the better performance and inference speed. Based on Korean character candidates detected by the RPN, we classify them into 710 classes including 709 Korean characters generally used in practice and a non-Korean character. Similar to the RPN, our classification network (CLSN) contains residual blocks and the Inception architecture, but the CLSN is deeper than the RPN to classify 710 classes. The proposed method performs detection and recognition in character-level for considering running time. Both detection and recognition could be processed in alphabet-level.
However, this approach requires one more process to compose a character from alphabets, which increases running time. In experiments, we achieved 98.14% as the best accuracy on detection and recognition on 709 Korean character classes. The fastest running time of our method is 5.9fps with 97.69% of accuracy for the whole process.



The main framework consists of two steps: 1) candidate detection and 2) candidate classification. In the candidate detection step, we obtain Korean character candidates by using a RPN. Then, these candidates are used as the input of a CLSN to recognize Korean character classes of the candidates. Overall framework is illustrated in Fig. 3.







Our dataset consists of 51k images of traffic guide signs containing 709 classes of Korean characters. Images contain
various orientation, size, illumination of traffic guide signs and Korean characters in them. For training, we first manually
labeled location and a class of Korean characters in 4k images to train the RPN and the CLSN. To train the CLSN, we used English and Chinese characters, number, and background in these 4k images as negative patches. In order to increase an amount of training data, we employed detection results obtained by the intermediately learned RPN and CLSN. In other words, we iteratively added correctly classified results to a positive set and misclassified data to a negative set of training data by testing 50k images. As a result, our networks trained with 200k positive patches and 130k negative patches. In order to evaluate the proposed method, we used 1k images containing 10,852 Korean characters. We show accuracy and computation time over Intersection over Union (IoU) threshold of non-maximum suppression in Table 2. Our
method achieved 98.14% as the best accuracy with 3.3fps of a computation time. For faster detection and recognition, we could compromise with computation costs at 5.9fps with 97.69% of accuracy. In addition, we quantitatively analyzed
F-measure of each character class as shown in Fig. 5. Note that we omitted class labels because of space limit and only
show the results of classes including the number of characters over ten. As can be seen, 96.3% of classes have 0.8 or higher F-measure and 91.7% of classes have 0.9 or higher F-measure. We also confirmed that 3.7% of classes has F-measure lower than 0.8. This could not be reliable because the number of characters qualitative evaluation on detection and recognition in in these classes is below than three. As shown in Fig. 6, the proposed method reliably detects and recognizes Korean characters. Note that Korean characters outside a traffic guide sign are considered as negative examples not to be detected.





English Text Detection in Natural Images

- Overall frameworks


- Candidate patch extraction


- Patch Classification Using Ensemble of ResNets

The proposed method reduces the classification errors by incorporating three ResNets with different hyper-parameters. First, we select two ResNets among the three ResNets that are trained with different hyper-parameters. When we trained these three ResNets, the learning rate of each layer was set differently according to the ResNet. Second, each character candidate patch is classified into one of the classes using the following rules.


• If both ResNets classify a character candidate patch with a character region, that patch is defined as a character region.

• If both ResNets classify a character candidate patch into a non-character region, that patch is defined as a non-character region.

• If the patch classification results are different, the class of the character candidate patch is determined according to the confidence score.



Character Region Grouping via Self-tuning Spectral Clustering


- Text level recognition results


- Word level recognition results


- Quantitative results


Hyunjun Eun, Jonghee Kim, Jinsu Kim, and Changick Kim, "Fast Korean Text Detection and Recognition in Traffic Guide Signs," Accepted to IEEE International Conference on Visual Communications and Image Processing (VCIP), 2018.

Jinsu Kim, Yoonhyung Kim, and Changick Kim, "A Robust Ensemble of ResNets for Character Level End-to-end Text Detection in Natural Scene Images," in Proc. ACM Content-Based Multimedia Indexing (CBMI), Firenze, Italy, Jun. 19-21, 2017. https://drive.google.com/open?id=1ZGiABFVmFHM5DRPoW-QWZFKI3zuybVN9