Text in natural scene images usually contains a lot of semantic value and recognizing the texts is an important step for understanding the scene. Unlike the printed documents, text in a natural scene is more difficult due to large variations in geographical placement, backgrounds, textures, fonts, and illumination conditions. In this work, we propose a method which first detects and recognizes characters by utilizing the Convolutional Neural Network (CNN), and then decodes a series of recognized characters into words with a Weight Finite State Transducer (WFST). WFST has been successfully utilized in the speech recognition field, where it is shown that it can efficiently incorporate lexicon or high order language model in the word labelling tasks. In the experiments, we have shown that the proposed algorithm can robustly recognize words in the scene images from the public datasets such ICDAR 2003, and SVT-WORD.

Please click the thumbnail image to open the full-size PDF file.