1: \begin{abstract}
2: This work presents a method for visual text recognition without
3: using any paired supervisory data.
4: We formulate the text recognition task as one of aligning the
5: conditional distribution of strings predicted from given text images,
6: with lexically valid strings sampled from target corpora.
7: This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets.
8: We present detailed analysis for various aspects of the proposed method,
9: namely --- (1) impact of the length of training sequences on convergence,
10: (2) relation between character frequencies and the order in which they are learnt,
11: (3) generalisation ability of our recognition network to inputs of arbitrary lengths,
12: and (4) impact of varying the text corpus on recognition accuracy.
13: Finally, we demonstrate excellent text recognition accuracy on both
14: synthetically generated text images, and scanned images of real printed books,
15: using no labelled training examples.
16: \end{abstract}
17: