| @@ -1,9 +1,8 @@ | | | @@ -1,9 +1,8 @@ |
1 | This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO | | 1 | Tesseract provides an OCR engine and a command line program. It |
2 | OUTPUT FORMATTING, and NO UI. It can only process an image of a | | 2 | includes a new neural net (LSTM) based OCR engine which is focused on |
3 | single column and create text from it. It can detect fixed pitch | | 3 | line recognition, but also still provides a legacy OCR engine which |
4 | vs proportional text. Having said that, in 1995, this engine was | | 4 | works by recognizing character patterns. Tesseract has Unicode (UTF-8) |
5 | in the top 3 in terms of character accuracy, and it compiles and | | 5 | support, and can recognize more than 100 languages "out of the box". |
6 | runs on both Linux and Windows. Another current limitation is that | | 6 | Tesseract can be trained to recognize other languages. It supports |
7 | it only recognizes English and its character set is only US-ASCII. | | 7 | various output formats: plain text, hOCR (HTML), PDF, |
8 | Training code IS included in the open source release however, and | | 8 | invisible-text-only PDF, and TSV. |
9 | will be included in a future release. | | | |