tesseract

Tesseract Open Source OCR Engine (main repository)

Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402 (1 ratings)
Rated 5.0 out of 5
Subscribe to updates I use tesseract


Statistics on tesseract

Number of watchers on Github 16844
Number of open issues 228
Average time to close an issue 1 day
Main language C++
Average time to merge a PR 2 days
Open pull requests 52+
Closed pull requests 45+
Last commit 6 months ago
Repo Created about 4 years ago
Repo Last Updated 6 months ago
Size 38.3 MB
Organization / Authortesseract-ocr
Latest Release3.05.01
Contributors24
Page Updated
Do you use tesseract? Leave a review!
View open issues (228)
View tesseract activity
View on github
Latest Open Source Launches
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating tesseract for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)

Tesseract OCR

Build Status Build status Coverity Scan Build Status Insight.io

About

This package contains an OCR engine - libtesseract and a command line program - tesseract.

The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages out of the box.

Tesseract supports various output formats: plain-text, hocr(html), pdf, tsv, invisible-text-only pdf.

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

This project does not include a GUI application. If you need one, please see the 3rdParty wiki page.

Tesseract can be trained to recognize other languages. See Tesseract Training for more information.

Brief history

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

The latest stable version is 3.05.01, released on June 1, 2017. Latest source code for 3.05 is available from 3.05 branch on GitHub.

Source code for the new LSTM based 4.00.00alpha version is available from the master branch on GitHub. Please note this branch is under active development.

See Release Notes and Change Log for more details of the releases.

Installing Tesseract

You can either Install Tesseract via pre-built binary package or build it from source.

Supported Compilers are:

  • GCC 4.8 and above
  • Clang 3.4 and above
  • MSVC 2015, 2017

Other compilers might work, but are not officially supported.

Running Tesseract

Basic command line usage:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

For more information about the various command line options use tesseract --help or man tesseract.

For developers

Developers can use libtesseract C or C++ API to build their own application. If you need bindings to libtesseract for other programming languages, please see the wrapper section on AddOns wiki page.

Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.

Support

Before you submit an issue, please review the guidelines for this repository.

For support, first read the Wiki, particularly the FAQ to see if your problem is addressed there. If not, search the Tesseract user forum, the Tesseract developer forum and past issues, and if you still can't find what you need, ask for support in the mailing-lists.

Mailing-lists:

Please report an issue only for a bug, not for asking questions.

License

The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Latest Version of README

For the latest online version of the README.md see:

https://github.com/tesseract-ocr/tesseract/blob/master/README.md

tesseract open issues Ask a question     (View All Issues)
  • almost 2 years CMake hangs at "Performing 71 checks using 8 threads"
  • almost 2 years Memory leak of EDGEPT objects
  • almost 2 years Minor win32 sintax issues under Windows unicode
  • almost 2 years Replace all NULLs with nullptr
  • almost 2 years Resolution information in PNG files is ignored
  • almost 2 years Failure to box เดฎ character in malayalam
  • almost 2 years Telugu - Simple characters are not being recognized
  • almost 2 years APPLY_BOXES: boxfile line FAILURE! Couldn't find a matching blob for เดฎ in malayalam
  • almost 2 years Buffer overflow in proto_evidence_
  • almost 2 years Training issue : APPLY_BOXES failure : Couldn't find a matching blob
  • almost 2 years Are there more PSM modes than are listed in the help/wiki - 11 and 12?
  • almost 2 years Font detection broken for segmention_mode > PSM_SINGLE_WORD
  • almost 2 years Speckled Documents Create Psychological Case for Tesseract
  • about 2 years Error in boxClipToRectangle: box outside rectangle
  • about 2 years macOS regression due to "Fix Cygwin compatibility"
  • about 2 years C-API for OSResults (orientation and script results) is not a stable ABI
  • about 2 years Minimum Pango version
  • about 2 years Creating ALTO [enhancement]
  • about 2 years text2image: comma in font name
  • about 2 years User patterns using bazaar config do not work
  • about 2 years Text is garbled in pdf.js (Cygwin / UB Mannheim binaries)
  • about 2 years peculiarities when running text2image on windows
  • about 2 years Glyphless font in pdf leads to spaces between characters
  • about 2 years non-word recognition worsened/disimproved since tesseract v3.0.4 ?
  • over 2 years Dramatically different results with O1 and O2 optimizations in clang
  • over 2 years [For the record] Tesseract 3.01 crash on specific table like columns
  • over 2 years Add version information to training tools
  • over 2 years Confuse about dump "tessnoimages.png" when segmenting page and detect orientation.
  • over 2 years Compiled 3.03 and 3.04 with VS2013, Memory Leaks detected
  • over 2 years good accuracy but too slow, how to improve Tesseract speed
tesseract open pull requests (View All Pulls)
  • Add LTR & mixed direction test files
  • Remove const from STRING::string()
  • Introduce POSIX data types
  • free clusterer fixed a memory leak. TODO: free protolists referenced in normprotolist
  • add DAWG_TYPE_HFST, used by OCRicola
  • Merge google code branch https://code.google.com/r/email-hocr-tsv
  • Issue 1351: OpenCL build - kernel_ThresholdRectToPix() not accounting for padding bits in the output pix?!
  • Ocricola (H)FST support
  • Issue 1353: Patch for /training/tessopt.cpp
  • Gcode issue1199 p2
  • Issue 1199: Remove the vestigal traces of custom memory allocators, part 1, updated
  • Issue 1106: [PATCH] Fixed potential usage of uninitialized bool variaโ€ฆ
  • 1341: [PATCH] Fix potential null pointer dereference in ccmain/paragraphs.cpp
  • Issue 1139: Patch fixing text2image's --only_extract_font_properties mode to print variant kerning spacing
  • Issue 1316: The traineddata file must be closed after it was opened
  • Handle null raw_choice - fixes #235
  • Remove conditional definition of off_t
  • Enable all ligatures available in a font for text2image rendering
  • Dockerify using travis build script
  • Fix incompatibility with some C++11 implementations
  • Character boxes in hOCR output
  • Replace deliberate segv with raise(SIGABRT)
  • Remove redundant destructor
  • Avoid unnecessary memory allocations
  • File names in file header comments
  • training: Remove unnecessary const qualifiers
  • Fix a typo in tesseract(1) man page
  • Fix some typos (found by codespell)
  • ccutil/ambigs: Optimize tesseract::UnicharIdArrayUtils::compare
  • Fix 32 bit builds (missing _mm256_extract_epi64)
  • Simplify delete / free usage
  • Switch to semantic versioning
  • Use portable data types
  • Add experimental support for big endian hosts (read part only)
  • Add the packaging metadata to build the tesseract snap
  • Use POSIX data types and macros
  • opencl: Add 'static' attributes for local functions and variables
  • Warnings are people too!
  • Minor fixes
  • Optimize calculation of dot product for AVX
  • RFC: RAII (don't merge)
  • RFC: Add initial support for traineddata files in compressed archive formats (don't merge)
  • Fix #1222 Unable to detect text in few images when using multi language hints
  • Added JPEG quality option parameter (-c jpg_quality=n)
  • Don't drop words with low certainty
  • download and install eng and osd traineddata
  • RFC: #pragma omp simd (don't merge)
  • Change dummy test to OSD Tests for Tesseract.
  • Modify api example to use text fixtures and parameters
  • Update package version (Visual Studio)
  • doc: Add missing language to list
  • Use POSIX data types for external interfaces
tesseract questions on Stackoverflow (View All Questions)
  • OpenCV Gaussian blur breaks Tesseract?
  • Tesseract-OCR: need to train all types of samples?
  • Android OCR App using Tesseract
  • How to convert green text (from console) with tesseract?
  • Progress/cancel callback in Tesseract using ETEXT_DESC
  • Where are the Tesseract API docs?
  • Tesseract - Recognizing symbols - hearts spades diamonds clubs
  • Can't get output from Tesseract command run through os.system
  • Can I read this captcha with imagemagick/tesseract?
  • Not able to initialize tess-two (Could not initialize Tesseract API error)
  • Building google tesseract with NDK for using in android studio
  • Android Tesseract taking too long and returning absurd data
  • How to detect only letters and numbers for a foreign language in tesseract
  • How to detect simple text with Tesseract ORC?
  • Tesseract getutf8text performance
  • What does the Tesseract OCR library require of an image to be able to accurately extract text?
  • C++ opencv / tesseract cross-compile to windows with MXE
  • Tesseract OCR on Windows Python
  • Tesseract gives no recognition results (Android studio; Java)
  • How to get Hocr output using python-tesseract
  • Initializing a Tesseract
  • OCR - How to train a new Tesseract model?
  • Tesseract 3.04.00 on mac, ERROR "can not open input file"
  • Tesseract OCR user patterns
  • Tesseract OCR not able to train image correctly
  • Android Tesseract App crashes on OCR Function
  • Tesseract OCR: Parameter for Font Size (Single Character)
  • can Tesseract OCR be extended or trainned?
  • Can I test tesseract ocr in windows command line?
  • Tesseract New Automated method training need explaination
tesseract list of languages used
tesseract latest release notes
3.05.01 3.05.01 Release

Bug fix release

3.05.00 3.05.00 Release
  • Made some fine tuning to the hOCR output.
    • Added TSV as another optional output format.
    • Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method.
    • text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer.
    • Training tools - Replaced asserts with tprintf() and exit(1).
    • Fixed Cygwin compatibility.
    • Improved multipage tiff processing.
    • Improved the embedded pdf font (pdf.ttf).
    • Enable selection of OCR engine mode from command line.
    • Changed tesseract command line parameter '-psm' to '--psm'.
    • Added new C API for orientation and script detection, removed the old one.
    • Increased minimum autoconf version to 2.59.
    • Removed dead code.
    • Fixed many compiler warning.
    • Fixed memory and resource leaks.
    • Fixed some issues with the 'Cube' OCR engine.
    • Fixed some openCL issues.
    • Added option to build Tesseract with CMake build system.
    • Implemented CPPAN support for easy Windows building.
3.04.01 3.04.01 release

bug-fix release of 3.04 version

More projects by tesseract-ocr View all
Other projects in C++