Return to site

Optical Character Recognition

February 27, 2020

What is OCR?

OCR (optical character recognition) is the process of converting a scanned document into fully editable and searchable virtual files, transforming paper documents into formats such as Microsoft Word, Excel spread sheets, CV files and PDF searchable documents.

In many office environments huge amounts of time is spent on unnecessary tasks such as inputting data, or searching through piles of documents and files to retrieve information needed to complete a task.

Use case: From image to text via a API Flask Server and Tesseract

Many thanks to Richard Torzynski https://github.com/ricktorzynski and specially his repo https://github.com/ricktorzynski/ocr-tesseract-docker :)

Advantages

From faster searches and easier editing to saving digital and physical storage space, you’ll find many benefits to using OCR software to turn document images into searchable, editable text:

  • Au revoir retyping – Unless you’re a fan of extra time at the keyboard recreating documents that exist in printed or scanned format, you’ll love the time savings you get when converting those image files into searchable, editable text via OCR.
  • Speedy digital searches – By converting scanned text into a word processing file, OCR lets you search through documents using keywords or phrases. Got a few hundred invoices? Let your PC search for the client name you need faster than you can say “coffee break.”
  • Typing new text – If you need that image of a document to function like real text, where you can add new paragraphs, copy and paste, edit out an old reference, etc., OCR lets you do it. It’s ideal for everything from updating contracts to making changes to your archive of family recipes.
  • Saving space – If you’ve got reams of paper documents taking up space in your office, you can scan them into PDF files with the confidence that your OCR software will let you retrieve any of the text you need to work with, whenever you may need it. Goodbye big file cabinets, hello tidy little CDs of archived documents.
  • Accessibility – If you or someone you know is vision-impaired, OCR software can help turn books, magazines and other printed documents into accessible files that they can listen to with the help of a combination of word processing software and computer voice-over utilities.

Limits

The limits are mainly focused on the ability to distinguish characters
  • DPI: One of the biggest factors is DPI or Dots per Inch. Setting the DPI lower than 200 will yield uninteligible results wheras setting it higher than 600dpi will just increase the size of the stored file without yielding much better results. We tend to recommend a 300dpi for in item.

  • the original document is wrinkled, torn, or otherwise damaged

  • faded or otherwise aged

  • rendered with nonstandard fonts or in human handwriting,

Bonus :

OCR in Arabic language

First download from https://github.com/tesseract-ocr/tessdata the arabic model or any other language

Then in the Flask Server change in app.py function upload_file

text = pytesseract.image_to_string(Image.open(ofilename))

with:
tessdata_dir_config = r'--tessdata-dir "<tessdata_repo>"'
text = pytesseract.image_to_string(Image.open(ofilename),lang='ara', config=tessdata_dir_config)