Open Source OCR using Tesseract and Google Colab

3 min readFeb 12, 2022

Optical Character Recognition (OCR) has been a use case in Computer Vision. The popularity is because of its wide range of applications. It can be used for Data Entry for Business, Number Plate Recognition, and many more. Basically, any application where we have to extract text from an image.

Tesseract is the most available open-source software for OCR. The original software is available as a command-line tool for windows. Because Python is the most popular language used now a days, Tesseract has now been developed and implemented in Python too and is open source.

Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others.

Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

Ordinarily, OpenCV tasks are resource intensive — and if you are running low on processing power, the best way is to use Google Colab.

Google has done the best thing ever by providing a free cloud service based on Jupyter Notebooks that supports free GPU. Not only is this a great tool for improving coding skills, but it also allows absolutely anyone to develop deep learning applications using popular libraries such as PyTorch, TensorFlow, Keras, and OpenCV. Colab provides GPU and it’s totally free.

Here are the steps to extract text from the image in Google Colab Notebook for OCR using Pytesseract:

Step1: Install Pytesseract and tesseract-OCR in Google Colab.

!sudo apt install tesseract-ocr!pip install pytesseract

Step2: import libraries

import pytesseract

import shutil

import os

import random

try:

from PIL import Image

except ImportError:

import Image

Step3: Upload Image to the Colab

We can manually upload the image by clicking on file- upload but we can also use the following code for uploading the image to Colab.

from google.colab import files

uploaded = files.upload()

Step4: Text Extraction

The image_to_string function will take an image as an argument and returns an extracted text from the image. We can either directly print it or store this string in one variable.

image_path_in_colab=‘image.jpg’

extractedInformation = pytesseract.image_to_string(Image.open(image_path_in_colab))

Say, we want to use this sample text as our source,to test our OCR :

After processing in Google Colab,we get the following result:

The GitHub repo for the project is available at : https://github.com/suyesha07/Optical-Character-Reader

Feel free to check this Colab Notebook in: https://colab.research.google.com/github/suyesha07/Optical-Character-Reader/blob/main/OCR.ipynb

Open Source OCR using Tesseract and Google Colab

Step1: Install Pytesseract and tesseract-OCR in Google Colab.

Step2: import libraries

Step3: Upload Image to the Colab

Step4: Text Extraction

Written by Suyesha Bhattacharjee

No responses yet