Extract text from images using Python pytesseract

pytesseract is an OCR library in Python which is used to extract text from images. Python-tesseract is a python wrapper for Google's Tesseract-OCR.

Extracting text using pytesseract

To use pytesseract, tesseract needs to be installed in the system, refer below steps for tesseract installation.

tesseract installation on windows

1. Navigate to https://github.com/UB-Mannheim/tesseract/wiki and download Tesseract installer for Windows.

download tesseract for windows

2. Double click on downloaded installer to begin the installation and select language.

tesseract installation

3. Click on "Next" to continue installation.

tesseract installation

4. In the "License Agreement" widget click on "I Agree".

tesseract installation

5. In the "Choose Users" section select "Install for anyone using this computer".

tesseract installation

6. In the "Choose Components" section click on "Next".

tesseract installation

7. In the "Choose Install Location" section click on "Next".

tesseract installation

7. In the "Choose Install Menu Folder" window click on "Install".

tesseract installation

8. In the "Installation Complete" window click on "Next".

tesseract installation

9. Click on "Finish" to complete the setup.

tesseract installation

pytesseract and pillow installation

Now the tesseract is installed, let's proceed with Python module's installation.

1. Install pillow using pip.

   
  pip install Pillow
   

2. Install pip pytesseract using pip.

   
  pip install pytesseract
   

Python code to extract text from images using pytesseract

Now we have all the required modules in place, let's write Python code to read the text from below image.

tesseract installation

1. Create a python file and import the required modules.

   

  from PIL import Image
  import pytesseract

   

2. Define the tesseract_cmd path.

   

  pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
  
   

3. Define the image path and open the image.

   

  image_path = "sample.png"
  img = Image.open(image_path)
  
   

4. Get the text from image by running Tesseract OCR.

   

  text = pytesseract.image_to_string(Image.open(image_path))
  print(text)
  
   

Output

   
  "If you don't read the
  newspapers,
  you are uninformed.
  If you do read them,
  
  you are misinformed."
   

5. Complete code snippet to extract text from images using pytesseract.

   
  from PIL import Image
  import pytesseract
  
  
  pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
  
  print(pytesseract.get_languages(config=''))
  image_path = "sample.png"
  img = Image.open(image_path)
  
  text = pytesseract.image_to_string(Image.open(image_path))
  print(text)
   

Category: Python