pytesseract is an OCR library in Python which is used to extract text from images. Python-tesseract is a python wrapper for Google's Tesseract-OCR.
To use pytesseract
, tesseract needs to be installed in the
system, refer below steps for tesseract installation.
1. Navigate to https://github.com/UB-Mannheim/tesseract/wiki and download Tesseract installer for Windows.
2. Double click on downloaded installer to begin the installation and select language.
3. Click on "Next" to continue installation.
4. In the "License Agreement" widget click on "I Agree".
5. In the "Choose Users" section select "Install for anyone using this computer".
6. In the "Choose Components" section click on "Next".
7. In the "Choose Install Location" section click on "Next".
7. In the "Choose Install Menu Folder" window click on "Install".
8. In the "Installation Complete" window click on "Next".
9. Click on "Finish" to complete the setup.
Now the tesseract is installed, let's proceed with Python module's installation.
1. Install pillow
using pip.
pip install Pillow
2. Install pip pytesseract
using pip.
pip install pytesseract
Now we have all the required modules in place, let's write Python code to read the text from below image.
1. Create a python file and import the required modules.
from PIL import Image
import pytesseract
2. Define the tesseract_cmd
path.
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
3. Define the image path and open the image.
image_path = "sample.png"
img = Image.open(image_path)
4. Get the text from image by running Tesseract OCR.
text = pytesseract.image_to_string(Image.open(image_path))
print(text)
"If you don't read the
newspapers,
you are uninformed.
If you do read them,
you are misinformed."
5. Complete code snippet to extract text from images using
pytesseract
.
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
print(pytesseract.get_languages(config=''))
image_path = "sample.png"
img = Image.open(image_path)
text = pytesseract.image_to_string(Image.open(image_path))
print(text)
Category: Python