This post explains how to extract text from PDF files using Python.
To extract text from PDF files in below two Python modules are required.
pytesseract
module requires
tesseract executable. Let's set up tesseract for Windows.
1. Download tesseract executable from this Link.
2. Install the downloaded tesseract executable .
pdf2image
module requires
Poppler PDF rendering library. Let's set up Poppler for Windows.
1. Download Poppler from this Link.
2. Extract the downloaded binary and place the extracted folder in
C:\Program Files\
.
Extracting text from PDF files is a two step process, first PDF files needs to be converted to
images using pdf2image
and than images needs to be converted into strings using pytesseract
.
1. Install required modules.
pip install Pillow
pip install pdf2image
pip install pytesseract
2. Import required modules and functions.
import os
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
3. Define Poppler
and tesseract_cmd
executable path.
poppler_path = r'C:\Program Files\poppler-0.68.0\bin' # Replace with your installation location
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract' # Replace with your installation location
4. Define pdf file path.
pdf_path = "sample.pdf" # Replace with path of PDF File
5. Convert PDF file to images using convert_from_path
function.
images = convert_from_path(pdf_path=pdf_path, poppler_path=poppler_path)
6. Iterate over PDF pages and save each page as PNG image.
for count, img in enumerate(images):
img_name = f"page_{count}.png"
img.save(img_name, "PNG")
7. After successful execution you should see image for each PDF page in current working directory.
8. Create list of all PNG files generated from last step.
png_files = [f for f in os.listdir(".") if f.endswith(".png")]
9. Extract text from images using pytesseract.image_to_string
method.
for png_file in png_files:
extracted_text = pytesseract.image_to_string(Image.open(png_file))
print(extracted_text)
10. Complete code snippet for extracting Text from PDF files.
# Module Imports
import os
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
# Define Paths
poppler_path = r'C:\Program Files\poppler-0.68.0\bin'
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
pdf_path = "sample.pdf"
# Save PDF pages to images
images = convert_from_path(pdf_path=pdf_path, poppler_path=poppler_path)
for count, img in enumerate(images):
img_name = f"page_{count}.png"
img.save(img_name, "png")
# Extract Text
png_files = [f for f in os.listdir(".") if f.endswith(".png")]
for png_file in png_files:
extracted_text = pytesseract.image_to_string(Image.open(png_file))
print(extracted_text)
Category: Python