How to extract text from PDF files in Python

This post explains how to extract text from PDF files using Python.

To extract text from PDF files in below two Python modules are required.

  • pytesseract
  • pdf2image

Prerequisite for using pytesseract

pytesseract module requires tesseract executable. Let's set up tesseract for Windows.

1. Download tesseract executable from this Link.

2. Install the downloaded tesseract executable .

Prerequisite for using pdf2image

pdf2image module requires Poppler PDF rendering library. Let's set up Poppler for Windows.

1. Download Poppler from this Link.

2. Extract the downloaded binary and place the extracted folder in C:\Program Files\.

Extract Text from PDF Files


Extracting text from PDF files is a two step process, first PDF files needs to be converted to images using pdf2image and than images needs to be converted into strings using pytesseract.

1. Install required modules.

   

pip install Pillow

pip install pdf2image

pip install pytesseract

     

2. Import required modules and functions.

  

import os    
from PIL import Image
import pytesseract
from pdf2image import convert_from_path

   

3. Define Poppler and tesseract_cmd executable path.

   

poppler_path = r'C:\Program Files\poppler-0.68.0\bin' # Replace with your installation location

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract' # Replace with your installation location
     

4. Define pdf file path.

   

pdf_path = "sample.pdf" # Replace with path of PDF File

     

5. Convert PDF file to images using convert_from_path function.

  

images = convert_from_path(pdf_path=pdf_path, poppler_path=poppler_path)

     

6. Iterate over PDF pages and save each page as PNG image.

   

for count, img in enumerate(images):
  img_name = f"page_{count}.png"  
  img.save(img_name, "PNG")

     

7. After successful execution you should see image for each PDF page in current working directory.

8. Create list of all PNG files generated from last step.

   

png_files = [f for f in os.listdir(".") if f.endswith(".png")]

   

9. Extract text from images using pytesseract.image_to_string method.

   

for png_file in png_files:
  extracted_text = pytesseract.image_to_string(Image.open(png_file))
  print(extracted_text)

 

10. Complete code snippet for extracting Text from PDF files.

   

# Module Imports
import os
from PIL import Image
import pytesseract
from pdf2image import convert_from_path

# Define Paths
poppler_path = r'C:\Program Files\poppler-0.68.0\bin'
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

pdf_path = "sample.pdf"

# Save PDF pages to images
images = convert_from_path(pdf_path=pdf_path, poppler_path=poppler_path)
for count, img in enumerate(images):
    img_name = f"page_{count}.png"  
    img.save(img_name, "png")

# Extract Text
png_files = [f for f in os.listdir(".") if f.endswith(".png")]

for png_file in png_files:
    extracted_text = pytesseract.image_to_string(Image.open(png_file))
    print(extracted_text)

 

Category: Python