January 28th 2022 • 2 minute read

How To Convert PDFs to Images for ML Projects Using Ghostscript and Multiprocessing

Want to use data from PDF files for your machine learning projects? Convert them to images using Ghostscript. Get it done even faster (7x speed up) with multiprocessing.

TLDR; Run the following Ghostscript command to convert a pdf file to a series of png files.

bash

gs -dNOPAUSE -sDEVICE=png16m -r600  -sOutputFile=document-%02d.png "document.pdf" -dBATCH

PDF files frequently contain data (text, images, tables, logs etc) valuable for training machine learning algorithms. However, most ML models are designed to process datasets (training or inference) represented as RGB images (JPEG, PNG), text, tabular data or combinations of these base modalities. A first step in utilizing pdf files for machine learning is to convert PDF files to images and then explore subsequent tasks (e.g. image classification, object detection, text extraction, etc).

I recently had to assemble a dataset of images from PDF files and this post documents the process I followed - Ghostscript!

Ghostscript

Ghostscript is an interpreter for PostScript and Portable Document Format (PDF) files. Ghostscript consists of a PostScript interpreter layer, and a graphics library. The graphics library is shared with all the other products in the Ghostscript family, so all of these technologies are sometimes referred to as Ghostscript, rather than the more correct GhostPDL.

If you are on a Mac, Ghostscript is installed by default (included in the Apple OS X package). On other platforms, follow the installation guide here. Verify that Ghostscript is installed and working (available on your path) by running the following command gs on your terminal.

Ghostscript from the command line

Once Ghostscript is installed, you can run the following script from the command line to convert a pdf file to a series of images:

PNG

bash

gs -dNOPAUSE -sDEVICE=png16m -r600  -sOutputFile=document-%02d.png "document.pdf" -dBATCH

JPG

bash

gs -dNOPAUSE -sDEVICE=jpg -r200 -dJPEGQ=60 -sOutputFile=document-%02d.jpg "document.pdf" -dBATCH

The scripts above will take the specified pdf document.pdf, convert to a series of images and save them to the current directory. -dNOPAUSE tells Ghostscript to not pause after each page is rendered. -r600 specifies the resolution (dpi) of the output images. -sOutputFile=document-%02d.png specifies the output file name and the %02d is a placeholder for the page number.-sDEVICE=png16m and -sDEVICE=jpg specifies the output format.

Ghostscript in Python

We take the above script and convert it to a Python script using subprocess.

python

pdf_to_image.py

import os
import subprocess

# make directory named output
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)

file_path = "pdf/document.pdf"
file_name = os.path.basename(file_path)
file_name = file_name.split(".")[0]

# execute Ghostscript command, print results
results = subprocess.run(["gs", "-dNOPAUSE", "-sDEVICE=png16m", "-r600", f"-sOutputFile={output_dir}/{file_name}-%02d.png", f"{file_path}", "-dBATCH"], stdout=subprocess.PIPE)
print(results.stdout.decode("utf-8"))

Convert all PDFs in a directory

Similarly, you can walk through a directory of pdf files and convert them to images.

python

pdf_to_image.py

import os
import subprocess
from tqdm import tqdm # progress bar

def convert_pdf(file_path, output_dir, resolution=400):
    file_name = os.path.basename(file_path)
    file_name = file_name.split(".")[0]

    # execute Ghostscript command, print results
    results = subprocess.run(["gs", "-dNOPAUSE", "-sDEVICE=png16m", f"-r{resolution}", f"-sOutputFile={output_dir}/{file_name}-%02d.png", f"{file_path}", "-dBATCH"], stdout=subprocess.PIPE)


# make directory named output
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)

# walk through folder named samp, convert all pdf files to png
for root, dirs, files in os.walk("samp"):
    for file in tqdm(files):
        if file.endswith(".pdf"):
            file_path = os.path.join(root, file)
            convert_pdf(file_path, output_dir)
    ##print number of images in output directory
print(f"{len(os.listdir(output_dir))} images in output directory")

and here are the results:

bash

terminal_output

(victors):pdftoimg#: python pdf_img.py
100%|███████████████████████████████████| 70/70 [01:45<00:00,  1.50s/it]
192 images in output directory

All well and good! We now have a series of images in the output directory. However, each pdf file takes about 1.5 seconds to convert each document. 1 minute and 45 seconds to convert 70 documents (note that conversion speed depends on the resolution parameter i.e., reducing your resolution will lead to faster conversion).

Multiprocessing Vs. Threading In Python: What You Need To Know.

TLDR; Use threading if your program is I/O or multiprocessing if it's CPU bound.

When the goal is to convert hundreds of thousands of documents, this can be slow. Hey multiprocessing!

Ghostscript with Python and multiprocessing

We can use the python multiprocessing library to speed up the conversion process. We can define a pool of processes and use the map function to run the conversion on each document in parallel. Luckily, converting files is an embarrassingly parallel task where all of the data and logic to complete the task can be encapsulated in a single function without any shared state.

python

pdf_to_image_multiproc.py

from multiprocessing import Pool
import os
import subprocess
import time

file_list = []
output_dir = "output_zip"
pdf_directory = "samp"
start_time = time.time()
os.makedirs(output_dir, exist_ok=True)

# get list of files in pdf_directory directory
for root, dirs, files in os.walk(pdf_directory):
    for file in (files):
        if file.endswith(".pdf"):
            file_path = os.path.join(root, file)
            file_list.append(file_path)

def convert_pdf(file_path, output_dir, resolution=400):
    file_name = os.path.basename(file_path)
    file_name = file_name.split(".")[0]
    # execute Ghostscript command, print results
    subprocess.run(["gs", "-dNOPAUSE", "-sDEVICE=png16m", f"-r{resolution}", f"-sOutputFile={output_dir}/{file_name}-%02d.png", f"{file_path}", "-dBATCH"], stdout=subprocess.PIPE)


def process_pdf(args):
    file_path, i = args
    convert_pdf(file_path, output_dir, resolution=400)


if __name__ == "__main__":
    # create pool of threads to process each pdf in parrallel
    num_processes = 30
    pool = Pool(processes=(num_processes))
    pool.map(process_pdf, zip(file_list, range(len(file_list))))
    pool.close()
    print(f"Last file completed in {time.time() - start_time} seconds")

and the output ..

bash

terminal_output

(victors):pdftoimg#: python pdf_img_multiproc.py
Last file completed in 15.732686758041382 seconds

We have shaved our time down from 1 minute 45 seconds to 15.7 seconds (6.7x speed up).

And that's it! Happy data generating!

Interested in more articles like this? Subscribe to get a monthly roundup of new posts and other interesting ideas at the intersection of Applied AI and HCI.

Read and Subscribe

← Previous