How To Convert PDFs to Images for ML Projects Using Ghostscript and Multiprocessing
TLDR; Run the following Ghostscript command to convert a pdf file to a series of png files.
PDF files frequently contain data (text, images, tables, logs etc) valuable for training machine learning algorithms. However, most ML models are designed to process datasets (training or inference) represented as RGB images (JPEG, PNG), text, tabular data or combinations of these base modalities. A first step in utilizing pdf files for machine learning is to convert PDF files to images and then explore subsequent tasks (e.g. image classification, object detection, text extraction, etc).
I recently had to assemble a dataset of images from PDF files and this post documents the process I followed - Ghostscript!
Ghostscript is an interpreter for PostScript and Portable Document Format (PDF) files. Ghostscript consists of a PostScript interpreter layer, and a graphics library. The graphics library is shared with all the other products in the Ghostscript family, so all of these technologies are sometimes referred to as Ghostscript, rather than the more correct GhostPDL.
If you are on a Mac, Ghostscript is installed by default (included in the Apple OS X package). On other platforms, follow the installation guide here. Verify that Ghostscript is installed and working (available on your path) by running the following command
gs on your terminal.
Once Ghostscript is installed, you can run the following script from the command line to convert a pdf file to a series of images:
The scripts above will take the specified pdf
document.pdf, convert to a series of images and save them to the current directory.
-dNOPAUSE tells Ghostscript to not pause after each page is rendered.
-r600 specifies the resolution (dpi) of the output images.
-sOutputFile=document-%02d.png specifies the output file name and the
%02d is a placeholder for the page number.
-sDEVICE=jpg specifies the output format.
We take the above script and convert it to a Python script using
Similarly, you can walk through a directory of pdf files and convert them to images.
and here are the results:
All well and good! We now have a series of images in the
output directory. However, each pdf file takes about 1.5 seconds to convert each document. 1 minute and 45 seconds to convert 70 documents (note that conversion speed depends on the resolution parameter i.e., reducing your resolution will lead to faster conversion).
When the goal is to convert hundreds of thousands of documents, this can be slow. Hey multiprocessing!
We can use the python multiprocessing library to speed up the conversion process. We can define a
pool of processes and use the
map function to run the conversion on each document in parallel. Luckily, converting files is an embarrassingly parallel task where all of the data and logic to complete the task can be encapsulated in a single function without any shared state.
and the output ..
We have shaved our time down from 1 minute 45 seconds to 15.7 seconds (6.7x speed up).
And that's it! Happy data generating!