Overview

Using Google’s Vision API cloud service, we can extract and detect different information and data from an image/file. In this tutorial we are going to learn how to extract text from a PDF (or TIFF) file using the DOCUMENT_TEXT_DETECTION feature.

Extract text from a PDF/TIFF file using Vision API is actually not as straightforward as I initial thought it would be. For instance, you cannot reference a file stored on your PC, instead, you have to first store the PDF/TIFF file on your Google Cloud Storage (this is a different product from Google Drive), and extract the file using the Cloud Storage API.

Limitations

The Vision API will only accept PDF/TIFF fewer than 2,000 pages.

Documentation

https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python

 

 

Script used in the tutorial