Reading data from PDF using tabula-py

Do you think really need PDF in Data science?

Yes, In real-world scenarios there are chances of having dataset in any formats. We should be knowing How to tackle/read the datasets in such scenarios.

Today we are going to see how to read the data from PDF file?

To achieve we need to install the library that supports reading the PDF file. Yes, the answer is here. From tabula-py, we can read the PDF and do a lot more of manipulations using PDF.

tabula-py Installation

Go to Anaconda command prompt, try using below command

Finally, you will be getting the screen as below.

Installing Tabula-py

Reading all pages in PDF

Determine how many data frame exist in the PDF ?


Totally having 4 data frames in the PDF. Let see how to read the individual data frame .

In this case reading the 2nd data frame exist in the PDF. The syntax of reading the data frame is <<dataframe_reference>>[index]

dataframe_reference — reference variable used to store whole data frame which read from PDF index — Specifies the index position of data frame.

Second data frame

Reading the individual pages

Pages — symbolizes under which page the data frame need to read

Third data frame

Read partial area of PDF

We can read the pdf with certain part of area.

If you want to set a certain part of page, you can use area option.

area : Portion of the page to analyze(top, left, bottom, right). Default is entire page.

partial area of PDF

Extract to JSON, TSV or CSV

Extracting the first page of data frame to JSON.

Extract to JSON format

Convert PDF tables to JSON,CSV or TSV

You can convert files directly rather creating Python objects with convert_into() function.


Data Science and Machine Learning enthusiast | Software Architect | Full stack developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store