Reading data from PDF using tabula-py

Image for post
Image for post

Do you think really need PDF in Data science?

Yes, In real-world scenarios there are chances of having dataset in any formats. We should be knowing How to tackle/read the datasets in such scenarios.

Today we are going to see how to read the data from PDF file?

To achieve we need to install the library that supports reading the PDF file. Yes, the answer is here. From tabula-py, we can read the PDF and do a lot more of manipulations using PDF.

Image for post
Image for post

tabula-py Installation

Go to Anaconda command prompt, try using below command

pip install tabula-py

Finally, you will be getting the screen as below.

Installing Tabula-py

Reading all pages in PDF

pdf_path = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
dfs = tabula.read_pdf(pdf_path, stream=True, pages="all")

Determine how many data frame exist in the PDF ?

print(len(dfs))

4

Totally having 4 data frames in the PDF. Let see how to read the individual data frame .

In this case reading the 2nd data frame exist in the PDF. The syntax of reading the data frame is <<dataframe_reference>>[index]

dataframe_reference — reference variable used to store whole data frame which read from PDF index — Specifies the index position of data frame.

dfs[2]
Image for post
Image for post
Second data frame

Reading the individual pages

dfs = tabula.read_pdf(pdf_path, pages=3, stream=True)

Pages — symbolizes under which page the data frame need to read

dfs[0]
Image for post
Image for post
Third data frame

Read partial area of PDF

We can read the pdf with certain part of area.

If you want to set a certain part of page, you can use area option.

area : Portion of the page to analyze(top, left, bottom, right). Default is entire page.

dfs = tabula.read_pdf(pdf_path, area=[126,149,212,462], pages=2)
dfs[0]
Image for post
Image for post
partial area of PDF

Extract to JSON, TSV or CSV

Extracting the first page of data frame to JSON.

tabula.read_pdf(pdf_path, output_format="json", pages="1")
Image for post
Image for post

Extract to JSON format

Convert PDF tables to JSON,CSV or TSV

You can convert files directly rather creating Python objects with convert_into() function.

tabula.convert_into(pdf_path, "test.json", output_format="json", pages=1)

References

https://pypi.org/project/tabula-py/

Written by

Data Science and Machine Learning enthusiast | Software Architect | Full stack developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store