Reading data from PDF using tabula-py

Antony Christopher
3 min readOct 5, 2020

--

Do you think really need PDF in Data science?

Yes, In real-world scenarios there are chances of having dataset in any formats. We should be knowing How to tackle/read the datasets in such scenarios.

Today we are going to see how to read the data from PDF file?

To achieve we need to install the library that supports reading the PDF file. Yes, the answer is here. From tabula-py, we can read the PDF and do a lot more of manipulations using PDF.

tabula-py Installation

Go to Anaconda command prompt, try using below command

pip install tabula-py

Finally, you will be getting the screen as below.

Installing Tabula-py

Reading all pages in PDF

pdf_path = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
dfs = tabula.read_pdf(pdf_path, stream=True, pages="all")

Determine how many data frame exist in the PDF ?

print(len(dfs))

4

Totally having 4 data frames in the PDF. Let see how to read the individual data frame .

In this case reading the 2nd data frame exist in the PDF. The syntax of reading the data frame is <<dataframe_reference>>[index]

dataframe_reference — reference variable used to store whole data frame which read from PDF index — Specifies the index position of data frame.

dfs[2]
Second data frame

Reading the individual pages

dfs = tabula.read_pdf(pdf_path, pages=3, stream=True)

Pages — symbolizes under which page the data frame need to read

dfs[0]
Third data frame

Read partial area of PDF

We can read the pdf with certain part of area.

If you want to set a certain part of page, you can use area option.

area : Portion of the page to analyze(top, left, bottom, right). Default is entire page.

dfs = tabula.read_pdf(pdf_path, area=[126,149,212,462], pages=2)
dfs[0]
partial area of PDF

Extract to JSON, TSV or CSV

Extracting the first page of data frame to JSON.

tabula.read_pdf(pdf_path, output_format="json", pages="1")

Extract to JSON format

Convert PDF tables to JSON,CSV or TSV

You can convert files directly rather creating Python objects with convert_into() function.

tabula.convert_into(pdf_path, "test.json", output_format="json", pages=1)

References

https://pypi.org/project/tabula-py/

--

--

Antony Christopher

Data Science and Machine Learning enthusiast | Software Architect | Full stack developer