Reading data from PDF using tabula-py
Do you think really need PDF in Data science?
Yes, In real-world scenarios there are chances of having dataset in any formats. We should be knowing How to tackle/read the datasets in such scenarios.
Today we are going to see how to read the data from PDF file?
To achieve we need to install the library that supports reading the PDF file. Yes, the answer is here. From tabula-py, we can read the PDF and do a lot more of manipulations using PDF.
Go to Anaconda command prompt, try using below command
pip install tabula-py
Finally, you will be getting the screen as below.
Reading all pages in PDF
pdf_path = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
dfs = tabula.read_pdf(pdf_path, stream=True, pages="all")
Determine how many data frame exist in the PDF ?
Totally having 4 data frames in the PDF. Let see how to read the individual data frame .
In this case reading the 2nd data frame exist in the PDF. The syntax of reading the data frame is <<dataframe_reference>>[index]
dataframe_reference — reference variable used to store whole data frame which read from PDF index — Specifies the index position of data frame.
Reading the individual pages
dfs = tabula.read_pdf(pdf_path, pages=3, stream=True)
Pages — symbolizes under which page the data frame need to read
Read partial area of PDF
We can read the pdf with certain part of area.
If you want to set a certain part of page, you can use
area : Portion of the page to analyze(top, left, bottom, right). Default is entire page.
dfs = tabula.read_pdf(pdf_path, area=[126,149,212,462], pages=2)
Extract to JSON, TSV or CSV
Extracting the first page of data frame to JSON.
tabula.read_pdf(pdf_path, output_format="json", pages="1")
Extract to JSON format
Convert PDF tables to JSON,CSV or TSV
You can convert files directly rather creating Python objects with
tabula.convert_into(pdf_path, "test.json", output_format="json", pages=1)