Extract Tables from PDF to a CSV using Python Tabula

Financial documentation usually contains many tabular data which can be difficult to analyze if it is in PDF. With Tabula library in python, these tables can be easily converted into CSV files. Here is how:

Code:

import tabula

df = tabula.read_pdf(r’D:\PDF to Excel\Credit Card Statement – Aug 2020.pdf’, pages=’all’)

data =tabula.convert_into(r’D:\PDF to Excel\Credit Card Statement – Aug 2020.pdf’, r’D:\PDF to Excel\Credit Card Statement – Aug 2020.csv’ , output_format=”csv”,pages=1, stream=True)

Output:

Explanation:

read_pdf function will take ‘Credit Card Statement – Aug 2020.PDF’ Document and convert any tables in ‘page – 1’ in a csv format. 

The CSV file is saved under the same folder as pdf.

for more details please follow: https://tabula-py.readthedocs.io/en/latest/

Note: In case, if you face any javascript error – you need to add your javascript bin path to your systems environment variable and restart your system. This resolved my error and saved a clean table extracted from PDF document.

The CSV file is saved under the same folder as pdf.

Note: In case, if you face any javascript error – you need to add your javascript bin path to your systems environment variable and restart your system. This resolved my error and saved a clean table extracted from PDF document.

No Comments

Post A Comment