doeextractor

pre-commit

DOE Reports Extractor

Requirements

Tabula

Poppler via pdf2image

https://github.com/Belval/pdf2image#how-to-install

Amazon Textract

AWS Subscription (Access Key and Secret Key)

Features

  • Extract tables from PDF reports of DOE using Amazon Textract (Online, more accurate, may incur charges.)
  • Extract tables from PDF reports of DOE using Tabula (Offline, less accurate, free and open source.)

Usage

Available commands

$ doeextractor --help
Usage: doeextractor [OPTIONS] COMMAND [ARGS]...

Console script for doeextractor.

Options:
--help  Show this message and exit.

Commands:
extract          Extract tables from a PDF file using Amazon Textract
parse            Parse extracted tables from Amazon Textract
show-debug-info  Debug info for DOE Extractor
tabula-extract   Extract tables from a PDF file using Tabula
tabula-parse     Parse extracted tables from Tabula

Extracting a report

Amazon Textract

$ doeextractor extract 'reports/2022-05-18/petro_min_2022-may-10.pdf'
File is already analyzed
('55bd3e728ab9d40076262fc8af2abbb2', 'reports/2022-05-18/petro_min_2022-may-10.pdf', 'reports/2022-05-18/petro_min_2022-may-10.csv')

$ doeextractor extract 'reports/2022-05-18/petro_sluz_2022-may-10_mimaropa.pdf'
Saved 2 pages to output/petro_sluz_2022-may-10_mimaropa
Analyzing...
0 / 2
1 / 2
2 / 2
CSV results are written to reports/2022-05-18/petro_sluz_2022-may-10_mimaropa.csv

Tabula

$ doeextractor extract -i 'reports/2022-05-18/petro_min_2022-may-10.pdf'
Running tabula with this command:
java -jar tabula-1.0.5-jar-with-dependencies.jar --lattice -f CSV --pages 1 /Users/pro/retailprices/reports/2022-05-18/petro_min_2022-may-10.pdf
AREA,PRODUCT,PETRON,SHELL,CALTEX,PHOENIX,FLYING V,SEAOIL,JETTI,MY GAS,INDEPENDENT,OVERALL,COMMON,AVERAGE,
"",,Liquid Fuels Price Range,,,,,,,,,,,,
"",REGION IX,,,,,,,,,,,,,
OUTLET",N.A,82.41 - 82.61,NONE,82.51.A,82.41 - 82.61,,N.A,"NO BRANCH/
"",,RON 95,79.61 - 79.86,80.11 - 81.11,79.61 - 79.61,82.11 - 82.11,,79.61 - 79.61,79.61 - 81.66,79.61 - 82.11,79.61,80.22,,
"",,RON 91,78.86 - 79.11,79.36 - 79.36,78.86 - 78.86,81.36 - 81.36,NO BRANCH/,78.86 - 78.86,78.86 - 78.86,78.86 - 81.36,78.86,79.31,,
"",,DIESEL,81.02 - 81.23,81.52 - 83.92,81.02 - 81.03,83.53 - 83.53,OUTLET,80.02 - 80.02,81.03 - 83.95,80.02 - 83.95,81.03,81.59,,
"",,DIESEL PLUS,83.02 - 83.02,87.42 - 87.42,N.A,N.A,,N.A,N.A,83.02 - 87.42,87.42,85.95,,
"",,KEROSENE,83.70 - 83.70,-,83.13 - 83.13,N.A,,N.A,-,83.13 - 83.70,83.13,83.32,,
OUTLET",78.45 - 78.45,78.20 - 78.45,78.20,78.25.20 - 78.20,-,"NO BRANCH/
"",,RON 91,77.70 - 77.70,-,-,-,NO BRANCH/,77.70 - 77.70,-,77.95 - 77.95,77.70 - 77.95,77.70,77.76,
"",,DIESEL,80.30 - 80.30,-,-,-,OUTLET,80.30 - 80.30,-,80.55 - 80.55,80.30 - 80.55,80.30,80.35,
"",,KEROSENE,-,-,-,N.A,,N.A,N.A,80.20 - 80.20,80.20 - 80.20,NONE,80.20,
OUTLET",77.21 - 79.21,NONE,78.22 - 77.21,79.21 - 79.21,78.25 - 78.25,-,,-,"NO BRANCH/
"",,RON 91,76.71 - 76.71,-,78.05 - 78.05,-,,-,76.71 - 78.05,NONE,77.38,,,
"",,DIESEL,82.57 - 82.57,83.82 - 83.82,83.85 - 83.85,-,,-,82.57 - 83.85,NONE,83.41,,,
"",Dipolog City,RON 100,-,-,-,-,-,-,-,-,-,-,NONE,NONE
"",,RON 97,-,N.A,N.A,-,N.A,N.A,N.A,N.A,N.A,-,NONE,NONE
"",,RON 95,74.21 - 77.21,77.71 - 79.11,78.55 - 78.55,-,-,77.21 - 77.21,-,-,75.50 - 75.50,74.21 - 79.11,77.21,77.21
"",,RON 91,73.56 - 76.71,77.21 - 78.96,78.30 - 78.30,-,-,76.71 - 76.71,-,-,75.50 - 75.50,73.56 - 78.96,76.71,76.81
"",,DIESEL,78.22 - 82.57,83.07 - 84.31,83.80 - 83.80,-,-,82.57 - 82.57,-,-,76.65 - 76.65,76.65 - 84.31,82.57,81.99
"",,DIESEL PLUS,-,90.27 - 90.27,N.A,N.A,N.A,-,N.A,N.A,N.A,90.27 - 90.27,90.27,90.27
"",,KEROSENE,-,-,-,N.A,-,N.A,N.A,-,-,-,NONE,NONE
OUTLET",-,-,NONE,NONERON 95,"NO BRANCH/
RON 91,81.85 - 81.85,81.85 - 81.85,81.85,81.85,,,,,,,,,,
DIESEL,84.35 - 84.85,84.35 - 84.85,84.35,84.52,,,,,,,,,,
KEROSENE,80.60 - 80.60,80.60 - 80.60,NONE,80.60,,,,,,,,,,

Output in JSON format and write to file

Amazon Textract - CSV only

Tabula

$ doeextractor extract --pages all -i '/Users/pro/retailprices/reports/2022-05-18/petro_min_2022-may-10.pdf' -f JSON -o samples/petro_min_2022-may-10.json
Running tabula with this command:
java -jar /Users/pro/tabula-1.0.5-jar-with-dependencies.jar --lattice -f JSON --pages all /Users/pro/retailprices/reports/2022-05-18/petro_min_2022-may-10.pdf -o /Users/pro/doeextractor/samples/petro_min_2022-may-10.json
$ file samples/petro_min_2022-may-10.json
samples/petro_min_2022-may-10.json: JSON data

Parsing the extracted report

Amazon Textract

$ doeextractor parse samples/petro_min_2022-may-10.csv -o output/petro_min_2022-may-10-output.json
Parse extracted tables
[.] Getting headers
[.] Reading data
[.] Correcting locations
[.] Breaking up merged lines
[.] Re-inserting merged 3 rows
Output file saved to: /Users/pro/doeextractor/output/petro_min_2022-may-10-output.json
[.] Done

Tabula

$ doeextractor parse samples/petro_min_2022-may-10.json -o samples/parsed_output.json
Parse extracted tables
Output file saved to: /Users/pro/doeextractor/samples/parsed_output.json

Indices and tables