doeextractor package

Submodules

doeextractor.analyser module

doeextractor.analyser.analyse(results: List[doeextractor.models.fuel_line_price.FuelLinePriceItem])

Get mean, median, mode of prices from all companies

doeextractor.analyser.mean(items)
doeextractor.analyser.median(items)
doeextractor.analyser.mode(items)

doeextractor.cli module

Console script for doeextractor.

doeextractor.constants module

class doeextractor.constants.ExtractMethod

Bases: enum.Enum

An enumeration.

LATTICE = '--lattice'
STREAM = '--stream'
class doeextractor.constants.Formats

Bases: enum.Enum

An enumeration.

CSV = 'CSV'
JSON = 'JSON'
TSV = 'TSV'

doeextractor.exceptions module

exception doeextractor.exceptions.NotFoundException

Bases: Exception

doeextractor.file_helpers module

doeextractor.file_helpers.add_file_to_local_cache(file_path: pathlib.Path, output_file_path: pathlib.Path) → str

Add file to local cache.

doeextractor.file_helpers.convert_pdf_to_png(file_path, merge_pages=False) → pathlib.Path

Convert a PDF file to PNG.

doeextractor.file_helpers.get_checksum(file_path)

Calculates the checksum of a file.

doeextractor.file_helpers.is_file_already_analyzed(file_path)

Checks if a file is already analyzed.

doeextractor.parser module

class doeextractor.parser.Token(value, token_type: str)

Bases: object

is_overall_range_header() → bool
doeextractor.parser.classify(text)

Classify text simplified for DOE reports

doeextractor.parser.parse(input_file_path: str, output_file_path: str = None)

Simple parser for extracted tables.

doeextractor.parser.tokenize(text_data: list)

doeextractor.tabula module

doeextractor.tabula.debug_info(silent=False)
doeextractor.tabula.extract(*args, **kwargs)

Extract data from PDFs using tabula

doeextractor.textract_parser module

doeextractor.textract_parser.classify(text)

Classify text simplified for DOE reports

doeextractor.textract_parser.clean(input_file_path: pathlib.Path, overwrite=False)

Clean extracted tables.

doeextractor.textract_parser.parse(input_file_path: str, output_file_path: str = None, clean_input=True)

Simple parser for extracted tables.

doeextractor.textract_parser.tokenize_input(input_file_path: str)

Tokenize input file.

doeextractor.textractor module

doeextractor.textractor.extract(input_file_path: Union[str, pathlib.Path])

Extract tables from a PDF file using Amazon Textract

doeextractor.textractor.generate_table_csv(table_result, blocks_map, table_index)
doeextractor.textractor.get_rows_columns_map(table_result, blocks_map)
doeextractor.textractor.get_table_csv_results(input_file: pathlib.Path)
doeextractor.textractor.get_text(result, blocks_map)

doeextractor.token_types module

doeextractor.token_types.feature_is_city(text)
doeextractor.token_types.feature_is_price(text)

Module contents

Top-level package for doeextractor.