doeextractor package¶

Subpackages¶

doeextractor.models package

Submodules¶

doeextractor.analyser module¶

doeextractor.analyser.analyse(results: List[doeextractor.models.fuel_line_price.FuelLinePriceItem])¶: Get mean, median, mode of prices from all companies

doeextractor.analyser.mean(items)¶

doeextractor.analyser.median(items)¶

doeextractor.analyser.mode(items)¶

doeextractor.cli module¶

Console script for doeextractor.

doeextractor.constants module¶

class doeextractor.constants.ExtractMethod¶

Bases: enum.Enum

An enumeration.

LATTICE = '--lattice'¶

STREAM = '--stream'¶

class doeextractor.constants.Formats¶

Bases: enum.Enum

An enumeration.

CSV = 'CSV'¶

JSON = 'JSON'¶

TSV = 'TSV'¶

doeextractor.exceptions module¶

exception doeextractor.exceptions.NotFoundException¶: Bases: Exception

doeextractor.file_helpers module¶

doeextractor.file_helpers.add_file_to_local_cache(file_path: pathlib.Path, output_file_path: pathlib.Path) → str¶: Add file to local cache.

doeextractor.file_helpers.convert_pdf_to_png(file_path, merge_pages=False) → pathlib.Path¶: Convert a PDF file to PNG.

doeextractor.file_helpers.get_checksum(file_path)¶: Calculates the checksum of a file.

doeextractor.file_helpers.is_file_already_analyzed(file_path)¶: Checks if a file is already analyzed.

doeextractor.parser module¶

class doeextractor.parser.Token(value, token_type: str)¶

Bases: object

is_overall_range_header() → bool¶

doeextractor.parser.classify(text)¶: Classify text simplified for DOE reports

doeextractor.parser.parse(input_file_path: str, output_file_path: str = None)¶: Simple parser for extracted tables.

doeextractor.parser.tokenize(text_data: list)¶

doeextractor.tabula module¶

doeextractor.tabula.debug_info(silent=False)¶

doeextractor.tabula.extract(*args, **kwargs)¶: Extract data from PDFs using tabula

doeextractor.textract_parser module¶

doeextractor.textract_parser.classify(text)¶: Classify text simplified for DOE reports

doeextractor.textract_parser.clean(input_file_path: pathlib.Path, overwrite=False)¶: Clean extracted tables.

doeextractor.textract_parser.parse(input_file_path: str, output_file_path: str = None, clean_input=True)¶: Simple parser for extracted tables.

doeextractor.textract_parser.tokenize_input(input_file_path: str)¶: Tokenize input file.

doeextractor.textractor module¶

doeextractor.textractor.extract(input_file_path: Union[str, pathlib.Path])¶: Extract tables from a PDF file using Amazon Textract

doeextractor.textractor.generate_table_csv(table_result, blocks_map, table_index)¶

doeextractor.textractor.get_rows_columns_map(table_result, blocks_map)¶

doeextractor.textractor.get_table_csv_results(input_file: pathlib.Path)¶

doeextractor.textractor.get_text(result, blocks_map)¶

doeextractor.token_types module¶

doeextractor.token_types.feature_is_city(text)¶

doeextractor.token_types.feature_is_price(text)¶

Module contents¶

Top-level package for doeextractor.