doeextractor package¶
Subpackages¶
Submodules¶
doeextractor.analyser module¶
-
doeextractor.analyser.analyse(results: List[doeextractor.models.fuel_line_price.FuelLinePriceItem])¶ Get mean, median, mode of prices from all companies
-
doeextractor.analyser.mean(items)¶
-
doeextractor.analyser.median(items)¶
-
doeextractor.analyser.mode(items)¶
doeextractor.cli module¶
Console script for doeextractor.
doeextractor.constants module¶
doeextractor.exceptions module¶
-
exception
doeextractor.exceptions.NotFoundException¶ Bases:
Exception
doeextractor.file_helpers module¶
-
doeextractor.file_helpers.add_file_to_local_cache(file_path: pathlib.Path, output_file_path: pathlib.Path) → str¶ Add file to local cache.
-
doeextractor.file_helpers.convert_pdf_to_png(file_path, merge_pages=False) → pathlib.Path¶ Convert a PDF file to PNG.
-
doeextractor.file_helpers.get_checksum(file_path)¶ Calculates the checksum of a file.
-
doeextractor.file_helpers.is_file_already_analyzed(file_path)¶ Checks if a file is already analyzed.
doeextractor.parser module¶
-
class
doeextractor.parser.Token(value, token_type: str)¶ Bases:
object-
is_overall_range_header() → bool¶
-
-
doeextractor.parser.classify(text)¶ Classify text simplified for DOE reports
-
doeextractor.parser.parse(input_file_path: str, output_file_path: str = None)¶ Simple parser for extracted tables.
-
doeextractor.parser.tokenize(text_data: list)¶
doeextractor.tabula module¶
-
doeextractor.tabula.debug_info(silent=False)¶
-
doeextractor.tabula.extract(*args, **kwargs)¶ Extract data from PDFs using tabula
doeextractor.textract_parser module¶
-
doeextractor.textract_parser.classify(text)¶ Classify text simplified for DOE reports
-
doeextractor.textract_parser.clean(input_file_path: pathlib.Path, overwrite=False)¶ Clean extracted tables.
-
doeextractor.textract_parser.parse(input_file_path: str, output_file_path: str = None, clean_input=True)¶ Simple parser for extracted tables.
-
doeextractor.textract_parser.tokenize_input(input_file_path: str)¶ Tokenize input file.
doeextractor.textractor module¶
-
doeextractor.textractor.extract(input_file_path: Union[str, pathlib.Path])¶ Extract tables from a PDF file using Amazon Textract
-
doeextractor.textractor.generate_table_csv(table_result, blocks_map, table_index)¶
-
doeextractor.textractor.get_rows_columns_map(table_result, blocks_map)¶
-
doeextractor.textractor.get_table_csv_results(input_file: pathlib.Path)¶
-
doeextractor.textractor.get_text(result, blocks_map)¶
doeextractor.token_types module¶
-
doeextractor.token_types.feature_is_city(text)¶
-
doeextractor.token_types.feature_is_price(text)¶
Module contents¶
Top-level package for doeextractor.