bolster ======= .. py:module:: bolster .. autoapi-nested-parse:: Bolster - A personal collection of Python utilities and data sources. A grab bag of handy functions for working with Northern Ireland data, basic stats operations, and general data science tasks. Built for personal projects and exploration. What's in here: - data_sources: NI water quality, house prices, cinema listings, etc. - stats: Basic data frame operations and distribution fitting - utils: Web scraping helpers, decorators, AWS/Azure bits - cli: Command line tools for the data sources Quick examples: >>> from bolster.data_sources import ni_water >>> quality_data = ni_water.get_water_quality() >>> 'NI Hardness Classification' in quality_data.columns True >>> from bolster.stats import add_totals >>> import pandas as pd >>> df = pd.DataFrame([[1, 2], [3, 4]]) >>> add_totals(df, inplace=False) 0 1 total 0 1 2 3 1 3 4 7 total 4 6 10 Author: Andrew Bolster Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/bolster/cli/index /autoapi/bolster/data_sources/index /autoapi/bolster/exceptions/index /autoapi/bolster/stats/index /autoapi/bolster/utils/index Attributes ---------- .. autoapisummary:: bolster.__author__ bolster.__email__ bolster.__version__ bolster.logger Exceptions ---------- .. autoapisummary:: bolster.DataNotFoundError bolster.DataSourceError bolster.NetworkError bolster.ParseError bolster.ValidationError bolster.MultipleErrors Classes ------- .. autoapisummary:: bolster.memoize Functions --------- .. autoapisummary:: bolster.always bolster.poolmap bolster.batch bolster.chunks bolster.arg_exception_logger bolster.backoff bolster.tag_gen bolster.exceptional_executor bolster.working_directory bolster.compress_for_relay bolster.decompress_from_relay bolster.pretty_print_request bolster.get_recursively bolster.transform_ bolster.diff bolster.aggregate bolster.breadth bolster.depth bolster.set_keys bolster.keys_at bolster.items_at bolster.leaves bolster.leaf_paths bolster.flatten_dict bolster.uncollect_object bolster.dict_concat_safe bolster.build_default_mapping_dict_from_keys Package Contents ---------------- .. py:data:: __author__ :value: 'Andrew Bolster' .. py:data:: __email__ :value: 'andrew.bolster@gmail.com' .. py:data:: __version__ .. py:exception:: DataNotFoundError(message, url = None, source = None) Bases: :py:obj:`DataSourceError` Raised when expected data publications or URLs are not found. .. rubric:: Examples - Publication page returns 404 - Expected Excel file link missing from page - RSS feed returns no entries - API endpoint returns empty response :param message: Description of what data was not found :param url: Optional URL that was being accessed :param source: Optional data source identifier Initialize DataSourceError with message and optional context. .. py:attribute:: url :value: None .. py:attribute:: source :value: None .. py:method:: __str__() Return str(self). .. py:exception:: DataSourceError Bases: :py:obj:`Exception` Base class for all data source errors. This is the root exception for all domain-specific errors in Bolster. All other exceptions should inherit from this base class. Initialize self. See help(type(self)) for accurate signature. .. py:exception:: NetworkError(message, url = None, status_code = None, retry_count = None) Bases: :py:obj:`DataSourceError` Raised when network operations fail beyond retry limits. .. rubric:: Examples - Timeout errors after retries - Connection refused - DNS resolution failures - Server returning persistent errors (500, 503) :param message: Description of network failure :param url: Optional URL that failed :param status_code: Optional HTTP status code :param retry_count: Optional number of retries attempted Initialize NetworkError with message and optional network context. .. py:attribute:: url :value: None .. py:attribute:: status_code :value: None .. py:attribute:: retry_count :value: None .. py:method:: __str__() Return str(self). .. py:exception:: ParseError(message, file_path = None, parser_type = None) Bases: :py:obj:`DataSourceError` Raised when file or data parsing fails. .. rubric:: Examples - Malformed Excel file structure - Unexpected CSV format - HTML parsing issues - JSON decode errors :param message: Description of parsing failure :param file_path: Optional path to file that failed to parse :param parser_type: Optional type of parser (excel, csv, html, json) Initialize ParseError with message and optional parsing context. .. py:attribute:: file_path :value: None .. py:attribute:: parser_type :value: None .. py:method:: __str__() Return str(self). .. py:exception:: ValidationError(message, data_info = None, validation_type = None) Bases: :py:obj:`DataSourceError` Raised when data fails integrity validation checks. .. rubric:: Examples - Required columns missing from DataFrame - Data values outside expected ranges - Inconsistent data relationships - Empty datasets when data expected :param message: Description of validation failure :param data_info: Optional info about the problematic data :param validation_type: Optional type of validation that failed Initialize ValidationError with message and optional validation context. .. py:attribute:: data_info :value: None .. py:attribute:: validation_type :value: None .. py:method:: __str__() Return str(self). .. py:data:: logger .. py:function:: always(x, **kwargs) Pointless passthrough replacement for 'always true' filtering. >>> always('false') True >>> always(False) True >>> always(True) True .. py:function:: poolmap(f, iterable, max_workers = None, progress = None, **kwargs) Helper function to encapsulate a ThreadPoolExecutor mapped function workflow. Accepts (assumed to be `tqdm` style) progress monitor callback. `kwargs` are passed identically to all `f(i)` calls for each i in `iterable` :param f: function to map across :param iterable: Sequence of items to process :param max_workers: Maximum number of worker threads (Default value = None) :param progress: Progress callback function (Default value = None) :param \*\*kwargs: passed as arguments to f :returns: Dictionary mapping from input items to their results :rtype: dict .. py:function:: batch(seq, n = 1) Split a sequence into n-length batches (is still iterable, not list). :param seq: Sequence to split into batches :param n: Size of each batch (default: 1) :returns: Generator yielding batches of the sequence .. rubric:: Examples >>> next((b for b in batch(range(10), 2))) range(0, 2) >>> [b for b in batch(list(range(10)), 2)] [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]] .. py:function:: chunks(iterable, size=10) Outputs chunks of size N from an iterable (generator). :param iterable: param size: :param iterable: Iterable: :param size: (Default value = 10) Returns: >>> next((b for b in chunks(range(10), 2))) [0, 1] >>> [b for b in chunks(list(range(10)), 2)] [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]] .. py:function:: arg_exception_logger(func) Helper Decorator to provide info on the arguments that cause the exception of a wrapped function. :param func: Function to wrap with exception logging :returns: Wrapped function with exception argument logging :rtype: Callable .. py:function:: backoff(exception_to_check = BaseException, tries = 5, delay = 0.2, backoff = 2, logger = logger) Retry calling the decorated function using an exponential backoff. http://www.saltycrane.com/blog/2009/11/trying-out-retry-decorator-python/ original from: http://wiki.python.org/moin/PythonDecoratorLibrary#Retry Can't Type-Annotate Exceptions because [it's verboten](https://peps.python.org/pep-0484/#exceptions) :param exception_to_check: the exception to check. may be a tuple of exceptions to check tries: number of times to try (not retry) before giving up (Default value = 5) delay: initial delay between retries in seconds (Default value = 0.4) backoff: backoff multiplier e.g. value of 2 will double the delay each retry (Default value = 2) logger: logger to use. If None, print (Default value = local utils logger) .. py:exception:: MultipleErrors(errors=None) Bases: :py:obj:`BaseException` Exception Class to enable the capturing of multiple exceptions without interrupting control flow. I.e. catch the exception, but carry on and report the exceptions at the end. E.g. .. code-block:: python exceptions = MultipleErrors() try: do_risky_thing_with(this) #raises ValueError except: exceptions.capture_current_exception() try: do_other_thing_with(this) #raises AttributeError except: exceptions.capture_current_exception() exceptions.do_raise() .. code-block:: none Traceback (most recent call last): .... Value Error Traceback (most recent call last): ... AttributeError Initialize MultipleErrors with optional list of existing errors. .. py:attribute:: errors :value: [] .. py:method:: __str__() Return formatted string representation of all captured exceptions. .. py:method:: capture_current_exception() Gathers exception info from the current context and retains it. .. py:method:: do_raise() Raises itself if it contains any errors. .. py:function:: tag_gen(seq, **kwargs) Generator stream that adds kwargs to each entry yielded. :param seq: Iterator of dictionaries to tag :param \*\*kwargs: Additional key-value pairs to add to each dictionary .. rubric:: Examples The below example shows the creation of an empty dict generator where tag_gen is used to insert a new key/value (k=1) in each item on the fly >>> all([i['k'] == 1 for i in tag_gen(({} for _ in range(4)), k=1)]) True .. py:function:: exceptional_executor(futures, exception_handler=None, timeout=None) Generator for concurrent.Futures handling. When an exception is raised in an executing Future, f.result() called on it's own will raise that exception in the parent thread, killing execution and causing loss of 'future local' scope. Instead, query the future for it's exception state first, and handle that separately, by default by logging it as an exception. :param futures: Sequence of Future objects to process :param exception_handler: Optional callable to handle exceptions from futures :param timeout: Optional timeout for waiting for futures to complete .. py:function:: working_directory(path) Contextmanager that changes working directory and returns to previous on exit. :param path: Union[str: Path]: .. py:function:: compress_for_relay(obj) Compress json-serializable object to a gzipped base64 string. :param obj: return: :param obj: Union[List,Dict]: >>> decompress_from_relay(compress_for_relay(['test'])) ['test'] >>> decompress_from_relay(compress_for_relay({'test':'test'})) {'test': 'test'} .. py:function:: decompress_from_relay(msg) Uncompress gzipped base64 string to a json-serializable object. ['test']. :param msg: AnyStr: .. py:class:: memoize(func) cache the return value of a method. This class is meant to be used as a decorator of methods. The return value from a given method invocation will be cached on the instance whose method was invoked. All arguments passed to a method decorated with memoize must be hashable. If a memoized method is invoked directly on its class the result will not be cached. Instead the method will be invoked like a static method: .. code-block:: python class Obj(object): @memoize def add_to(self, arg): return self + arg Obj.add_to(1) # not enough arguments Obj.add_to(1, 2) # returns 3, result is not cached Source: http://code.activestate.com/recipes/577452-a-memoize-decorator-for-instance-methods/ Augmented with cache hit/miss population Counters Initialize the LRU cache decorator with a function. .. py:attribute:: func .. py:method:: __get__(obj, objtype=None) .. py:method:: __call__(*args, **kw) Execute the cached function with LRU behavior and hit/miss tracking. .. py:function:: pretty_print_request(req, expose_auth=False, authentication_header_blacklist = None) At this point it is completely built and ready to be fired; it is "prepared". However pay attention at the formatting used in this function because it is programmed to be pretty printed and may differ from the actual request. :param req: HTTP request object to pretty print :param expose_auth: Whether to expose authentication headers (Default value = False) :param authentication_header_blacklist: List of header names to redact when expose_auth is False .. py:function:: get_recursively(search_dict, field) Takes a dict with nested lists and dicts, and searches all dicts for a key of the field provided. Originally taken from https://stackoverflow.com/a/20254842 :param search_dict: Dict: :param field: str: Returns: >>> get_recursively({'id' : 5,'children' : {'id' : 6,'children' : {'id' : 7,'children' : {}}}}, 'id') [5, 6, 7] .. py:function:: transform_(r, rule_keys) Generic Item-wise transformation function. The values in `r` are updated based on key-matching in `rule_keys`, i.e. -> out[k] = rule_keys[k] (r[k]). HOWEVER, this can do more that straight callable mapping; can *also* update the key, i.e., for a given rule such that `R = rule_keys[k]`: R can be used to select that field to be selected in the output >>> r = {'a':'1','b':'2','c':'3'} >>> transform_(r, {'a':None}) {'a': '1'} Rename a key >>> transform_(r, {'a':('A',None)}) {'A': '1'} Apply a function to a key's value >>> transform_(r, {'a':('a',int)}) {'a': 1} Or a combination of these >>> transform_(r, {'a':('A',int), 'b':None}) {'A': 1, 'b': '2'} .. py:function:: diff(new, old, excluded_fields = None) Perform a one-depth diff of a pair of dictionaries. #TODO diff needs tests .. py:function:: aggregate(base, group_key, item_key, condition = None) Abstracted groupby-sum for lists of dicts. Operationally equivalent to: ``` df = pd.DataFrame(base) df.where(condition).groupby(group_key)[item_key].sum() ``` # TODO aggregate needs tests :param base: List of dictionaries to group and sum :param group_key: Key(s) to group by - can be string, tuple, or list of strings :param item_key: Key to sum values for within each group :param condition: Optional function to filter records before grouping .. py:function:: breadth(d) Get the total 'width' of a tree. > Why was this a thing? No idea .. py:function:: depth(d) Get the maximum depth of a tree. .. py:function:: set_keys(d) Extract the set of all keys of a nested dict/tree. .. py:function:: keys_at(d, n, i = 0) Extract the keys of a tree at a given depth. .. py:function:: items_at(d, n, i = 0) Extract the elements from a tree at a given depth. .. py:function:: leaves(d) Iterate on the leaves of a tree. .. py:function:: leaf_paths(d, path = None) Get all leaf paths in a nested dictionary structure. .. py:function:: flatten_dict(d, head = '', sep = ':') Flatten a nested dictionary using separator for key names. .. py:function:: uncollect_object(d) Convert flat dictionary back to nested structure using path tuples. .. py:function:: dict_concat_safe(d, keys, default = None) Really Lazy Func because `dict.get('key',default)` is a pain in the ass for lists. .. py:function:: build_default_mapping_dict_from_keys(keys) Constructs a mapping dictionary between (presumably) snakecase keys to 'human-readable' title case. Intended for easy construction of presentable graphs/tables etc. >>> build_default_mapping_dict_from_keys(['a_b','b_c','c_d']) {'a_b': 'A B', 'b_c': 'B C', 'c_d': 'C D'}