bolster.utils.web ================= .. py:module:: bolster.utils.web .. autoapi-nested-parse:: HTTP session utilities with retry and rate-limit handling. Provides a pre-configured :class:`requests.Session` (``session``) with: - Automatic retry on transient server errors (500/502/503/504) - Rate-limit awareness: exponential backoff on 429 responses, capped at 60 s - A consistent ``User-Agent`` header for polite scraping - Helpers for downloading Excel files and ZIP archives in memory All data-source modules should import ``session`` from here rather than calling :func:`requests.get` directly, so that retry logic is applied uniformly. .. rubric:: Example >>> from bolster.utils.web import session >>> type(session).__name__ 'Session' Attributes ---------- .. autoapisummary:: bolster.utils.web.logger bolster.utils.web.ua bolster.utils.web.session Classes ------- .. autoapisummary:: bolster.utils.web.RateLimitAwareRetry bolster.utils.web.CachingSession Functions --------- .. autoapisummary:: bolster.utils.web.get_last_valid bolster.utils.web.resilient_get bolster.utils.web.get_excel_dataframe bolster.utils.web.download_extract_zip Module Contents --------------- .. py:data:: logger .. py:data:: ua .. py:class:: RateLimitAwareRetry(total = 10, connect = None, read = None, redirect = None, status = None, other = None, allowed_methods = DEFAULT_ALLOWED_METHODS, status_forcelist = None, backoff_factor = 0, backoff_max = DEFAULT_BACKOFF_MAX, raise_on_redirect = True, raise_on_status = True, history = None, respect_retry_after_header = True, remove_headers_on_redirect = DEFAULT_REMOVE_HEADERS_ON_REDIRECT, backoff_jitter = 0.0, retry_after_max = DEFAULT_RETRY_AFTER_MAX) Bases: :py:obj:`urllib3.util.retry.Retry` Retry strategy that logs HTTP errors and connection failures for diagnosis. .. py:method:: increment(method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None) Override increment to track the last response status. .. py:class:: CachingSession Bases: :py:obj:`requests.Session` requests.Session that caches GET responses to disk with a TTL. Only caches responses whose Content-Type starts with "text/" (HTML, plain text) — binary downloads (Excel, ZIP, etc.) bypass the cache and should go through CachedDownloader instead. Cache lives in ~/.cache/bolster/_pages/ keyed by URL hash. TTL is controlled by _PAGE_CACHE_TTL_SECONDS (default: 1 hour). .. rubric:: Example >>> from bolster.utils.web import session >>> type(session).__name__ 'CachingSession' .. py:method:: get(url, **kwargs) Sends a GET request. Returns :class:`Response` object. :param url: URL for the new :class:`Request` object. :param \*\*kwargs: Optional arguments that ``request`` takes. :rtype: requests.Response .. py:data:: session .. py:function:: get_last_valid(url) Get the last valid URL from Wayback Machine. .. py:function:: resilient_get(url, **kwargs) Attempt a get, but if it fails, try using the wayback machine to get the last valid version and get that. If all else fails, raise a HTTPError from the inner "NoCDXRecordFound" exception. .. py:function:: get_excel_dataframe(file_url, requests_kwargs = None, read_kwargs = None) Download and read Excel file into pandas DataFrame. .. py:function:: download_extract_zip(url) Download a ZIP file and extract its contents in memory. Yields (filename, file-like object) pairs. Shows a tqdm progress bar during download sized to Content-Length when the server provides it.