bolster.utils.web
HTTP session utilities with retry and rate-limit handling.
Provides a pre-configured requests.Session (session) with:
Automatic retry on transient server errors (500/502/503/504)
Rate-limit awareness: exponential backoff on 429 responses, capped at 60 s
A consistent
User-Agentheader for polite scrapingHelpers for downloading Excel files and ZIP archives in memory
All data-source modules should import session from here rather than
calling requests.get() directly, so that retry logic is applied
uniformly.
Example
>>> from bolster.utils.web import session
>>> type(session).__name__
'Session'
Attributes
Classes
Retry strategy that logs HTTP errors and connection failures for diagnosis. |
|
requests.Session that caches GET responses to disk with a TTL. |
Functions
|
Get the last valid URL from Wayback Machine. |
|
Attempt a get, but if it fails, try using the wayback machine to get the last valid version and get that. |
|
Download and read Excel file into pandas DataFrame. |
|
Download a ZIP file and extract its contents in memory. |
Module Contents
- class bolster.utils.web.RateLimitAwareRetry(total=10, connect=None, read=None, redirect=None, status=None, other=None, allowed_methods=DEFAULT_ALLOWED_METHODS, status_forcelist=None, backoff_factor=0, backoff_max=DEFAULT_BACKOFF_MAX, raise_on_redirect=True, raise_on_status=True, history=None, respect_retry_after_header=True, remove_headers_on_redirect=DEFAULT_REMOVE_HEADERS_ON_REDIRECT, backoff_jitter=0.0, retry_after_max=DEFAULT_RETRY_AFTER_MAX)[source]
Bases:
urllib3.util.retry.RetryRetry strategy that logs HTTP errors and connection failures for diagnosis.
- class bolster.utils.web.CachingSession[source]
Bases:
requests.Sessionrequests.Session that caches GET responses to disk with a TTL.
Only caches responses whose Content-Type starts with “text/” (HTML, plain text) — binary downloads (Excel, ZIP, etc.) bypass the cache and should go through CachedDownloader instead.
Cache lives in ~/.cache/bolster/_pages/ keyed by URL hash. TTL is controlled by _PAGE_CACHE_TTL_SECONDS (default: 1 hour).
Example
>>> from bolster.utils.web import session >>> type(session).__name__ 'CachingSession'
- bolster.utils.web.resilient_get(url, **kwargs)[source]
Attempt a get, but if it fails, try using the wayback machine to get the last valid version and get that.
If all else fails, raise a HTTPError from the inner “NoCDXRecordFound” exception.