bolster.utils.web

HTTP session utilities with retry and rate-limit handling.

Provides a pre-configured requests.Session (session) with:

  • Automatic retry on transient server errors (500/502/503/504)

  • Rate-limit awareness: exponential backoff on 429 responses, capped at 60 s

  • A consistent User-Agent header for polite scraping

  • Helpers for downloading Excel files and ZIP archives in memory

All data-source modules should import session from here rather than calling requests.get() directly, so that retry logic is applied uniformly.

Example

>>> from bolster.utils.web import session
>>> type(session).__name__
'Session'

Attributes

logger

ua

session

Classes

RateLimitAwareRetry

Retry strategy that logs HTTP errors and connection failures for diagnosis.

CachingSession

requests.Session that caches GET responses to disk with a TTL.

Functions

get_last_valid(url)

Get the last valid URL from Wayback Machine.

resilient_get(url, **kwargs)

Attempt a get, but if it fails, try using the wayback machine to get the last valid version and get that.

get_excel_dataframe(file_url[, requests_kwargs, ...])

Download and read Excel file into pandas DataFrame.

download_extract_zip(url)

Download a ZIP file and extract its contents in memory.

Module Contents

bolster.utils.web.logger[source]
bolster.utils.web.ua[source]
class bolster.utils.web.RateLimitAwareRetry(total=10, connect=None, read=None, redirect=None, status=None, other=None, allowed_methods=DEFAULT_ALLOWED_METHODS, status_forcelist=None, backoff_factor=0, backoff_max=DEFAULT_BACKOFF_MAX, raise_on_redirect=True, raise_on_status=True, history=None, respect_retry_after_header=True, remove_headers_on_redirect=DEFAULT_REMOVE_HEADERS_ON_REDIRECT, backoff_jitter=0.0, retry_after_max=DEFAULT_RETRY_AFTER_MAX)[source]

Bases: urllib3.util.retry.Retry

Retry strategy that logs HTTP errors and connection failures for diagnosis.

increment(method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None)[source]

Override increment to track the last response status.

class bolster.utils.web.CachingSession[source]

Bases: requests.Session

requests.Session that caches GET responses to disk with a TTL.

Only caches responses whose Content-Type starts with “text/” (HTML, plain text) — binary downloads (Excel, ZIP, etc.) bypass the cache and should go through CachedDownloader instead.

Cache lives in ~/.cache/bolster/_pages/ keyed by URL hash. TTL is controlled by _PAGE_CACHE_TTL_SECONDS (default: 1 hour).

Example

>>> from bolster.utils.web import session
>>> type(session).__name__
'CachingSession'
get(url, **kwargs)[source]

Sends a GET request. Returns Response object.

Parameters:
  • url – URL for the new Request object.

  • **kwargs – Optional arguments that request takes.

Return type:

requests.Response

bolster.utils.web.session[source]
bolster.utils.web.get_last_valid(url)[source]

Get the last valid URL from Wayback Machine.

bolster.utils.web.resilient_get(url, **kwargs)[source]

Attempt a get, but if it fails, try using the wayback machine to get the last valid version and get that.

If all else fails, raise a HTTPError from the inner “NoCDXRecordFound” exception.

bolster.utils.web.get_excel_dataframe(file_url, requests_kwargs=None, read_kwargs=None)[source]

Download and read Excel file into pandas DataFrame.

bolster.utils.web.download_extract_zip(url)[source]

Download a ZIP file and extract its contents in memory.

Yields (filename, file-like object) pairs. Shows a tqdm progress bar during download sized to Content-Length when the server provides it.