bolster.utils.web
=================

.. py:module:: bolster.utils.web

.. autoapi-nested-parse::

   HTTP session utilities with retry and rate-limit handling.

   Provides a pre-configured :class:`requests.Session` (``session``) with:

   - Automatic retry on transient server errors (500/502/503/504)
   - Rate-limit awareness: exponential backoff on 429 responses, capped at 60 s
   - A consistent ``User-Agent`` header for polite scraping
   - Helpers for downloading Excel files and ZIP archives in memory

   All data-source modules should import ``session`` from here rather than
   calling :func:`requests.get` directly, so that retry logic is applied
   uniformly.

   .. rubric:: Example

   >>> from bolster.utils.web import session
   >>> type(session).__name__
   'Session'


Attributes
----------

.. autoapisummary::

   bolster.utils.web.logger
   bolster.utils.web.ua
   bolster.utils.web.session


Classes
-------

.. autoapisummary::

   bolster.utils.web.RateLimitAwareRetry
   bolster.utils.web.CachingSession


Functions
---------

.. autoapisummary::

   bolster.utils.web.get_last_valid
   bolster.utils.web.resilient_get
   bolster.utils.web.get_excel_dataframe
   bolster.utils.web.download_extract_zip


Module Contents
---------------

.. py:data:: logger

.. py:data:: ua

.. py:class:: RateLimitAwareRetry(total = 10, connect = None, read = None, redirect = None, status = None, other = None, allowed_methods = DEFAULT_ALLOWED_METHODS, status_forcelist = None, backoff_factor = 0, backoff_max = DEFAULT_BACKOFF_MAX, raise_on_redirect = True, raise_on_status = True, history = None, respect_retry_after_header = True, remove_headers_on_redirect = DEFAULT_REMOVE_HEADERS_ON_REDIRECT, backoff_jitter = 0.0, retry_after_max = DEFAULT_RETRY_AFTER_MAX)

   Bases: :py:obj:`urllib3.util.retry.Retry`


   Retry strategy that logs HTTP errors and connection failures for diagnosis.


   .. py:method:: increment(method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None)

      Override increment to track the last response status.


.. py:class:: CachingSession

   Bases: :py:obj:`requests.Session`


   requests.Session that caches GET responses to disk with a TTL.

   Only caches responses whose Content-Type starts with "text/" (HTML, plain
   text) — binary downloads (Excel, ZIP, etc.) bypass the cache and should
   go through CachedDownloader instead.

   Cache lives in ~/.cache/bolster/_pages/ keyed by URL hash.
   TTL is controlled by _PAGE_CACHE_TTL_SECONDS (default: 1 hour).

   .. rubric:: Example

   >>> from bolster.utils.web import session
   >>> type(session).__name__
   'CachingSession'


   .. py:method:: get(url, **kwargs)

      Sends a GET request. Returns :class:`Response` object.

      :param url: URL for the new :class:`Request` object.
      :param \*\*kwargs: Optional arguments that ``request`` takes.
      :rtype: requests.Response


.. py:data:: session

.. py:function:: get_last_valid(url)

   Get the last valid URL from Wayback Machine.


.. py:function:: resilient_get(url, **kwargs)

   Attempt a get, but if it fails, try using the wayback machine to get the last valid version and get that.

   If all else fails, raise a HTTPError from the inner "NoCDXRecordFound" exception.


.. py:function:: get_excel_dataframe(file_url, requests_kwargs = None, read_kwargs = None)

   Download and read Excel file into pandas DataFrame.


.. py:function:: download_extract_zip(url)

   Download a ZIP file and extract its contents in memory.

   Yields (filename, file-like object) pairs. Shows a tqdm progress bar
   during download sized to Content-Length when the server provides it.