bolster.utils.datatables

Generic utility for extracting DataTables data from HTML pages.

Many Northern Ireland government statistics pages use R’s flexdashboard/DT package to embed DataTables widgets. The data is stored as column-transposed JSON inside <script type="application/json"> blocks with a {"x": {"data": [...], ...}} structure, where x["data"] is a list of column arrays (not row arrays) and x["container"] holds the HTML table header with column names.

Example

>>> from bolster.utils.datatables import datatables_to_dataframe
>>> payload = {
...     "data": [["A", "B"], [1, 2]],
...     "container": "<table><thead><tr><th>Name</th><th>Value</th></tr></thead></table>",
... }
>>> df = datatables_to_dataframe(payload)
>>> list(df.columns)
['Name', 'Value']

Attributes

logger

Exceptions

DataTablesError

Raised when DataTables extraction fails.

Functions

fetch_datatables_json(url[, timeout])

Fetch an HTML page and extract the embedded DT widget JSON payload.

datatables_to_dataframe(payload)

Convert a DT widget payload into a row-oriented DataFrame.

get_column_headers_from_url(url[, timeout])

Fetch a DataTables page and return its column header names.

Module Contents

bolster.utils.datatables.logger[source]
exception bolster.utils.datatables.DataTablesError[source]

Bases: Exception

Raised when DataTables extraction fails.

Initialize self. See help(type(self)) for accurate signature.

bolster.utils.datatables.fetch_datatables_json(url, timeout=30)[source]

Fetch an HTML page and extract the embedded DT widget JSON payload.

The payload is the parsed content of the largest <script type="application/json"> block whose x.data key is a column-transposed list (i.e. a list of lists).

Parameters:
  • url (str) – URL of the HTML page containing a DataTables widget.

  • timeout (int) – HTTP request timeout in seconds.

Returns:

The x sub-dict from the DT widget payload, containing at minimum "data" (list of column arrays) and "container" (HTML header).

Raises:

DataTablesError – If the page cannot be fetched or no DT payload is found.

Return type:

dict

Example

>>> from bolster.utils.datatables import DataTablesError
>>> try:
...     fetch_datatables_json("https://example.com/data.html")
... except DataTablesError:
...     print("DataTablesError raised for invalid page")
DataTablesError raised for invalid page

A successful call returns a dict extracted from the page’s DT widget. The shape mirrors what _extract_datatables_payload returns:

>>> sample_html = (
...     '<script type="application/json">'
...     '{"x": {"data": [["Belfast", "Derry"], [1200, 800]],'
...     ' "container": "<thead><tr><th>City</th><th>Count</th></tr></thead>"}}'
...     "</script>"
... )
>>> payload = _extract_datatables_payload(sample_html)
>>> sorted(payload.keys())
['container', 'data']
>>> len(payload["data"])
2
>>> payload["data"][0]
['Belfast', 'Derry']
bolster.utils.datatables.datatables_to_dataframe(payload)[source]

Convert a DT widget payload into a row-oriented DataFrame.

The payload["data"] field is a list of column arrays (column-transposed). This function transposes it into a normal row-oriented DataFrame and uses column names from payload["container"] if available.

Parameters:

payload (dict) – The x sub-dict from a DT widget JSON block, as returned by fetch_datatables_json().

Returns:

DataFrame with one row per record and columns named from the HTML header.

Raises:

DataTablesError – If payload["data"] is missing or malformed.

Return type:

pandas.DataFrame

Example

>>> payload = {
...     "data": [["a", "b"], [1, 2]],
...     "container": "<table><thead><tr><th>Name</th><th>Value</th></tr></thead></table>",
... }
>>> df = datatables_to_dataframe(payload)
>>> list(df.columns)
['Name', 'Value']
>>> len(df)
2
bolster.utils.datatables.get_column_headers_from_url(url, timeout=30)[source]

Fetch a DataTables page and return its column header names.

Convenience helper for discovery.

Parameters:
  • url (str) – URL of the HTML page.

  • timeout (int) – HTTP request timeout in seconds.

Returns:

List of column header strings.

Return type:

list[str]