bolster.data_sources.nisra.disease_prevalence

NISRA Raw Disease Prevalence Module.

Provides access to Northern Ireland’s raw disease prevalence statistics published annually by the Department of Health. The data originate from General Practice clinical disease registers (Quality & Outcomes Framework, QOF) and are released once per year after National Prevalence Day.

Data Coverage:
  • Financial years 2004/05 to 2025/26 (22 years, extended annually)

  • NI-level summary: registered patients per disease register (Table 1) and prevalence per 1,000 patients (Table 2a)

  • GP practice-level: same metrics per practice (Table 5a–5q), 2009/10–2025/26

  • 26 named disease registers; 14 are active as of 2025/26

  • ~305–360 GP practices per year

Data Source:

Department of Health Northern Ireland publishes an Excel workbook via https://www.health-ni.gov.uk/articles/prevalence-statistics. The landing page links to a publications page which hosts the Excel file.

Update Frequency:

Annual, approximately May of the following calendar year.

Example

>>> from bolster.data_sources.nisra import disease_prevalence as dp
>>> df = dp.get_latest_disease_prevalence()
>>> 'registered_patients' in df.columns
True
>>> 'prevalence_per_1000' in df.columns
True
Publication Details:

Attributes

logger

DOH_LANDING_PAGE

DOH_BASE_URL

Functions

get_latest_publication_url()

Return the URL of the most recent disease prevalence Excel workbook.

parse_ni_summary(file_path)

Parse Table 1 and Table 2 from the disease prevalence Excel workbook.

get_latest_disease_prevalence([force_refresh])

Fetch and return the latest NI disease prevalence data.

validate_disease_prevalence(df[, level])

Validate the disease prevalence DataFrame for internal consistency.

parse_gp_practice_lookup(file_path[, sheet_name])

Parse Table 4 (GP practice details) into a lookup DataFrame.

parse_gp_practice(file_path, sheet_name)

Parse a single Table 5 sheet into a long-format GP practice DataFrame.

parse_all_gp_practices(file_path)

Parse all Table 5 sheets and return a concatenated long-format DataFrame.

get_latest_gp_prevalence([force_refresh])

Fetch and return the latest GP-practice-level disease prevalence data.

Module Contents

bolster.data_sources.nisra.disease_prevalence.logger[source]
bolster.data_sources.nisra.disease_prevalence.DOH_LANDING_PAGE = 'https://www.health-ni.gov.uk/articles/prevalence-statistics'[source]
bolster.data_sources.nisra.disease_prevalence.DOH_BASE_URL = 'https://www.health-ni.gov.uk'[source]
bolster.data_sources.nisra.disease_prevalence.get_latest_publication_url()[source]

Return the URL of the most recent disease prevalence Excel workbook.

Scrapes the Department of Health landing page, follows the first link to a publications page, and returns the .xlsx download URL found there.

Returns:

Absolute URL of the latest Excel workbook.

Raises:

NISRADataNotFoundError – If the Excel link cannot be located.

Return type:

str

Example

>>> url = get_latest_publication_url()
>>> url.endswith(".xlsx")
True
bolster.data_sources.nisra.disease_prevalence.parse_ni_summary(file_path)[source]

Parse Table 1 and Table 2 from the disease prevalence Excel workbook.

Reads both NI-level summary sheets and returns a single merged long-format DataFrame with one row per (financial_year, register) combination.

Parameters:

file_path – Path-like object or string pointing to the downloaded .xlsx workbook.

Returns:

  • year (int): Start year of the financial year (e.g. 2004 for “2004/05”)
    • financial_year (str): Financial year label (e.g. “2004/05”)

    • register (str): Normalised disease register name

    • registered_patients (float): Number of patients on the register at National Prevalence Day (NaN if not available for that year)

    • prevalence_per_1000 (float): Prevalence per 1,000 registered patients (NaN if not available for that year)

Rows are sorted by register, then year.

Return type:

DataFrame with columns

Raises:
  • NISRADataNotFoundError – If the expected sheets are not found.

  • NISRAValidationError – If the parsed data fails basic sanity checks.

Example

>>> df = parse_ni_summary("/tmp/rdptd-tables-2026.xlsx")
>>> set(df.columns) >= {"year", "financial_year", "register",
...                     "registered_patients", "prevalence_per_1000"}
True
bolster.data_sources.nisra.disease_prevalence.get_latest_disease_prevalence(force_refresh=False)[source]

Fetch and return the latest NI disease prevalence data.

Downloads the current Excel workbook from the Department of Health website (with a 24×365-hour cache), parses both NI-level summary tables, validates the result, and returns a clean long-format DataFrame.

Parameters:

force_refresh (bool) – If True, bypass the local file cache and re-download the workbook. Default: False.

Returns:

  • year (int): Start year of the financial year
    • financial_year (str): Label such as “2004/05”

    • register (str): Disease register name (normalised)

    • registered_patients (float): Patients on register at NPD

    • prevalence_per_1000 (float): Prevalence per 1,000 registered pts

Data spans 2004/05 to the latest published year.

Return type:

DataFrame with columns

Raises:
  • NISRADataNotFoundError – If the workbook cannot be located or downloaded.

  • NISRAValidationError – If the parsed data fails validation.

Example

>>> df = get_latest_disease_prevalence()
>>> df["register"].nunique() >= 14
True
>>> df["year"].min() <= 2004
True
bolster.data_sources.nisra.disease_prevalence.validate_disease_prevalence(df, level='ni')[source]

Validate the disease prevalence DataFrame for internal consistency.

Checks that required columns are present, the DataFrame is non-empty, has sufficient temporal coverage, and that key numeric fields are within plausible bounds.

Parameters:
Returns:

True if all checks pass.

Raises:
  • NISRAValidationError – Describing the first failing check.

  • ValueError – If level is not "ni" or "gp".

Return type:

bool

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     "year": [2004], "financial_year": ["2004/05"],
...     "register": ["Hypertension"],
...     "registered_patients": [184824.0],
...     "prevalence_per_1000": [102.9],
... })
>>> validate_disease_prevalence(df)
True
bolster.data_sources.nisra.disease_prevalence.parse_gp_practice_lookup(file_path, sheet_name=None)[source]

Parse Table 4 (GP practice details) into a lookup DataFrame.

Table 4 is a single sheet in the workbook that provides the canonical practice name, address, and postcode for every GP practice code. It is used to populate the practice_name column that is absent from the Table 5 prevalence sheets.

The sheet header row is at row index 7 (0-based). Data rows start immediately below and continue until the last row; all Practice ID values start with "Z".

Parameters:
  • file_path – Path to the downloaded .xlsx workbook.

  • sheet_name (str | None) – Sheet name override. If None (default), uses the constant _SHEET_TABLE4 ("Table 4 GP practice details").

Returns:

  • practice_code (str): Practice identifier (e.g. "Z00001")

  • practice_name (str): Practice name (e.g. "Dr. IRWIN")

  • address (str or None): Full formatted address (Address1 / Address2 / Address3 joined, NaN parts dropped)

  • postcode (str or None): Postcode (e.g. "BT4 1NS")

Return type:

DataFrame with columns

Raises:

NISRADataNotFoundError – If the sheet cannot be found in the workbook.

Example

>>> lkp = parse_gp_practice_lookup("/tmp/rdptd-tables-2026.xlsx")
>>> "practice_code" in lkp.columns
True
>>> lkp["practice_code"].str.startswith("Z").all()
True
bolster.data_sources.nisra.disease_prevalence.parse_gp_practice(file_path, sheet_name)[source]

Parse a single Table 5 sheet into a long-format GP practice DataFrame.

Reads the compound 2-row header (rows 4–5) to identify register columns, then extracts one row per (practice, register) pair. Practice codes are in column 1 and always start with “Z”; the “Northern Ireland” summary row at the foot of each sheet is excluded.

The sheet names in the workbook follow the pattern "Table 5X Prevalence YYYY" (e.g. "Table 5a Prevalence 2026"), from which the financial year is derived automatically.

Parameters:
  • file_path – Path to the downloaded .xlsx workbook (string or path-like).

  • sheet_name (str) – Exact name of the Table 5 sheet to parse, e.g. "Table 5a Prevalence 2026".

Returns:

  • practice_code (str): Practice identifier (e.g. "Z00001")

  • practice_name (object): None; practice names are sourced from Table 4 and are joined in parse_all_gp_practices() rather than at the individual sheet level

  • lcg (str or None): Local Commissioning Group name

  • federation (str or None): Federation name; None for years before 2017/18 when the column was not present

  • financial_year (str): Label such as "2025/26"

  • year (int): Start year of the financial year (e.g. 2025)

  • register (str): Normalised disease register name

  • registered_patients (float): Patients on register at NPD; NaN if not available for that register in that year

  • prevalence_per_1000 (float): Prevalence per 1,000 registered patients; NaN if not available

Return type:

Long-format DataFrame with columns

Raises:

NISRADataNotFoundError – If the sheet is not found or the expected column blocks cannot be located.

Example

>>> df = parse_gp_practice("/tmp/rdptd-tables-2026.xlsx",
...                        "Table 5a Prevalence 2026")
>>> df["practice_code"].str.startswith("Z").all()
True
>>> "Hypertension" in df["register"].values
True
bolster.data_sources.nisra.disease_prevalence.parse_all_gp_practices(file_path)[source]

Parse all Table 5 sheets and return a concatenated long-format DataFrame.

Iterates every sheet whose name starts with "Table 5" in the workbook, calls parse_gp_practice() for each, and concatenates the results. Financial year and start year are derived from each sheet’s name.

Sheets that cannot be parsed (e.g. due to unexpected structure) are skipped with a warning rather than raising an exception, so a partial result is always returned.

Parameters:

file_path – Path to the downloaded .xlsx workbook (string or path-like).

Returns:

practice_code, practice_name, lcg, federation, financial_year, year, register, registered_patients, prevalence_per_1000.

Rows are sorted by financial_year, practice_code, then register.

Return type:

Long-format DataFrame with columns

Raises:

NISRADataNotFoundError – If no Table 5 sheets can be found or parsed.

Example

>>> df = parse_all_gp_practices("/tmp/rdptd-tables-2026.xlsx")
>>> df["financial_year"].nunique() >= 17
True
>>> df["practice_code"].nunique() >= 300
True
bolster.data_sources.nisra.disease_prevalence.get_latest_gp_prevalence(force_refresh=False)[source]

Fetch and return the latest GP-practice-level disease prevalence data.

Downloads the current Excel workbook from the Department of Health website (with a 24×365-hour cache), parses all Table 5 sheets (one per financial year), validates the result, and returns a clean long-format DataFrame covering 2009/10 to the latest published year.

Parameters:

force_refresh (bool) – If True, bypass the local file cache and re-download the workbook. Default: False.

Returns:

  • practice_code (str): GP practice identifier (e.g. "Z00001")

  • practice_name (str or None): Practice name from Table 4 lookup (e.g. "Dr. IRWIN"); None if the practice code is not found in Table 4 (e.g. historical practices no longer listed)

  • lcg (str or None): Local Commissioning Group

  • federation (str or None): Federation name (None pre-2017/18)

  • financial_year (str): Label such as "2025/26"

  • year (int): Start year of the financial year

  • register (str): Disease register name (normalised)

  • registered_patients (float): Patients on register at NPD

  • prevalence_per_1000 (float): Prevalence per 1,000 registered pts

Data spans 2009/10 to the latest published year (~17 financial years) with ~305–360 GP practices per year.

Return type:

Long-format DataFrame with columns

Raises:
  • NISRADataNotFoundError – If the workbook cannot be located or downloaded.

  • NISRAValidationError – If the parsed data fails validation.

Example

>>> df = get_latest_gp_prevalence()
>>> df["practice_code"].str.startswith("Z").all()
True
>>> df["financial_year"].nunique() >= 17
True