bolster.data_sources.nisra.disease_prevalence ============================================= .. py:module:: bolster.data_sources.nisra.disease_prevalence .. autoapi-nested-parse:: NISRA Raw Disease Prevalence Module. Provides access to Northern Ireland's raw disease prevalence statistics published annually by the Department of Health. The data originate from General Practice clinical disease registers (Quality & Outcomes Framework, QOF) and are released once per year after National Prevalence Day. Data Coverage: - Financial years 2004/05 to 2025/26 (22 years, extended annually) - NI-level summary: registered patients per disease register (Table 1) and prevalence per 1,000 patients (Table 2a) - GP practice-level: same metrics per practice (Table 5a–5q), 2009/10–2025/26 - 26 named disease registers; 14 are active as of 2025/26 - ~305–360 GP practices per year Data Source: Department of Health Northern Ireland publishes an Excel workbook via https://www.health-ni.gov.uk/articles/prevalence-statistics. The landing page links to a publications page which hosts the Excel file. Update Frequency: Annual, approximately May of the following calendar year. .. rubric:: Example >>> from bolster.data_sources.nisra import disease_prevalence as dp >>> df = dp.get_latest_disease_prevalence() >>> 'registered_patients' in df.columns True >>> 'prevalence_per_1000' in df.columns True Publication Details: - Frequency: Annual - Published by: Department of Health / NISRA - Source: https://www.health-ni.gov.uk/articles/prevalence-statistics Attributes ---------- .. autoapisummary:: bolster.data_sources.nisra.disease_prevalence.logger bolster.data_sources.nisra.disease_prevalence.DOH_LANDING_PAGE bolster.data_sources.nisra.disease_prevalence.DOH_BASE_URL Functions --------- .. autoapisummary:: bolster.data_sources.nisra.disease_prevalence.get_latest_publication_url bolster.data_sources.nisra.disease_prevalence.parse_ni_summary bolster.data_sources.nisra.disease_prevalence.get_latest_disease_prevalence bolster.data_sources.nisra.disease_prevalence.validate_disease_prevalence bolster.data_sources.nisra.disease_prevalence.parse_gp_practice_lookup bolster.data_sources.nisra.disease_prevalence.parse_gp_practice bolster.data_sources.nisra.disease_prevalence.parse_all_gp_practices bolster.data_sources.nisra.disease_prevalence.get_latest_gp_prevalence Module Contents --------------- .. py:data:: logger .. py:data:: DOH_LANDING_PAGE :value: 'https://www.health-ni.gov.uk/articles/prevalence-statistics' .. py:data:: DOH_BASE_URL :value: 'https://www.health-ni.gov.uk' .. py:function:: get_latest_publication_url() Return the URL of the most recent disease prevalence Excel workbook. Scrapes the Department of Health landing page, follows the first link to a publications page, and returns the .xlsx download URL found there. :returns: Absolute URL of the latest Excel workbook. :raises NISRADataNotFoundError: If the Excel link cannot be located. .. rubric:: Example >>> url = get_latest_publication_url() >>> url.endswith(".xlsx") True .. py:function:: parse_ni_summary(file_path) Parse Table 1 and Table 2 from the disease prevalence Excel workbook. Reads both NI-level summary sheets and returns a single merged long-format DataFrame with one row per (financial_year, register) combination. :param file_path: Path-like object or string pointing to the downloaded .xlsx workbook. :returns: - year (int): Start year of the financial year (e.g. 2004 for "2004/05") - financial_year (str): Financial year label (e.g. "2004/05") - register (str): Normalised disease register name - registered_patients (float): Number of patients on the register at National Prevalence Day (NaN if not available for that year) - prevalence_per_1000 (float): Prevalence per 1,000 registered patients (NaN if not available for that year) Rows are sorted by register, then year. :rtype: DataFrame with columns :raises NISRADataNotFoundError: If the expected sheets are not found. :raises NISRAValidationError: If the parsed data fails basic sanity checks. .. rubric:: Example >>> df = parse_ni_summary("/tmp/rdptd-tables-2026.xlsx") >>> set(df.columns) >= {"year", "financial_year", "register", ... "registered_patients", "prevalence_per_1000"} True .. py:function:: get_latest_disease_prevalence(force_refresh = False) Fetch and return the latest NI disease prevalence data. Downloads the current Excel workbook from the Department of Health website (with a 24×365-hour cache), parses both NI-level summary tables, validates the result, and returns a clean long-format DataFrame. :param force_refresh: If True, bypass the local file cache and re-download the workbook. Default: False. :returns: - year (int): Start year of the financial year - financial_year (str): Label such as "2004/05" - register (str): Disease register name (normalised) - registered_patients (float): Patients on register at NPD - prevalence_per_1000 (float): Prevalence per 1,000 registered pts Data spans 2004/05 to the latest published year. :rtype: DataFrame with columns :raises NISRADataNotFoundError: If the workbook cannot be located or downloaded. :raises NISRAValidationError: If the parsed data fails validation. .. rubric:: Example >>> df = get_latest_disease_prevalence() >>> df["register"].nunique() >= 14 True >>> df["year"].min() <= 2004 True .. py:function:: validate_disease_prevalence(df, level = 'ni') Validate the disease prevalence DataFrame for internal consistency. Checks that required columns are present, the DataFrame is non-empty, has sufficient temporal coverage, and that key numeric fields are within plausible bounds. :param df: DataFrame as returned by :func:`parse_ni_summary`, :func:`get_latest_disease_prevalence`, :func:`parse_all_gp_practices`, or :func:`get_latest_gp_prevalence`. :param level: Validation mode — ``"ni"`` (default) for NI-level summary DataFrames or ``"gp"`` for GP-practice-level DataFrames. ``"ni"`` is the backward-compatible default. :returns: True if all checks pass. :raises NISRAValidationError: Describing the first failing check. :raises ValueError: If *level* is not ``"ni"`` or ``"gp"``. .. rubric:: Example >>> import pandas as pd >>> df = pd.DataFrame({ ... "year": [2004], "financial_year": ["2004/05"], ... "register": ["Hypertension"], ... "registered_patients": [184824.0], ... "prevalence_per_1000": [102.9], ... }) >>> validate_disease_prevalence(df) True .. py:function:: parse_gp_practice_lookup(file_path, sheet_name = None) Parse Table 4 (GP practice details) into a lookup DataFrame. Table 4 is a single sheet in the workbook that provides the canonical practice name, address, and postcode for every GP practice code. It is used to populate the ``practice_name`` column that is absent from the Table 5 prevalence sheets. The sheet header row is at row index 7 (0-based). Data rows start immediately below and continue until the last row; all ``Practice ID`` values start with ``"Z"``. :param file_path: Path to the downloaded ``.xlsx`` workbook. :param sheet_name: Sheet name override. If ``None`` (default), uses the constant ``_SHEET_TABLE4`` (``"Table 4 GP practice details"``). :returns: - ``practice_code`` (str): Practice identifier (e.g. ``"Z00001"``) - ``practice_name`` (str): Practice name (e.g. ``"Dr. IRWIN"``) - ``address`` (str or None): Full formatted address (Address1 / Address2 / Address3 joined, NaN parts dropped) - ``postcode`` (str or None): Postcode (e.g. ``"BT4 1NS"``) :rtype: DataFrame with columns :raises NISRADataNotFoundError: If the sheet cannot be found in the workbook. .. rubric:: Example >>> lkp = parse_gp_practice_lookup("/tmp/rdptd-tables-2026.xlsx") >>> "practice_code" in lkp.columns True >>> lkp["practice_code"].str.startswith("Z").all() True .. py:function:: parse_gp_practice(file_path, sheet_name) Parse a single Table 5 sheet into a long-format GP practice DataFrame. Reads the compound 2-row header (rows 4–5) to identify register columns, then extracts one row per (practice, register) pair. Practice codes are in column 1 and always start with "Z"; the "Northern Ireland" summary row at the foot of each sheet is excluded. The sheet names in the workbook follow the pattern ``"Table 5X Prevalence YYYY"`` (e.g. ``"Table 5a Prevalence 2026"``), from which the financial year is derived automatically. :param file_path: Path to the downloaded ``.xlsx`` workbook (string or path-like). :param sheet_name: Exact name of the Table 5 sheet to parse, e.g. ``"Table 5a Prevalence 2026"``. :returns: - ``practice_code`` (str): Practice identifier (e.g. ``"Z00001"``) - ``practice_name`` (object): ``None``; practice names are sourced from Table 4 and are joined in :func:`parse_all_gp_practices` rather than at the individual sheet level - ``lcg`` (str or None): Local Commissioning Group name - ``federation`` (str or None): Federation name; ``None`` for years before 2017/18 when the column was not present - ``financial_year`` (str): Label such as ``"2025/26"`` - ``year`` (int): Start year of the financial year (e.g. ``2025``) - ``register`` (str): Normalised disease register name - ``registered_patients`` (float): Patients on register at NPD; ``NaN`` if not available for that register in that year - ``prevalence_per_1000`` (float): Prevalence per 1,000 registered patients; ``NaN`` if not available :rtype: Long-format DataFrame with columns :raises NISRADataNotFoundError: If the sheet is not found or the expected column blocks cannot be located. .. rubric:: Example >>> df = parse_gp_practice("/tmp/rdptd-tables-2026.xlsx", ... "Table 5a Prevalence 2026") >>> df["practice_code"].str.startswith("Z").all() True >>> "Hypertension" in df["register"].values True .. py:function:: parse_all_gp_practices(file_path) Parse all Table 5 sheets and return a concatenated long-format DataFrame. Iterates every sheet whose name starts with ``"Table 5"`` in the workbook, calls :func:`parse_gp_practice` for each, and concatenates the results. Financial year and start year are derived from each sheet's name. Sheets that cannot be parsed (e.g. due to unexpected structure) are skipped with a warning rather than raising an exception, so a partial result is always returned. :param file_path: Path to the downloaded ``.xlsx`` workbook (string or path-like). :returns: ``practice_code``, ``practice_name``, ``lcg``, ``federation``, ``financial_year``, ``year``, ``register``, ``registered_patients``, ``prevalence_per_1000``. Rows are sorted by ``financial_year``, ``practice_code``, then ``register``. :rtype: Long-format DataFrame with columns :raises NISRADataNotFoundError: If no Table 5 sheets can be found or parsed. .. rubric:: Example >>> df = parse_all_gp_practices("/tmp/rdptd-tables-2026.xlsx") >>> df["financial_year"].nunique() >= 17 True >>> df["practice_code"].nunique() >= 300 True .. py:function:: get_latest_gp_prevalence(force_refresh = False) Fetch and return the latest GP-practice-level disease prevalence data. Downloads the current Excel workbook from the Department of Health website (with a 24×365-hour cache), parses all Table 5 sheets (one per financial year), validates the result, and returns a clean long-format DataFrame covering 2009/10 to the latest published year. :param force_refresh: If True, bypass the local file cache and re-download the workbook. Default: False. :returns: - ``practice_code`` (str): GP practice identifier (e.g. ``"Z00001"``) - ``practice_name`` (str or None): Practice name from Table 4 lookup (e.g. ``"Dr. IRWIN"``); ``None`` if the practice code is not found in Table 4 (e.g. historical practices no longer listed) - ``lcg`` (str or None): Local Commissioning Group - ``federation`` (str or None): Federation name (``None`` pre-2017/18) - ``financial_year`` (str): Label such as ``"2025/26"`` - ``year`` (int): Start year of the financial year - ``register`` (str): Disease register name (normalised) - ``registered_patients`` (float): Patients on register at NPD - ``prevalence_per_1000`` (float): Prevalence per 1,000 registered pts Data spans 2009/10 to the latest published year (~17 financial years) with ~305–360 GP practices per year. :rtype: Long-format DataFrame with columns :raises NISRADataNotFoundError: If the workbook cannot be located or downloaded. :raises NISRAValidationError: If the parsed data fails validation. .. rubric:: Example >>> df = get_latest_gp_prevalence() >>> df["practice_code"].str.startswith("Z").all() True >>> df["financial_year"].nunique() >= 17 True