bolster.data_sources.nisra.disease_prevalence
=============================================

.. py:module:: bolster.data_sources.nisra.disease_prevalence

.. autoapi-nested-parse::

   NISRA Raw Disease Prevalence Module.

   Provides access to Northern Ireland's raw disease prevalence statistics
   published annually by the Department of Health.  The data originate from
   General Practice clinical disease registers (Quality & Outcomes Framework,
   QOF) and are released once per year after National Prevalence Day.

   Data Coverage:
       - Financial years 2004/05 to 2025/26 (22 years, extended annually)
       - NI-level summary: registered patients per disease register (Table 1)
         and prevalence per 1,000 patients (Table 2a)
       - GP practice-level: same metrics per practice (Table 5a–5q), 2009/10–2025/26
       - 26 named disease registers; 14 are active as of 2025/26
       - ~305–360 GP practices per year

   Data Source:
       Department of Health Northern Ireland publishes an Excel workbook via
       https://www.health-ni.gov.uk/articles/prevalence-statistics.  The
       landing page links to a publications page which hosts the Excel file.

   Update Frequency:
       Annual, approximately May of the following calendar year.

   .. rubric:: Example

   >>> from bolster.data_sources.nisra import disease_prevalence as dp
   >>> df = dp.get_latest_disease_prevalence()
   >>> 'registered_patients' in df.columns
   True
   >>> 'prevalence_per_1000' in df.columns
   True

   Publication Details:
       - Frequency: Annual
       - Published by: Department of Health / NISRA
       - Source: https://www.health-ni.gov.uk/articles/prevalence-statistics


Attributes
----------

.. autoapisummary::

   bolster.data_sources.nisra.disease_prevalence.logger
   bolster.data_sources.nisra.disease_prevalence.DOH_LANDING_PAGE
   bolster.data_sources.nisra.disease_prevalence.DOH_BASE_URL


Functions
---------

.. autoapisummary::

   bolster.data_sources.nisra.disease_prevalence.get_latest_publication_url
   bolster.data_sources.nisra.disease_prevalence.parse_ni_summary
   bolster.data_sources.nisra.disease_prevalence.get_latest_disease_prevalence
   bolster.data_sources.nisra.disease_prevalence.validate_disease_prevalence
   bolster.data_sources.nisra.disease_prevalence.parse_gp_practice_lookup
   bolster.data_sources.nisra.disease_prevalence.parse_gp_practice
   bolster.data_sources.nisra.disease_prevalence.parse_all_gp_practices
   bolster.data_sources.nisra.disease_prevalence.get_latest_gp_prevalence


Module Contents
---------------

.. py:data:: logger

.. py:data:: DOH_LANDING_PAGE
   :value: 'https://www.health-ni.gov.uk/articles/prevalence-statistics'


.. py:data:: DOH_BASE_URL
   :value: 'https://www.health-ni.gov.uk'


.. py:function:: get_latest_publication_url()

   Return the URL of the most recent disease prevalence Excel workbook.

   Scrapes the Department of Health landing page, follows the first link
   to a publications page, and returns the .xlsx download URL found there.

   :returns: Absolute URL of the latest Excel workbook.

   :raises NISRADataNotFoundError: If the Excel link cannot be located.

   .. rubric:: Example

   >>> url = get_latest_publication_url()
   >>> url.endswith(".xlsx")
   True


.. py:function:: parse_ni_summary(file_path)

   Parse Table 1 and Table 2 from the disease prevalence Excel workbook.

   Reads both NI-level summary sheets and returns a single merged long-format
   DataFrame with one row per (financial_year, register) combination.

   :param file_path: Path-like object or string pointing to the downloaded .xlsx
                     workbook.

   :returns:     - year (int): Start year of the financial year (e.g. 2004 for "2004/05")
                 - financial_year (str): Financial year label (e.g. "2004/05")
                 - register (str): Normalised disease register name
                 - registered_patients (float): Number of patients on the register
                   at National Prevalence Day (NaN if not available for that year)
                 - prevalence_per_1000 (float): Prevalence per 1,000 registered
                   patients (NaN if not available for that year)

             Rows are sorted by register, then year.
   :rtype: DataFrame with columns

   :raises NISRADataNotFoundError: If the expected sheets are not found.
   :raises NISRAValidationError: If the parsed data fails basic sanity checks.

   .. rubric:: Example

   >>> df = parse_ni_summary("/tmp/rdptd-tables-2026.xlsx")
   >>> set(df.columns) >= {"year", "financial_year", "register",
   ...                     "registered_patients", "prevalence_per_1000"}
   True


.. py:function:: get_latest_disease_prevalence(force_refresh = False)

   Fetch and return the latest NI disease prevalence data.

   Downloads the current Excel workbook from the Department of Health
   website (with a 24×365-hour cache), parses both NI-level summary
   tables, validates the result, and returns a clean long-format DataFrame.

   :param force_refresh: If True, bypass the local file cache and re-download
                         the workbook.  Default: False.

   :returns:     - year (int): Start year of the financial year
                 - financial_year (str): Label such as "2004/05"
                 - register (str): Disease register name (normalised)
                 - registered_patients (float): Patients on register at NPD
                 - prevalence_per_1000 (float): Prevalence per 1,000 registered pts

             Data spans 2004/05 to the latest published year.
   :rtype: DataFrame with columns

   :raises NISRADataNotFoundError: If the workbook cannot be located or downloaded.
   :raises NISRAValidationError: If the parsed data fails validation.

   .. rubric:: Example

   >>> df = get_latest_disease_prevalence()
   >>> df["register"].nunique() >= 14
   True
   >>> df["year"].min() <= 2004
   True


.. py:function:: validate_disease_prevalence(df, level = 'ni')

   Validate the disease prevalence DataFrame for internal consistency.

   Checks that required columns are present, the DataFrame is non-empty,
   has sufficient temporal coverage, and that key numeric fields are within
   plausible bounds.

   :param df: DataFrame as returned by :func:`parse_ni_summary`,
              :func:`get_latest_disease_prevalence`, :func:`parse_all_gp_practices`,
              or :func:`get_latest_gp_prevalence`.
   :param level: Validation mode — ``"ni"`` (default) for NI-level summary DataFrames
                 or ``"gp"`` for GP-practice-level DataFrames.  ``"ni"`` is the
                 backward-compatible default.

   :returns: True if all checks pass.

   :raises NISRAValidationError: Describing the first failing check.
   :raises ValueError: If *level* is not ``"ni"`` or ``"gp"``.

   .. rubric:: Example

   >>> import pandas as pd
   >>> df = pd.DataFrame({
   ...     "year": [2004], "financial_year": ["2004/05"],
   ...     "register": ["Hypertension"],
   ...     "registered_patients": [184824.0],
   ...     "prevalence_per_1000": [102.9],
   ... })
   >>> validate_disease_prevalence(df)
   True


.. py:function:: parse_gp_practice_lookup(file_path, sheet_name = None)

   Parse Table 4 (GP practice details) into a lookup DataFrame.

   Table 4 is a single sheet in the workbook that provides the canonical
   practice name, address, and postcode for every GP practice code.  It
   is used to populate the ``practice_name`` column that is absent from
   the Table 5 prevalence sheets.

   The sheet header row is at row index 7 (0-based).  Data rows start
   immediately below and continue until the last row; all ``Practice ID``
   values start with ``"Z"``.

   :param file_path: Path to the downloaded ``.xlsx`` workbook.
   :param sheet_name: Sheet name override.  If ``None`` (default), uses the
                      constant ``_SHEET_TABLE4`` (``"Table 4 GP practice details"``).

   :returns:     - ``practice_code`` (str): Practice identifier (e.g. ``"Z00001"``)
                 - ``practice_name`` (str): Practice name (e.g. ``"Dr. IRWIN"``)
                 - ``address`` (str or None): Full formatted address (Address1 /
                   Address2 / Address3 joined, NaN parts dropped)
                 - ``postcode`` (str or None): Postcode (e.g. ``"BT4 1NS"``)
   :rtype: DataFrame with columns

   :raises NISRADataNotFoundError: If the sheet cannot be found in the workbook.

   .. rubric:: Example

   >>> lkp = parse_gp_practice_lookup("/tmp/rdptd-tables-2026.xlsx")
   >>> "practice_code" in lkp.columns
   True
   >>> lkp["practice_code"].str.startswith("Z").all()
   True


.. py:function:: parse_gp_practice(file_path, sheet_name)

   Parse a single Table 5 sheet into a long-format GP practice DataFrame.

   Reads the compound 2-row header (rows 4–5) to identify register columns,
   then extracts one row per (practice, register) pair.  Practice codes are
   in column 1 and always start with "Z"; the "Northern Ireland" summary row
   at the foot of each sheet is excluded.

   The sheet names in the workbook follow the pattern
   ``"Table 5X Prevalence YYYY"`` (e.g. ``"Table 5a Prevalence 2026"``),
   from which the financial year is derived automatically.

   :param file_path: Path to the downloaded ``.xlsx`` workbook (string or
                     path-like).
   :param sheet_name: Exact name of the Table 5 sheet to parse, e.g.
                      ``"Table 5a Prevalence 2026"``.

   :returns:

             - ``practice_code`` (str): Practice identifier (e.g. ``"Z00001"``)
             - ``practice_name`` (object): ``None``; practice names are sourced
               from Table 4 and are joined in :func:`parse_all_gp_practices`
               rather than at the individual sheet level
             - ``lcg`` (str or None): Local Commissioning Group name
             - ``federation`` (str or None): Federation name; ``None`` for years
               before 2017/18 when the column was not present
             - ``financial_year`` (str): Label such as ``"2025/26"``
             - ``year`` (int): Start year of the financial year (e.g. ``2025``)
             - ``register`` (str): Normalised disease register name
             - ``registered_patients`` (float): Patients on register at NPD;
               ``NaN`` if not available for that register in that year
             - ``prevalence_per_1000`` (float): Prevalence per 1,000 registered
               patients; ``NaN`` if not available
   :rtype: Long-format DataFrame with columns

   :raises NISRADataNotFoundError: If the sheet is not found or the expected
       column blocks cannot be located.

   .. rubric:: Example

   >>> df = parse_gp_practice("/tmp/rdptd-tables-2026.xlsx",
   ...                        "Table 5a Prevalence 2026")
   >>> df["practice_code"].str.startswith("Z").all()
   True
   >>> "Hypertension" in df["register"].values
   True


.. py:function:: parse_all_gp_practices(file_path)

   Parse all Table 5 sheets and return a concatenated long-format DataFrame.

   Iterates every sheet whose name starts with ``"Table 5"`` in the workbook,
   calls :func:`parse_gp_practice` for each, and concatenates the results.
   Financial year and start year are derived from each sheet's name.

   Sheets that cannot be parsed (e.g. due to unexpected structure) are
   skipped with a warning rather than raising an exception, so a partial
   result is always returned.

   :param file_path: Path to the downloaded ``.xlsx`` workbook (string or
                     path-like).

   :returns: ``practice_code``, ``practice_name``, ``lcg``, ``federation``,
             ``financial_year``, ``year``, ``register``,
             ``registered_patients``, ``prevalence_per_1000``.

             Rows are sorted by ``financial_year``, ``practice_code``,
             then ``register``.
   :rtype: Long-format DataFrame with columns

   :raises NISRADataNotFoundError: If no Table 5 sheets can be found or parsed.

   .. rubric:: Example

   >>> df = parse_all_gp_practices("/tmp/rdptd-tables-2026.xlsx")
   >>> df["financial_year"].nunique() >= 17
   True
   >>> df["practice_code"].nunique() >= 300
   True


.. py:function:: get_latest_gp_prevalence(force_refresh = False)

   Fetch and return the latest GP-practice-level disease prevalence data.

   Downloads the current Excel workbook from the Department of Health
   website (with a 24×365-hour cache), parses all Table 5 sheets (one per
   financial year), validates the result, and returns a clean long-format
   DataFrame covering 2009/10 to the latest published year.

   :param force_refresh: If True, bypass the local file cache and re-download
                         the workbook.  Default: False.

   :returns:

             - ``practice_code`` (str): GP practice identifier (e.g. ``"Z00001"``)
             - ``practice_name`` (str or None): Practice name from Table 4 lookup
               (e.g. ``"Dr. IRWIN"``); ``None`` if the practice code is not found
               in Table 4 (e.g. historical practices no longer listed)
             - ``lcg`` (str or None): Local Commissioning Group
             - ``federation`` (str or None): Federation name (``None`` pre-2017/18)
             - ``financial_year`` (str): Label such as ``"2025/26"``
             - ``year`` (int): Start year of the financial year
             - ``register`` (str): Disease register name (normalised)
             - ``registered_patients`` (float): Patients on register at NPD
             - ``prevalence_per_1000`` (float): Prevalence per 1,000 registered pts

             Data spans 2009/10 to the latest published year (~17 financial years)
             with ~305–360 GP practices per year.
   :rtype: Long-format DataFrame with columns

   :raises NISRADataNotFoundError: If the workbook cannot be located or downloaded.
   :raises NISRAValidationError: If the parsed data fails validation.

   .. rubric:: Example

   >>> df = get_latest_gp_prevalence()
   >>> df["practice_code"].str.startswith("Z").all()
   True
   >>> df["financial_year"].nunique() >= 17
   True