bolster.data_sources.nisra.disease_prevalence
NISRA Raw Disease Prevalence Module.
Provides access to Northern Ireland’s raw disease prevalence statistics published annually by the Department of Health. The data originate from General Practice clinical disease registers (Quality & Outcomes Framework, QOF) and are released once per year after National Prevalence Day.
- Data Coverage:
Financial years 2004/05 to 2025/26 (22 years, extended annually)
NI-level summary: registered patients per disease register (Table 1) and prevalence per 1,000 patients (Table 2a)
GP practice-level: same metrics per practice (Table 5a–5q), 2009/10–2025/26
26 named disease registers; 14 are active as of 2025/26
~305–360 GP practices per year
- Data Source:
Department of Health Northern Ireland publishes an Excel workbook via https://www.health-ni.gov.uk/articles/prevalence-statistics. The landing page links to a publications page which hosts the Excel file.
- Update Frequency:
Annual, approximately May of the following calendar year.
Example
>>> from bolster.data_sources.nisra import disease_prevalence as dp
>>> df = dp.get_latest_disease_prevalence()
>>> 'registered_patients' in df.columns
True
>>> 'prevalence_per_1000' in df.columns
True
- Publication Details:
Frequency: Annual
Published by: Department of Health / NISRA
Source: https://www.health-ni.gov.uk/articles/prevalence-statistics
Attributes
Functions
Return the URL of the most recent disease prevalence Excel workbook. |
|
|
Parse Table 1 and Table 2 from the disease prevalence Excel workbook. |
|
Fetch and return the latest NI disease prevalence data. |
|
Validate the disease prevalence DataFrame for internal consistency. |
|
Parse Table 4 (GP practice details) into a lookup DataFrame. |
|
Parse a single Table 5 sheet into a long-format GP practice DataFrame. |
|
Parse all Table 5 sheets and return a concatenated long-format DataFrame. |
|
Fetch and return the latest GP-practice-level disease prevalence data. |
Module Contents
- bolster.data_sources.nisra.disease_prevalence.DOH_LANDING_PAGE = 'https://www.health-ni.gov.uk/articles/prevalence-statistics'[source]
- bolster.data_sources.nisra.disease_prevalence.DOH_BASE_URL = 'https://www.health-ni.gov.uk'[source]
- bolster.data_sources.nisra.disease_prevalence.get_latest_publication_url()[source]
Return the URL of the most recent disease prevalence Excel workbook.
Scrapes the Department of Health landing page, follows the first link to a publications page, and returns the .xlsx download URL found there.
- Returns:
Absolute URL of the latest Excel workbook.
- Raises:
NISRADataNotFoundError – If the Excel link cannot be located.
- Return type:
Example
>>> url = get_latest_publication_url() >>> url.endswith(".xlsx") True
- bolster.data_sources.nisra.disease_prevalence.parse_ni_summary(file_path)[source]
Parse Table 1 and Table 2 from the disease prevalence Excel workbook.
Reads both NI-level summary sheets and returns a single merged long-format DataFrame with one row per (financial_year, register) combination.
- Parameters:
file_path – Path-like object or string pointing to the downloaded .xlsx workbook.
- Returns:
- year (int): Start year of the financial year (e.g. 2004 for “2004/05”)
financial_year (str): Financial year label (e.g. “2004/05”)
register (str): Normalised disease register name
registered_patients (float): Number of patients on the register at National Prevalence Day (NaN if not available for that year)
prevalence_per_1000 (float): Prevalence per 1,000 registered patients (NaN if not available for that year)
Rows are sorted by register, then year.
- Return type:
DataFrame with columns
- Raises:
NISRADataNotFoundError – If the expected sheets are not found.
NISRAValidationError – If the parsed data fails basic sanity checks.
Example
>>> df = parse_ni_summary("/tmp/rdptd-tables-2026.xlsx") >>> set(df.columns) >= {"year", "financial_year", "register", ... "registered_patients", "prevalence_per_1000"} True
- bolster.data_sources.nisra.disease_prevalence.get_latest_disease_prevalence(force_refresh=False)[source]
Fetch and return the latest NI disease prevalence data.
Downloads the current Excel workbook from the Department of Health website (with a 24×365-hour cache), parses both NI-level summary tables, validates the result, and returns a clean long-format DataFrame.
- Parameters:
force_refresh (bool) – If True, bypass the local file cache and re-download the workbook. Default: False.
- Returns:
- year (int): Start year of the financial year
financial_year (str): Label such as “2004/05”
register (str): Disease register name (normalised)
registered_patients (float): Patients on register at NPD
prevalence_per_1000 (float): Prevalence per 1,000 registered pts
Data spans 2004/05 to the latest published year.
- Return type:
DataFrame with columns
- Raises:
NISRADataNotFoundError – If the workbook cannot be located or downloaded.
NISRAValidationError – If the parsed data fails validation.
Example
>>> df = get_latest_disease_prevalence() >>> df["register"].nunique() >= 14 True >>> df["year"].min() <= 2004 True
- bolster.data_sources.nisra.disease_prevalence.validate_disease_prevalence(df, level='ni')[source]
Validate the disease prevalence DataFrame for internal consistency.
Checks that required columns are present, the DataFrame is non-empty, has sufficient temporal coverage, and that key numeric fields are within plausible bounds.
- Parameters:
df (pandas.DataFrame) – DataFrame as returned by
parse_ni_summary(),get_latest_disease_prevalence(),parse_all_gp_practices(), orget_latest_gp_prevalence().level (str) – Validation mode —
"ni"(default) for NI-level summary DataFrames or"gp"for GP-practice-level DataFrames."ni"is the backward-compatible default.
- Returns:
True if all checks pass.
- Raises:
NISRAValidationError – Describing the first failing check.
ValueError – If level is not
"ni"or"gp".
- Return type:
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... "year": [2004], "financial_year": ["2004/05"], ... "register": ["Hypertension"], ... "registered_patients": [184824.0], ... "prevalence_per_1000": [102.9], ... }) >>> validate_disease_prevalence(df) True
- bolster.data_sources.nisra.disease_prevalence.parse_gp_practice_lookup(file_path, sheet_name=None)[source]
Parse Table 4 (GP practice details) into a lookup DataFrame.
Table 4 is a single sheet in the workbook that provides the canonical practice name, address, and postcode for every GP practice code. It is used to populate the
practice_namecolumn that is absent from the Table 5 prevalence sheets.The sheet header row is at row index 7 (0-based). Data rows start immediately below and continue until the last row; all
Practice IDvalues start with"Z".- Parameters:
file_path – Path to the downloaded
.xlsxworkbook.sheet_name (str | None) – Sheet name override. If
None(default), uses the constant_SHEET_TABLE4("Table 4 GP practice details").
- Returns:
practice_code(str): Practice identifier (e.g."Z00001")practice_name(str): Practice name (e.g."Dr. IRWIN")address(str or None): Full formatted address (Address1 / Address2 / Address3 joined, NaN parts dropped)postcode(str or None): Postcode (e.g."BT4 1NS")
- Return type:
DataFrame with columns
- Raises:
NISRADataNotFoundError – If the sheet cannot be found in the workbook.
Example
>>> lkp = parse_gp_practice_lookup("/tmp/rdptd-tables-2026.xlsx") >>> "practice_code" in lkp.columns True >>> lkp["practice_code"].str.startswith("Z").all() True
- bolster.data_sources.nisra.disease_prevalence.parse_gp_practice(file_path, sheet_name)[source]
Parse a single Table 5 sheet into a long-format GP practice DataFrame.
Reads the compound 2-row header (rows 4–5) to identify register columns, then extracts one row per (practice, register) pair. Practice codes are in column 1 and always start with “Z”; the “Northern Ireland” summary row at the foot of each sheet is excluded.
The sheet names in the workbook follow the pattern
"Table 5X Prevalence YYYY"(e.g."Table 5a Prevalence 2026"), from which the financial year is derived automatically.- Parameters:
file_path – Path to the downloaded
.xlsxworkbook (string or path-like).sheet_name (str) – Exact name of the Table 5 sheet to parse, e.g.
"Table 5a Prevalence 2026".
- Returns:
practice_code(str): Practice identifier (e.g."Z00001")practice_name(object):None; practice names are sourced from Table 4 and are joined inparse_all_gp_practices()rather than at the individual sheet levellcg(str or None): Local Commissioning Group namefederation(str or None): Federation name;Nonefor years before 2017/18 when the column was not presentfinancial_year(str): Label such as"2025/26"year(int): Start year of the financial year (e.g.2025)register(str): Normalised disease register nameregistered_patients(float): Patients on register at NPD;NaNif not available for that register in that yearprevalence_per_1000(float): Prevalence per 1,000 registered patients;NaNif not available
- Return type:
Long-format DataFrame with columns
- Raises:
NISRADataNotFoundError – If the sheet is not found or the expected column blocks cannot be located.
Example
>>> df = parse_gp_practice("/tmp/rdptd-tables-2026.xlsx", ... "Table 5a Prevalence 2026") >>> df["practice_code"].str.startswith("Z").all() True >>> "Hypertension" in df["register"].values True
- bolster.data_sources.nisra.disease_prevalence.parse_all_gp_practices(file_path)[source]
Parse all Table 5 sheets and return a concatenated long-format DataFrame.
Iterates every sheet whose name starts with
"Table 5"in the workbook, callsparse_gp_practice()for each, and concatenates the results. Financial year and start year are derived from each sheet’s name.Sheets that cannot be parsed (e.g. due to unexpected structure) are skipped with a warning rather than raising an exception, so a partial result is always returned.
- Parameters:
file_path – Path to the downloaded
.xlsxworkbook (string or path-like).- Returns:
practice_code,practice_name,lcg,federation,financial_year,year,register,registered_patients,prevalence_per_1000.Rows are sorted by
financial_year,practice_code, thenregister.- Return type:
Long-format DataFrame with columns
- Raises:
NISRADataNotFoundError – If no Table 5 sheets can be found or parsed.
Example
>>> df = parse_all_gp_practices("/tmp/rdptd-tables-2026.xlsx") >>> df["financial_year"].nunique() >= 17 True >>> df["practice_code"].nunique() >= 300 True
- bolster.data_sources.nisra.disease_prevalence.get_latest_gp_prevalence(force_refresh=False)[source]
Fetch and return the latest GP-practice-level disease prevalence data.
Downloads the current Excel workbook from the Department of Health website (with a 24×365-hour cache), parses all Table 5 sheets (one per financial year), validates the result, and returns a clean long-format DataFrame covering 2009/10 to the latest published year.
- Parameters:
force_refresh (bool) – If True, bypass the local file cache and re-download the workbook. Default: False.
- Returns:
practice_code(str): GP practice identifier (e.g."Z00001")practice_name(str or None): Practice name from Table 4 lookup (e.g."Dr. IRWIN");Noneif the practice code is not found in Table 4 (e.g. historical practices no longer listed)lcg(str or None): Local Commissioning Groupfederation(str or None): Federation name (Nonepre-2017/18)financial_year(str): Label such as"2025/26"year(int): Start year of the financial yearregister(str): Disease register name (normalised)registered_patients(float): Patients on register at NPDprevalence_per_1000(float): Prevalence per 1,000 registered pts
Data spans 2009/10 to the latest published year (~17 financial years) with ~305–360 GP practices per year.
- Return type:
Long-format DataFrame with columns
- Raises:
NISRADataNotFoundError – If the workbook cannot be located or downloaded.
NISRAValidationError – If the parsed data fails validation.
Example
>>> df = get_latest_gp_prevalence() >>> df["practice_code"].str.startswith("Z").all() True >>> df["financial_year"].nunique() >= 17 True