bolster.data_sources.psni ========================= .. py:module:: bolster.data_sources.psni .. autoapi-nested-parse:: PSNI (Police Service of Northern Ireland) Data Sources. This module provides access to PSNI open data including: - Crime Statistics: Police recorded crime data with monthly updates - Road Traffic Collisions: Injury collision, casualty, and vehicle data - Police Ombudsman: Complaint statistics from 2000/01 to present - PACE Statistics: Annual stop & search and arrests under the Police and Criminal Evidence (PACE) Order Data is sourced from OpenDataNI and the Police Ombudsman's Office under the Open Government Licence v3.0. Geographic breakdowns use the 11 Policing Districts which align with Northern Ireland's Local Government Districts (LGDs), enabling integration with other NISRA datasets. .. rubric:: Example >>> from bolster.data_sources.psni import crime_statistics, road_traffic_collisions >>> df = crime_statistics.get_historical_crime_statistics() >>> 'lgd_code' in df.columns True >>> lgd_code = crime_statistics.get_lgd_code("Belfast City") >>> lgd_code 'N09000003' >>> casualties = road_traffic_collisions.get_casualties() >>> 'severity' in casualties.columns True See individual module docstrings for detailed documentation. Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/bolster/data_sources/psni/crime_statistics/index /autoapi/bolster/data_sources/psni/pace/index /autoapi/bolster/data_sources/psni/police_ombudsman/index /autoapi/bolster/data_sources/psni/road_traffic_collisions/index /autoapi/bolster/data_sources/psni/stop_and_search/index Exceptions ---------- .. autoapisummary:: bolster.data_sources.psni.PSNIDataError bolster.data_sources.psni.PSNIDataNotFoundError bolster.data_sources.psni.PSNIDataStaleError bolster.data_sources.psni.PSNIValidationError Functions --------- .. autoapisummary:: bolster.data_sources.psni.clear_cache bolster.data_sources.psni.filter_by_crime_type bolster.data_sources.psni.filter_by_date_range bolster.data_sources.psni.filter_by_district bolster.data_sources.psni.get_available_crime_types bolster.data_sources.psni.get_available_districts bolster.data_sources.psni.get_crime_trends bolster.data_sources.psni.get_data_source_info bolster.data_sources.psni.get_historical_crime_statistics bolster.data_sources.psni.get_latest_crime_statistics bolster.data_sources.psni.get_lgd_code bolster.data_sources.psni.get_nuts3_code bolster.data_sources.psni.get_nuts_region_name bolster.data_sources.psni.get_outcome_rates_by_district bolster.data_sources.psni.get_total_crimes_by_district bolster.data_sources.psni.parse_crime_statistics_file bolster.data_sources.psni.validate_crime_statistics bolster.data_sources.psni.get_annual_publication_url bolster.data_sources.psni.get_latest_complaints bolster.data_sources.psni.get_quarterly_publication_url bolster.data_sources.psni.parse_annual bolster.data_sources.psni.parse_quarterly bolster.data_sources.psni.validate_complaints bolster.data_sources.psni.get_rtc_annual_summary bolster.data_sources.psni.get_rtc_available_years bolster.data_sources.psni.get_casualties bolster.data_sources.psni.get_casualties_by_district bolster.data_sources.psni.get_casualties_by_road_user bolster.data_sources.psni.get_casualties_with_collision_details bolster.data_sources.psni.get_collisions bolster.data_sources.psni.get_vehicles bolster.data_sources.psni.validate_rtc_data Package Contents ---------------- .. py:exception:: PSNIDataError Bases: :py:obj:`Exception` Base exception for PSNI data errors. All PSNI-specific exceptions inherit from this class, allowing callers to catch all PSNI errors with a single except clause. Initialize self. See help(type(self)) for accurate signature. .. py:exception:: PSNIDataNotFoundError Bases: :py:obj:`PSNIDataError` Raised when a PSNI data file cannot be downloaded or accessed. This exception is raised when: - Network requests fail (timeout, connection errors) - HTTP errors occur (404, 500, etc.) - The requested resource is unavailable Initialize self. See help(type(self)) for accurate signature. .. py:exception:: PSNIDataStaleError Bases: :py:obj:`PSNIDataError` Raised when a PSNI data source is known to be stale with no accessible update. This exception is raised when the underlying data source has not been updated and no machine-readable replacement is accessible (e.g. due to Cloudflare protection on the official PSNI website blocking automated downloads). Initialize self. See help(type(self)) for accurate signature. .. py:exception:: PSNIValidationError Bases: :py:obj:`PSNIDataError` Raised when PSNI data fails validation checks. This exception is raised when: - CSV structure doesn't match expected columns - Data contains invalid or unexpected values - Required fields are missing or malformed Initialize self. See help(type(self)) for accurate signature. .. py:function:: clear_cache(pattern = None) Clear cached files from the PSNI cache directory. :param pattern: Optional glob pattern to match specific files (e.g., ``*.csv``). If None, clears all cached files in the directory. :returns: Number of files deleted .. rubric:: Example >>> from bolster.data_sources.psni._base import clear_cache >>> deleted = clear_cache("*.csv") >>> isinstance(deleted, int) True .. py:function:: filter_by_crime_type(df, crime_type) Filter crime statistics to specific crime type(s). :param df: DataFrame from get_latest_crime_statistics :param crime_type: Crime type(s) to filter (e.g., "Burglary" or ["Violence with injury", "Robbery"]) :returns: Filtered DataFrame .. rubric:: Example >>> df = get_latest_crime_statistics() >>> violence = filter_by_crime_type(df, "Violence with injury (including homicide & death/serious injury by unlawful driving)") >>> len(violence) > 0 True .. py:function:: filter_by_date_range(df, start_date = None, end_date = None) Filter crime statistics to a date range. :param df: DataFrame from get_latest_crime_statistics :param start_date: Start date (inclusive), e.g., "2020-01-01" or datetime :param end_date: End date (inclusive), e.g., "2021-12-31" or datetime :returns: Filtered DataFrame .. rubric:: Example >>> df = get_latest_crime_statistics() >>> # Get 2020 data >>> df_2020 = filter_by_date_range(df, "2020-01-01", "2020-12-31") >>> df_2020['calendar_year'].unique().tolist() [2020] >>> >>> # Get data from 2018 onwards >>> recent = filter_by_date_range(df, start_date="2018-01-01") >>> len(recent) > 0 True .. py:function:: filter_by_district(df, district) Filter crime statistics to specific policing district(s). :param df: DataFrame from get_latest_crime_statistics :param district: District name(s) to filter (e.g., "Belfast City" or ["Belfast City", "Derry City & Strabane"]) :returns: Filtered DataFrame .. rubric:: Example >>> df = get_latest_crime_statistics() >>> belfast = filter_by_district(df, "Belfast City") >>> belfast['policing_district'].unique().tolist() ['Belfast City'] >>> >>> # Multiple districts >>> cities = filter_by_district(df, ["Belfast City", "Derry City & Strabane"]) >>> len(cities['policing_district'].unique()) == 2 True .. py:function:: get_available_crime_types(df) Get list of all crime types in the dataset. :param df: DataFrame from get_latest_crime_statistics :returns: Sorted list of crime type names .. rubric:: Example >>> df = get_latest_crime_statistics() >>> crime_types = get_available_crime_types(df) >>> isinstance(crime_types, list) True >>> 'Total police recorded crime' in crime_types True .. py:function:: get_available_districts(df) Get list of all policing districts in the dataset. :param df: DataFrame from get_latest_crime_statistics :returns: Sorted list of district names .. rubric:: Example >>> df = get_latest_crime_statistics() >>> districts = get_available_districts(df) >>> isinstance(districts, list) True >>> 'Northern Ireland' in districts True .. py:function:: get_crime_trends(df, crime_type = 'Total police recorded crime', district = 'Northern Ireland', measure = 'Police Recorded Crime') Get monthly crime trends for a specific crime type and district. :param df: DataFrame from get_latest_crime_statistics :param crime_type: Crime type to analyze (default: total crimes) :param district: Policing district (default: Northern Ireland total) :param measure: Data measure to use (default: Police Recorded Crime) :returns: date, calendar_year, month, count :rtype: DataFrame with columns .. rubric:: Example >>> df = get_latest_crime_statistics() >>> trends = get_crime_trends(df, district="Belfast City") >>> sorted(trends.columns.tolist()) ['calendar_year', 'count', 'date', 'month'] >>> len(trends) > 0 True .. py:function:: get_data_source_info() Get information about crime statistics data sources. Returns a dictionary with URLs and contact information for accessing PSNI crime statistics. Use this when you need data beyond December 2021. :returns: - opendatani_url: OpenDataNI dataset URL (data through Dec 2021) - data_guide_url: PDF data guide URL - psni_official_url: PSNI official statistics page (current data) - contact_email: PSNI Statistics Branch email - data_limitation: Description of OpenDataNI data limitations - last_update: Last known update date for OpenDataNI :rtype: Dictionary with keys .. rubric:: Example >>> info = get_data_source_info() >>> sorted(info.keys()) ['contact_email', 'data_guide_url', 'data_limitation', 'last_update', 'opendatani_url', 'psni_official_url'] .. py:function:: get_historical_crime_statistics(force_refresh = False, add_geographic_codes = True) Get historical police recorded crime statistics (April 2001 – December 2021). Downloads the crime statistics CSV from OpenDataNI. This dataset covers April 2001 through December 2021 and has not been updated since January 2022. For 2022+ data, consult PSNI directly. :param force_refresh: If True, bypass cache and download fresh data :param add_geographic_codes: If True, add LGD and NUTS3 code columns :returns: date, calendar_year, month, policing_district, crime_type, data_measure, count, lgd_code, nuts3_code, nuts3_name :rtype: DataFrame with columns :raises PSNIDataNotFoundError: If download fails :raises PSNIValidationError: If file structure is unexpected .. rubric:: Example >>> df = get_historical_crime_statistics() >>> sorted(df.columns.tolist()) ['calendar_year', 'count', 'crime_type', 'data_measure', 'date', 'lgd_code', 'month', 'nuts3_code', 'nuts3_name', 'policing_district'] >>> df['date'].max().year 2021 .. py:function:: get_latest_crime_statistics(force_refresh = False, add_geographic_codes = True) Raises PSNIDataStaleError — use get_historical_crime_statistics() instead. The OpenDataNI source was last updated January 2022. PSNI's official site publishes current data but is Cloudflare-protected and inaccessible to automated downloads. Use ``get_historical_crime_statistics()`` to access the data available (Apr 2001–Dec 2021). :raises PSNIDataStaleError: Always — this data source has no accessible update. .. py:function:: get_lgd_code(district_name) Get LGD code for a policing district. :param district_name: Policing district name (e.g., "Belfast City") :returns: LGD code (e.g., "N09000003") or None if not found .. rubric:: Example >>> get_lgd_code("Belfast City") 'N09000003' .. py:function:: get_nuts3_code(district_name) Get NUTS3 regional code for a policing district. Uses NUTS 2021 classification where each LGD maps 1:1 to a NUTS3 region. :param district_name: Policing district name (e.g., "Belfast City") :returns: NUTS3 code (e.g., "UKN06") or None if not found .. rubric:: Example >>> get_nuts3_code("Belfast City") 'UKN06' >>> get_nuts3_code("Derry City & Strabane") 'UKN0A' .. py:function:: get_nuts_region_name(nuts3_code) Get descriptive name for a NUTS3 region code. :param nuts3_code: NUTS3 code (e.g., "UKN06") :returns: Region name (e.g., "Belfast") or None if not found .. rubric:: Example >>> get_nuts_region_name("UKN06") 'Belfast' >>> get_nuts_region_name("UKN0A") 'Derry City and Strabane' .. py:function:: get_outcome_rates_by_district(df, year = None, crime_type = 'Total police recorded crime') Calculate crime outcome rates by policing district. Outcome rate represents the percentage of crimes with an outcome (charge, caution, community resolution, etc.) :param df: DataFrame from get_latest_crime_statistics :param year: Optional year to filter (uses all years if None) :param crime_type: Crime type to analyze (default: total crimes) :returns: policing_district, lgd_code, average_outcome_rate :rtype: DataFrame with columns .. rubric:: Example >>> df = get_latest_crime_statistics() >>> outcomes = get_outcome_rates_by_district(df, year=2021) >>> 'average_outcome_rate' in outcomes.columns True .. py:function:: get_total_crimes_by_district(df, year = None) Calculate total recorded crimes by policing district. :param df: DataFrame from get_latest_crime_statistics :param year: Optional year to filter (uses all years if None) :returns: policing_district, lgd_code, nuts3_code, total_crimes :rtype: DataFrame with columns .. rubric:: Example >>> df = get_latest_crime_statistics() >>> totals_2021 = get_total_crimes_by_district(df, year=2021) >>> sorted(totals_2021.columns.tolist()) ['lgd_code', 'nuts3_code', 'policing_district', 'total_crimes'] .. py:function:: parse_crime_statistics_file(file_path, add_geographic_codes = True) Parse PSNI crime statistics CSV file. The file is in long format with columns for year, month, district, crime type, data measure, and count. This function reads the CSV, cleans column names, adds date parsing, and optionally adds LGD and NUTS3 geographic codes for cross-dataset integration. :param file_path: Path to the crime statistics CSV file :param add_geographic_codes: If True, add LGD and NUTS3 code columns :returns: - calendar_year: int (year of crime) - month: str (month name: Apr, May, ..., Dec) - policing_district: str (district name or "Northern Ireland") - crime_type: str (Home Office crime classification) - data_measure: str (type of measure - crime count, outcome number, outcome rate) - count: float (value - can be count or percentage) - date: datetime (first day of month) - lgd_code: str (ONS LGD code, if add_geographic_codes=True) - nuts3_code: str (NUTS3 region code, if add_geographic_codes=True) - nuts3_name: str (NUTS3 region name, if add_geographic_codes=True) :rtype: DataFrame with columns :raises PSNIValidationError: If file structure is unexpected .. rubric:: Example >>> path = download_file(CRIME_STATISTICS_URL, cache_ttl_hours=24*7) >>> df = parse_crime_statistics_file(path) >>> 'crime_type' in df.columns True >>> len(df) > 0 True .. py:function:: validate_crime_statistics(df) Validate crime statistics data integrity. Performs sanity checks on the crime statistics data: - Non-negative crime counts - Reasonable date ranges - Expected policing districts present - No unexpected missing data :param df: DataFrame from parse_crime_statistics_file or get_latest_crime_statistics :returns: True if validation passes :raises PSNIValidationError: If validation fails .. rubric:: Example >>> df = get_latest_crime_statistics() >>> validate_crime_statistics(df) True .. py:function:: get_annual_publication_url() Scrape the complaint-statistics page for the latest .xlsx download link. policeombudsman.org returns 403 to default User-Agents; this function uses a browser-like UA via ``bolster.utils.web.session``. :returns: Absolute URL of the latest annual Excel spreadsheet. :raises PSNIDataNotFoundError: If the page cannot be retrieved or no .xlsx link is found. .. rubric:: Example >>> url = get_annual_publication_url() >>> url.startswith("https://") True .. py:function:: get_latest_complaints(breakdown = 'totals', force_refresh = False) Download and return the latest Police Ombudsman complaint data. For ``totals``, ``by_district``, ``by_allegation_type``, and ``by_outcome`` the annual publication is used (richest historical coverage). For ``quarterly`` the latest quarterly bulletin is used. :param breakdown: One of: - ``"totals"`` — total complaints 2000/01 to present (default) - ``"by_district"`` — complaints by policing district, 2011/12+ - ``"by_allegation_type"`` — allegations by type, 2011/12+ - ``"by_outcome"`` — closures by outcome, 2011/12+ - ``"quarterly"`` — quarterly complaints, latest 5 financial years :param force_refresh: If ``True``, bypass cache and re-download the source file. :returns: Tidy DataFrame for the requested breakdown. :raises ValueError: If *breakdown* is not one of the recognised values. :raises PSNIDataNotFoundError: If the source cannot be downloaded. .. rubric:: Example >>> df = get_latest_complaints() >>> set(["year", "complaints"]).issubset(df.columns) True >>> df_d = get_latest_complaints("by_district") >>> "district" in df_d.columns True .. py:function:: get_quarterly_publication_url() Scrape the quarterly-reports page for the latest .xlsx download link. policeombudsman.org returns 403 to default User-Agents; this function uses a browser-like UA via ``bolster.utils.web.session``. :returns: Absolute URL of the latest quarterly Excel spreadsheet. :raises PSNIDataNotFoundError: If the page cannot be retrieved or no .xlsx link is found. .. rubric:: Example >>> url = get_quarterly_publication_url() >>> url.startswith("https://") True .. py:function:: parse_annual(file_path) Parse the annual Police Ombudsman statistics Excel workbook. Extracts four key tables from the workbook: - ``totals``: total complaints 2000/01 onwards (T1) - ``by_district``: complaints by policing district, 2011/12 onwards (T8) - ``by_allegation_type``: allegations by type & subtype, 2011/12+ (T10) - ``by_outcome``: complaint closures by outcome, 2011/12 onwards (T12) :param file_path: Local path (or file-like) to the downloaded ``.xlsx`` file. :returns: Dict mapping breakdown name to tidy DataFrame. All DataFrames include ``year`` (int, financial-year start) and ``year_label`` (e.g. ``"2024/25"``) columns. :raises PSNIDataNotFoundError: If required sheets cannot be found. .. rubric:: Example >>> from bolster.data_sources.psni import police_ombudsman >>> result = parse_annual.__doc__ # placeholder >>> 'totals' in result False .. py:function:: parse_quarterly(file_path) Parse a quarterly Police Ombudsman statistics Excel workbook. Extracts three tables: - ``complaints``: complaints received by quarter × year - ``allegations``: allegations received by quarter × year - ``by_district``: complaints by policing district × year The quarterly workbook covers the latest five financial years, with four quarters per year plus totals. :param file_path: Local path (or file-like) to the downloaded ``.xlsx`` file. :returns: Dict mapping key name to long-form DataFrame. Each DataFrame includes ``year_label`` (e.g. ``"2024/25"``) and ``year`` (int start year). :raises PSNIDataNotFoundError: If required sheets cannot be found. .. rubric:: Example >>> from bolster.data_sources.psni import police_ombudsman >>> True # real call requires downloaded file True .. py:function:: validate_complaints(df, breakdown) Validate a Police Ombudsman complaints DataFrame. Checks that: - The DataFrame is non-empty. - Required columns for the given *breakdown* are present. - The ``year`` column contains plausible financial-year start years. - Complaint / allegation counts are non-negative. :param df: DataFrame to validate (as returned by :func:`get_latest_complaints`). :param breakdown: One of ``"totals"``, ``"by_district"``, ``"by_allegation_type"``, ``"by_outcome"``, ``"quarterly"``. :returns: ``True`` if validation passes. :raises PSNIValidationError: If any check fails. .. rubric:: Example >>> import pandas as pd >>> df = pd.DataFrame({"year": [2020, 2021], "complaints": [3000, 3100]}) >>> validate_complaints(df, "totals") True .. py:function:: get_rtc_annual_summary(years = None, force_refresh = False) Get annual summary statistics across multiple years. Provides aggregated collision and casualty counts by year, useful for trend analysis. :param years: List of years to include (default: all available) :param force_refresh: If True, bypass cache and re-download :returns: - year: int - collisions: int (total collisions) - casualties: int (total casualties) - fatal: int (fatal casualties) - serious: int (serious injuries) - slight: int (slight injuries) - fatalities_per_100_collisions: float :rtype: DataFrame with columns .. rubric:: Example >>> summary = get_annual_summary() >>> 'fatal' in summary.columns True .. py:function:: get_rtc_available_years() Get list of years with available RTC data. :returns: List of years (integers) in descending order .. rubric:: Example >>> years = get_available_years() >>> len(years) > 0 True .. py:function:: get_casualties(year = None, force_refresh = False, decode_values = True) Get casualty records for a specific year. Each row represents a single casualty involved in a road traffic collision. Casualties are linked to collisions via the 'ref' column. :param year: Year to fetch (default: latest available) :param force_refresh: If True, bypass cache and re-download :param decode_values: If True, decode coded values to human-readable strings :returns: - year: int - ref: int (collision reference number for linking) - vehicle_id: int - casualty_id: int - casualty_class: str (road user type if decoded) - sex_code: int - age_group: int - severity: str ('Fatal', 'Serious', 'Slight' if decoded) - severity_code: int (1=fatal, 2=serious, 3=slight) :rtype: DataFrame with columns including .. rubric:: Example >>> df = get_casualties(2024) >>> 'severity' in df.columns True .. py:function:: get_casualties_by_district(year = None, force_refresh = False) Get casualty counts by policing district. :param year: Year to fetch (default: latest available) :param force_refresh: If True, bypass cache and re-download :returns: - district: str (policing district name) - lgd_code: str (ONS LGD code) - collisions: int - casualties: int - fatal: int - serious: int - slight: int :rtype: DataFrame with columns .. rubric:: Example >>> by_district = get_casualties_by_district(2024) >>> 'district' in by_district.columns True .. py:function:: get_casualties_by_road_user(year = None, force_refresh = False) Get casualty counts by road user type. :param year: Year to fetch (default: latest available) :param force_refresh: If True, bypass cache and re-download :returns: - casualty_class: str (road user type) - casualties: int - fatal: int - serious: int - slight: int - fatality_rate: float (fatal / total %) :rtype: DataFrame with columns .. rubric:: Example >>> by_user = get_casualties_by_road_user(2024) >>> 'casualty_class' in by_user.columns True .. py:function:: get_casualties_with_collision_details(year = None, force_refresh = False) Get casualty records merged with collision details. Combines casualty data with collision information including date, location, and road conditions. :param year: Year to fetch (default: latest available) :param force_refresh: If True, bypass cache and re-download :returns: DataFrame with casualty records enriched with collision details .. rubric:: Example >>> df = get_casualties_with_collision_details(2024) >>> 'severity' in df.columns True .. py:function:: get_collisions(year = None, force_refresh = False, decode_values = True) Get collision records for a specific year. Each row represents a single road traffic collision with details about date, time, location, road conditions, and severity. :param year: Year to fetch (default: latest available) :param force_refresh: If True, bypass cache and re-download :param decode_values: If True, decode coded values to human-readable strings :returns: - year: int - ref: int (collision reference number) - district: str (policing district name if decoded) - district_code: str (original code) - month: int - day: int - weekday: str (day name if decoded) - hour: int - vehicles: int (number of vehicles) - casualties: int (number of casualties) - light_conditions: str (if decoded) - weather: str (if decoded) - road_surface: str (if decoded) - lgd_code: str (ONS LGD code) - nuts3_code: str (NUTS3 region code) :rtype: DataFrame with columns including .. rubric:: Example >>> df = get_collisions(2024) >>> 'severity' in df.columns or 'district' in df.columns True .. py:function:: get_vehicles(year = None, force_refresh = False, decode_values = True) Get vehicle records for a specific year. Each row represents a single vehicle involved in a road traffic collision. Vehicles are linked to collisions via the 'ref' column. :param year: Year to fetch (default: latest available) :param force_refresh: If True, bypass cache and re-download :param decode_values: If True, decode coded values to human-readable strings :returns: - year: int - ref: int (collision reference number for linking) - vehicle_id: int - vehicle_type: str (if decoded) - vehicle_type_code: int - driver_sex_code: int - driver_age_group: int :rtype: DataFrame with columns including .. rubric:: Example >>> df = get_vehicles(2024) >>> 'vehicle_id' in df.columns True .. py:function:: validate_rtc_data(df, data_type) Validate RTC data integrity. :param df: DataFrame to validate :param data_type: Type of data ('collision', 'casualty', or 'vehicle') :returns: True if validation passes :raises PSNIValidationError: If validation fails