bolster.data_sources.gender_pay_gap =================================== .. py:module:: bolster.data_sources.gender_pay_gap .. autoapi-nested-parse:: UK Gender Pay Gap Reporting Data Source. Provides access to UK Gender Pay Gap (GPG) reporting data published annually by the UK Government. All employers with 250+ employees are legally required to report their gender pay gap figures each year. Data Source: **Download page**: https://gender-pay-gap.service.gov.uk/viewing/download Annual CSV files are published for each reporting year, covering all UK employers with 250+ employees. Data is available from 2017 to present. Northern Ireland employers (identifiable by BT postcodes) are included in the UK-wide dataset. The reporting deadline is 4 April each year (for the 12-month snapshot period ending 5 April the previous year), so data for year Y covers the snapshot date of 5 April Y. Update Frequency: Annual (April each year) Geographic Coverage: UK-wide. NI employers identifiable via BT postcode prefix. Licence: Open Government Licence v3.0 Metrics Provided: - Mean and median hourly pay gap (%) between male and female employees - Mean and median bonus pay gap (%) - Proportion of male/female employees receiving a bonus - Pay quartile gender composition (lower, lower-middle, upper-middle, upper) - Employer metadata (size band, SIC code, Companies House number) .. rubric:: Example >>> from bolster.data_sources import gender_pay_gap >>> df = gender_pay_gap.get_data(year=2024) >>> 'employer_name' in df.columns True >>> ni_df = gender_pay_gap.get_data(year=2024, postcode_prefix="BT") >>> len(ni_df) > 0 True Attributes ---------- .. autoapisummary:: bolster.data_sources.gender_pay_gap.logger bolster.data_sources.gender_pay_gap.GPG_BASE_URL bolster.data_sources.gender_pay_gap.FIRST_YEAR bolster.data_sources.gender_pay_gap.COLUMN_MAPPING bolster.data_sources.gender_pay_gap.NUMERIC_COLUMNS Exceptions ---------- .. autoapisummary:: bolster.data_sources.gender_pay_gap.GenderPayGapError bolster.data_sources.gender_pay_gap.GenderPayGapDataNotFoundError Functions --------- .. autoapisummary:: bolster.data_sources.gender_pay_gap.get_available_years bolster.data_sources.gender_pay_gap.get_data bolster.data_sources.gender_pay_gap.get_all_years bolster.data_sources.gender_pay_gap.validate_data Module Contents --------------- .. py:data:: logger .. py:data:: GPG_BASE_URL :value: 'https://gender-pay-gap.service.gov.uk/viewing/download-data/{year}' .. py:data:: FIRST_YEAR :value: 2017 .. py:data:: COLUMN_MAPPING .. py:data:: NUMERIC_COLUMNS :value: ['diff_mean_hourly_percent', 'diff_median_hourly_percent', 'diff_mean_bonus_percent',... .. py:exception:: GenderPayGapError Bases: :py:obj:`Exception` Base exception for gender pay gap data errors. Initialize self. See help(type(self)) for accurate signature. .. py:exception:: GenderPayGapDataNotFoundError Bases: :py:obj:`GenderPayGapError` Raised when data for the requested year is not available. Initialize self. See help(type(self)) for accurate signature. .. py:function:: get_available_years() Return the list of reporting years with published data. Data is published annually. The first year is 2017; data for the current year is available once the April reporting deadline has passed. :returns: List of years (integers) for which data is available, e.g. [2017, ..., 2024]. .. rubric:: Example >>> years = get_available_years() >>> len(years) > 0 True .. py:function:: get_data(year, postcode_prefix = None, force_refresh = False) Download and parse gender pay gap data for a given reporting year. Returns UK-wide data by default. Use ``postcode_prefix`` to filter to a specific region — e.g. ``"BT"`` for Northern Ireland, ``"EH"`` for Edinburgh, ``"M"`` for Manchester. The full dataset is always downloaded first; filtering happens in-memory. :param year: The reporting year (e.g. 2024 for the snapshot date of 5 April 2024). Must be between 2017 and the most recent available year. :param postcode_prefix: If given, only return employers whose postcode starts with this prefix (case-insensitive). ``None`` (default) returns all UK employers. Common values: ``"BT"`` (Northern Ireland), ``"EH"`` (Edinburgh), ``"G"`` (Glasgow), ``"M"`` (Manchester). :param force_refresh: If True, bypass any cached response. Has no effect currently as responses are streamed directly; reserved for future caching. :returns: - employer_name: str - employer_id: str - address: str - postcode: str - company_number: str - sic_codes: str - diff_mean_hourly_percent: float — mean hourly pay gap (positive = men paid more) - diff_median_hourly_percent: float - diff_mean_bonus_percent: float - diff_median_bonus_percent: float - male_bonus_percent: float — % of male employees receiving a bonus - female_bonus_percent: float - male_lower_quartile: float — % of lower pay quartile who are male - female_lower_quartile: float - male_lower_middle_quartile: float - female_lower_middle_quartile: float - male_upper_middle_quartile: float - female_upper_middle_quartile: float - male_top_quartile: float - female_top_quartile: float - company_link_to_gpg_info: str - responsible_person: str - employer_size: str — e.g. "250 to 499", "500 to 999", "5000 to 19,999", "20,000 or more" - current_name: str - submitted_after_deadline: bool - due_date: datetime - date_submitted: datetime - reporting_year: int — the reporting year (same as ``year`` arg) :rtype: DataFrame with one row per employer, columns :raises GenderPayGapDataNotFoundError: If data for the requested year is not available. :raises GenderPayGapError: If the download or parse fails. .. rubric:: Example >>> df = get_data(year=2024) >>> 'employer_name' in df.columns True >>> ni = get_data(year=2024, postcode_prefix="BT") >>> len(ni) > 0 True .. py:function:: get_all_years(postcode_prefix = None) Download and combine gender pay gap data for all available years. Useful for trend analysis across multiple reporting years. :param postcode_prefix: If given, filter to employers whose postcode starts with this prefix before combining. See :func:`get_data` for details. :returns: Combined DataFrame with all years, including a ``reporting_year`` column. See :func:`get_data` for full column documentation. .. rubric:: Example >>> df = get_all_years(postcode_prefix="BT") >>> 'reporting_year' in df.columns True .. py:function:: validate_data(df) Validate a gender pay gap DataFrame for internal consistency. Checks: - Required columns are present - Pay quartile columns sum to ~100% (male + female = 100 per quartile) - Hourly pay gap values are within plausible range (-100% to +100%) :param df: DataFrame from :func:`get_data` or :func:`get_all_years`. :returns: True if validation passes. :raises GenderPayGapError: If any validation check fails. .. rubric:: Example >>> df = get_data(year=2024) >>> validate_data(df) True