bolster.data_sources.gender_pay_gap

UK Gender Pay Gap Reporting Data Source.

Provides access to UK Gender Pay Gap (GPG) reporting data published annually by the UK Government. All employers with 250+ employees are legally required to report their gender pay gap figures each year.

Data Source:

Download page: https://gender-pay-gap.service.gov.uk/viewing/download

Annual CSV files are published for each reporting year, covering all UK employers with 250+ employees. Data is available from 2017 to present. Northern Ireland employers (identifiable by BT postcodes) are included in the UK-wide dataset.

The reporting deadline is 4 April each year (for the 12-month snapshot period ending 5 April the previous year), so data for year Y covers the snapshot date of 5 April Y.

Update Frequency: Annual (April each year) Geographic Coverage: UK-wide. NI employers identifiable via BT postcode prefix. Licence: Open Government Licence v3.0

Metrics Provided:
  • Mean and median hourly pay gap (%) between male and female employees

  • Mean and median bonus pay gap (%)

  • Proportion of male/female employees receiving a bonus

  • Pay quartile gender composition (lower, lower-middle, upper-middle, upper)

  • Employer metadata (size band, SIC code, Companies House number)

Example

>>> from bolster.data_sources import gender_pay_gap
>>> df = gender_pay_gap.get_data(year=2024)
>>> 'employer_name' in df.columns
True
>>> ni_df = gender_pay_gap.get_data(year=2024, postcode_prefix="BT")
>>> len(ni_df) > 0
True

Attributes

logger

GPG_BASE_URL

FIRST_YEAR

COLUMN_MAPPING

NUMERIC_COLUMNS

Exceptions

GenderPayGapError

Base exception for gender pay gap data errors.

GenderPayGapDataNotFoundError

Raised when data for the requested year is not available.

Functions

get_available_years()

Return the list of reporting years with published data.

get_data(year[, postcode_prefix, force_refresh])

Download and parse gender pay gap data for a given reporting year.

get_all_years([postcode_prefix])

Download and combine gender pay gap data for all available years.

validate_data(df)

Validate a gender pay gap DataFrame for internal consistency.

Module Contents

bolster.data_sources.gender_pay_gap.logger[source]
bolster.data_sources.gender_pay_gap.GPG_BASE_URL = 'https://gender-pay-gap.service.gov.uk/viewing/download-data/{year}'[source]
bolster.data_sources.gender_pay_gap.FIRST_YEAR = 2017[source]
bolster.data_sources.gender_pay_gap.COLUMN_MAPPING[source]
bolster.data_sources.gender_pay_gap.NUMERIC_COLUMNS = ['diff_mean_hourly_percent', 'diff_median_hourly_percent', 'diff_mean_bonus_percent',...[source]
exception bolster.data_sources.gender_pay_gap.GenderPayGapError[source]

Bases: Exception

Base exception for gender pay gap data errors.

Initialize self. See help(type(self)) for accurate signature.

exception bolster.data_sources.gender_pay_gap.GenderPayGapDataNotFoundError[source]

Bases: GenderPayGapError

Raised when data for the requested year is not available.

Initialize self. See help(type(self)) for accurate signature.

bolster.data_sources.gender_pay_gap.get_available_years()[source]

Return the list of reporting years with published data.

Data is published annually. The first year is 2017; data for the current year is available once the April reporting deadline has passed.

Returns:

List of years (integers) for which data is available, e.g. [2017, …, 2024].

Return type:

list[int]

Example

>>> years = get_available_years()
>>> len(years) > 0
True
bolster.data_sources.gender_pay_gap.get_data(year, postcode_prefix=None, force_refresh=False)[source]

Download and parse gender pay gap data for a given reporting year.

Returns UK-wide data by default. Use postcode_prefix to filter to a specific region — e.g. "BT" for Northern Ireland, "EH" for Edinburgh, "M" for Manchester. The full dataset is always downloaded first; filtering happens in-memory.

Parameters:
  • year (int) – The reporting year (e.g. 2024 for the snapshot date of 5 April 2024). Must be between 2017 and the most recent available year.

  • postcode_prefix (str | None) – If given, only return employers whose postcode starts with this prefix (case-insensitive). None (default) returns all UK employers. Common values: "BT" (Northern Ireland), "EH" (Edinburgh), "G" (Glasgow), "M" (Manchester).

  • force_refresh (bool) – If True, bypass any cached response. Has no effect currently as responses are streamed directly; reserved for future caching.

Returns:

  • employer_name: str

  • employer_id: str

  • address: str

  • postcode: str

  • company_number: str

  • sic_codes: str

  • diff_mean_hourly_percent: float — mean hourly pay gap (positive = men paid more)

  • diff_median_hourly_percent: float

  • diff_mean_bonus_percent: float

  • diff_median_bonus_percent: float

  • male_bonus_percent: float — % of male employees receiving a bonus

  • female_bonus_percent: float

  • male_lower_quartile: float — % of lower pay quartile who are male

  • female_lower_quartile: float

  • male_lower_middle_quartile: float

  • female_lower_middle_quartile: float

  • male_upper_middle_quartile: float

  • female_upper_middle_quartile: float

  • male_top_quartile: float

  • female_top_quartile: float

  • company_link_to_gpg_info: str

  • responsible_person: str

  • employer_size: str — e.g. “250 to 499”, “500 to 999”, “5000 to 19,999”, “20,000 or more”

  • current_name: str

  • submitted_after_deadline: bool

  • due_date: datetime

  • date_submitted: datetime

  • reporting_year: int — the reporting year (same as year arg)

Return type:

DataFrame with one row per employer, columns

Raises:

Example

>>> df = get_data(year=2024)
>>> 'employer_name' in df.columns
True
>>> ni = get_data(year=2024, postcode_prefix="BT")
>>> len(ni) > 0
True
bolster.data_sources.gender_pay_gap.get_all_years(postcode_prefix=None)[source]

Download and combine gender pay gap data for all available years.

Useful for trend analysis across multiple reporting years.

Parameters:

postcode_prefix (str | None) – If given, filter to employers whose postcode starts with this prefix before combining. See get_data() for details.

Returns:

Combined DataFrame with all years, including a reporting_year column. See get_data() for full column documentation.

Return type:

pandas.DataFrame

Example

>>> df = get_all_years(postcode_prefix="BT")
>>> 'reporting_year' in df.columns
True
bolster.data_sources.gender_pay_gap.validate_data(df)[source]

Validate a gender pay gap DataFrame for internal consistency.

Checks: - Required columns are present - Pay quartile columns sum to ~100% (male + female = 100 per quartile) - Hourly pay gap values are within plausible range (-100% to +100%)

Parameters:

df (pandas.DataFrame) – DataFrame from get_data() or get_all_years().

Returns:

True if validation passes.

Raises:

GenderPayGapError – If any validation check fails.

Return type:

bool

Example

>>> df = get_data(year=2024)
>>> validate_data(df)
True