bolster.data_sources.gender_pay_gap
UK Gender Pay Gap Reporting Data Source.
Provides access to UK Gender Pay Gap (GPG) reporting data published annually by the UK Government. All employers with 250+ employees are legally required to report their gender pay gap figures each year.
- Data Source:
Download page: https://gender-pay-gap.service.gov.uk/viewing/download
Annual CSV files are published for each reporting year, covering all UK employers with 250+ employees. Data is available from 2017 to present. Northern Ireland employers (identifiable by BT postcodes) are included in the UK-wide dataset.
The reporting deadline is 4 April each year (for the 12-month snapshot period ending 5 April the previous year), so data for year Y covers the snapshot date of 5 April Y.
Update Frequency: Annual (April each year) Geographic Coverage: UK-wide. NI employers identifiable via BT postcode prefix. Licence: Open Government Licence v3.0
- Metrics Provided:
Mean and median hourly pay gap (%) between male and female employees
Mean and median bonus pay gap (%)
Proportion of male/female employees receiving a bonus
Pay quartile gender composition (lower, lower-middle, upper-middle, upper)
Employer metadata (size band, SIC code, Companies House number)
Example
>>> from bolster.data_sources import gender_pay_gap
>>> df = gender_pay_gap.get_data(year=2024)
>>> 'employer_name' in df.columns
True
>>> ni_df = gender_pay_gap.get_data(year=2024, postcode_prefix="BT")
>>> len(ni_df) > 0
True
Attributes
Exceptions
Base exception for gender pay gap data errors. |
|
Raised when data for the requested year is not available. |
Functions
Return the list of reporting years with published data. |
|
|
Download and parse gender pay gap data for a given reporting year. |
|
Download and combine gender pay gap data for all available years. |
|
Validate a gender pay gap DataFrame for internal consistency. |
Module Contents
- bolster.data_sources.gender_pay_gap.GPG_BASE_URL = 'https://gender-pay-gap.service.gov.uk/viewing/download-data/{year}'[source]
- bolster.data_sources.gender_pay_gap.NUMERIC_COLUMNS = ['diff_mean_hourly_percent', 'diff_median_hourly_percent', 'diff_mean_bonus_percent',...[source]
- exception bolster.data_sources.gender_pay_gap.GenderPayGapError[source]
Bases:
ExceptionBase exception for gender pay gap data errors.
Initialize self. See help(type(self)) for accurate signature.
- exception bolster.data_sources.gender_pay_gap.GenderPayGapDataNotFoundError[source]
Bases:
GenderPayGapErrorRaised when data for the requested year is not available.
Initialize self. See help(type(self)) for accurate signature.
- bolster.data_sources.gender_pay_gap.get_available_years()[source]
Return the list of reporting years with published data.
Data is published annually. The first year is 2017; data for the current year is available once the April reporting deadline has passed.
- Returns:
List of years (integers) for which data is available, e.g. [2017, …, 2024].
- Return type:
Example
>>> years = get_available_years() >>> len(years) > 0 True
- bolster.data_sources.gender_pay_gap.get_data(year, postcode_prefix=None, force_refresh=False)[source]
Download and parse gender pay gap data for a given reporting year.
Returns UK-wide data by default. Use
postcode_prefixto filter to a specific region — e.g."BT"for Northern Ireland,"EH"for Edinburgh,"M"for Manchester. The full dataset is always downloaded first; filtering happens in-memory.- Parameters:
year (int) – The reporting year (e.g. 2024 for the snapshot date of 5 April 2024). Must be between 2017 and the most recent available year.
postcode_prefix (str | None) – If given, only return employers whose postcode starts with this prefix (case-insensitive).
None(default) returns all UK employers. Common values:"BT"(Northern Ireland),"EH"(Edinburgh),"G"(Glasgow),"M"(Manchester).force_refresh (bool) – If True, bypass any cached response. Has no effect currently as responses are streamed directly; reserved for future caching.
- Returns:
employer_name: str
employer_id: str
address: str
postcode: str
company_number: str
sic_codes: str
diff_mean_hourly_percent: float — mean hourly pay gap (positive = men paid more)
diff_median_hourly_percent: float
diff_mean_bonus_percent: float
diff_median_bonus_percent: float
male_bonus_percent: float — % of male employees receiving a bonus
female_bonus_percent: float
male_lower_quartile: float — % of lower pay quartile who are male
female_lower_quartile: float
male_lower_middle_quartile: float
female_lower_middle_quartile: float
male_upper_middle_quartile: float
female_upper_middle_quartile: float
male_top_quartile: float
female_top_quartile: float
company_link_to_gpg_info: str
responsible_person: str
employer_size: str — e.g. “250 to 499”, “500 to 999”, “5000 to 19,999”, “20,000 or more”
current_name: str
submitted_after_deadline: bool
due_date: datetime
date_submitted: datetime
reporting_year: int — the reporting year (same as
yeararg)
- Return type:
DataFrame with one row per employer, columns
- Raises:
GenderPayGapDataNotFoundError – If data for the requested year is not available.
GenderPayGapError – If the download or parse fails.
Example
>>> df = get_data(year=2024) >>> 'employer_name' in df.columns True >>> ni = get_data(year=2024, postcode_prefix="BT") >>> len(ni) > 0 True
- bolster.data_sources.gender_pay_gap.get_all_years(postcode_prefix=None)[source]
Download and combine gender pay gap data for all available years.
Useful for trend analysis across multiple reporting years.
- Parameters:
postcode_prefix (str | None) – If given, filter to employers whose postcode starts with this prefix before combining. See
get_data()for details.- Returns:
Combined DataFrame with all years, including a
reporting_yearcolumn. Seeget_data()for full column documentation.- Return type:
Example
>>> df = get_all_years(postcode_prefix="BT") >>> 'reporting_year' in df.columns True
- bolster.data_sources.gender_pay_gap.validate_data(df)[source]
Validate a gender pay gap DataFrame for internal consistency.
Checks: - Required columns are present - Pay quartile columns sum to ~100% (male + female = 100 per quartile) - Hourly pay gap values are within plausible range (-100% to +100%)
- Parameters:
df (pandas.DataFrame) – DataFrame from
get_data()orget_all_years().- Returns:
True if validation passes.
- Raises:
GenderPayGapError – If any validation check fails.
- Return type:
Example
>>> df = get_data(year=2024) >>> validate_data(df) True