bolster.data_sources.nisra.baby_names
=====================================

.. py:module:: bolster.data_sources.nisra.baby_names

.. autoapi-nested-parse::

   NISRA Baby Names Northern Ireland Data Source.

   Provides access to baby name statistics for Northern Ireland from the Northern Ireland
   Statistics and Research Agency (NISRA), including:
   - Full historical list of all first forenames given to babies registered in NI (1997–present)
   - Annual rank and count for every name, by sex (Boys/Girls)

   The module uses the Full Name List file which contains all registered names with their
   rank and count for each year from 1997 to the most recent publication year.

   Data Source:
       **Statistics Page**: https://www.nisra.gov.uk/statistics/births/baby-names

       The statistics page lists all Baby Names publications in reverse chronological order
       (newest first). The module automatically scrapes this page to find the latest
       Baby Names publication, then downloads the Full Names List Excel file from that
       publication's detail page.

       The full names list files contain complete time series from 1997 to the most recent
       year, updated annually in April.

   Update Frequency: Annual (published April each year)
   Geographic Coverage: Northern Ireland (births registered in NI)

   .. rubric:: Example

   >>> from bolster.data_sources.nisra import baby_names
   >>> df = baby_names.get_baby_names()
   >>> sorted(df.columns.tolist())
   ['count', 'name', 'rank', 'sex', 'year']
   >>> sorted(df['sex'].unique().tolist())
   ['Boys', 'Girls']
   >>> df['year'].min() >= 1997
   True


Attributes
----------

.. autoapisummary::

   bolster.data_sources.nisra.baby_names.logger
   bolster.data_sources.nisra.baby_names.BABY_NAMES_STATS_URL
   bolster.data_sources.nisra.baby_names.NISRA_BASE_URL


Functions
---------

.. autoapisummary::

   bolster.data_sources.nisra.baby_names.get_baby_names_publication_url
   bolster.data_sources.nisra.baby_names.parse_baby_names_file
   bolster.data_sources.nisra.baby_names.get_baby_names
   bolster.data_sources.nisra.baby_names.validate_baby_names


Module Contents
---------------

.. py:data:: logger

.. py:data:: BABY_NAMES_STATS_URL
   :value: 'https://www.nisra.gov.uk/statistics/births/baby-names'


.. py:data:: NISRA_BASE_URL
   :value: 'https://www.nisra.gov.uk'


.. py:function:: get_baby_names_publication_url()

   Scrape NISRA to find the latest Baby Names Full Name List Excel URL.

   Navigates the publication structure:
   1. Scrapes the baby names statistics page for the latest publication link
   2. Follows link to the publication detail page
   3. Finds the Full Names List Excel file link

   :returns: URL of the latest Full Name List Excel file

   :raises NISRADataNotFoundError: If publication or file not found


.. py:function:: parse_baby_names_file(file_path)

   Parse NISRA Full Name List Excel file into long-format DataFrame.

   The Full Name List Excel file contains two sheets:
   - Table 1: Boys' names (1997 to present), wide format with 3 columns per year
               (Name, Number of Babies, Rank), 29+ year blocks across the row
   - Table 2: Girls' names, same structure as Table 1

   Names with suppressed counts (shown as '..') are excluded (names with fewer
   than 3 occurrences are suppressed for disclosure control).

   :param file_path: Path to the Full Name List Excel file

   :returns:

             - year: int — registration year
             - name: str — first forename (title case as registered)
             - sex: str — "Boys" or "Girls"
             - rank: int — rank within that sex and year (1 = most popular)
             - count: int — number of babies registered with that name
   :rtype: Long-format DataFrame with columns

   :raises NISRAValidationError: If the file structure is unexpected or no data parsed


.. py:function:: get_baby_names(force_refresh = False)

   Get the full historical Baby Names series for Northern Ireland (1997–present).

   Automatically discovers and downloads the most recent Full Name List publication
   from the NISRA website, which contains the complete historical series from 1997.

   :param force_refresh: If True, bypass cache and download fresh data

   :returns:

             - year: int — registration year (1997–present)
             - name: str — first forename as registered
             - sex: str — "Boys" or "Girls"
             - rank: int — rank within sex and year (1 = most popular)
             - count: int — number of babies with that name
   :rtype: Long-format DataFrame with columns

   :raises NISRADataNotFoundError: If the latest publication cannot be found
   :raises NISRAValidationError: If the file structure is unexpected

   .. rubric:: Example

   >>> df = get_baby_names()
   >>> sorted(df.columns.tolist())
   ['count', 'name', 'rank', 'sex', 'year']
   >>> df['year'].min() >= 1997
   True
   >>> sorted(df['sex'].unique().tolist())
   ['Boys', 'Girls']
   >>> df[df['year'] == df['year'].max()].nsmallest(1, 'rank')['name'].iloc[0] is not None
   True


.. py:function:: validate_baby_names(df)

   Validate a baby names DataFrame for structural and data integrity.

   Checks:
   - Required columns are present
   - No null values in any column
   - Both sexes present ("Boys" and "Girls")
   - Year range starts at or before 1999 (data should go back to 1997)
   - Rank starts at 1 for at least one year/sex combination
   - All counts are positive (> 0)
   - No negative ranks or counts

   :param df: DataFrame to validate (from parse_baby_names_file or get_baby_names)

   :returns: True if validation passes

   :raises NISRAValidationError: If any validation check fails, with descriptive message

   .. rubric:: Example

   >>> import pandas as pd
   >>> valid_df = pd.DataFrame({
   ...     'year': [2020, 2020], 'name': ['Noah', 'Jack'],
   ...     'sex': ['Boys', 'Boys'], 'rank': [1, 2], 'count': [100, 90]
   ... })
   >>> validate_baby_names(valid_df)
   True