bolster.utils.rss
=================

.. py:module:: bolster.utils.rss

.. autoapi-nested-parse::

   RSS Feed parsing utilities for bolster.

   This module provides utilities for parsing and working with RSS/Atom feeds,
   with a focus on government statistics and research publications.


Attributes
----------

.. autoapisummary::

   bolster.utils.rss.logger


Classes
-------

.. autoapisummary::

   bolster.utils.rss.FeedEntry
   bolster.utils.rss.Feed


Functions
---------

.. autoapisummary::

   bolster.utils.rss.parse_date
   bolster.utils.rss.parse_feed_entry
   bolster.utils.rss.parse_rss_feed
   bolster.utils.rss.filter_entries
   bolster.utils.rss.get_nisra_statistics_feed


Module Contents
---------------

.. py:data:: logger

.. py:class:: FeedEntry

   Represents a single entry from an RSS/Atom feed.


   .. py:attribute:: title
      :type:  str


   .. py:attribute:: link
      :type:  str


   .. py:attribute:: published
      :type:  datetime.datetime | None
      :value: None


   .. py:attribute:: updated
      :type:  datetime.datetime | None
      :value: None


   .. py:attribute:: summary
      :type:  str | None
      :value: None


   .. py:attribute:: author
      :type:  str | None
      :value: None


   .. py:attribute:: categories
      :type:  list[str]
      :value: None


   .. py:attribute:: content
      :type:  str | None
      :value: None


   .. py:attribute:: id
      :type:  str | None
      :value: None


   .. py:method:: __post_init__()

      Initialize empty lists for mutable default arguments.


   .. py:method:: to_dict()

      Convert entry to dictionary representation.


.. py:class:: Feed

   Represents a parsed RSS/Atom feed.


   .. py:attribute:: title
      :type:  str


   .. py:attribute:: link
      :type:  str


   .. py:attribute:: description
      :type:  str | None
      :value: None


   .. py:attribute:: entries
      :type:  list[FeedEntry]
      :value: None


   .. py:attribute:: language
      :type:  str | None
      :value: None


   .. py:attribute:: updated
      :type:  datetime.datetime | None
      :value: None


   .. py:method:: __post_init__()

      Initialize empty lists for mutable default arguments.


   .. py:method:: to_dict()

      Convert feed to dictionary representation.


.. py:function:: parse_date(date_str)

   Parse a date string into a datetime object.

   :param date_str: Date string in various formats

   :returns: Parsed datetime object or None if parsing fails


.. py:function:: parse_feed_entry(entry)

   Parse a feedparser entry into a FeedEntry object.

   :param entry: feedparser entry dictionary

   :returns: FeedEntry object


.. py:function:: parse_rss_feed(feed_url, timeout = 30)

   Parse an RSS or Atom feed from a URL.

   :param feed_url: URL of the RSS/Atom feed
   :param timeout: Request timeout in seconds (default: 30)

   :returns: Feed object containing parsed feed data

   :raises Exception: If the feed cannot be fetched
   :raises ValueError: If the feed cannot be parsed

   .. rubric:: Example

   >>> feed = parse_rss_feed(
   ...     "https://www.gov.uk/search/research-and-statistics.atom?"
   ...     "content_store_document_type=all_research_and_statistics&"
   ...     "organisations%5B%5D=northern-ireland-statistics-and-research-agency"
   ... )
   >>> feed.title
   'Research and statistics from Northern Ireland Statistics and Research Agency (NISRA)'
   >>> sorted(feed.__dataclass_fields__)
   ['description', 'entries', 'language', 'link', 'title', 'updated']
   >>> len(feed.entries) > 0
   True
   >>> entry = feed.entries[0]
   >>> sorted(entry.__dataclass_fields__)
   ['author', 'categories', 'content', 'id', 'link', 'published', 'summary', 'title', 'updated']
   >>> isinstance(entry.title, str) and isinstance(entry.link, str)
   True
   >>> entry.link.startswith("http")
   True
   >>> from datetime import datetime
   >>> isinstance(entry.published, datetime)
   True


.. py:function:: filter_entries(entries, title_contains = None, category = None, after_date = None, before_date = None)

   Filter feed entries based on various criteria.

   :param entries: List of FeedEntry objects to filter
   :param title_contains: Filter entries whose title contains this string (case-insensitive)
   :param category: Filter entries that have this category
   :param after_date: Filter entries published after this date
   :param before_date: Filter entries published before this date

   :returns: Filtered list of FeedEntry objects

   .. rubric:: Example

   >>> from bolster.utils.rss import FeedEntry, filter_entries
   >>> from datetime import datetime
   >>> entries = [
   ...     FeedEntry("Births Statistics April 2024", "http://example.com/1", published=datetime(2024, 4, 1)),
   ...     FeedEntry("Deaths Statistics April 2024", "http://example.com/2", published=datetime(2024, 4, 2)),
   ...     FeedEntry("Old Statistics 2023", "http://example.com/3", published=datetime(2023, 6, 1)),
   ... ]
   >>> recent = filter_entries(entries, title_contains="births", after_date="2024-01-01")
   >>> [e.title for e in recent]
   ['Births Statistics April 2024']


.. py:function:: get_nisra_statistics_feed(order = 'recent', timeout = 30, limit = None)

   Get the NISRA statistics feed from GOV.UK.

   The GOV.UK Atom feed returns 20 entries per page. When limit exceeds 20,
   multiple pages are fetched automatically.

   :param order: Sort order - 'recent' for newest first, 'oldest' for oldest first
   :param timeout: Request timeout in seconds
   :param limit: Maximum number of entries to return (None = first page only, i.e. 20)

   :returns: Feed object with NISRA statistics

   .. rubric:: Example

   >>> feed = get_nisra_statistics_feed()
   >>> feed100 = get_nisra_statistics_feed(limit=100)