bolster.utils.rss ================= .. py:module:: bolster.utils.rss .. autoapi-nested-parse:: RSS Feed parsing utilities for bolster. This module provides utilities for parsing and working with RSS/Atom feeds, with a focus on government statistics and research publications. Attributes ---------- .. autoapisummary:: bolster.utils.rss.logger Classes ------- .. autoapisummary:: bolster.utils.rss.FeedEntry bolster.utils.rss.Feed Functions --------- .. autoapisummary:: bolster.utils.rss.parse_date bolster.utils.rss.parse_feed_entry bolster.utils.rss.parse_rss_feed bolster.utils.rss.filter_entries bolster.utils.rss.get_nisra_statistics_feed Module Contents --------------- .. py:data:: logger .. py:class:: FeedEntry Represents a single entry from an RSS/Atom feed. .. py:attribute:: title :type: str .. py:attribute:: link :type: str .. py:attribute:: published :type: datetime.datetime | None :value: None .. py:attribute:: updated :type: datetime.datetime | None :value: None .. py:attribute:: summary :type: str | None :value: None .. py:attribute:: author :type: str | None :value: None .. py:attribute:: categories :type: list[str] :value: None .. py:attribute:: content :type: str | None :value: None .. py:attribute:: id :type: str | None :value: None .. py:method:: __post_init__() Initialize empty lists for mutable default arguments. .. py:method:: to_dict() Convert entry to dictionary representation. .. py:class:: Feed Represents a parsed RSS/Atom feed. .. py:attribute:: title :type: str .. py:attribute:: link :type: str .. py:attribute:: description :type: str | None :value: None .. py:attribute:: entries :type: list[FeedEntry] :value: None .. py:attribute:: language :type: str | None :value: None .. py:attribute:: updated :type: datetime.datetime | None :value: None .. py:method:: __post_init__() Initialize empty lists for mutable default arguments. .. py:method:: to_dict() Convert feed to dictionary representation. .. py:function:: parse_date(date_str) Parse a date string into a datetime object. :param date_str: Date string in various formats :returns: Parsed datetime object or None if parsing fails .. py:function:: parse_feed_entry(entry) Parse a feedparser entry into a FeedEntry object. :param entry: feedparser entry dictionary :returns: FeedEntry object .. py:function:: parse_rss_feed(feed_url, timeout = 30) Parse an RSS or Atom feed from a URL. :param feed_url: URL of the RSS/Atom feed :param timeout: Request timeout in seconds (default: 30) :returns: Feed object containing parsed feed data :raises Exception: If the feed cannot be fetched :raises ValueError: If the feed cannot be parsed .. rubric:: Example >>> feed = parse_rss_feed( ... "https://www.gov.uk/search/research-and-statistics.atom?" ... "content_store_document_type=all_research_and_statistics&" ... "organisations%5B%5D=northern-ireland-statistics-and-research-agency" ... ) >>> feed.title 'Research and statistics from Northern Ireland Statistics and Research Agency (NISRA)' >>> sorted(feed.__dataclass_fields__) ['description', 'entries', 'language', 'link', 'title', 'updated'] >>> len(feed.entries) > 0 True >>> entry = feed.entries[0] >>> sorted(entry.__dataclass_fields__) ['author', 'categories', 'content', 'id', 'link', 'published', 'summary', 'title', 'updated'] >>> isinstance(entry.title, str) and isinstance(entry.link, str) True >>> entry.link.startswith("http") True >>> from datetime import datetime >>> isinstance(entry.published, datetime) True .. py:function:: filter_entries(entries, title_contains = None, category = None, after_date = None, before_date = None) Filter feed entries based on various criteria. :param entries: List of FeedEntry objects to filter :param title_contains: Filter entries whose title contains this string (case-insensitive) :param category: Filter entries that have this category :param after_date: Filter entries published after this date :param before_date: Filter entries published before this date :returns: Filtered list of FeedEntry objects .. rubric:: Example >>> from bolster.utils.rss import FeedEntry, filter_entries >>> from datetime import datetime >>> entries = [ ... FeedEntry("Births Statistics April 2024", "http://example.com/1", published=datetime(2024, 4, 1)), ... FeedEntry("Deaths Statistics April 2024", "http://example.com/2", published=datetime(2024, 4, 2)), ... FeedEntry("Old Statistics 2023", "http://example.com/3", published=datetime(2023, 6, 1)), ... ] >>> recent = filter_entries(entries, title_contains="births", after_date="2024-01-01") >>> [e.title for e in recent] ['Births Statistics April 2024'] .. py:function:: get_nisra_statistics_feed(order = 'recent', timeout = 30, limit = None) Get the NISRA statistics feed from GOV.UK. The GOV.UK Atom feed returns 20 entries per page. When limit exceeds 20, multiple pages are fetched automatically. :param order: Sort order - 'recent' for newest first, 'oldest' for oldest first :param timeout: Request timeout in seconds :param limit: Maximum number of entries to return (None = first page only, i.e. 20) :returns: Feed object with NISRA statistics .. rubric:: Example >>> feed = get_nisra_statistics_feed() >>> feed100 = get_nisra_statistics_feed(limit=100)