bolster.utils.rss

RSS Feed parsing utilities for bolster.

This module provides utilities for parsing and working with RSS/Atom feeds, with a focus on government statistics and research publications.

Attributes

logger

Classes

`FeedEntry`	Represents a single entry from an RSS/Atom feed.
`Feed`	Represents a parsed RSS/Atom feed.

Functions

`parse_date`(date_str)	Parse a date string into a datetime object.
`parse_feed_entry`(entry)	Parse a feedparser entry into a FeedEntry object.
`parse_rss_feed`(feed_url[, timeout])	Parse an RSS or Atom feed from a URL.
`filter_entries`(entries[, title_contains, category, ...])	Filter feed entries based on various criteria.
`get_nisra_statistics_feed`([order, timeout, limit])	Get the NISRA statistics feed from GOV.UK.

Module Contents

bolster.utils.rss.logger[source]

class bolster.utils.rss.FeedEntry[source]

Represents a single entry from an RSS/Atom feed.

title: str[source]

link: str[source]

published: datetime.datetime | None = None[source]

updated: datetime.datetime | None = None[source]

summary: str | None = None[source]

author: str | None = None[source]

categories: list[str] = None[source]

content: str | None = None[source]

id: str | None = None[source]

__post_init__()[source]: Initialize empty lists for mutable default arguments.

to_dict()[source]

Convert entry to dictionary representation.

class bolster.utils.rss.Feed[source]

Represents a parsed RSS/Atom feed.

title: str[source]

link: str[source]

description: str | None = None[source]

entries: list[FeedEntry] = None[source]

language: str | None = None[source]

updated: datetime.datetime | None = None[source]

__post_init__()[source]: Initialize empty lists for mutable default arguments.

to_dict()[source]

Convert feed to dictionary representation.

bolster.utils.rss.parse_date(date_str)[source]

Parse a date string into a datetime object.

Parameters:: date_str (str | None) – Date string in various formats
Returns:: Parsed datetime object or None if parsing fails
Return type:: datetime.datetime | None

bolster.utils.rss.parse_feed_entry(entry)[source]

Parse a feedparser entry into a FeedEntry object.

Parameters:: entry (feedparser.FeedParserDict) – feedparser entry dictionary
Returns:: FeedEntry object
Return type:: FeedEntry

bolster.utils.rss.parse_rss_feed(feed_url, timeout=30)[source]

Parse an RSS or Atom feed from a URL.

Parameters:

feed_url (str) – URL of the RSS/Atom feed
timeout (int) – Request timeout in seconds (default: 30)

Returns:

Feed object containing parsed feed data

Raises:

Exception – If the feed cannot be fetched
ValueError – If the feed cannot be parsed

Return type:

Feed

Example

>>> feed = parse_rss_feed(
...     "https://www.gov.uk/search/research-and-statistics.atom?"
...     "content_store_document_type=all_research_and_statistics&"
...     "organisations%5B%5D=northern-ireland-statistics-and-research-agency"
... )
>>> feed.title
'Research and statistics from Northern Ireland Statistics and Research Agency (NISRA)'
>>> sorted(feed.__dataclass_fields__)
['description', 'entries', 'language', 'link', 'title', 'updated']
>>> len(feed.entries) > 0
True
>>> entry = feed.entries[0]
>>> sorted(entry.__dataclass_fields__)
['author', 'categories', 'content', 'id', 'link', 'published', 'summary', 'title', 'updated']
>>> isinstance(entry.title, str) and isinstance(entry.link, str)
True
>>> entry.link.startswith("http")
True
>>> from datetime import datetime
>>> isinstance(entry.published, datetime)
True

bolster.utils.rss.filter_entries(entries, title_contains=None, category=None, after_date=None, before_date=None)[source]

Filter feed entries based on various criteria.

Parameters:

entries (list[FeedEntry]) – List of FeedEntry objects to filter
title_contains (str | None) – Filter entries whose title contains this string (case-insensitive)
category (str | None) – Filter entries that have this category
after_date (datetime.datetime | str | None) – Filter entries published after this date
before_date (datetime.datetime | str | None) – Filter entries published before this date

Returns:

Filtered list of FeedEntry objects

Return type:

list[FeedEntry]

Example

>>> from bolster.utils.rss import FeedEntry, filter_entries
>>> from datetime import datetime
>>> entries = [
...     FeedEntry("Births Statistics April 2024", "http://example.com/1", published=datetime(2024, 4, 1)),
...     FeedEntry("Deaths Statistics April 2024", "http://example.com/2", published=datetime(2024, 4, 2)),
...     FeedEntry("Old Statistics 2023", "http://example.com/3", published=datetime(2023, 6, 1)),
... ]
>>> recent = filter_entries(entries, title_contains="births", after_date="2024-01-01")
>>> [e.title for e in recent]
['Births Statistics April 2024']

bolster.utils.rss.get_nisra_statistics_feed(order='recent', timeout=30, limit=None)[source]

Get the NISRA statistics feed from GOV.UK.

The GOV.UK Atom feed returns 20 entries per page. When limit exceeds 20, multiple pages are fetched automatically.

Parameters:

order (str) – Sort order - ‘recent’ for newest first, ‘oldest’ for oldest first
timeout (int) – Request timeout in seconds
limit (int | None) – Maximum number of entries to return (None = first page only, i.e. 20)

Returns:

Feed object with NISRA statistics

Return type:

Feed

Example

>>> feed = get_nisra_statistics_feed()
>>> feed100 = get_nisra_statistics_feed(limit=100)