bolster.utils.rss

RSS Feed parsing utilities for bolster.

This module provides utilities for parsing and working with RSS/Atom feeds, with a focus on government statistics and research publications.

Attributes

logger

Classes

FeedEntry

Represents a single entry from an RSS/Atom feed.

Feed

Represents a parsed RSS/Atom feed.

Functions

parse_date(date_str)

Parse a date string into a datetime object.

parse_feed_entry(entry)

Parse a feedparser entry into a FeedEntry object.

parse_rss_feed(feed_url[, timeout])

Parse an RSS or Atom feed from a URL.

filter_entries(entries[, title_contains, category, ...])

Filter feed entries based on various criteria.

get_nisra_statistics_feed([order, timeout, limit])

Get the NISRA statistics feed from GOV.UK.

Module Contents

bolster.utils.rss.logger[source]
class bolster.utils.rss.FeedEntry[source]

Represents a single entry from an RSS/Atom feed.

title: str[source]
published: datetime.datetime | None = None[source]
updated: datetime.datetime | None = None[source]
summary: str | None = None[source]
author: str | None = None[source]
categories: list[str] = None[source]
content: str | None = None[source]
id: str | None = None[source]
__post_init__()[source]

Initialize empty lists for mutable default arguments.

to_dict()[source]

Convert entry to dictionary representation.

class bolster.utils.rss.Feed[source]

Represents a parsed RSS/Atom feed.

title: str[source]
description: str | None = None[source]
entries: list[FeedEntry] = None[source]
language: str | None = None[source]
updated: datetime.datetime | None = None[source]
__post_init__()[source]

Initialize empty lists for mutable default arguments.

to_dict()[source]

Convert feed to dictionary representation.

bolster.utils.rss.parse_date(date_str)[source]

Parse a date string into a datetime object.

Parameters:

date_str (str | None) – Date string in various formats

Returns:

Parsed datetime object or None if parsing fails

Return type:

datetime.datetime | None

bolster.utils.rss.parse_feed_entry(entry)[source]

Parse a feedparser entry into a FeedEntry object.

Parameters:

entry (feedparser.FeedParserDict) – feedparser entry dictionary

Returns:

FeedEntry object

Return type:

FeedEntry

bolster.utils.rss.parse_rss_feed(feed_url, timeout=30)[source]

Parse an RSS or Atom feed from a URL.

Parameters:
  • feed_url (str) – URL of the RSS/Atom feed

  • timeout (int) – Request timeout in seconds (default: 30)

Returns:

Feed object containing parsed feed data

Raises:
Return type:

Feed

Example

>>> feed = parse_rss_feed(
...     "https://www.gov.uk/search/research-and-statistics.atom?"
...     "content_store_document_type=all_research_and_statistics&"
...     "organisations%5B%5D=northern-ireland-statistics-and-research-agency"
... )
>>> feed.title
'Research and statistics from Northern Ireland Statistics and Research Agency (NISRA)'
>>> sorted(feed.__dataclass_fields__)
['description', 'entries', 'language', 'link', 'title', 'updated']
>>> len(feed.entries) > 0
True
>>> entry = feed.entries[0]
>>> sorted(entry.__dataclass_fields__)
['author', 'categories', 'content', 'id', 'link', 'published', 'summary', 'title', 'updated']
>>> isinstance(entry.title, str) and isinstance(entry.link, str)
True
>>> entry.link.startswith("http")
True
>>> from datetime import datetime
>>> isinstance(entry.published, datetime)
True
bolster.utils.rss.filter_entries(entries, title_contains=None, category=None, after_date=None, before_date=None)[source]

Filter feed entries based on various criteria.

Parameters:
  • entries (list[FeedEntry]) – List of FeedEntry objects to filter

  • title_contains (str | None) – Filter entries whose title contains this string (case-insensitive)

  • category (str | None) – Filter entries that have this category

  • after_date (datetime.datetime | str | None) – Filter entries published after this date

  • before_date (datetime.datetime | str | None) – Filter entries published before this date

Returns:

Filtered list of FeedEntry objects

Return type:

list[FeedEntry]

Example

>>> from bolster.utils.rss import FeedEntry, filter_entries
>>> from datetime import datetime
>>> entries = [
...     FeedEntry("Births Statistics April 2024", "http://example.com/1", published=datetime(2024, 4, 1)),
...     FeedEntry("Deaths Statistics April 2024", "http://example.com/2", published=datetime(2024, 4, 2)),
...     FeedEntry("Old Statistics 2023", "http://example.com/3", published=datetime(2023, 6, 1)),
... ]
>>> recent = filter_entries(entries, title_contains="births", after_date="2024-01-01")
>>> [e.title for e in recent]
['Births Statistics April 2024']
bolster.utils.rss.get_nisra_statistics_feed(order='recent', timeout=30, limit=None)[source]

Get the NISRA statistics feed from GOV.UK.

The GOV.UK Atom feed returns 20 entries per page. When limit exceeds 20, multiple pages are fetched automatically.

Parameters:
  • order (str) – Sort order - ‘recent’ for newest first, ‘oldest’ for oldest first

  • timeout (int) – Request timeout in seconds

  • limit (int | None) – Maximum number of entries to return (None = first page only, i.e. 20)

Returns:

Feed object with NISRA statistics

Return type:

Feed

Example

>>> feed = get_nisra_statistics_feed()
>>> feed100 = get_nisra_statistics_feed(limit=100)