Scraping Pages vs RSS

There are a lot of interesting websites out there. But it's too much work to visit each site regularly to see if there's anything new. RSS is one way to be notified when changes occur, without having to visit every site (not to mention trying to figure out what has changed since the prior visit). Web scraping is another way. What's the difference?

RSS

It's easy to use RSS feeds, they are easy to set up, and there are plenty of readers available.

However RSS feeds are configured by a webmaster. If the page(s) you are interested in are not considered important by the webmaster (and therefore not included in the feed), you're out of luck.

Also, RSS feeds typically aren't prioritized. Most RSS feeds are a random mix of links without any kind of ordering. In contrast, on a news website's home page (for example), there has probably been a warm-blooded human being who has used his/her editorial judgement to decide which articles are the most important.

Web Scraping

When you want to scrape a web page, you decide what content you will scrape, not the site's webmaster. In theory, any page that is visible in your web browser can also be scraped.

With web scraping you can also benefit from the wisdom of the warm-blooded human being (mentioned above) who uses editorial judgement to determine which content is the most important.

While web scraping addresses the two disadvantages of RSS feeds, it comes at a considerable cost: Almost anyone can subscribe to RSS feeds, but scraping web pages requires technical knowledge. You have to study the HTML of the pages you want to scrape, and write a little program called a spider that will extract content from the HTML, based on the instructions you give it.

An Example

Here's an example of scraping three different types of web pages and listing the most recent content:

Under the Hood

What the example above is doing can be broken down into three steps. I give credit to the handy tools and services that make the task easier:

Scraping: I use the (Python-based) Scrapy package to: 1) download a web page; 2) find desired HTML elements and extract their contents; and 3) store the contents in a JSON file.
Rendering: Once I have a JSON file for each site, I combine them and convert them into an HTML page to send via email. I use the Jinja2 templating engine to create a combined HTML page.
Emailing: One handy way to stay informed is to have the latest changes to your favorite sites delivered to your email inbox. But depending on the computer that does the scraping and sends the email, other email servers might treat the email as spam. This is where a smart SMTP host like SendGrid comes in handy. (And for sending modest numbers of emails, it's free of charge.)

Published: 24 May 2018