Recipe: Scrape sites without RSS to generate RSS for them

Suppose you'd like to generate RSS for the site that doesn't have RSS. Here's a way to do it with Plagger.

Let's take England and Wales Recent Decisions as an example, since this is the site Simon Cozens wanted to scrape when he first played with Plagger :) This site is the database of England law cases, and the HTML contains links to the recent cases. You want to turn those recent cases into an RSS feed, with each case as entry.

First create a stub config YAML file as follows:

plugins:
  - module: Subscription::Config
    config:
      feed:
        - url: http://www.bailii.org/recent-decisions-ew.html

Run plagger with this config and you'll get:

Plagger [error] http://www.bailii.org/recent-decisions-ew.html is not aggregated by any aggregator

because there's no RSS feed found on the URL and Plagger doesn't know how to fetch information from this URL.

Use CustomFeed?::Simple

So now you have to tell Plagger how to extract links from there. Look at the HTML structure closely and you'll figure out that all the links to the individual cases are like "/ew/cases/EWHC/Admin/2006/1892.html". You can use CustomFeed?::Simple plugin to extract those links from the web page. Update the config as follows.

plugins:
  - module: Subscription::Config
    config:
      feed:
        - url: http://www.bailii.org/recent-decisions-ew.html
          meta:
            follow_link: /\d{4}/\d{4}\.html
  - module: CustomFeed::Simple

Notice the meta and follow_link defined in the config, to declare the links to follow. The value of follow_link is a perl regular expression to match with the link URL. I made it a little loose so that it can pick up all the links from the index page.

Run plagger again and you'll get tons of:

Plagger::Plugin::CustomFeed::Simple [debug] Add http://www.bailii.org/ew/cases/EWHC/TCC/2006/1187.html ...

It means you have now successfuly extracted links from there!

Now, just add Publish::Feed to generate RSS feed out of the data scraped.

plugins:
  - module: Subscription::Config
    config:
      feed:
        - url: http://www.bailii.org/recent-decisions-ew.html
          meta:
            follow_link: /\d{4}/\d{4}\.html
  - module: CustomFeed::Simple
  - module: Publish::Feed
    config:
      dir: /path/to/public_html
      format: RSS
      filename: bailii.xml

And you'll get /path/to/public_html/bailii.xml, which is ready to subscribe to using your favorite RSS reader.

Upgrade the Feed with EntryFullText?

Your feed contains title and link to each case now. Probably you're not satisfied with the data it provides. Yeah, we want summary or full-content!

Filter::EntryFullText? is the plugin to scrape individual entry page to extract the full content (or summary) to upgrade content-less feeds. You can write a custom pattern files for any site to tell the plugin how to extract the content.

For the EW Case site, look at the HTML closely again and you'll notice that the HTML goes like:

<P><B>Lord Justice Longmore :</B> </P>

<LI VALUE="1."><A NAME='para1'>On 2nd March 2005 in the Crown Court ...
</A></LI>

So you'd like to extract the para1 portion from the page. Save the following text as assets/plugins/Filter-EntryFullText/bailii.yaml: (See InstallPlagger about assets directory)

handle: http://www\.bailii\.org/ew/cases/
extract: <A NAME='para1'>(.*?)</A>

This means, if the link matches with the value of handle, do the pattern match using extract, to extract the 1st matched value as entry body. Alternatively, you can write pattern files like:

handle: http://www\.bailii\.org/ew/cases/
extract: <div class="date">(.*?)</div><div class="body">(.*?)</div>
extract_capture: date body

to say, "do this match and capture the matched value as date and body".

Anyway, you wrote a pattern file for the EW case now and add the following to the config.yaml:

  - module: Filter::EntryFullText

Then run plagger again. Now Filter::EntryFullText? plugin fetches individual entries to extract the summary, which will be set to the RSS feed description field. Upgrade is done. YAY!

Using XPath instead of regexps

You can use XPath expressions instead of regexps to extract data from the feed. The previous example could also be written:

handle: http://www\.bailii\.org/ew/cases/
extract_xpath:
  date: //div[@class="date"]
  body: //div[@class="body"]

This can be more robust -- for example, the regexp example will break if the body contains another div element.