Recipe: Dedupe Entries using Rules

By default, Plagger aggregates feeds and all the entries on the feeds are consumed by Publish and Notify plugins. It won't be a big deal if you use Subscription::Bloglines (and equivalent) plugins because Bloglines do de-dupication of feed entries on their side.

But if you aggregate lots of feeds by yourself, and notify only updated entries using Email notifier plugin (ala Publish::Gmail), it'd be a little problematic.

Here's some hackish ways to dedupe entries using Rules.

Filter::Rule

Filter::Rule is a plugin to merge and strip entries from updated feeds using Rules. For example, the following config strips entries which doesn't match "Plagger" in their titles.

- module: Filter::Rule
  rule:
    expression: $args->{entry}->title =~ /Plagger/i

The idea is, to use rule to determine if you want to keep the entry or not, to use with later plugins. So this Filter::Rule directive walks through all the entries in the updated feeds, then scan its title and if it doesn't match with "Plagger", delete from them.

Use Rule::Fresh

Rule::Fresh is a rule module to check entry's modified time to see if it's a fresh entry.

- module: Filter::Rule
  rule:
    module: Fresh
    duration: 120

This config strips entries older than 120 minutes. So when you run your plagger job every 5 minutes, you probably want to set it 5, rather than 120. You can also use temporary file's timestamp to keep track of which file is new or not.

- module: Filter::Rule
  rule:
    module: Fresh
    mtime:
      path: /path/to/mtime.tmp
      autoupdate: 1

This config checks mtime of a file specified in path config, then compares entry's datetime with the mtime.

Optionally, the rule (on initialization phase) automatically updates the mtime with the current datetime. This way, you can run your plagger script whenever you want, and the path (in this case, /path/to/mtime.tmp) remembers the last time Plagger process was injected, and checks the entries' datetime to find fresh entries.

Use Rule::Deduped

Rule::Fresh is a pretty neat and easy way to find fresh entries based on the modified time, but it wouldn't work for some edge cases like entries without datetime, or aggregated feeds like del.icio.us and Hatena Boookmark.

In that case, you'd like to deduplicate entries based on its permalink (and probably entry body) to see if it's actually a new entry you've never seen before. Rule::Deduped plugin does it for you.

- module: Filter::Rule
  rule:
    module: Deduped

This configuration setup de-duplication database in ~/.plagger, and keeps track of entry's permalink, date with MD5 digest of its title concatinated with body. By default, it checks the combination of permalink (URL) + date to see if it's a new entry.

If you'd like to check if it's a seen entry but with updated entry body,

- module: Filter::Rule
  rule:
    module: Deduped
    compare_body: 1

Adding compare_body will check the MD5 digest of title and body, and if it's changed (which means title or body is updated), they're detected as new entry.

Right now it uses DB_File (Berkeley DB) as a backend, but eventually we'll integrate the functionality with Plagger's native database backend when it's done.