Changeset 1872

Show
Ignore:
Timestamp:
11/28/06 01:44:28
Author:
nik
Message:

Fix a bug with the HTML tag extraction. Greedy modifiers meant that this
HTML "<img ...> <i>...</i>" would see that the 'i' in "<img ...>" could
match with "</i>", and did that. Rewrite the regexp to avoid this.

Add a test case for this bug.

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • trunk/plagger/lib/Plagger/Plugin/Summary/Simple.pm

    r1779 r1872  
    2121        local $HTML::Tagset::isBodyElement{div} = 0; 
    2222        my $html = $text->data; 
    23         while ($html =~ s|^\s*<(\w*)\s*[^>]*>(.*?)</\1>|$2|gs) { 
     23        while ($html =~ s|^\s*<([^ >]+)(?:\s+[^>]+)?>(.*?)</\1>|$2|gs) { 
    2424            if ($HTML::Tagset::isBodyElement{lc($1)}) { 
    2525                return "<$1>$2</$1>"; 
  • trunk/plagger/t/plugins/Summary-Simple/base.t

    r1778 r1872  
    8282<p>First paragraph</p> 
    8383 
     84=== Make sure element names are extracted properly 
     85--- input 
     86<img src="..."> <i><a href="...">more text</a></i> some more text 
     87--- expected 
     88<img src="..."> <i><a href="...">more text</a></i> some more text 
     89 
    8490=== I18N. Japanese plaintext 
    8591--- input