updated site_config

This commit is contained in:
Nicolas Lœuillet 2014-10-27 06:46:13 +01:00
parent 4a50075784
commit 90a1a78b1e
64 changed files with 684 additions and 118 deletions

View File

@ -1,2 +1,2 @@
title: substring-before(//title, '—')
test_url: http://512pixels.net/more-on-linked-lists/
title: //meta[@property='og:title']/@content
test_url: http://www.512pixels.net/blog/2014/10/the-move

View File

@ -1,12 +1,14 @@
Full-Text RSS site config files
================
[Full-Text RSS](http://fivefilters.org/content-only/), our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no site patterns, it tries to detect the content block automatically.
[Full-Text RSS](http://fivefilters.org/content-only/), our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no rules are found, it tries to detect the content block automatically.
This repository contains the site config files we use in Full-Text RSS.
This repository contains the site-specific extraction rules we rely on in Full-Text RSS.
### Contributing changes
We run automated tests on these files to detect issues. If you'd like to help keep these up to date, please look at the [test results](http://siteconfig.fivefilters.org/test/) and see which files you'd like to contribute fixes for.
We chose GitHub for this set of files because they offer one feature which we hope will make contributing changes easier: [file editing](https://github.com/blog/844-forking-with-the-edit-button) through the web interface.
You can now make changes to any of our site config files and request that your changes be pulled into the main set we maintain. This is what GitHub calls the Fork and Pull model:
@ -31,7 +33,7 @@ Marco, Instapaper's creator, graciously opened up the database of contributions
> And, recognizing that your efforts could be useful to a wide range of other tools and services, I'll make the list of all of these site-specific configurations available to the public, free, with no strings attached.
Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at [instapaper.com/bodytext/](http://instapaper.com/bodytext/) (login required).
Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at [instapaper.com/bodytext/](http://instapaper.com/bodytext/) (no longer available since Instapaper was sold).
### Testing site config files

View File

@ -1,4 +1,4 @@
body: //section[@class='content']
date: //span[1]
author: //h1[@id='sitetitle']
test_url: https://alexduner.com/blog/2013/1/something-i-learned-today
test_url: http://alexduner.com/blog/something-i-learned-today

View File

@ -1,3 +1,5 @@
body: //section[@class='main_cont']/img | //div[@class='articleContent']
title: //div[@class='blog_top_left']//h2
author: //a[@class='b'][1]
date: substring-after(substring-before(//div, 'Posted in'), ' on ')
strip_image_src: /content/images/globals/
@ -8,4 +10,6 @@ prune: no
single_page_link: concat('http://www.anandtech.com/print/', substring-after(//meta[@property='og:url']/@content, '/show/'))
test_url: http://www.anandtech.com/show/5812/eurocom-monster-10-clevos-little-monster/
test_url: http://www.anandtech.com/show/8370/gigabyte-am1m-s2h-review
test_url: http://www.anandtech.com/show/8402/sandisk-releases-ultra-ii-ssd-the-second-tlc-nand-ssd-in-the-market
test_url: http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores

View File

@ -0,0 +1,23 @@
# Author: zinnober
prune: no
title: substring-before(//div[@id='content']/h1, ',')
single_page_link: //a[@title='Seite drucken']
body: //div[@id='detail-body']
replace_string(<span class="description">): <em>
replace_string(<p class="leadtext"><small>): <p class="leadtext">
# Fix headlines
replace_string(Patrick Hollstein): &nbsp;
replace_string(APOTHEKE ADHOC): &nbsp;
replace_string(dpa): &nbsp;
replace_string(Katharina Lübke): &nbsp;
replace_string(Julia Pradel): &nbsp;
replace_string(Franziska Gerhardt): &nbsp;
test_url: http://www.apotheke-adhoc.de/nachrichten/politik/nachricht-detail-politik/deutscher-apothekertag-antraege-gegen-lieferengpaesse-2/

View File

@ -13,5 +13,7 @@ title: //div[@id='story']//h2[@class='title']
strip: //div[@class='pager']
next_page_link: //nav//a[span/@class='next']/@href
native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
test_url: http://arstechnica.com/tech-policy/news/2012/02/gigabit-internet-for-80-the-unlikely-success-of-californias-sonicnet.ars
test_url: http://arstechnica.com/apple/2005/04/macosx-10-4/

View File

@ -0,0 +1,13 @@
title: //div[@class='col-center']/h1
author: //div[@class='personality']/a
date: //div[@class='personality-date']
body: //div[@class='content-top ']//div[@class='content'][1] | //div[contains(@class,'article-body')] | //div[contains(@class,'main-article')]
next_page_link: //div[@id='review-link']/a
strip: //div[@class='author-block']
strip: //p//iframe[contains(@src,'signup')]/preceding::p[1]
test_url: http://www.autocar.co.uk/car-review/volkswagen/golf
test_url: http://www.autocar.co.uk/car-news/pebble-beach/saleen-unveils-performance-electric-vehicle-based-tesla-model-s
test_url: http://www.autocar.co.uk/car-review/rolls-royce/first-drives/rolls-royce-ghost-series-ii-first-drive-review

View File

@ -13,7 +13,7 @@ body: //div[contains(@class, 'hrecipe')]//div[@id='subcolumn-1']
#strip: //div[@class="story-feature narrow"]
#strip: //div[@class="story-feature wide"]
#strip: //div[@class="story-feature dslideshow-enclosure"]
strip: //div[contains(@class, "story-feature")]
strip: //div[contains(@class, "story-feature") and not(contains(@class, 'full-width'))]
strip: //span[@class="story-date"]
#strip: //div[@class="caption body-narrow-width"]
strip: //div[@class="warning"]//p
@ -30,13 +30,26 @@ strip: //div[contains(@class, 'comment-introduction')]
strip: //div[contains(@class, 'share-tools')]
strip: //div[@id='also-related-links']
strip_id_or_class: share-help
strip_id_or_class: comments_module
replace_string(<noscript>): <div>
replace_string(</noscript>): </div>
tidy: no
prune: no
dissolve: //h2
test_url: http://www.bbc.co.uk/sport/0/football/23224017
test_contains: Swansea City have completed the club-record signing
test_url: http://www.bbc.co.uk/news/business-15060862
test_contains: Europe's leaders are meeting again to try to solve
# news feed
test_url: http://feeds.bbci.co.uk/news/rss.xml
# sports feed
test_url: http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int
# video entry
test_url: http://www.bbc.co.uk/news/world-asia-22056933
test_url: http://www.bbc.co.uk/news/world-asia-22056933

60
inc/3rdparty/site_config/standard/bbc.com.txt vendored Executable file
View File

@ -0,0 +1,60 @@
body: //div[@class="story-body"]
# for video entries
body: //div[contains(@class, "videoInStory") or @id="meta-information"]
title: //h1[@class="story-header"]
date: //span[@class="story-date"]/span[@class='date']
# for sport site
date: //meta[@name='DCTERMS.created']/@content
author: //div[@id='headline']//span[@class='byline-name']
# recipes, e.g. http://www.bbc.co.uk/food/recipes/mymincepies_71055
body: //div[contains(@class, 'hrecipe')]//div[@id='subcolumn-1']
#strip: //div[@class="story-feature narrow"]
#strip: //div[@class="story-feature wide"]
#strip: //div[@class="story-feature dslideshow-enclosure"]
strip: //div[contains(@class, "story-feature") and not(contains(@class, 'full-width'))]
strip: //span[@class="story-date"]
#strip: //div[@class="caption body-narrow-width"]
strip: //div[@class="warning"]//p
strip: //div[@id='page-bookmark-links-head']
strip: //object
strip: //div[contains(@class, "bbccom_advert_placeholder")]
strip: //div[contains(@class, "embedded-hyper")]
strip: //div[contains(@class, 'market-data')]
strip: //a[contains(@class, 'hidden')]
strip: //div[contains(@class, 'hypertabs')]
strip: //div[contains(@class, 'related')]
strip: //form[@id='comment-form']
strip: //div[contains(@class, 'comment-introduction')]
strip: //div[contains(@class, 'share-tools')]
strip: //div[@id='also-related-links']
strip_id_or_class: share-help
strip_id_or_class: comments_module
replace_string(<noscript>): <div>
replace_string(</noscript>): </div>
native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
tidy: no
prune: no
dissolve: //h2
test_url: http://www.bbc.com/sport/0/football/28918021
test_contains: Cameroonian footballer Albert Ebosse has died
test_url: http://www.bbc.com/sport/0/football/23224017
test_url: http://www.bbc.com/news/business-15060862
test_contains: Europe's leaders are meeting again to try
# news feed
test_url: http://feeds.bbci.co.uk/news/rss.xml
# sports feed
test_url: http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int
# video entry
test_url: http://www.bbc.com/news/world-asia-22056933

View File

@ -0,0 +1,19 @@
body: //div[@id='column_1']
next_page_link: //div[@class='next']/a[not(contains(@href, '/comments') or contains(@href, '/news/'))]
prune: no
author: substring-after(//p[@class='byline'], 'by ')
date: substring-before(substring-after(//p[@class='byline'], 'on '), ' by')
strip: //h1
strip_id_or_class: socialLinks
strip_id_or_class: byline
strip_id_or_class: pageSelector
strip_id_or_class: articleTabs
strip_id_or_class: pageNav
strip_id_or_class: share
strip_id_or_class: commentsContainer
strip_id_or_class: below_article_related
test_url: http://www.bit-tech.net/hardware/storage/2014/08/13/ocz-arc-100-240gb-review/1
test_url: http://www.bit-tech.net/news/bits/2014/08/15/google-trojan/1

View File

@ -0,0 +1,16 @@
body: //div[contains(@class, 'article_pages')]
strip_id_or_class: article_page-header
strip_id_or_class: paginator
strip_id_or_class: article_info
find_string: src="data:image
replace_string: ignore-src="data:image
find_string: data-defer-src="
replace_string: src="
prune: no
test_url: http://bleacherreport.com/articles/feed
test_url: http://bleacherreport.com/articles/2137787-christian-ponders-newborn-daughter-was-named-after-fsu-legend-bobby-bowden
test_url: http://bleacherreport.com/articles/2137596-college-football-week-1-picks-unlv-runnin-rebels-vs-arizona-wildcats/

View File

@ -0,0 +1,45 @@
# Author: zinnober
tidy: no
prune: no
# Set author
author: //a[@rel='author']
# Set date
date: //span[@class='Datum']
# Content is here
body: //div[@class='Artikel']
# Tidy up before article
strip: //div[@id='FAZHeaderNeu']
strip: //h2[@itemprop='headline']
strip: //span[@class='Datum']
strip: //span[@class='Autor']
strip_id_or_class: ArticlePagerTop
strip: //div[@class='FAZArtikelEinleitung']/h2
# General cleanup
strip: //div[@class='clear']
strip: //span[@class='Bildnachweis']
strip: //iframe
strip_id_or_class: Community
strip: ' · '
# Remove tracking and ads
strip_image_src: /l.gif?
strip: //img[@width='1']
strip_id_or_class: invisible
strip_id_or_class: Anzeige
strip_id_or_class: billboard
# Remove clutter after article
strip_id_or_class: Tagline
strip_id_or_class: ArtikelAbbinder
strip_id_or_class: FAZArtikelKommentare
strip_id_or_class: ArtikelKommentieren
strip_id_or_class: FAZContentRight
# Try it yourself
test_url: http://blogs.faz.net/wost/2014/08/17/viel-fuck-und-wenig-guter-sex-1239/

View File

@ -19,5 +19,8 @@ strip: //p[@class='nota_pie']
strip: //div[starts-with(@id, 'sumario') and contains(., 'más información')]
strip: //div[@id='coment' or @id='foros_not']
test_url: http://elpais.com/elpais/2012/02/06/gente/1328526783_491687.html
test_url: http://www.elpais.com/articulo/cultura/mano/retrato/materia/elpepicul/20120207elpepicul_2/Tes
test_url: http://brasil.elpais.com/brasil/2014/10/15/politica/1413334841_878730.html
test_contains: O PT quer intensificar a presença do ex-presidente
test_url: http://brasil.elpais.com/brasil/2014/10/13/internacional/1413225730_450761.html
test_contains: Todos na localidade onde ele nasceu ainda falavam da façanha

View File

@ -1,30 +1,17 @@
# story has several pages, should be detected
body: //div[@id='storyBody']
body: //div[@id='article_body']
body: //div[@id='story_body']
# include the lead graphic in the body, if available
body: //div[contains(concat(' ', normalize-space(@id), ' '), ' lead_graphic ')] | //div[contains(concat(' ', normalize-space(@itemprop), ' '), ' articleBody ')]
title: //h1[contains(concat(' ', normalize-space(@itemprop), ' '), ' headline ')]
date: //time[contains(concat(' ', normalize-space(@itemprop), ' '), ' datePublished ')]
title://h1[@id='article_headline']
# article author
author: //p[@class='author']/a
# story author(s)
author: substring-after(//p[@class='byline'], 'By ')
# article date
date: //span[@class='published_date']
# story date
date: //span[@class='date']
date: substring-after(//div[contains(@class,'attributor')],'on')
strip_id_or_class: inset
strip: //p/span[@class='photoCredit']
strip: //h1
strip_id_or_class: page_count
strip_id_or_class: tools
strip_id_or_class: pagination
single_page_link: //li[@id='stPrint']/a
strip_id_or_class: photo_credit
strip_id_or_class: photo_caption
strip_id_or_class: inline_gallery
# pull quote, often inside a blockquote element
strip_id_or_class: pq
strip_id_or_class: credit
strip_id_or_class: figcaption
strip_id_or_class: related_item
test_url: http://www.businessweek.com/magazine/buyback-insurance-a-good-deal-for-retailers-07282011.html
test_url: http://www.businessweek.com/articles/2012-06-06/american-pain-the-largest-u-dot-s-dot-pill-mills-rise-and-fall
test_url: http://www.businessweek.com/articles/2012-06-06/american-pain-the-largest-u-dot-s-dot-pill-mills-rise-and-fall
test_url: http://www.businessweek.com/articles/2014-07-09/american-apparel-dov-charneys-sleazy-struggle-for-control

View File

@ -10,6 +10,15 @@ date: //time[@data-print='date']
body: //div[@data-print='body']
body: //section[@data-print='body']
find_string: rel:bf_image_src=
replace_string: src=
find_string: src="data:
replace_string: disabled_src="data:
native_ad_clue: //meta[@property="article:section" and @content="Advertiser"]
# For various things...
strip: *[@data-print="ignore"]
test_url: http://www.buzzfeed.com/hgrant/35-reasons-why-dogs-hate-the-holidays
test_url: http://www.buzzfeed.com/hgrant/35-reasons-why-dogs-hate-the-holidays
# Native ad
test_url: http://www.buzzfeed.com/bravo/ways-to-up-your-online-dating-game

View File

@ -0,0 +1,28 @@
# Author: zinnober
tidy: no
prune: no
# Set title
title: //h2
date: //li[@class='time']
# Set author
author: //a[contains(@rel, 'author')]
# Content is here
body: //div[@id='content']
# Tidy up before article
strip: //div[@class='meta']
# Tidy up after article
strip_id_or_class: nr_related_placeholder
strip_id_or_class: twitter-share-button
strip_id_or_class: afterpost
strip_id_or_class: tags
# Try it yourself
test_url: http://www.canonrumors.com/2014/09/chuck-westfall-talks-canon-eos-7d-mark-ii/
test_url: http://www.canonrumors.com/2014/09/canon-cinema-eos-captures-space-in-4k-for-new-imax-3d-film/

View File

@ -2,4 +2,5 @@ title: //div[@class='title']
author: //div[@class='author']
prune: no
test_url: http://www.chomsky.info/onchomsky/2002----.htm
test_url: http://www.chomsky.info/onchomsky/2002----.htm
test_contains: The propaganda model argues

View File

@ -1,5 +1,9 @@
title: //div[@id='maincontent']//h1
body: //div[@id='resizeableText']
single_page_link: concat(//link[@rel='canonical']/@href, '?sp=true')
test_url: http://cn.reuters.com/article/CNAnalysesNews/idCNKBS0FF0NM20140710
test_url: http://cn.reuters.feedsportal.com/CNAnalysesNews
test_url: http://cn.reuters.feedsportal.com/CNAnalysesNews
# multipage link
test_url: http://cn.reuters.com/article/idCNKBS0FF0UL20140710

View File

@ -1 +1,3 @@
body: //div[@id='content']
body: //div[@id='readme']
test_url: http://code.fivefilters.org/full-text-rss

View File

@ -15,4 +15,4 @@ strip_id_or_class: promotion-tag
tidy: no
prune: no
test_url: www.csmonitor.com/World/Middle-East/2011/1108/Imminent-Iran-nuclear-threat-A-timeline-of-warnings-since-1979/Earliest-warnings-1979-84
test_url: http://www.csmonitor.com/World/Middle-East/2011/1108/Imminent-Iran-nuclear-threat-A-timeline-of-warnings-since-1979/Earliest-warnings-1979-84

View File

@ -2,4 +2,4 @@ single_page_link: //a
tidy: no
prune: no
test_url: da.feedsportal.com/c/585/f/413794/s/17037b5a/l/0L0Stelegraaf0Bnl0Cbinnenland0C10A2757860C0I0IKlacht0Itegen0Idr0B0IFrank0Iniet0I0Eontvankelijk0I0I0Bhtml0Dcid0Frss/ia1.htm
test_url: http://da.feedsportal.com/c/585/f/413794/s/17037b5a/l/0L0Stelegraaf0Bnl0Cbinnenland0C10A2757860C0I0IKlacht0Itegen0Idr0B0IFrank0Iniet0I0Eontvankelijk0I0I0Bhtml0Dcid0Frss/ia1.htm

View File

@ -0,0 +1,31 @@
# Author: zinnober
tidy: no
prune: no
# Set title
title: //header/h1
# Set author
author: //a[rel='author']
# Content is here
body: //article
# Tidy up before article
strip: //header
# Tidy up article
strip: //div[contains(@id, 'gallery-')]
replace_string(<a rel="attachment): <p rel="attachment
# Tidy up after article
strip: //div[@class='sm']
strip_id_or_class: related
strip_id_or_class: comments
strip: //footer
# Try it yourself
test_url: http://www.designsponge.com/2010/06/seattle-design-guide.html
test_url: http://www.designsponge.com/2012/04/sneak-peek-liz-cook.html

View File

@ -2,4 +2,6 @@ body: (//blockquote[contains(@class, 'postcontent')])[1]
body: (//div[starts-with(@id, 'post_message')])[1]
prune: no
tidy: no
tidy: no
test_url: http://www.desitvforum.net/forum/watch-online/431739-creature-3d-2014-watch-online-download-dvd-rip.html

View File

@ -0,0 +1,29 @@
# Author: zinnober
prune: yes
tidy: yes
title: //h1
date: //p[@class='news_datum']
author: //span[@class='author']
body: //div[@class='tagesnews-content']
# General clenaup
strip_id_or_class: dachzeile
strip: //h3
strip: //p[@class='bodytext']//a
strip_id_or_class: autor_datum
strip_id_or_class: comments
strip_id_or_class: banner-
strip: //p[contains(., 'Lesen Sie')]
strip: //p[contains(., ' in DAZ')]
# Fix image captions
replace_string(<p class="image_caption">): <p><small><em>
replace_string(</dd>): </em></small></dd>
test_url: http://www.deutsche-apotheker-zeitung.de/pharmazie/news/2014/09/03/weniger-nebenwirkungen-aber-kein-zusatznutzen/13715.html
test_url: http://www.deutsche-apotheker-zeitung.de/recht/news/2014/09/02/urteile-zum-cannabis-eigenanbau-bfarm-geht-in-berufung/13716.html

View File

@ -1,8 +1,6 @@
title: //h1[@id='query_h1']
body: //div[contains(@class, 'lunatext results_content')]
strip_id_or_class: spl_unshd
#replace_string(<div class="dicTl">): <div class="dicTl">------------------<br />
body: //div[contains(@class, 'source-data')]
strip: //button
prune: no
test_url: http://www.wired.com/cloudline/2011/10/meet-arms-cortex-a15-the-future-of-the-ipad-and-possibly-the-macbook-air/
test_url: http://dictionary.reference.com/browse/propaganda

View File

@ -1 +1,3 @@
single_page_link: //a[@id='download_button_link']
single_page_link: //a[@id='download_button_link']
test_url: https://www.dropbox.com/s/qmocfrco2t0d28o/Fluffbeast.docx

View File

@ -0,0 +1,24 @@
# Author: Marvin Dickhaus <github@marvindickhaus.de>
# 2014-10-08
#Tidy just messes up the DOM
tidy: no
title: //h1
body: //h2 | //div[@id='artikelteaser'] | //div[@id='artikeltext']
#Strip
strip_image_src: artikel_a_merken.gif
strip: //div[@class='zusatzinfo']
#Author: substring is used to remove the " Von " prefix.
author: substring(//li[@class='artikelautor'], 5)
date: //li[@class='artikeldatum']
#The first two URLs will at some point no longer show
#the full article. There is a time-based paywall
#installed. Using the feed should present valid output
test_url: http://www.echo-online.de/art1231,5503063
test_url: http://www.echo-online.de/art1168,5502598
test_url: http://www.echo-online.de/rss/darmstadt.xml

View File

@ -1,8 +1,13 @@
body: //div[@class='main-content']
body: //article[contains(@class, 'resp-node')]
date: //time[@class='date-created']
strip: //aside
prune: no
autodetect_next_page: no
test_url: http://www.economist.com/node/21528429
test_url: http://www.economist.com/node/21528429
test_url: http://www.economist.com/news/essays/21623373-which-something-old-and-powerful-encountered-vault
test_contains: the calfskin pages are smooth
test_contains: Books will evolve online and off

View File

@ -1,8 +1,9 @@
body: //div[ @class='content' ] | //div[ @class='blog-entry' ]
body: //p[@class='strapline'] | //div[@class='cover-image'] | //article[@class='hd']
strip: //div[@class='social top']
strip: //p[@class='byline']
strip: //h2/abbr | //div[ @class='lowleader' ] | //*[ @class='discussion' ] | //img[ @class='play-button' ] | //div[ @class='boxout' ] | //h2/a | //h2 | //h2/div | //p[ @class='timestamp' ] | //a[ @class='eurogamer-author' ] | //p[ @class='aPager' ] | //h1 | //div[ @id='lowleader' ] | //a[ @class='next' ] | //div[contains(concat(' ', normalize-space(@class), ' '), ' pullquote ')]
date: //span[@itemprop='datePublished']
author: //a[@itemprop='author']/text()
date://p[ @class='timestamp' ]
author://a[ @class='eurogamer-author' ]
test_url: http://www.eurogamer.net/articles/digitalfoundry-vs-unreal-engine-4
test_url: http://www.eurogamer.net/articles/2014-08-20-bungie-ordered-to-return-shares-to-composer-marty-odonnell
test_url: http://www.eurogamer.net/articles/2014-08-20-invisible-inc-does-espionage-justice

View File

@ -1,5 +1,12 @@
body: //div[@id='imagestage']
body: //div[contains(@class, 'userContentWrapper')]
strip_id_or_class: commentable
prune: no
tidy: no
test_url: https://www.facebook.com/feeds/page.php?id=338077742912613&format=rss20
# single_page_link: replace(substring-after(//noscript//meta[@http-equiv="refresh"]/@content, 'URL='), "&amp;", "&")
test_url: https://www.facebook.com/permalink.php?story_fbid=10154584776550183&id=294468630182
test_contains: holding an extraordinary session in Brussels this month

0
inc/3rdparty/site_config/standard/faz.net.txt vendored Normal file → Executable file
View File

View File

@ -5,8 +5,8 @@ strip: //div[contains(@class, 'related-companies')]
strip: //div[@id='y-article-related']
strip: //div[@id='ypf-article-related']
prune: no
tidy: no
single_page_link: //div[@class='ft']//a[contains(@href, 'page=all')]
test_url: http://sg.finance.yahoo.com/news/Motorola-takes-wraps-249-rsg-3508842732.html?x=0&.v=1
test_url: http://finance.yahoo.com/news/super-young-retirement-savers.html
test_url: http://finance.yahoo.com/news/canadian-orebodies-gives-notice-exercise-130000032.html

View File

@ -1,2 +1,2 @@
body: //div[@class='entry']
test_url: http://www.fivechapters.com/2010/paris-part-one/
test_url: http://www.fivechapters.com/2014/the-saddest-writer-in-america-part-two/

View File

@ -1 +1,4 @@
prune: no
body: //section[contains(@class, 'container')]
prune: no
test_url: http://fivefilters.org/kindle-it/

View File

@ -1,15 +1,19 @@
title: //div[@class='translateHead']//h1 | //div[@id='art-mast']//h1
author: substring-after(//span[@id='by-line'], 'BY ')
date: //span[@id='pub-date']
body: //div[@id='art-mast']/h2 | //div[@class='translateBody'] | //div[@id='art-body']
body: (//article//img[contains(@class, 'main_photo')])[1] | (//article//div[contains(@class, 'full_post_content')])[1]
#body: //div[@id='art-mast']/h2 | //div[@class='translateBody'] | //div[@id='art-body']
#Strip inside article content
strip: //div[@id='share-box']
strip: //div[@id='special-box']
strip: //div[@id='special-box
strip_id_or_class: side_panel
prune: no
single_page_link: //span[@id='controls']/a[contains(@href, 'print=yes')]
single_page_link: //a[text()='SINGLE PAGE']
test_url: http://www.foreignpolicy.com/articles/2014/07/22/the_end_game_in_gaza_netanyahu_hamas
test_url: http://www.foreignpolicy.com/articles/2011/08/01/a_murderers_manifesto_and_me
test_url: http://www.foreignpolicy.com/articles/2012/02/29/five_years_in_damascus

View File

@ -1,25 +1,34 @@
# Jens Kohl, jens.kohl@...
# - Added publication date
# - Striped pagination block
# - Added single page link
# - Added xpath-querys for the printer friendly version
# Author: zinnober
# Rewrite of original template which fetched the printer-version without pictures
title: //h1
body: //div[@class='formatted']
tidy: no
prune: no
date: substring-after(//li[2][@class="text1"], 'Datum:')
strip: //ol[@class="list-chapters"]
strip_comments: yes
# Set full title
title: //h1
# next: commands for printer friendly pages
single_page_link: //a[contains(@href, 'print.php?a=')]/@href
title: //body/h3
strip_image_src: staticrl/images/logo.jpg
strip_image_src: http://cpx.golem.de/cpx.php?class=7
strip: //body/h3
strip: //body/b[1]
strip: //body/b[2]
strip: //body/b[3]
strip: //div[1]
test_url: http://www.golem.de/1112/88696.html
date: //time
# Content is here
body: //article
# Fetch full multipage articles
next_page_link: //a[@id='atoc_next']
# Remove tracking and ads
strip_id_or_class: iqadtile4
# General Cleanup
strip_id_or_class: list-jtoc
strip_id_or_class: table-jtoc
strip_id_or_class: implied
strip_id_or_class: social-
strip_id_or_class: comments
strip_id_or_class: footer
# Tidy up galleries (could still be improved, though)
strip: //img[@src='']
# Try yourself
test_url: http://www.golem.de/news/intel-core-i7-5960x-im-test-die-pc-revolution-beginnt-mit-octacore-und-ddr4-1408-108893.html
test_url: http://www.golem.de/news/test-infamous-first-light-neonbunter-actionspass-1408-108914.html

View File

@ -1,9 +1,42 @@
#second part of single_page_link for telepolis-articles (desktop-version of site)
single_page_link: //p[@class='news_option']/a | //a[@id='tp-druckversion']
# Author: zinnober
# Template should work well with either desktop or mobile version (m.heise.de)
prune: no
title: //article/h1 | //h1
date: //p[@class='news_datum']
title: //h1
body: //div[@class='meldung_wrapper']
author: //h4[@class='author']
test_url: http://www.heise.de/newsticker/meldung/Europa-soll-Grundrechteschutz-im-Netz-staerken-1392664.html
test_url: http://www.heise.de/tp/artikel/42/42579/1.html
body: //article | //div[@class='meldung_wrapper']
# General cleanup
strip: //time
strip: //h4[@class='author']
strip: //p[@class='news_datum']
strip: //p[@class='artikel_datum']
strip: //a[contains(@href, 'mailto')]
strip_id_or_class: comments
strip_id_or_class: ISI_IGNORE
strip_id_or_class: clear
strip_id_or_class: linkurl_grossbild
strip_id_or_class: image-num
strip_id_or_class: heisebox_right
strip_id_or_class: dossier
# Strip Ads
strip_id_or_class: ad_
# Some optimizations
replace_string(<h5>): <h2>
replace_string(</h5>): </h2>
replace_string(<span class="bild_rechts"): <p
replace_string(<div class="heisebox">): <blockquote>
next_page_link: //a[@class='next']
next_page_link: //a[@title='vor']
test_url: http://www.heise.de/open/artikel/Die-Neuerungen-von-Linux-3-15-2196231.html
test_url: http://m.heise.de/open/artikel/Die-Neuerungen-von-Linux-3-15-2196231.html
test_url: http://www.heise.de/newsticker/meldung/Ueberwachungstechnik-Die-globale-Handy-Standortueberwachung-2301494.html

View File

@ -2,4 +2,4 @@ body: //table[@class='ap-smallphoto-table'] | //div[@class='body']//*[@class='en
tidy: no
strip_image_src: analytics.apnewsregistry
test_url: http://hosted.ap.org/dynamic/stories/U/US_SPENDING_SHOWDOWN?SITE=FLPET&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2011-04-06-07-46-50
test_url: http://hosted.ap.org/dynamic/stories/E/EU_TURKEY_KURDS?SITE=KSNEW&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2014-10-14-10-50-25

View File

@ -0,0 +1,14 @@
body: //div[@id='left-stack' or contains(@class, 'center-stack')]
find_string: class="artwork" src="
replace_string: class="artwork" src-disabled="
find_string: src-swap-high-dpi="
replace_string: src="
strip_id_or_class: rating
strip_id_or_class: listeners-also-bought
prune: no
test_url: https://itunes.apple.com/us/rss/topaudiobooks/limit=10/xml
test_url: https://itunes.apple.com/us/audiobook/the-giver-unabridged/id356345850

View File

@ -4,4 +4,4 @@ body: //div[@itemprop='articleBody']
tidy: no
test_url: http://www.kachiblog.com/2013/05/samsung-galaxy-s4-vs-samsung-galaxy.html
test_url: http://www.kachiblog.com/feeds/posts/default
test_url: http://www.kachiblog.com/feed

View File

@ -0,0 +1,7 @@
title: //div[@itemprop='headline']
body: //noscript/img | //div[@itemprop='text']
author: //div[@class='meta meta--post']//a[@class='is-author']
date: //div[@class='meta meta--post']//time/@datetime
test_url: http://www.lifehacker.co.uk/2014/08/22/dealhacker-10-google-chromecast-super-cheap-batteries-much
test_url: http://www.lifehacker.co.uk/2014/08/18/andrognito-hides-files-youd-like-keep-away-prying-eyes

View File

@ -25,4 +25,4 @@ strip_id_or_class: 'rightimage'
#Comments
strip: //table
strip: //p/following-sibling::*[0]
test_url: http://www.mainpost.de/ueberregional/meinung/Dioxin-Skandal-bringt-Agrarministerin-in-Bedraengnis;art9517,5920211
test_url: http://www.mainpost.de/regional/wuerzburg/Autobahnschuetze-Staatsanwalt-fordert-zwoelf-Jahre;art492151,8386332

View File

@ -1,4 +1,5 @@
strip_id_or_class: article-tools
strip_id_or_class: pagenav
prune: no
test_url: http://www.medialens.org/index.php/alerts/alert-archive/2012/713-the-illusion-of-democracy.html
test_url: http://www.medialens.org/index.php/alerts/alert-archive/2012/713-the-illusion-of-democracy.html
test_contains: In an era of permanent war, economic meltdown

View File

@ -1,7 +1,12 @@
body: //div[contains(@class, 'post-content-inner')]
strip_id_or_class: follow-ups
strip_id_or_class: footer
body: //div[contains(@class, 'postContent-inner')]
strip_id_or_class: supplementalPostContent
prune: no
test_url: https://medium.com/p/6844c0d7893b
test_url: https://medium.com/@savolai/kaytettavyyden-haasteet-keskustelukulttuurista-2-3-6844c0d7893b
test_contains: Jos käytettävyysongelmat ovat kerran niin tyypillisiä
test_contains: Keskustelukulttuuriongelmasta (subjective vs. objective bugs)
test_url: https://medium.com/health-the-future/thirty-things-ive-learned-482765ee3503
test_contains: Remember you will die
test_contains: You have to have some faith.

View File

@ -0,0 +1,12 @@
strip: //div[contains(@style, 'float:right') and contains(., 'advertisement')]
body: //div[@style="float:left;width:740px;"]
tidy: no
test_url: http://www.menshealth.com.sg/fitness/mh-picks-under-armour-clutchfit-nitro-mid-cleats
test_contains: These cleats are made for one thing
test_url: http://www.menshealth.com.sg/fitness/top-10-fat-burning-bodyweight-moves-you-can-do-10-minutes
test_contains: let this workout fool you
test_url: http://www.menshealth.com.sg/fitness/feed

View File

@ -8,4 +8,4 @@ strip_id_or_class: news_morearticlesincat
strip_id_or_class: ezc_comments
strip_comments: yes
test_url: http://www.northumberlandview.ca/index.php?module=news&func=display&sid=5972
test_url: http://www.northumberlandview.ca/index.php?module=news&type=user&func=display&sid=31127

View File

@ -42,8 +42,12 @@ strip://h6[@class = 'kicker']
author:substring-after(//h6[@class='byline'],'By ')
test_url: http://www.nytimes.com/2011/07/24/books/review/an-academic-authors-unintentional-masterpiece.html
test_contains: In this column I want to look at a not uncommon way of writing
test_url: http://www.nytimes.com/2012/06/10/arts/television/the-newsroom-aaron-sorkins-return-to-tv.html
test_contains: IF youve seen enough of Aaron Sorkins theater
test_url: http://www.nytimes.com/2013/03/25/world/middleeast/israeli-military-responds-after-patrols-come-under-fire-from-syria.html
test_url: http://www.nytimes.com/2013/08/15/nyregion/when-the-new-york-city-subway-ran-without-rails.html
test_url: http://www.nytimes.com/2004/02/29/weekinreview/correspondence-class-consciousness-china-s-wealthy-live-creed-hobbes-darwin-meet.html
test_url: http://www.nytimes.com/2014/06/19/opinion/gail-collins-romney-and-the-2016-contenders-huddle.html
test_url: http://www.nytimes.com/2014/06/19/opinion/gail-collins-romney-and-the-2016-contenders-huddle.html

View File

@ -1,3 +1,5 @@
body: //div[@id='_ctl12__ctl0_Article']
body: //div[contains(@class, 'article-photo-wrapper')]
prune: no
autodetect_on_failure: no
test_url: http://www.real.gr/DefaultArthro.aspx?page=arthro&id=360962&catID=1
test_contains: Επισήμως το αποψινό υπουργικό

View File

@ -7,7 +7,7 @@ author: //p[@class="tagline"]/a
# this doesn't work for some reason...?
date: //p[@class="tagline"]//@datetime
body: //div[@class="expando"]//div[@class="usertext-body"]
body: (//div[contains(@class, 'noncollapsed')]//div[contains(@class, 'usertext-body')])[1]
strip_id_or_class: tagline
strip_id_or_class: unvotable-message
@ -17,4 +17,5 @@ strip_id_or_class: buttons
single_page_link: //p[@class="title"]/a[contains(@href, 'http://')]
test_url: http://www.reddit.com/r/truegaming/comments/wfe7r/i_wrote_about_the_problems_i_honestly_feel_that/
test_url: http://www.reddit.com/r/worldnews/comments/1as37r/twelve_north_korean_soldiers_attempting_to_defect/
test_url: http://www.reddit.com/r/worldnews/comments/1as37r/twelve_north_korean_soldiers_attempting_to_defect/
test_url: http://www.reddit.com/r/WritingPrompts/comments/2786lw/wp_in_a_world_where_puns_are_illegal_one_man/chybk8e

View File

@ -1,4 +1,4 @@
body: //div[@class="storyBox"]
body: //div[contains(concat(' ',normalize-space(@class),' '),' article ') and (contains(concat(' ',normalize-space(@class),' '),' clear '))]
title: //div[@class="storyBox"]/h1
author: //a[@rel="author"]
date: substring-before(//span[@class="dateline"], 'by')

View File

@ -1,4 +1,4 @@
#grab the actual content div
body: //div[@class='rt-article']
test_url: http://www.sourcebooks.com/next/sourcebooks-next-our-blog/1601-another-piece-of-the-e-puzzle-or-when-good-ebook-promotions-go-bad.html
test_url: http://www.sourcebooks.com/blog/happy-27th-birthday-sourcebooks.html

View File

@ -0,0 +1,5 @@
body: //div[contains(@class, 'story-text')]
strip_id_or_class: related
test_url: http://www.tabletmag.com/jewish-news-and-politics/181181/mossberg-parallel-states?all=1

View File

@ -0,0 +1,60 @@
# Author: zinnober
# Should work with "normal" articles as well as with image galleries
prune: no
# Title
title: //h1/span[@class='hcf-headline']
# Set author
author: //a[@rel='author']
# Set date
date: //span[@class='date hcf-atlas']
# Fetch full multipage articles
next_page_link: //a[contains(@class, 'hcf-forward')]
# Content is here
body: //article
body: //div[contains(@class, 'hcf-screen')]
# Remove tracking and ads
strip_id_or_class: hcf-ad
strip_id_or_class: hcf-autoload-ad
strip_id_or_class: hcf-content-ad
# Tidy up before article
strip: //article/h1
strip_id_or_class: hcf-atlas
strip_id_or_class: hcf-author
strip_id_or_class: date hcf-atlas
strip_id_or_class: date hcf-atlas
# General cleanup
strip: //div[contains(@class, 'hcf-screen')]//h1
strip: //div[@class='hcf-subpage-titles']//ul
strip_id_or_class: hcf-doctype-media
strip_id_or_class: hcf-inline-gallery
strip_id_or_class: hcf-doctype-video
strip_id_or_class: hcf-links
strip_id_or_class: hcf-mini-navi
strip_id_or_class: hcf-media-control
strip_id_or_class: hcf-hidden
replace_string(<span class="hcf-update">Update</span>): <strong>Update: </strong>
# Fix pictures and captions
replace_string(<a class="hcf-doctype-gallery): <p class="hcf-doctype-gallery
replace_string(<a class="hcf-doctype-enlarge): <p class="hcf-doctype-enlarge
replace_string(<figcaption class="hcf-caption">): <br><small><em>
replace_string(</figcaption>): </em></small>
# Fix image galleries
replace_string(<a class=" ajaxify): <p class="ajaxify
replace_string(<div class="hcf-caption"><div><p>): <small><em>
# Try it yourself
test_url: http://www.tagesspiegel.de/berlin/bezirke/wedding/wedding-jetzt/auf-der-suche-nach-einem-stadtteil-wilder-weiter-wedding/8757156.html
test_url: http://www.tagesspiegel.de/berlin/olympia-in-berlin-der-flughafen-tegel-soll-das-olympische-dorf-werden/10645036.html
test_url: http://www.tagesspiegel.de/mediacenter/fotostrecken/berlin/bildergalerie-kreuzberger-der-woche/9305534.html

View File

@ -1,3 +1,3 @@
single_page_link_in_feed: //b/a
test_url_feed: http://www.techmeme.com/feed.xml
test_url: http://www.techmeme.com/feed.xml

View File

@ -15,6 +15,8 @@ strip: //div[@class='earthbox']
single_page_link: //article//a[contains(@class, 'print')]
native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
test_url: http://www.theatlantic.com/technology/archive/2011/04/want-to-see-how-crazy-a-bot-run-market-can-be/237773/
test_url: http://www.theatlantic.com/magazine/archive/2007/11/the-autumn-of-the-multitaskers/6342/
test_url: http://www.theatlantic.com/entertainment/archive/2012/04/30-rock-live-a-funny-reminder-of-why-sitcoms-arent-shot-live-anymore/256447/

View File

@ -1,5 +1,10 @@
body: //div[contains(@class, 'entry-content')]//div[contains(@class, 'column-2')]
single_page_link: //div[contains(@class, 'pagination')]//a[contains(@title, 'ingle page')]
strip_id_or_class: entry-related
strip_id_or_class: entry-sidebar
strip_id_or_class: entry-pagination
tidy: no
prune: no
test_url: http://www.theglobeandmail.com/report-on-business/rob-magazine/how-a-novice-miner-survived-a-summer-in-the-klondike/article2345350/
test_url: http://www.theglobeandmail.com/report-on-business/rob-magazine/how-a-novice-miner-survived-a-summer-in-the-klondike/article2345350/
test_url: http://www.theglobeandmail.com/report-on-business/industry-news/energy-and-resources/cliffs-natural-resources-looking-to-exit-ontarios-ring-of-fire/article20651617/

View File

@ -6,8 +6,19 @@ strip: //div[contains(@class, 'kindleWidget')]
#strip: //a[not(text())]
strip_id_or_class: pocket-btn
author: //li[@class='byline']
native_ad_clue: //meta[@property="article:tag" and contains(@content, "Partner zone")]
native_ad_clue: //meta[@property="video:tag" and contains(@content, "Partner zone")]
prune: no
tidy: no
test_url: http://www.theguardian.com/world/2013/oct/04/nsa-gchq-attack-tor-network-encryption
test_contains: The National Security Agency has made repeated attempts to develop
test_contains: The agency did not directly address those questions, instead providing a statement.
test_url: http://www.theguardian.com/world/2013/oct/03/edward-snowden-files-john-lanchester
test_url: http://www.theguardian.com/commentisfree/2014/jun/15/britishness-search-identity-my-part-in-camerons-odyssey
test_contains: In August, the editor of the Guardian rang me up and asked if I would spend a week in New York
test_contains: As the second most senior judge in the country, Lord Hoffmann, said in 2004 about a previous version of our anti-terrorism laws
test_url: http://www.theguardian.com/commentisfree/2014/jun/15/britishness-search-identity-my-part-in-camerons-odyssey
# Native ad
test_url: http://www.theguardian.com/sustainable-business/2014/jul/18/ben-jerry-turn-ice-cream-into-energy

View File

@ -15,6 +15,11 @@ strip: //nav
strip: //img[contains(@class, 'vox-lazy-load')]
# deal with bad parsing
strip: //div[contains(@class, 'story-image')]//div[contains(., 'function(')]
strip: //div[contains(@class, 'm-linkset')]
strip: //div[contains(@class, 'm-entry__sidebar')]
strip: //ul[contains(@class, 'm-article__sources')]
strip: //div[contains(@class, 'chorus-emc__content')]
strip_id_or_class: gallery
strip_id_or_class: article-meta
@ -45,4 +50,4 @@ test_url: http://www.theverge.com/2012/2/29/2821763/lytro-review
test_url: http://www.theverge.com/2011/11/3/2534861/nokia-lumia-800-review
test_url: http://www.theverge.com/2013/2/24/4026114/barnes-noble-shifting-focus-away-from-nook-hardware
test_url: http://www.theverge.com/2014/6/19/5824072/top-shelf-living-the-dream
test_url: http://www.theverge.com/rss/frontpage
test_url: http://www.theverge.com/rss/frontpage

View File

@ -0,0 +1,25 @@
# Author: zinnober
tidy: no
prune: no
# Set author
author: //a[contains(@rel, 'author')]
# Content is here
body: //article
# Tidy up before article
strip: //header
# Get rid of doubled images
strip: //img[contains(@class, '-hidden')]
# Tidy up after article
strip_id_or_class: social-list
strip_id_or_class: meta-info
strip: //footer
# Try it yourself
test_url: http://www.thisiscolossal.com/2014/09/chicago-in-the-fog-by-michael-salisbury/
test_url: http://www.thisiscolossal.com/2014/09/bird-portraits-ruffling-with-personality-by-leila-jeffreys/

View File

@ -0,0 +1,10 @@
title: //div[@id='headline']
body: //div[@class='entry_text']
author: //div[text() = 'Author:']/following-sibling::div/a
date: //div[text() = 'Published:']/following-sibling::div
single_page_link: //a[@href='noscript.html']
prune: no
test_url: http://towerofthehand.com/blog/2014/08/08-pitch-this-got-spinoff/index.html
test_url: http://towerofthehand.com/blog/2014/07/31-definitions-and-embodiments/index.html
test_url: http://towerofthehand.com/blog/2014/07/03-hero-with-thousand-faces/index.html

View File

@ -6,4 +6,5 @@ date: //span[contains(@class, 'js-short-timestamp')]/@data-time
prune: no
tidy: no
test_url: https://twitter.com/medialens/status/216883678582804480
test_url: https://twitter.com/medialens/status/216883678582804480
test_contains: is all but alone in challenging the tsunami of UK

View File

@ -2,6 +2,7 @@ title: //meta[@property="og:title"]/@content
author: //div[contains(@class, 'byline')]//span[contains(@class, 'name')]
date: //div[contains(@class, 'cn_date_time')]
body: //div[contains(@class, 'pageContainers')]
body: //div[@id='main']
body: //article[@id='items-container']
#body: //h2[@class='sub-header'] | //div[contains(@class, 'contributor-type') or @class='display-date' or @class='content-container']
@ -26,5 +27,7 @@ strip: //li[@class='blogNavPrev']
single_page_link: //a[@title='Print this page']
test_url: http://www.vanityfair.com/politics/features/2011/05/egypt-revolutionaries-201105
test_contains: nothing can take away from the miracle of Tahrir Square
test_url: http://www.vanityfair.com/politics/features/2008/08/hitchens200808
test_url: http://www.vanityfair.com/style/2012/01/prisoners-of-style-201201
test_url: http://www.vanityfair.com/style/2012/01/prisoners-of-style-201201

18
inc/3rdparty/site_config/standard/wn.de.txt vendored Executable file
View File

@ -0,0 +1,18 @@
author: //div[@id='main']//div[@class='col right']//div[contains(@class, 'attribute-author')]
body: //div[@id='main']//div[@class='col right']
strip_id_or_class: boxes
strip_id_or_class: lazy
strip_id_or_class: comment_box
strip_id_or_class: fb_comments
find_string: <noscript>
replace_string: <div>
find_string: </noscript>
replace_string: </div>
prune: no
tidy: no
test_url: http://www.wn.de/Muenster/Kultur/1742956-Wilm-Weppelmann-verlaesst-die-Einsiedelei-Und-dann-ab-unter-die-Dusche
# feed
test_url: http://www.wn.de/rss/feed/wn_muenster

View File

@ -1,4 +1,3 @@
# 2014-10-21 [Marmo] added stripping of inline ads and appropriate test_url
# 2013.10.30 [rezor92] fixed single_page_link
# 2012-12-23 [carlo@...] fixed half-assed headlines in articles, removed inline author profiles, adjusted picture captions
# 2012-03-17 [dkless@...] Cut metadata parts in the beginning and the ends of the content block; copyright entries for pictures removed; Author fixed, not sure if old entries still valid (I left them); Weird problems with some pages addressed (see last section for removing hidden section)
@ -17,8 +16,6 @@ author: substring-after(//li[@class='source first '], 'Quelle: ')
strip_id_or_class: articleheader
strip: //div[@id="comments"] | //div[@class="pagination block"] | //p[@class="ressortbacklink"] | //div[@id="relatedArticles"] | // div[@class="inline portrait"]
#Remove inline ads
strip: //div[@class="innerad"]
#Removes author and date from the start
strip: //ul[@class="tools"]
@ -46,4 +43,3 @@ strip_id_or_class:"pagination"
footnotes: no
test_url: http://www.zeit.de/kultur/film/2012-12/Kurzfilmtag
test_url: http://www.zeit.de/wissen/2014-10/ebola-nigeria-who