mirror of
https://github.com/moparisthebest/wallabag
synced 2024-12-25 00:38:51 -05:00
Merge pull request #888 from wallabag/updated-site-config
updated site_config
This commit is contained in:
commit
24479b479d
@ -1,2 +1,2 @@
|
||||
title: substring-before(//title, '—')
|
||||
test_url: http://512pixels.net/more-on-linked-lists/
|
||||
title: //meta[@property='og:title']/@content
|
||||
test_url: http://www.512pixels.net/blog/2014/10/the-move
|
||||
|
8
inc/3rdparty/site_config/standard/README.md
vendored
8
inc/3rdparty/site_config/standard/README.md
vendored
@ -1,12 +1,14 @@
|
||||
Full-Text RSS site config files
|
||||
================
|
||||
|
||||
[Full-Text RSS](http://fivefilters.org/content-only/), our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no site patterns, it tries to detect the content block automatically.
|
||||
[Full-Text RSS](http://fivefilters.org/content-only/), our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no rules are found, it tries to detect the content block automatically.
|
||||
|
||||
This repository contains the site config files we use in Full-Text RSS.
|
||||
This repository contains the site-specific extraction rules we rely on in Full-Text RSS.
|
||||
|
||||
### Contributing changes
|
||||
|
||||
We run automated tests on these files to detect issues. If you'd like to help keep these up to date, please look at the [test results](http://siteconfig.fivefilters.org/test/) and see which files you'd like to contribute fixes for.
|
||||
|
||||
We chose GitHub for this set of files because they offer one feature which we hope will make contributing changes easier: [file editing](https://github.com/blog/844-forking-with-the-edit-button) through the web interface.
|
||||
|
||||
You can now make changes to any of our site config files and request that your changes be pulled into the main set we maintain. This is what GitHub calls the Fork and Pull model:
|
||||
@ -31,7 +33,7 @@ Marco, Instapaper's creator, graciously opened up the database of contributions
|
||||
|
||||
> And, recognizing that your efforts could be useful to a wide range of other tools and services, I'll make the list of all of these site-specific configurations available to the public, free, with no strings attached.
|
||||
|
||||
Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at [instapaper.com/bodytext/](http://instapaper.com/bodytext/) (login required).
|
||||
Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at [instapaper.com/bodytext/](http://instapaper.com/bodytext/) (no longer available since Instapaper was sold).
|
||||
|
||||
### Testing site config files
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
body: //section[@class='content']
|
||||
date: //span[1]
|
||||
author: //h1[@id='sitetitle']
|
||||
test_url: https://alexduner.com/blog/2013/1/something-i-learned-today
|
||||
test_url: http://alexduner.com/blog/something-i-learned-today
|
||||
|
@ -1,3 +1,5 @@
|
||||
body: //section[@class='main_cont']/img | //div[@class='articleContent']
|
||||
title: //div[@class='blog_top_left']//h2
|
||||
author: //a[@class='b'][1]
|
||||
date: substring-after(substring-before(//div, 'Posted in'), ' on ')
|
||||
strip_image_src: /content/images/globals/
|
||||
@ -8,4 +10,6 @@ prune: no
|
||||
|
||||
single_page_link: concat('http://www.anandtech.com/print/', substring-after(//meta[@property='og:url']/@content, '/show/'))
|
||||
|
||||
test_url: http://www.anandtech.com/show/5812/eurocom-monster-10-clevos-little-monster/
|
||||
test_url: http://www.anandtech.com/show/8370/gigabyte-am1m-s2h-review
|
||||
test_url: http://www.anandtech.com/show/8402/sandisk-releases-ultra-ii-ssd-the-second-tlc-nand-ssd-in-the-market
|
||||
test_url: http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores
|
||||
|
23
inc/3rdparty/site_config/standard/apotheke-adhoc.de.txt
vendored
Executable file
23
inc/3rdparty/site_config/standard/apotheke-adhoc.de.txt
vendored
Executable file
@ -0,0 +1,23 @@
|
||||
# Author: zinnober
|
||||
|
||||
prune: no
|
||||
|
||||
title: substring-before(//div[@id='content']/h1, ',')
|
||||
|
||||
single_page_link: //a[@title='Seite drucken']
|
||||
|
||||
body: //div[@id='detail-body']
|
||||
|
||||
replace_string(<span class="description">): <em>
|
||||
replace_string(<p class="leadtext"><small>): <p class="leadtext">
|
||||
|
||||
# Fix headlines
|
||||
replace_string(Patrick Hollstein):
|
||||
replace_string(APOTHEKE ADHOC):
|
||||
replace_string(dpa):
|
||||
replace_string(Katharina Lübke):
|
||||
replace_string(Julia Pradel):
|
||||
replace_string(Franziska Gerhardt):
|
||||
|
||||
test_url: http://www.apotheke-adhoc.de/nachrichten/politik/nachricht-detail-politik/deutscher-apothekertag-antraege-gegen-lieferengpaesse-2/
|
||||
|
@ -13,5 +13,7 @@ title: //div[@id='story']//h2[@class='title']
|
||||
strip: //div[@class='pager']
|
||||
next_page_link: //nav//a[span/@class='next']/@href
|
||||
|
||||
native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
|
||||
|
||||
test_url: http://arstechnica.com/tech-policy/news/2012/02/gigabit-internet-for-80-the-unlikely-success-of-californias-sonicnet.ars
|
||||
test_url: http://arstechnica.com/apple/2005/04/macosx-10-4/
|
||||
|
13
inc/3rdparty/site_config/standard/autocar.co.uk.txt
vendored
Executable file
13
inc/3rdparty/site_config/standard/autocar.co.uk.txt
vendored
Executable file
@ -0,0 +1,13 @@
|
||||
title: //div[@class='col-center']/h1
|
||||
author: //div[@class='personality']/a
|
||||
date: //div[@class='personality-date']
|
||||
body: //div[@class='content-top ']//div[@class='content'][1] | //div[contains(@class,'article-body')] | //div[contains(@class,'main-article')]
|
||||
|
||||
next_page_link: //div[@id='review-link']/a
|
||||
|
||||
strip: //div[@class='author-block']
|
||||
strip: //p//iframe[contains(@src,'signup')]/preceding::p[1]
|
||||
|
||||
test_url: http://www.autocar.co.uk/car-review/volkswagen/golf
|
||||
test_url: http://www.autocar.co.uk/car-news/pebble-beach/saleen-unveils-performance-electric-vehicle-based-tesla-model-s
|
||||
test_url: http://www.autocar.co.uk/car-review/rolls-royce/first-drives/rolls-royce-ghost-series-ii-first-drive-review
|
15
inc/3rdparty/site_config/standard/bbc.co.uk.txt
vendored
15
inc/3rdparty/site_config/standard/bbc.co.uk.txt
vendored
@ -13,7 +13,7 @@ body: //div[contains(@class, 'hrecipe')]//div[@id='subcolumn-1']
|
||||
#strip: //div[@class="story-feature narrow"]
|
||||
#strip: //div[@class="story-feature wide"]
|
||||
#strip: //div[@class="story-feature dslideshow-enclosure"]
|
||||
strip: //div[contains(@class, "story-feature")]
|
||||
strip: //div[contains(@class, "story-feature") and not(contains(@class, 'full-width'))]
|
||||
strip: //span[@class="story-date"]
|
||||
#strip: //div[@class="caption body-narrow-width"]
|
||||
strip: //div[@class="warning"]//p
|
||||
@ -30,13 +30,26 @@ strip: //div[contains(@class, 'comment-introduction')]
|
||||
strip: //div[contains(@class, 'share-tools')]
|
||||
strip: //div[@id='also-related-links']
|
||||
|
||||
strip_id_or_class: share-help
|
||||
strip_id_or_class: comments_module
|
||||
|
||||
replace_string(<noscript>): <div>
|
||||
replace_string(</noscript>): </div>
|
||||
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
dissolve: //h2
|
||||
|
||||
test_url: http://www.bbc.co.uk/sport/0/football/23224017
|
||||
test_contains: Swansea City have completed the club-record signing
|
||||
|
||||
test_url: http://www.bbc.co.uk/news/business-15060862
|
||||
test_contains: Europe's leaders are meeting again to try to solve
|
||||
|
||||
# news feed
|
||||
test_url: http://feeds.bbci.co.uk/news/rss.xml
|
||||
# sports feed
|
||||
test_url: http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int
|
||||
# video entry
|
||||
test_url: http://www.bbc.co.uk/news/world-asia-22056933
|
60
inc/3rdparty/site_config/standard/bbc.com.txt
vendored
Executable file
60
inc/3rdparty/site_config/standard/bbc.com.txt
vendored
Executable file
@ -0,0 +1,60 @@
|
||||
body: //div[@class="story-body"]
|
||||
# for video entries
|
||||
body: //div[contains(@class, "videoInStory") or @id="meta-information"]
|
||||
title: //h1[@class="story-header"]
|
||||
date: //span[@class="story-date"]/span[@class='date']
|
||||
# for sport site
|
||||
date: //meta[@name='DCTERMS.created']/@content
|
||||
author: //div[@id='headline']//span[@class='byline-name']
|
||||
|
||||
# recipes, e.g. http://www.bbc.co.uk/food/recipes/mymincepies_71055
|
||||
body: //div[contains(@class, 'hrecipe')]//div[@id='subcolumn-1']
|
||||
|
||||
#strip: //div[@class="story-feature narrow"]
|
||||
#strip: //div[@class="story-feature wide"]
|
||||
#strip: //div[@class="story-feature dslideshow-enclosure"]
|
||||
strip: //div[contains(@class, "story-feature") and not(contains(@class, 'full-width'))]
|
||||
strip: //span[@class="story-date"]
|
||||
#strip: //div[@class="caption body-narrow-width"]
|
||||
strip: //div[@class="warning"]//p
|
||||
strip: //div[@id='page-bookmark-links-head']
|
||||
strip: //object
|
||||
strip: //div[contains(@class, "bbccom_advert_placeholder")]
|
||||
strip: //div[contains(@class, "embedded-hyper")]
|
||||
strip: //div[contains(@class, 'market-data')]
|
||||
strip: //a[contains(@class, 'hidden')]
|
||||
strip: //div[contains(@class, 'hypertabs')]
|
||||
strip: //div[contains(@class, 'related')]
|
||||
strip: //form[@id='comment-form']
|
||||
strip: //div[contains(@class, 'comment-introduction')]
|
||||
strip: //div[contains(@class, 'share-tools')]
|
||||
strip: //div[@id='also-related-links']
|
||||
|
||||
strip_id_or_class: share-help
|
||||
strip_id_or_class: comments_module
|
||||
|
||||
replace_string(<noscript>): <div>
|
||||
replace_string(</noscript>): </div>
|
||||
|
||||
native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
|
||||
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
dissolve: //h2
|
||||
|
||||
test_url: http://www.bbc.com/sport/0/football/28918021
|
||||
test_contains: Cameroonian footballer Albert Ebosse has died
|
||||
|
||||
test_url: http://www.bbc.com/sport/0/football/23224017
|
||||
|
||||
test_url: http://www.bbc.com/news/business-15060862
|
||||
test_contains: Europe's leaders are meeting again to try
|
||||
|
||||
|
||||
# news feed
|
||||
test_url: http://feeds.bbci.co.uk/news/rss.xml
|
||||
# sports feed
|
||||
test_url: http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int
|
||||
# video entry
|
||||
test_url: http://www.bbc.com/news/world-asia-22056933
|
19
inc/3rdparty/site_config/standard/bit-tech.net.txt
vendored
Executable file
19
inc/3rdparty/site_config/standard/bit-tech.net.txt
vendored
Executable file
@ -0,0 +1,19 @@
|
||||
body: //div[@id='column_1']
|
||||
next_page_link: //div[@class='next']/a[not(contains(@href, '/comments') or contains(@href, '/news/'))]
|
||||
prune: no
|
||||
|
||||
author: substring-after(//p[@class='byline'], 'by ')
|
||||
date: substring-before(substring-after(//p[@class='byline'], 'on '), ' by')
|
||||
|
||||
strip: //h1
|
||||
strip_id_or_class: socialLinks
|
||||
strip_id_or_class: byline
|
||||
strip_id_or_class: pageSelector
|
||||
strip_id_or_class: articleTabs
|
||||
strip_id_or_class: pageNav
|
||||
strip_id_or_class: share
|
||||
strip_id_or_class: commentsContainer
|
||||
strip_id_or_class: below_article_related
|
||||
|
||||
test_url: http://www.bit-tech.net/hardware/storage/2014/08/13/ocz-arc-100-240gb-review/1
|
||||
test_url: http://www.bit-tech.net/news/bits/2014/08/15/google-trojan/1
|
16
inc/3rdparty/site_config/standard/bleacherreport.com.txt
vendored
Executable file
16
inc/3rdparty/site_config/standard/bleacherreport.com.txt
vendored
Executable file
@ -0,0 +1,16 @@
|
||||
body: //div[contains(@class, 'article_pages')]
|
||||
|
||||
strip_id_or_class: article_page-header
|
||||
strip_id_or_class: paginator
|
||||
strip_id_or_class: article_info
|
||||
|
||||
find_string: src="data:image
|
||||
replace_string: ignore-src="data:image
|
||||
find_string: data-defer-src="
|
||||
replace_string: src="
|
||||
|
||||
prune: no
|
||||
|
||||
test_url: http://bleacherreport.com/articles/feed
|
||||
test_url: http://bleacherreport.com/articles/2137787-christian-ponders-newborn-daughter-was-named-after-fsu-legend-bobby-bowden
|
||||
test_url: http://bleacherreport.com/articles/2137596-college-football-week-1-picks-unlv-runnin-rebels-vs-arizona-wildcats/
|
45
inc/3rdparty/site_config/standard/blogs.faz.net.txt
vendored
Executable file
45
inc/3rdparty/site_config/standard/blogs.faz.net.txt
vendored
Executable file
@ -0,0 +1,45 @@
|
||||
# Author: zinnober
|
||||
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
# Set author
|
||||
author: //a[@rel='author']
|
||||
|
||||
# Set date
|
||||
date: //span[@class='Datum']
|
||||
|
||||
# Content is here
|
||||
body: //div[@class='Artikel']
|
||||
|
||||
# Tidy up before article
|
||||
strip: //div[@id='FAZHeaderNeu']
|
||||
strip: //h2[@itemprop='headline']
|
||||
strip: //span[@class='Datum']
|
||||
strip: //span[@class='Autor']
|
||||
strip_id_or_class: ArticlePagerTop
|
||||
strip: //div[@class='FAZArtikelEinleitung']/h2
|
||||
|
||||
# General cleanup
|
||||
strip: //div[@class='clear']
|
||||
strip: //span[@class='Bildnachweis']
|
||||
strip: //iframe
|
||||
strip_id_or_class: Community
|
||||
strip: ' · '
|
||||
|
||||
# Remove tracking and ads
|
||||
strip_image_src: /l.gif?
|
||||
strip: //img[@width='1']
|
||||
strip_id_or_class: invisible
|
||||
strip_id_or_class: Anzeige
|
||||
strip_id_or_class: billboard
|
||||
|
||||
# Remove clutter after article
|
||||
strip_id_or_class: Tagline
|
||||
strip_id_or_class: ArtikelAbbinder
|
||||
strip_id_or_class: FAZArtikelKommentare
|
||||
strip_id_or_class: ArtikelKommentieren
|
||||
strip_id_or_class: FAZContentRight
|
||||
|
||||
# Try it yourself
|
||||
test_url: http://blogs.faz.net/wost/2014/08/17/viel-fuck-und-wenig-guter-sex-1239/
|
@ -19,5 +19,8 @@ strip: //p[@class='nota_pie']
|
||||
strip: //div[starts-with(@id, 'sumario') and contains(., 'más información')]
|
||||
strip: //div[@id='coment' or @id='foros_not']
|
||||
|
||||
test_url: http://elpais.com/elpais/2012/02/06/gente/1328526783_491687.html
|
||||
test_url: http://www.elpais.com/articulo/cultura/mano/retrato/materia/elpepicul/20120207elpepicul_2/Tes
|
||||
test_url: http://brasil.elpais.com/brasil/2014/10/15/politica/1413334841_878730.html
|
||||
test_contains: O PT quer intensificar a presença do ex-presidente
|
||||
|
||||
test_url: http://brasil.elpais.com/brasil/2014/10/13/internacional/1413225730_450761.html
|
||||
test_contains: Todos na localidade onde ele nasceu ainda falavam da façanha
|
||||
|
@ -1,30 +1,17 @@
|
||||
# story has several pages, should be detected
|
||||
body: //div[@id='storyBody']
|
||||
body: //div[@id='article_body']
|
||||
body: //div[@id='story_body']
|
||||
# include the lead graphic in the body, if available
|
||||
body: //div[contains(concat(' ', normalize-space(@id), ' '), ' lead_graphic ')] | //div[contains(concat(' ', normalize-space(@itemprop), ' '), ' articleBody ')]
|
||||
title: //h1[contains(concat(' ', normalize-space(@itemprop), ' '), ' headline ')]
|
||||
date: //time[contains(concat(' ', normalize-space(@itemprop), ' '), ' datePublished ')]
|
||||
|
||||
title://h1[@id='article_headline']
|
||||
|
||||
# article author
|
||||
author: //p[@class='author']/a
|
||||
# story author(s)
|
||||
author: substring-after(//p[@class='byline'], 'By ')
|
||||
|
||||
# article date
|
||||
date: //span[@class='published_date']
|
||||
# story date
|
||||
date: //span[@class='date']
|
||||
|
||||
date: substring-after(//div[contains(@class,'attributor')],'on')
|
||||
strip_id_or_class: inset
|
||||
strip: //p/span[@class='photoCredit']
|
||||
strip: //h1
|
||||
|
||||
strip_id_or_class: page_count
|
||||
strip_id_or_class: tools
|
||||
strip_id_or_class: pagination
|
||||
|
||||
single_page_link: //li[@id='stPrint']/a
|
||||
strip_id_or_class: photo_credit
|
||||
strip_id_or_class: photo_caption
|
||||
strip_id_or_class: inline_gallery
|
||||
# pull quote, often inside a blockquote element
|
||||
strip_id_or_class: pq
|
||||
strip_id_or_class: credit
|
||||
strip_id_or_class: figcaption
|
||||
strip_id_or_class: related_item
|
||||
|
||||
test_url: http://www.businessweek.com/magazine/buyback-insurance-a-good-deal-for-retailers-07282011.html
|
||||
test_url: http://www.businessweek.com/articles/2012-06-06/american-pain-the-largest-u-dot-s-dot-pill-mills-rise-and-fall
|
||||
test_url: http://www.businessweek.com/articles/2014-07-09/american-apparel-dov-charneys-sleazy-struggle-for-control
|
||||
|
@ -10,6 +10,15 @@ date: //time[@data-print='date']
|
||||
body: //div[@data-print='body']
|
||||
body: //section[@data-print='body']
|
||||
|
||||
find_string: rel:bf_image_src=
|
||||
replace_string: src=
|
||||
find_string: src="data:
|
||||
replace_string: disabled_src="data:
|
||||
|
||||
native_ad_clue: //meta[@property="article:section" and @content="Advertiser"]
|
||||
|
||||
# For various things...
|
||||
strip: *[@data-print="ignore"]
|
||||
test_url: http://www.buzzfeed.com/hgrant/35-reasons-why-dogs-hate-the-holidays
|
||||
# Native ad
|
||||
test_url: http://www.buzzfeed.com/bravo/ways-to-up-your-online-dating-game
|
28
inc/3rdparty/site_config/standard/canonrumors.com.txt
vendored
Executable file
28
inc/3rdparty/site_config/standard/canonrumors.com.txt
vendored
Executable file
@ -0,0 +1,28 @@
|
||||
# Author: zinnober
|
||||
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
# Set title
|
||||
title: //h2
|
||||
|
||||
date: //li[@class='time']
|
||||
|
||||
# Set author
|
||||
author: //a[contains(@rel, 'author')]
|
||||
|
||||
# Content is here
|
||||
body: //div[@id='content']
|
||||
|
||||
# Tidy up before article
|
||||
strip: //div[@class='meta']
|
||||
|
||||
# Tidy up after article
|
||||
strip_id_or_class: nr_related_placeholder
|
||||
strip_id_or_class: twitter-share-button
|
||||
strip_id_or_class: afterpost
|
||||
strip_id_or_class: tags
|
||||
|
||||
# Try it yourself
|
||||
test_url: http://www.canonrumors.com/2014/09/chuck-westfall-talks-canon-eos-7d-mark-ii/
|
||||
test_url: http://www.canonrumors.com/2014/09/canon-cinema-eos-captures-space-in-4k-for-new-imax-3d-film/
|
@ -3,3 +3,4 @@ author: //div[@class='author']
|
||||
prune: no
|
||||
|
||||
test_url: http://www.chomsky.info/onchomsky/2002----.htm
|
||||
test_contains: The propaganda model argues
|
||||
|
@ -1,5 +1,9 @@
|
||||
title: //div[@id='maincontent']//h1
|
||||
body: //div[@id='resizeableText']
|
||||
|
||||
single_page_link: concat(//link[@rel='canonical']/@href, '?sp=true')
|
||||
|
||||
test_url: http://cn.reuters.com/article/CNAnalysesNews/idCNKBS0FF0NM20140710
|
||||
test_url: http://cn.reuters.feedsportal.com/CNAnalysesNews
|
||||
# multipage link
|
||||
test_url: http://cn.reuters.com/article/idCNKBS0FF0UL20140710
|
@ -1 +1,3 @@
|
||||
body: //div[@id='content']
|
||||
body: //div[@id='readme']
|
||||
|
||||
test_url: http://code.fivefilters.org/full-text-rss
|
||||
|
@ -15,4 +15,4 @@ strip_id_or_class: promotion-tag
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
test_url: www.csmonitor.com/World/Middle-East/2011/1108/Imminent-Iran-nuclear-threat-A-timeline-of-warnings-since-1979/Earliest-warnings-1979-84
|
||||
test_url: http://www.csmonitor.com/World/Middle-East/2011/1108/Imminent-Iran-nuclear-threat-A-timeline-of-warnings-since-1979/Earliest-warnings-1979-84
|
||||
|
@ -2,4 +2,4 @@ single_page_link: //a
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
test_url: da.feedsportal.com/c/585/f/413794/s/17037b5a/l/0L0Stelegraaf0Bnl0Cbinnenland0C10A2757860C0I0IKlacht0Itegen0Idr0B0IFrank0Iniet0I0Eontvankelijk0I0I0Bhtml0Dcid0Frss/ia1.htm
|
||||
test_url: http://da.feedsportal.com/c/585/f/413794/s/17037b5a/l/0L0Stelegraaf0Bnl0Cbinnenland0C10A2757860C0I0IKlacht0Itegen0Idr0B0IFrank0Iniet0I0Eontvankelijk0I0I0Bhtml0Dcid0Frss/ia1.htm
|
||||
|
31
inc/3rdparty/site_config/standard/designsponge.com.txt
vendored
Executable file
31
inc/3rdparty/site_config/standard/designsponge.com.txt
vendored
Executable file
@ -0,0 +1,31 @@
|
||||
# Author: zinnober
|
||||
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
# Set title
|
||||
title: //header/h1
|
||||
|
||||
# Set author
|
||||
author: //a[rel='author']
|
||||
|
||||
# Content is here
|
||||
body: //article
|
||||
|
||||
# Tidy up before article
|
||||
strip: //header
|
||||
|
||||
# Tidy up article
|
||||
strip: //div[contains(@id, 'gallery-')]
|
||||
replace_string(<a rel="attachment): <p rel="attachment
|
||||
|
||||
|
||||
# Tidy up after article
|
||||
strip: //div[@class='sm']
|
||||
strip_id_or_class: related
|
||||
strip_id_or_class: comments
|
||||
strip: //footer
|
||||
|
||||
# Try it yourself
|
||||
test_url: http://www.designsponge.com/2010/06/seattle-design-guide.html
|
||||
test_url: http://www.designsponge.com/2012/04/sneak-peek-liz-cook.html
|
@ -3,3 +3,5 @@ body: (//div[starts-with(@id, 'post_message')])[1]
|
||||
|
||||
prune: no
|
||||
tidy: no
|
||||
|
||||
test_url: http://www.desitvforum.net/forum/watch-online/431739-creature-3d-2014-watch-online-download-dvd-rip.html
|
||||
|
29
inc/3rdparty/site_config/standard/deutsche-apotheker-zeitung.de.txt
vendored
Executable file
29
inc/3rdparty/site_config/standard/deutsche-apotheker-zeitung.de.txt
vendored
Executable file
@ -0,0 +1,29 @@
|
||||
# Author: zinnober
|
||||
|
||||
prune: yes
|
||||
tidy: yes
|
||||
|
||||
title: //h1
|
||||
date: //p[@class='news_datum']
|
||||
author: //span[@class='author']
|
||||
|
||||
body: //div[@class='tagesnews-content']
|
||||
|
||||
# General clenaup
|
||||
strip_id_or_class: dachzeile
|
||||
strip: //h3
|
||||
strip: //p[@class='bodytext']//a
|
||||
strip_id_or_class: autor_datum
|
||||
strip_id_or_class: comments
|
||||
strip_id_or_class: banner-
|
||||
|
||||
strip: //p[contains(., 'Lesen Sie')]
|
||||
strip: //p[contains(., '– in DAZ')]
|
||||
|
||||
# Fix image captions
|
||||
replace_string(<p class="image_caption">): <p><small><em>
|
||||
replace_string(</dd>): </em></small></dd>
|
||||
|
||||
test_url: http://www.deutsche-apotheker-zeitung.de/pharmazie/news/2014/09/03/weniger-nebenwirkungen-aber-kein-zusatznutzen/13715.html
|
||||
test_url: http://www.deutsche-apotheker-zeitung.de/recht/news/2014/09/02/urteile-zum-cannabis-eigenanbau-bfarm-geht-in-berufung/13716.html
|
||||
|
@ -1,8 +1,6 @@
|
||||
title: //h1[@id='query_h1']
|
||||
body: //div[contains(@class, 'lunatext results_content')]
|
||||
strip_id_or_class: spl_unshd
|
||||
#replace_string(<div class="dicTl">): <div class="dicTl">------------------<br />
|
||||
body: //div[contains(@class, 'source-data')]
|
||||
strip: //button
|
||||
|
||||
prune: no
|
||||
|
||||
test_url: http://www.wired.com/cloudline/2011/10/meet-arms-cortex-a15-the-future-of-the-ipad-and-possibly-the-macbook-air/
|
||||
test_url: http://dictionary.reference.com/browse/propaganda
|
||||
|
@ -1 +1,3 @@
|
||||
single_page_link: //a[@id='download_button_link']
|
||||
|
||||
test_url: https://www.dropbox.com/s/qmocfrco2t0d28o/Fluffbeast.docx
|
||||
|
24
inc/3rdparty/site_config/standard/echo-online.de.txt
vendored
Executable file
24
inc/3rdparty/site_config/standard/echo-online.de.txt
vendored
Executable file
@ -0,0 +1,24 @@
|
||||
# Author: Marvin Dickhaus <github@marvindickhaus.de>
|
||||
# 2014-10-08
|
||||
|
||||
#Tidy just messes up the DOM
|
||||
tidy: no
|
||||
|
||||
title: //h1
|
||||
body: //h2 | //div[@id='artikelteaser'] | //div[@id='artikeltext']
|
||||
|
||||
#Strip
|
||||
strip_image_src: artikel_a_merken.gif
|
||||
strip: //div[@class='zusatzinfo']
|
||||
|
||||
#Author: substring is used to remove the " Von " prefix.
|
||||
author: substring(//li[@class='artikelautor'], 5)
|
||||
|
||||
date: //li[@class='artikeldatum']
|
||||
|
||||
#The first two URLs will at some point no longer show
|
||||
#the full article. There is a time-based paywall
|
||||
#installed. Using the feed should present valid output
|
||||
test_url: http://www.echo-online.de/art1231,5503063
|
||||
test_url: http://www.echo-online.de/art1168,5502598
|
||||
test_url: http://www.echo-online.de/rss/darmstadt.xml
|
@ -1,4 +1,5 @@
|
||||
body: //div[@class='main-content']
|
||||
body: //article[contains(@class, 'resp-node')]
|
||||
date: //time[@class='date-created']
|
||||
strip: //aside
|
||||
prune: no
|
||||
@ -6,3 +7,7 @@ prune: no
|
||||
autodetect_next_page: no
|
||||
|
||||
test_url: http://www.economist.com/node/21528429
|
||||
|
||||
test_url: http://www.economist.com/news/essays/21623373-which-something-old-and-powerful-encountered-vault
|
||||
test_contains: the calfskin pages are smooth
|
||||
test_contains: Books will evolve online and off
|
||||
|
@ -1,8 +1,9 @@
|
||||
body: //div[ @class='content' ] | //div[ @class='blog-entry' ]
|
||||
body: //p[@class='strapline'] | //div[@class='cover-image'] | //article[@class='hd']
|
||||
strip: //div[@class='social top']
|
||||
strip: //p[@class='byline']
|
||||
|
||||
strip: //h2/abbr | //div[ @class='lowleader' ] | //*[ @class='discussion' ] | //img[ @class='play-button' ] | //div[ @class='boxout' ] | //h2/a | //h2 | //h2/div | //p[ @class='timestamp' ] | //a[ @class='eurogamer-author' ] | //p[ @class='aPager' ] | //h1 | //div[ @id='lowleader' ] | //a[ @class='next' ] | //div[contains(concat(' ', normalize-space(@class), ' '), ' pullquote ')]
|
||||
date: //span[@itemprop='datePublished']
|
||||
author: //a[@itemprop='author']/text()
|
||||
|
||||
date://p[ @class='timestamp' ]
|
||||
|
||||
author://a[ @class='eurogamer-author' ]
|
||||
test_url: http://www.eurogamer.net/articles/digitalfoundry-vs-unreal-engine-4
|
||||
test_url: http://www.eurogamer.net/articles/2014-08-20-bungie-ordered-to-return-shares-to-composer-marty-odonnell
|
||||
test_url: http://www.eurogamer.net/articles/2014-08-20-invisible-inc-does-espionage-justice
|
||||
|
@ -1,5 +1,12 @@
|
||||
body: //div[@id='imagestage']
|
||||
body: //div[contains(@class, 'userContentWrapper')]
|
||||
|
||||
strip_id_or_class: commentable
|
||||
|
||||
prune: no
|
||||
tidy: no
|
||||
|
||||
test_url: https://www.facebook.com/feeds/page.php?id=338077742912613&format=rss20
|
||||
# single_page_link: replace(substring-after(//noscript//meta[@http-equiv="refresh"]/@content, 'URL='), "&", "&")
|
||||
|
||||
test_url: https://www.facebook.com/permalink.php?story_fbid=10154584776550183&id=294468630182
|
||||
test_contains: holding an extraordinary session in Brussels this month
|
||||
|
0
inc/3rdparty/site_config/standard/faz.net.txt
vendored
Normal file → Executable file
0
inc/3rdparty/site_config/standard/faz.net.txt
vendored
Normal file → Executable file
@ -5,8 +5,8 @@ strip: //div[contains(@class, 'related-companies')]
|
||||
strip: //div[@id='y-article-related']
|
||||
strip: //div[@id='ypf-article-related']
|
||||
prune: no
|
||||
tidy: no
|
||||
|
||||
single_page_link: //div[@class='ft']//a[contains(@href, 'page=all')]
|
||||
|
||||
test_url: http://sg.finance.yahoo.com/news/Motorola-takes-wraps-249-rsg-3508842732.html?x=0&.v=1
|
||||
test_url: http://finance.yahoo.com/news/super-young-retirement-savers.html
|
||||
test_url: http://finance.yahoo.com/news/canadian-orebodies-gives-notice-exercise-130000032.html
|
@ -1,2 +1,2 @@
|
||||
body: //div[@class='entry']
|
||||
test_url: http://www.fivechapters.com/2010/paris-part-one/
|
||||
test_url: http://www.fivechapters.com/2014/the-saddest-writer-in-america-part-two/
|
||||
|
@ -1 +1,4 @@
|
||||
body: //section[contains(@class, 'container')]
|
||||
prune: no
|
||||
|
||||
test_url: http://fivefilters.org/kindle-it/
|
||||
|
@ -1,15 +1,19 @@
|
||||
title: //div[@class='translateHead']//h1 | //div[@id='art-mast']//h1
|
||||
author: substring-after(//span[@id='by-line'], 'BY ')
|
||||
date: //span[@id='pub-date']
|
||||
body: //div[@id='art-mast']/h2 | //div[@class='translateBody'] | //div[@id='art-body']
|
||||
body: (//article//img[contains(@class, 'main_photo')])[1] | (//article//div[contains(@class, 'full_post_content')])[1]
|
||||
#body: //div[@id='art-mast']/h2 | //div[@class='translateBody'] | //div[@id='art-body']
|
||||
#Strip inside article content
|
||||
strip: //div[@id='share-box']
|
||||
strip: //div[@id='special-box']
|
||||
strip: //div[@id='special-box
|
||||
|
||||
strip_id_or_class: side_panel
|
||||
|
||||
prune: no
|
||||
|
||||
single_page_link: //span[@id='controls']/a[contains(@href, 'print=yes')]
|
||||
single_page_link: //a[text()='SINGLE PAGE']
|
||||
|
||||
test_url: http://www.foreignpolicy.com/articles/2014/07/22/the_end_game_in_gaza_netanyahu_hamas
|
||||
test_url: http://www.foreignpolicy.com/articles/2011/08/01/a_murderers_manifesto_and_me
|
||||
test_url: http://www.foreignpolicy.com/articles/2012/02/29/five_years_in_damascus
|
51
inc/3rdparty/site_config/standard/golem.de.txt
vendored
51
inc/3rdparty/site_config/standard/golem.de.txt
vendored
@ -1,25 +1,34 @@
|
||||
# Jens Kohl, jens.kohl@...
|
||||
# - Added publication date
|
||||
# - Striped pagination block
|
||||
# - Added single page link
|
||||
# - Added xpath-querys for the printer friendly version
|
||||
# Author: zinnober
|
||||
# Rewrite of original template which fetched the printer-version without pictures
|
||||
|
||||
title: //h1
|
||||
body: //div[@class='formatted']
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
date: substring-after(//li[2][@class="text1"], 'Datum:')
|
||||
strip: //ol[@class="list-chapters"]
|
||||
strip_comments: yes
|
||||
# Set full title
|
||||
title: //h1
|
||||
|
||||
# next: commands for printer friendly pages
|
||||
single_page_link: //a[contains(@href, 'print.php?a=')]/@href
|
||||
title: //body/h3
|
||||
strip_image_src: staticrl/images/logo.jpg
|
||||
strip_image_src: http://cpx.golem.de/cpx.php?class=7
|
||||
strip: //body/h3
|
||||
strip: //body/b[1]
|
||||
strip: //body/b[2]
|
||||
strip: //body/b[3]
|
||||
strip: //div[1]
|
||||
test_url: http://www.golem.de/1112/88696.html
|
||||
date: //time
|
||||
|
||||
# Content is here
|
||||
body: //article
|
||||
|
||||
# Fetch full multipage articles
|
||||
next_page_link: //a[@id='atoc_next']
|
||||
|
||||
# Remove tracking and ads
|
||||
strip_id_or_class: iqadtile4
|
||||
|
||||
# General Cleanup
|
||||
strip_id_or_class: list-jtoc
|
||||
strip_id_or_class: table-jtoc
|
||||
strip_id_or_class: implied
|
||||
strip_id_or_class: social-
|
||||
strip_id_or_class: comments
|
||||
strip_id_or_class: footer
|
||||
|
||||
# Tidy up galleries (could still be improved, though)
|
||||
strip: //img[@src='']
|
||||
|
||||
# Try yourself
|
||||
test_url: http://www.golem.de/news/intel-core-i7-5960x-im-test-die-pc-revolution-beginnt-mit-octacore-und-ddr4-1408-108893.html
|
||||
test_url: http://www.golem.de/news/test-infamous-first-light-neonbunter-actionspass-1408-108914.html
|
||||
|
45
inc/3rdparty/site_config/standard/heise.de.txt
vendored
45
inc/3rdparty/site_config/standard/heise.de.txt
vendored
@ -1,9 +1,42 @@
|
||||
#second part of single_page_link for telepolis-articles (desktop-version of site)
|
||||
single_page_link: //p[@class='news_option']/a | //a[@id='tp-druckversion']
|
||||
# Author: zinnober
|
||||
# Template should work well with either desktop or mobile version (m.heise.de)
|
||||
|
||||
prune: no
|
||||
|
||||
title: //article/h1 | //h1
|
||||
date: //p[@class='news_datum']
|
||||
title: //h1
|
||||
body: //div[@class='meldung_wrapper']
|
||||
author: //h4[@class='author']
|
||||
|
||||
test_url: http://www.heise.de/newsticker/meldung/Europa-soll-Grundrechteschutz-im-Netz-staerken-1392664.html
|
||||
test_url: http://www.heise.de/tp/artikel/42/42579/1.html
|
||||
body: //article | //div[@class='meldung_wrapper']
|
||||
|
||||
# General cleanup
|
||||
strip: //time
|
||||
strip: //h4[@class='author']
|
||||
strip: //p[@class='news_datum']
|
||||
strip: //p[@class='artikel_datum']
|
||||
strip: //a[contains(@href, 'mailto')]
|
||||
strip_id_or_class: comments
|
||||
strip_id_or_class: ISI_IGNORE
|
||||
strip_id_or_class: clear
|
||||
|
||||
strip_id_or_class: linkurl_grossbild
|
||||
strip_id_or_class: image-num
|
||||
strip_id_or_class: heisebox_right
|
||||
strip_id_or_class: dossier
|
||||
|
||||
# Strip Ads
|
||||
strip_id_or_class: ad_
|
||||
|
||||
# Some optimizations
|
||||
replace_string(<h5>): <h2>
|
||||
replace_string(</h5>): </h2>
|
||||
replace_string(<span class="bild_rechts"): <p
|
||||
replace_string(<div class="heisebox">): <blockquote>
|
||||
|
||||
|
||||
next_page_link: //a[@class='next']
|
||||
next_page_link: //a[@title='vor']
|
||||
|
||||
test_url: http://www.heise.de/open/artikel/Die-Neuerungen-von-Linux-3-15-2196231.html
|
||||
test_url: http://m.heise.de/open/artikel/Die-Neuerungen-von-Linux-3-15-2196231.html
|
||||
test_url: http://www.heise.de/newsticker/meldung/Ueberwachungstechnik-Die-globale-Handy-Standortueberwachung-2301494.html
|
||||
|
@ -2,4 +2,4 @@ body: //table[@class='ap-smallphoto-table'] | //div[@class='body']//*[@class='en
|
||||
tidy: no
|
||||
strip_image_src: analytics.apnewsregistry
|
||||
|
||||
test_url: http://hosted.ap.org/dynamic/stories/U/US_SPENDING_SHOWDOWN?SITE=FLPET&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2011-04-06-07-46-50
|
||||
test_url: http://hosted.ap.org/dynamic/stories/E/EU_TURKEY_KURDS?SITE=KSNEW&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2014-10-14-10-50-25
|
||||
|
14
inc/3rdparty/site_config/standard/itunes.apple.com.txt
vendored
Executable file
14
inc/3rdparty/site_config/standard/itunes.apple.com.txt
vendored
Executable file
@ -0,0 +1,14 @@
|
||||
body: //div[@id='left-stack' or contains(@class, 'center-stack')]
|
||||
|
||||
find_string: class="artwork" src="
|
||||
replace_string: class="artwork" src-disabled="
|
||||
find_string: src-swap-high-dpi="
|
||||
replace_string: src="
|
||||
|
||||
strip_id_or_class: rating
|
||||
strip_id_or_class: listeners-also-bought
|
||||
|
||||
prune: no
|
||||
|
||||
test_url: https://itunes.apple.com/us/rss/topaudiobooks/limit=10/xml
|
||||
test_url: https://itunes.apple.com/us/audiobook/the-giver-unabridged/id356345850
|
@ -4,4 +4,4 @@ body: //div[@itemprop='articleBody']
|
||||
tidy: no
|
||||
|
||||
test_url: http://www.kachiblog.com/2013/05/samsung-galaxy-s4-vs-samsung-galaxy.html
|
||||
test_url: http://www.kachiblog.com/feeds/posts/default
|
||||
test_url: http://www.kachiblog.com/feed
|
||||
|
7
inc/3rdparty/site_config/standard/lifehacker.co.uk.txt
vendored
Executable file
7
inc/3rdparty/site_config/standard/lifehacker.co.uk.txt
vendored
Executable file
@ -0,0 +1,7 @@
|
||||
title: //div[@itemprop='headline']
|
||||
body: //noscript/img | //div[@itemprop='text']
|
||||
author: //div[@class='meta meta--post']//a[@class='is-author']
|
||||
date: //div[@class='meta meta--post']//time/@datetime
|
||||
|
||||
test_url: http://www.lifehacker.co.uk/2014/08/22/dealhacker-10-google-chromecast-super-cheap-batteries-much
|
||||
test_url: http://www.lifehacker.co.uk/2014/08/18/andrognito-hides-files-youd-like-keep-away-prying-eyes
|
@ -25,4 +25,4 @@ strip_id_or_class: 'rightimage'
|
||||
#Comments
|
||||
strip: //table
|
||||
strip: //p/following-sibling::*[0]
|
||||
test_url: http://www.mainpost.de/ueberregional/meinung/Dioxin-Skandal-bringt-Agrarministerin-in-Bedraengnis;art9517,5920211
|
||||
test_url: http://www.mainpost.de/regional/wuerzburg/Autobahnschuetze-Staatsanwalt-fordert-zwoelf-Jahre;art492151,8386332
|
||||
|
@ -2,3 +2,4 @@ strip_id_or_class: article-tools
|
||||
strip_id_or_class: pagenav
|
||||
prune: no
|
||||
test_url: http://www.medialens.org/index.php/alerts/alert-archive/2012/713-the-illusion-of-democracy.html
|
||||
test_contains: In an era of permanent war, economic meltdown
|
||||
|
13
inc/3rdparty/site_config/standard/medium.com.txt
vendored
13
inc/3rdparty/site_config/standard/medium.com.txt
vendored
@ -1,7 +1,12 @@
|
||||
body: //div[contains(@class, 'post-content-inner')]
|
||||
strip_id_or_class: follow-ups
|
||||
strip_id_or_class: footer
|
||||
body: //div[contains(@class, 'postContent-inner')]
|
||||
strip_id_or_class: supplementalPostContent
|
||||
|
||||
prune: no
|
||||
|
||||
test_url: https://medium.com/p/6844c0d7893b
|
||||
test_url: https://medium.com/@savolai/kaytettavyyden-haasteet-keskustelukulttuurista-2-3-6844c0d7893b
|
||||
test_contains: Jos käytettävyysongelmat ovat kerran niin tyypillisiä
|
||||
test_contains: Keskustelukulttuuriongelmasta (subjective vs. objective bugs)
|
||||
|
||||
test_url: https://medium.com/health-the-future/thirty-things-ive-learned-482765ee3503
|
||||
test_contains: Remember you will die
|
||||
test_contains: You have to have some faith.
|
||||
|
12
inc/3rdparty/site_config/standard/menshealth.com.sg.txt
vendored
Executable file
12
inc/3rdparty/site_config/standard/menshealth.com.sg.txt
vendored
Executable file
@ -0,0 +1,12 @@
|
||||
strip: //div[contains(@style, 'float:right') and contains(., 'advertisement')]
|
||||
body: //div[@style="float:left;width:740px;"]
|
||||
|
||||
tidy: no
|
||||
|
||||
test_url: http://www.menshealth.com.sg/fitness/mh-picks-under-armour-clutchfit-nitro-mid-cleats
|
||||
test_contains: These cleats are made for one thing
|
||||
|
||||
test_url: http://www.menshealth.com.sg/fitness/top-10-fat-burning-bodyweight-moves-you-can-do-10-minutes
|
||||
test_contains: let this workout fool you
|
||||
|
||||
test_url: http://www.menshealth.com.sg/fitness/feed
|
@ -8,4 +8,4 @@ strip_id_or_class: news_morearticlesincat
|
||||
strip_id_or_class: ezc_comments
|
||||
strip_comments: yes
|
||||
|
||||
test_url: http://www.northumberlandview.ca/index.php?module=news&func=display&sid=5972
|
||||
test_url: http://www.northumberlandview.ca/index.php?module=news&type=user&func=display&sid=31127
|
||||
|
@ -42,7 +42,11 @@ strip://h6[@class = 'kicker']
|
||||
author:substring-after(//h6[@class='byline'],'By ')
|
||||
|
||||
test_url: http://www.nytimes.com/2011/07/24/books/review/an-academic-authors-unintentional-masterpiece.html
|
||||
test_contains: In this column I want to look at a not uncommon way of writing
|
||||
|
||||
test_url: http://www.nytimes.com/2012/06/10/arts/television/the-newsroom-aaron-sorkins-return-to-tv.html
|
||||
test_contains: IF you’ve seen enough of Aaron Sorkin’s theater
|
||||
|
||||
test_url: http://www.nytimes.com/2013/03/25/world/middleeast/israeli-military-responds-after-patrols-come-under-fire-from-syria.html
|
||||
test_url: http://www.nytimes.com/2013/08/15/nyregion/when-the-new-york-city-subway-ran-without-rails.html
|
||||
test_url: http://www.nytimes.com/2004/02/29/weekinreview/correspondence-class-consciousness-china-s-wealthy-live-creed-hobbes-darwin-meet.html
|
||||
|
@ -1,3 +1,5 @@
|
||||
body: //div[@id='_ctl12__ctl0_Article']
|
||||
body: //div[contains(@class, 'article-photo-wrapper')]
|
||||
prune: no
|
||||
autodetect_on_failure: no
|
||||
|
||||
test_url: http://www.real.gr/DefaultArthro.aspx?page=arthro&id=360962&catID=1
|
||||
test_contains: Επισήμως το αποψινό υπουργικό
|
||||
|
@ -7,7 +7,7 @@ author: //p[@class="tagline"]/a
|
||||
# this doesn't work for some reason...?
|
||||
date: //p[@class="tagline"]//@datetime
|
||||
|
||||
body: //div[@class="expando"]//div[@class="usertext-body"]
|
||||
body: (//div[contains(@class, 'noncollapsed')]//div[contains(@class, 'usertext-body')])[1]
|
||||
|
||||
strip_id_or_class: tagline
|
||||
strip_id_or_class: unvotable-message
|
||||
@ -18,3 +18,4 @@ single_page_link: //p[@class="title"]/a[contains(@href, 'http://')]
|
||||
|
||||
test_url: http://www.reddit.com/r/truegaming/comments/wfe7r/i_wrote_about_the_problems_i_honestly_feel_that/
|
||||
test_url: http://www.reddit.com/r/worldnews/comments/1as37r/twelve_north_korean_soldiers_attempting_to_defect/
|
||||
test_url: http://www.reddit.com/r/WritingPrompts/comments/2786lw/wp_in_a_world_where_puns_are_illegal_one_man/chybk8e
|
@ -1,4 +1,4 @@
|
||||
body: //div[@class="storyBox"]
|
||||
body: //div[contains(concat(' ',normalize-space(@class),' '),' article ') and (contains(concat(' ',normalize-space(@class),' '),' clear '))]
|
||||
title: //div[@class="storyBox"]/h1
|
||||
author: //a[@rel="author"]
|
||||
date: substring-before(//span[@class="dateline"], 'by')
|
||||
|
@ -1,4 +1,4 @@
|
||||
#grab the actual content div
|
||||
body: //div[@class='rt-article']
|
||||
|
||||
test_url: http://www.sourcebooks.com/next/sourcebooks-next-our-blog/1601-another-piece-of-the-e-puzzle-or-when-good-ebook-promotions-go-bad.html
|
||||
test_url: http://www.sourcebooks.com/blog/happy-27th-birthday-sourcebooks.html
|
||||
|
5
inc/3rdparty/site_config/standard/tabletmag.com.txt
vendored
Executable file
5
inc/3rdparty/site_config/standard/tabletmag.com.txt
vendored
Executable file
@ -0,0 +1,5 @@
|
||||
body: //div[contains(@class, 'story-text')]
|
||||
|
||||
strip_id_or_class: related
|
||||
|
||||
test_url: http://www.tabletmag.com/jewish-news-and-politics/181181/mossberg-parallel-states?all=1
|
60
inc/3rdparty/site_config/standard/tagesspiegel.de.txt
vendored
Executable file
60
inc/3rdparty/site_config/standard/tagesspiegel.de.txt
vendored
Executable file
@ -0,0 +1,60 @@
|
||||
# Author: zinnober
|
||||
# Should work with "normal" articles as well as with image galleries
|
||||
|
||||
prune: no
|
||||
|
||||
# Title
|
||||
title: //h1/span[@class='hcf-headline']
|
||||
|
||||
# Set author
|
||||
author: //a[@rel='author']
|
||||
|
||||
# Set date
|
||||
date: //span[@class='date hcf-atlas']
|
||||
|
||||
# Fetch full multipage articles
|
||||
next_page_link: //a[contains(@class, 'hcf-forward')]
|
||||
|
||||
# Content is here
|
||||
body: //article
|
||||
body: //div[contains(@class, 'hcf-screen')]
|
||||
|
||||
# Remove tracking and ads
|
||||
strip_id_or_class: hcf-ad
|
||||
strip_id_or_class: hcf-autoload-ad
|
||||
strip_id_or_class: hcf-content-ad
|
||||
|
||||
# Tidy up before article
|
||||
strip: //article/h1
|
||||
strip_id_or_class: hcf-atlas
|
||||
strip_id_or_class: hcf-author
|
||||
strip_id_or_class: date hcf-atlas
|
||||
strip_id_or_class: date hcf-atlas
|
||||
|
||||
# General cleanup
|
||||
strip: //div[contains(@class, 'hcf-screen')]//h1
|
||||
strip: //div[@class='hcf-subpage-titles']//ul
|
||||
strip_id_or_class: hcf-doctype-media
|
||||
strip_id_or_class: hcf-inline-gallery
|
||||
strip_id_or_class: hcf-doctype-video
|
||||
strip_id_or_class: hcf-links
|
||||
strip_id_or_class: hcf-mini-navi
|
||||
strip_id_or_class: hcf-media-control
|
||||
strip_id_or_class: hcf-hidden
|
||||
replace_string(<span class="hcf-update">Update</span>): <strong>Update: </strong>
|
||||
|
||||
# Fix pictures and captions
|
||||
replace_string(<a class="hcf-doctype-gallery): <p class="hcf-doctype-gallery
|
||||
replace_string(<a class="hcf-doctype-enlarge): <p class="hcf-doctype-enlarge
|
||||
replace_string(<figcaption class="hcf-caption">): <br><small><em>
|
||||
replace_string(</figcaption>): </em></small>
|
||||
|
||||
# Fix image galleries
|
||||
replace_string(<a class=" ajaxify): <p class="ajaxify
|
||||
replace_string(<div class="hcf-caption"><div><p>): <small><em>
|
||||
|
||||
# Try it yourself
|
||||
test_url: http://www.tagesspiegel.de/berlin/bezirke/wedding/wedding-jetzt/auf-der-suche-nach-einem-stadtteil-wilder-weiter-wedding/8757156.html
|
||||
test_url: http://www.tagesspiegel.de/berlin/olympia-in-berlin-der-flughafen-tegel-soll-das-olympische-dorf-werden/10645036.html
|
||||
test_url: http://www.tagesspiegel.de/mediacenter/fotostrecken/berlin/bildergalerie-kreuzberger-der-woche/9305534.html
|
||||
|
@ -1,3 +1,3 @@
|
||||
single_page_link_in_feed: //b/a
|
||||
|
||||
test_url_feed: http://www.techmeme.com/feed.xml
|
||||
test_url: http://www.techmeme.com/feed.xml
|
||||
|
@ -15,6 +15,8 @@ strip: //div[@class='earthbox']
|
||||
|
||||
single_page_link: //article//a[contains(@class, 'print')]
|
||||
|
||||
native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
|
||||
|
||||
test_url: http://www.theatlantic.com/technology/archive/2011/04/want-to-see-how-crazy-a-bot-run-market-can-be/237773/
|
||||
test_url: http://www.theatlantic.com/magazine/archive/2007/11/the-autumn-of-the-multitaskers/6342/
|
||||
test_url: http://www.theatlantic.com/entertainment/archive/2012/04/30-rock-live-a-funny-reminder-of-why-sitcoms-arent-shot-live-anymore/256447/
|
@ -1,5 +1,10 @@
|
||||
body: //div[contains(@class, 'entry-content')]//div[contains(@class, 'column-2')]
|
||||
single_page_link: //div[contains(@class, 'pagination')]//a[contains(@title, 'ingle page')]
|
||||
strip_id_or_class: entry-related
|
||||
strip_id_or_class: entry-sidebar
|
||||
strip_id_or_class: entry-pagination
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
test_url: http://www.theglobeandmail.com/report-on-business/rob-magazine/how-a-novice-miner-survived-a-summer-in-the-klondike/article2345350/
|
||||
test_url: http://www.theglobeandmail.com/report-on-business/industry-news/energy-and-resources/cliffs-natural-resources-looking-to-exit-ontarios-ring-of-fire/article20651617/
|
@ -6,8 +6,19 @@ strip: //div[contains(@class, 'kindleWidget')]
|
||||
#strip: //a[not(text())]
|
||||
strip_id_or_class: pocket-btn
|
||||
author: //li[@class='byline']
|
||||
native_ad_clue: //meta[@property="article:tag" and contains(@content, "Partner zone")]
|
||||
native_ad_clue: //meta[@property="video:tag" and contains(@content, "Partner zone")]
|
||||
prune: no
|
||||
tidy: no
|
||||
|
||||
test_url: http://www.theguardian.com/world/2013/oct/04/nsa-gchq-attack-tor-network-encryption
|
||||
test_contains: The National Security Agency has made repeated attempts to develop
|
||||
test_contains: The agency did not directly address those questions, instead providing a statement.
|
||||
|
||||
test_url: http://www.theguardian.com/world/2013/oct/03/edward-snowden-files-john-lanchester
|
||||
test_contains: In August, the editor of the Guardian rang me up and asked if I would spend a week in New York
|
||||
test_contains: As the second most senior judge in the country, Lord Hoffmann, said in 2004 about a previous version of our anti-terrorism laws
|
||||
|
||||
test_url: http://www.theguardian.com/commentisfree/2014/jun/15/britishness-search-identity-my-part-in-camerons-odyssey
|
||||
# Native ad
|
||||
test_url: http://www.theguardian.com/sustainable-business/2014/jul/18/ben-jerry-turn-ice-cream-into-energy
|
||||
|
@ -15,6 +15,11 @@ strip: //nav
|
||||
strip: //img[contains(@class, 'vox-lazy-load')]
|
||||
# deal with bad parsing
|
||||
strip: //div[contains(@class, 'story-image')]//div[contains(., 'function(')]
|
||||
strip: //div[contains(@class, 'm-linkset')]
|
||||
strip: //div[contains(@class, 'm-entry__sidebar')]
|
||||
strip: //ul[contains(@class, 'm-article__sources')]
|
||||
strip: //div[contains(@class, 'chorus-emc__content')]
|
||||
|
||||
|
||||
strip_id_or_class: gallery
|
||||
strip_id_or_class: article-meta
|
||||
|
25
inc/3rdparty/site_config/standard/thisiscolossal.com.txt
vendored
Executable file
25
inc/3rdparty/site_config/standard/thisiscolossal.com.txt
vendored
Executable file
@ -0,0 +1,25 @@
|
||||
# Author: zinnober
|
||||
|
||||
tidy: no
|
||||
prune: no
|
||||
|
||||
# Set author
|
||||
author: //a[contains(@rel, 'author')]
|
||||
|
||||
# Content is here
|
||||
body: //article
|
||||
|
||||
# Tidy up before article
|
||||
strip: //header
|
||||
|
||||
# Get rid of doubled images
|
||||
strip: //img[contains(@class, '-hidden')]
|
||||
|
||||
# Tidy up after article
|
||||
strip_id_or_class: social-list
|
||||
strip_id_or_class: meta-info
|
||||
strip: //footer
|
||||
|
||||
# Try it yourself
|
||||
test_url: http://www.thisiscolossal.com/2014/09/chicago-in-the-fog-by-michael-salisbury/
|
||||
test_url: http://www.thisiscolossal.com/2014/09/bird-portraits-ruffling-with-personality-by-leila-jeffreys/
|
10
inc/3rdparty/site_config/standard/towerofthehand.com.txt
vendored
Executable file
10
inc/3rdparty/site_config/standard/towerofthehand.com.txt
vendored
Executable file
@ -0,0 +1,10 @@
|
||||
title: //div[@id='headline']
|
||||
body: //div[@class='entry_text']
|
||||
author: //div[text() = 'Author:']/following-sibling::div/a
|
||||
date: //div[text() = 'Published:']/following-sibling::div
|
||||
single_page_link: //a[@href='noscript.html']
|
||||
prune: no
|
||||
|
||||
test_url: http://towerofthehand.com/blog/2014/08/08-pitch-this-got-spinoff/index.html
|
||||
test_url: http://towerofthehand.com/blog/2014/07/31-definitions-and-embodiments/index.html
|
||||
test_url: http://towerofthehand.com/blog/2014/07/03-hero-with-thousand-faces/index.html
|
@ -7,3 +7,4 @@ prune: no
|
||||
tidy: no
|
||||
|
||||
test_url: https://twitter.com/medialens/status/216883678582804480
|
||||
test_contains: is all but alone in challenging the tsunami of UK
|
||||
|
@ -2,6 +2,7 @@ title: //meta[@property="og:title"]/@content
|
||||
author: //div[contains(@class, 'byline')]//span[contains(@class, 'name')]
|
||||
date: //div[contains(@class, 'cn_date_time')]
|
||||
body: //div[contains(@class, 'pageContainers')]
|
||||
body: //div[@id='main']
|
||||
body: //article[@id='items-container']
|
||||
#body: //h2[@class='sub-header'] | //div[contains(@class, 'contributor-type') or @class='display-date' or @class='content-container']
|
||||
|
||||
@ -26,5 +27,7 @@ strip: //li[@class='blogNavPrev']
|
||||
single_page_link: //a[@title='Print this page']
|
||||
|
||||
test_url: http://www.vanityfair.com/politics/features/2011/05/egypt-revolutionaries-201105
|
||||
test_contains: nothing can take away from the miracle of Tahrir Square
|
||||
|
||||
test_url: http://www.vanityfair.com/politics/features/2008/08/hitchens200808
|
||||
test_url: http://www.vanityfair.com/style/2012/01/prisoners-of-style-201201
|
18
inc/3rdparty/site_config/standard/wn.de.txt
vendored
Executable file
18
inc/3rdparty/site_config/standard/wn.de.txt
vendored
Executable file
@ -0,0 +1,18 @@
|
||||
author: //div[@id='main']//div[@class='col right']//div[contains(@class, 'attribute-author')]
|
||||
body: //div[@id='main']//div[@class='col right']
|
||||
strip_id_or_class: boxes
|
||||
strip_id_or_class: lazy
|
||||
strip_id_or_class: comment_box
|
||||
strip_id_or_class: fb_comments
|
||||
|
||||
find_string: <noscript>
|
||||
replace_string: <div>
|
||||
find_string: </noscript>
|
||||
replace_string: </div>
|
||||
|
||||
prune: no
|
||||
tidy: no
|
||||
|
||||
test_url: http://www.wn.de/Muenster/Kultur/1742956-Wilm-Weppelmann-verlaesst-die-Einsiedelei-Und-dann-ab-unter-die-Dusche
|
||||
# feed
|
||||
test_url: http://www.wn.de/rss/feed/wn_muenster
|
@ -1,4 +1,3 @@
|
||||
# 2014-10-21 [Marmo] added stripping of inline ads and appropriate test_url
|
||||
# 2013.10.30 [rezor92] fixed single_page_link
|
||||
# 2012-12-23 [carlo@...] fixed half-assed headlines in articles, removed inline author profiles, adjusted picture captions
|
||||
# 2012-03-17 [dkless@...] Cut metadata parts in the beginning and the ends of the content block; copyright entries for pictures removed; Author fixed, not sure if old entries still valid (I left them); Weird problems with some pages addressed (see last section for removing hidden section)
|
||||
@ -17,8 +16,6 @@ author: substring-after(//li[@class='source first '], 'Quelle: ')
|
||||
|
||||
strip_id_or_class: articleheader
|
||||
strip: //div[@id="comments"] | //div[@class="pagination block"] | //p[@class="ressortbacklink"] | //div[@id="relatedArticles"] | // div[@class="inline portrait"]
|
||||
#Remove inline ads
|
||||
strip: //div[@class="innerad"]
|
||||
|
||||
#Removes author and date from the start
|
||||
strip: //ul[@class="tools"]
|
||||
@ -46,4 +43,3 @@ strip_id_or_class:"pagination"
|
||||
|
||||
footnotes: no
|
||||
test_url: http://www.zeit.de/kultur/film/2012-12/Kurzfilmtag
|
||||
test_url: http://www.zeit.de/wissen/2014-10/ebola-nigeria-who
|
||||
|
Loading…
Reference in New Issue
Block a user