Full-Text RSS Feeds | fivefilters.org

ساخت وبلاگ

What is Full-Text RSS?

News enthusiasts
Full-Text RSS can transform partial web feeds — often summary-only feeds which expect you to visit cluttered, ad-ridden site to read the full story — to deliver the full content stripped of clutter and ads. Read articles in full, in peace, in your favourite news reading application.

Developers
Full-Text RSS is a free software PHP application to help you extract article content from web pages. Extract from a standard HTML page or transform partial feeds to full text. Designed to be run as a web service, but one which you control.


Features

Icon

Speedy article extraction

Extraction rules ensure accurate results for popular sites and blog platforms.

Icon

Multi-page support

Articles split across a number of pages can be joined back together.

Icon

Autodetection

Where extraction rules do not exist, Full-Text RSS relies on heuristics to detect content automatically.

Icon

Customisable

Add custom extraction rules for fine-grained extraction.

Icon

Language detection

Full-Text RSS can figure out the language of the article being processed.

Icon

Multiple formats

Extract articles from HTML pages and partial web feeds, and get result as RSS, JSON, or JSONP for easy parsing.

Icon

Easy hosting

Host on your own servers or deploy to the cloud. Pre-configured. No database required. See our hosting suggestions.

Icon

Freedom and transparency

Full-Text RSS is free software — no restrictive corporate APIs, no secret back doors.



Pricing

Basic

Free
  • We host it
  • Unlimited feeds
  • Language detection
  • 1-3 items per feed
  • Caching: 20 min
  • Links preserved
  • Link to FiveFilters.org
  • No JSON output
Start Now

Premium

From 4€ per month
  • We host it
  • Unlimited feeds
  • Language detection
  • 1-10 items per feed
  • Caching: 10 min
  • Links preserved or removed
  • No link to FiveFilters.org
  • No JSON output
Buy Now

Developer

Pay as you go
  • We host it
  • Unlimited feeds
  • Language detection
  • 1-10 items per feed
  • Caching: 10 min
  • Links preserved or removed
  • No link to FiveFilters.org
  • JSON output
Start Now

Download

Full-Text RSS 3.9.5

Released 28 March 2019 — What's new? — Changelog

We offer two purchase options. They come with the same license, but if you intend to use Full-Text RSS as part of a commercial project, or require more support, please purchase the one for business use.

Individual


Community support forum

Automatic update of extraction rules

Custom rules via builder

Free updates for a year

Buy Now — 35 €

Bundle Offer!

Everything above plus

Feed Creator
Term Extraction
PDF Newspaper

Free updates for 2 years!

Buy Bundle — 60 €

Business


Email support

Automatic update of extraction rules

1 request for custom rules + builder

Free updates for a year

Buy Now — 75 €

Bundle Offer!

Everything above plus

Feed Creator
Term Extraction
PDF Newspaper

Free updates for 2 years!

Buy Bundle — 150 €

What you get

Full-Text RSS 3.9.5 from FiveFilters.org includes:

  • Easy installation (no database setup required)
  • Technical support via our forum
  • Free updates for 1 year (half price after that)
  • Full source code
  • Business use customers: Email support + custom extraction rules for a site of your choice *

* If extraction does not work well on a particular site, contact us with details of what you're trying to extract and we'll send you a custom site config file.

After paying you will automatically receive an email with a download link to the zip package. The zip package contains a readme file with instructions for uploading the code to your web host via FTP.

Older versions

Older versions of Full-Text RSS can be downloaded free of charge from our code repository.

Note: we do not offer any support for these and for best extraction results we recommend buying the latest version.


More information

Documentation and support

Our help site covers most of what you'll need to know to get Full-Text RSS up and running and customised to work the way you want.

Our public forum is the place to ask questions and browse previous answers.

Hosted or self-hosted?

We want our users to be free to examine and run the code behind FiveFilters.org however they like. So rather than simply invite you to sign up for our premium hosted plan, we've gone to great effort to make the software easy to use and install on your own hosting account.

Using our hosted service (Free, Premium) is the easiest option as we manage everything. You do not have to worry about staying up to date because we maintain the code and any changes we make will automatically be made available to you.

If, however, you have your own hosting account or manage your own server, the self-hosted option gives you the freedom to run the code and manage things yourself — including writing custom extraction rules. We also have a help page on hosting options which should help you get started.

Note: We monitor our hosted service to prevent abuse. For developers needing to process very large amounts of data, we highly recommend downloading the self-hosted version.

API

The details here are mainly intended for developers using our self-hosted copy of Full-Text RSS for article extraction and feed conversion. News enthusiasts who simply want to subscribe to a full-text feed in their news reading application can safely ignore the details here and use the form above.

Full-Text RSS offers two endpoints: Article Extraction and Feed Conversion. If you've restricted access to Full-Text RSS, the final section on API keys will tell you how to pass your key along in the request.

1. Article Extraction

To extract article content from a web page and get a simple JSON response, use the following endpoint:

  • /extract.php?url=[url]

Request Parameters

When making HTTP requests, you can pass the following parameters to extract.php in a GET or POST request.

Note: for many of these parameters, the configuration file will ultimately determine if and how they can be used.

ParameterValueDescription
urlstring (URL)This is the only required parameter. It should be the URL to a standard HTML page. You can omit the 'http://' prefix if you like.
inputhtmlstring (HTML)If you already have the HTML, you can pass it here. We will not make any HTTP requests for the content if this parameter is used. Note: The input HTML should be UTF-8 encoded. And you will still need to give us the URL associated with the content (the URL may determine how the content is extracted, if we have extraction rules associated with it).
content0, 1 (default)If set to 0, the extracted content will not be included in the output.
linkspreserve (default), footnotes, removeLinks can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
xss0, 1 (default)

Use this to enable/disable XSS filtering. It is enabled by default, but if your application/framework/CMS already filters HTML for XSS vulnerabilities, you can disable XSS filtering here.

If enabled, we'll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: when enabled this will remove certain elements you may want to preserve, such as iframes.

lang0, 1 (default), 2, 3

Language detection. If you'd like Full-Text RSS to find the language of the articles it processes, you can use one of the following values:

0
Ignore language
1
Use article metadata (e.g. HTML lang attribute) (Default value)
2
As above, but guess the language if it's not specified.
3
Always guess the language, whether it's specified or not.
debug[no value], rawhtml, parsedhtml

If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems.

If the parameter value is rawhtml, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects.

If the parameter value is parsedhtml, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (rawhtml) output. If your extraction rules are not picking out any elements, this will likely help identify the problem.

Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.

parserhtml5php, libxmlThe default parser is libxml as it's the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It's slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter.
siteconfigstringSite-specific extraction rules are usually stored in text files in the site_config folder. You can also submit extraction rules directly in your request using this parameter.
proxy0, 1, string (proxy name)This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Response (example)

Simple JSON output containing extracted article title, content, and more. It was produced from the following input URL: http://chomsky.info/articles/20131105.htm

{
    "title": "De-Americanizing the World",
    "excerpt": "During the latest episode of the Washington farce that has astonish…",
    "date": null,
    "author": "Noam Chomsky",
    "language": "en",
    "url": "http://chomsky.info/articles/20131105.htm",
    "effective_url": "http://chomsky.info/articles/20131105.htm",
    "content": "<p>During the latest episode of the Washington farce that has aston…"
}

Note: For brevity the output above is truncated.


2. Feed Conversion

To transform a partial feed to a full-text feed, pass the URL (encoded) in the querystring to the following URL:

  • /makefulltextfeed.php?url=[url]

All the parameters in the form at the top of this page can be passed in this way. Examine the URL in the address bar after you click 'Create Feed' to see the values.

Request Parameters

When making HTTP requests, you can pass the following parameters to makefulltextfeed.php in a GET request. Most of these parameters have default values suitable for news enthusiasts who simply want to subscribe to a full-text feed in their news reading application. If that's what you're doing, you can safely ignore the details here. For developers, or others who need more control over the output produced by Full-Text RSS, this section should give you an idea of what you can do.

We do not provide form fields for all of these parameters, but you can modify the URL in your browser after clicking 'Create Feed' to use them.

Note: for many of these parameters, the configuration file will ultimately determine if and how they can be used.

ParameterValueDescription
urlstring (URL)This is the only required parameter. It should be the URL to a partial feed or a standard HTML page. You can omit the 'http://' prefix if you like.
formatrss (default), jsonThe default Full-Text RSS output is RSS. The only other valid output format is JSON. To get JSON output, pass format=json in the querystring. Exclude it from the URL (or set it to ‘rss’) if you’d like RSS.
summary0 (default), 1If set to 1, an excerpt will be included for each item in the output.
content0, 1 (default)If set to 0, the extracted content will not be included in the output.
linkspreserve (default), footnotes, removeLinks can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
exc0 (default), 1If Full-Text RSS fails to extract the article body, the generated feed item will include a message saying extraction failed followed by the original item description (if present in the original feed). You ask Full-Text RSS to remove such items from the generated feed completely by passing 1 in this parameter.
acceptauto (default), feed, html

Tell Full-Text RSS what it should expect when fetching the input URL. By default Full-Text RSS tries to guess whether the response is a feed or regular HTML page. It's a good idea to be explicit by passing the appropriate type in this parameter. This is useful if, for example, a feed stops working and begins to return HTML or redirecs to a HTML page as a result of site changes. In such a scenario, if you've been explicit about the URL being a feed, Full-Text RSS will not parse HTML returned in response. If you pass accept=html (previously html=1), Full-Text RSS will not attempt to parse the response as a feed. This increases performance slightly and should be used if you know that the URL is not a feed.

Note: If excluded, or set to auto, Full-Text RSS first tries to parse the server's response as a feed, and only if it fails to parse as a feed will it revert to HTML parsing. In the default parse-as-feed-first mode, Full-Text RSS will identify itself as PHP first and only if a valid feed is returned will it identify itself as a browser in subsequent requests to fetch the feed items. In parse-as-html mode, Full-Text RSS will identify itself as a browser from the very first request.

xss0 (default), 1

Use this to enable XSS filtering. We have not enabled this by default because we assume the majority of our users do not display the HTML retrieved by Full-Text RSS in a web page without further processing. If you subscribe to our generated feeds in your news reader application, it should, if it's good software, already filter the resulting HTML for XSS attacks, making it redundant for Full-Text RSS do the same. Similarly with frameworks/CMSs which display feed content - the content should be treated like any other user-submitted content.

If you are writing an application yourself which is processing feeds generated by Full-Text RSS, you can either filter the HTML yourself to remove potential XSS attacks or enable this option. This might be useful if you are processing our generated feeds with JavaScript on the client side - although there's client side xss filtering available too.

If enabled, we'll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.

callbackstringThis is for JSONP use. If you're requesting JSON output, you can also specify a callback function (Javascript client-side function) to receive the Full-Text RSS JSON output.
lang0, 1 (default), 2, 3

Language detection. If you'd like Full-Text RSS to find the language of the articles it processes, you can use one of the following values:

0
Ignore language
1
Use article metadata (e.g. HTML lang attribute) or feed metadata. (Default value)
2
As above, but guess the language if it's not specified.
3
Always guess the language, whether it's specified or not.

If language detection is enabled and a match is found, the language code will be returned in the <dc:language> element inside the <item> element.

debug[no value], rawhtml, parsedhtml

If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems.

If the parameter value is rawhtml, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects.

If the parameter value is parsedhtml, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (rawhtml) output. If your extraction rules are not picking out any elements, this will likely help identify the problem.

Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.

parserhtml5php, libxmlThe default parser is libxml as it's the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It's slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter.
siteconfigstringSite-specific extraction rules are usually stored in text files in the site_config folder. You can also submit extraction rules directly in your request using this parameter.
proxy0, 1, string (proxy name)This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Feed-only parameters — These parameters only apply to web feeds. They have no effect when the input URL points to a web page.

ParameterValueDescription
use_extracted_title[no value]By default, if the input URL points to a feed, item titles in the generated feed will not be changed - we assume item titles in feeds are not truncated. If you'd like them to be replaced with titles Full-Text RSS extracts, use this parameter in the request (the value does not matter). To enable/disable this for for all feeds, see the config file - specifically $options->favour_feed_titles
maxnumberThe maximum number of feed items to process. (The default and upper limit will be found in the configuration file.)

Response (example)

JSON output produced for the BBC feed http://feeds.bbci.co.uk/news/sitemap.xml. You can also request regular RSS.

{
    "rss": {
        "@attributes": {
            "version": "2.0"
        }
,
        "channel": {
            "title": "BBC News - Home",
            "link": "http://www.bbc.co.uk/news/#sa-ns_mchannel=rss&amp;ns_source=PublicR…",
            "description": "The latest stories from the Home section of the BBC News web site.",
            "ttl": 15,
            "image": {
                "title": "BBC News - Home",
                "link": "http://www.bbc.co.uk/news/#sa-ns_mchannel=rss&amp;ns_source=PublicR…",
                "url": "http://news.bbcimg.co.uk/nol/shared/img/bbc_news_120x60.gif"
            }
,
            "item": [
                {
                    "title": "Russia's Putin visits annexed Crimea",
                    "link": "http://www.bbc.co.uk/news/world-europe-27344029#sa-ns_mchannel=rss&…",
                    "guid": "http://www.bbc.co.uk/news/world-europe-27344029#sa-ns_mchannel=rss&…",
                    "description": "President Putin: "[Crimeans have] proved their loyalty to a histor…",
                    "content_encoded": "<!-- Adding hypertab -->&#13;n&#13;n&#13;n<!-- end of hypertab -…",
                    "pubDate": "Fri, 09 May 2014 15:02:04 +0000",
                    "dc_language": "en-gb",
                    "dc_format": "text/html",
                    "dc_identifier": "http://www.bbc.co.uk/news/world-europe-27344029",
                    "media_thumbnail": [
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74751000/jpg/_74751301_ycst2i…"
                            }

                        }
,
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74751000/jpg/_74751302_ycst2i…"
                            }

                        }

                    ]

                }
,
                {
                    "title": "Harris 'assaulted daughter's friend'",
                    "link": "http://www.bbc.co.uk/news/uk-27340134#sa-ns_mchannel=rss&ns_source=…",
                    "guid": "http://www.bbc.co.uk/news/uk-27340134#sa-ns_mchannel=rss&amp;ns_sou…",
                    "description": "Rolf Harris arrives at court flanked by his wife and daughter Rolf …",
                    "content_encoded": "<!-- Embedding the video player -->&#13;n<!-- This is the embedd…",
                    "pubDate": "Fri, 09 May 2014 15:21:52 +0000",
                    "dc_language": "en-gb",
                    "dc_format": "text/html",
                    "dc_identifier": "http://www.bbc.co.uk/news/uk-27340134",
                    "media_thumbnail": [
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74740000/jpg/_74740642_hi0221…"
                            }

                        }
,
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74740000/jpg/_74740643_hi0221…"
                            }

                        }

                    ]

                }
,
                {
                    "title": "Nigeria 'ignored' school warning",
                    "link": "http://www.bbc.co.uk/news/world-africa-27344863#sa-ns_mchannel=rss&…",
                    "guid": "http://www.bbc.co.uk/news/world-africa-27344863#sa-ns_mchannel=rss&…",
                    "description": "Nigeria's military had advance warning of the attack on a school at…",
                    "content_encoded": "<div class="caption full-width">&#13;n <img src="http://news.b…",
                    "pubDate": "Fri, 09 May 2014 15:48:34 +0000",
                    "dc_language": "en-gb",
                    "dc_format": "text/html",
                    "dc_identifier": "http://www.bbc.co.uk/news/world-africa-27344863",
                    "media_thumbnail": [
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74749000/jpg/_74749855_747495…"
                            }

                        }
,
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74749000/jpg/_74749856_747495…"
                            }

                        }

                    ]

                }

            ]

        }

    }

}

Note: For brevity the output above is truncated.


API Keys

To restrict access to your copy of Full-Text RSS, you can specify API keys in the config file.

Note: Full-text feeds produced by Full-Text RSS are intended to be publically accessible to work with feed readers. As such, the API key should not appear in the final URL for feeds.

ParameterValueDescription
key string or number

This parameter has two functions.

If you're calling Full-Text RSS programattically, it's better to use this parameter to provide the API key index number together with the hash parameter (see below) so that the actual API key does not get sent in the HTTP request.

If you pass the actual API key in this parameter, the hash parameter is not required. If you pass the actual API key to makefulltextfeed.php, Full-Text RSS will find the index number and generate the hash value automatically and redirect to a new URL to hide the API key. If you'd like to link to a generated feed publically while protecting your API key, make sure you copy and paste the URL that results after the redirect.

If you've configured Full-Text RSS to require a key, an invalid key will result in an error message.

hash string A SHA-1 hash value of the API key (actual key, not index number) and the URL supplied in the url parameter, concatenated. This parameter must be passed along with the API key's index number using the key parameter (see above). In PHP, for example: $hash = sha1($api_key.$url);

System requirements

PHP 5.2 or above is required. The code has been tested on local, shared hosting and cloud environments. We recommend you download and run our simple compatibility test before purchasing. It's a single (zipped) PHP file you can upload to your server and access through your browser. It will tell you whether your server is capable of running Full-Text RSS.

On our help site, we have a list of recommended hosts.

Software Components

Full-Text RSS is written in PHP and relies on the following primary components:

  • PHP Readability
  • SimplePie
  • FeedWriter
  • Humble HTTP Agent

Depending on your configuration, these secondary components may also be used:

  • HTML5-PHP
  • htmLawed
  • Rolling Curl
  • Zend Cache
  • Text_LanguageDetect

License

AGPL logo
This web application is licensed under the AGPL version 3. (More on why this is important.)

The software components in this application are licensed as follows...

  • PHP Readability: Apache License v2
  • SimplePie: BSD
  • FeedWriter: GPL v2
  • Humble HTTP Agent: AGPL v3
  • Zend: New BSD
  • Rolling Curl: Apache License v2
  • HTML5-PHP: MIT
  • htmLawed: LGPL v3
  • Text_LanguageDetect: BSD

Support

Icon

Frequently Asked Questions

What is this? How does it work? How can I use it? Why is my content appearing on other sites? See our Frequently Asked Questions page for answers.

Icon

Help

Our help site contains articles to get you started, and a forum to ask question.

Icon

Email

Direct your questions to [email protected].

Icon

Twitter

Direct your questions to @fivefilters. Why not follow us too?

اطلاع رسانی دکتر جمشید جعفرپور...
ما را در سایت اطلاع رسانی دکتر جمشید جعفرپور دنبال می کنید

برچسب : نویسنده : رضا رضوی drjafarpour بازدید : 402 تاريخ : دوشنبه 2 ارديبهشت 1398 ساعت: 14:40