1
0
mirror of https://github.com/moparisthebest/wget synced 2024-07-03 16:38:41 -04:00

[svn] Update the documentation on RES.

Published in <sxssn5mvsnk.fsf@florida.munich.redhat.com>.
This commit is contained in:
hniksic 2002-04-23 17:37:39 -07:00
parent 7a6bf2477d
commit 556b2cf861
2 changed files with 48 additions and 24 deletions

View File

@ -1,3 +1,10 @@
2002-04-24 Hrvoje Niksic <hniksic@arsdigita.com>
* wget.texi (Robot Exclusion): Explain how to turn off the robot
exclusion support from the command line.
(Wgetrc Commands): Explain that the `robots' variable also takes
effect on the "nofollow" matching.
2002-04-15 Hrvoje Niksic <hniksic@arsdigita.com> 2002-04-15 Hrvoje Niksic <hniksic@arsdigita.com>
* wget.texi (Download Options): Fix the documentation of * wget.texi (Download Options): Fix the documentation of

View File

@ -2219,8 +2219,11 @@ When set to on, retrieve symbolic links as if they were plain files; the
same as @samp{--retr-symlinks}. same as @samp{--retr-symlinks}.
@item robots = on/off @item robots = on/off
Use (or not) @file{/robots.txt} file (@pxref{Robots}). Be sure to know Specify whether the norobots convention is respected by Wget, ``on'' by
what you are doing before changing the default (which is @samp{on}). default. This switch controls both the @file{/robots.txt} and the
@samp{nofollow} aspect of the spec. @xref{Robot Exclusion}, for more
details about this. Be sure you know what you are doing before turning
this off.
@item server_response = on/off @item server_response = on/off
Choose whether or not to print the @sc{http} and @sc{ftp} server Choose whether or not to print the @sc{http} and @sc{ftp} server
@ -2744,14 +2747,14 @@ Other than that, Wget will not try to interfere with signals in any way.
This chapter contains some references I consider useful. This chapter contains some references I consider useful.
@menu @menu
* Robots:: Wget as a WWW robot. * Robot Exclusion:: Wget's support for RES.
* Security Considerations:: Security with Wget. * Security Considerations:: Security with Wget.
* Contributors:: People who helped. * Contributors:: People who helped.
@end menu @end menu
@node Robots, Security Considerations, Appendices, Appendices @node Robot Exclusion, Security Considerations, Appendices, Appendices
@section Robots @section Robot Exclusion
@cindex robots @cindex robot exclusion
@cindex robots.txt @cindex robots.txt
@cindex server maintenance @cindex server maintenance
@ -2759,26 +2762,35 @@ It is extremely easy to make Wget wander aimlessly around a web site,
sucking all the available data in progress. @samp{wget -r @var{site}}, sucking all the available data in progress. @samp{wget -r @var{site}},
and you're set. Great? Not for the server admin. and you're set. Great? Not for the server admin.
While Wget is retrieving static pages, there's not much of a problem. As long as Wget is only retrieving static pages, and doing it at a
But for Wget, there is no real difference between a static page and the reasonable rate (see the @samp{--wait} option), there's not much of a
most demanding CGI. For instance, a site I know has a section handled problem. The trouble is that Wget can't tell the difference between the
by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to smallest static page and the most demanding CGI. A site I know has a
HTML. The script can and does bring the machine to its knees without section handled by an, uh, @dfn{bitchin'} CGI Perl script that converts
providing anything useful to the downloader. Info files to HTML on the fly. The script is slow, but works well
enough for human users viewing an occasional Info file. However, when
someone's recursive Wget download stumbles upon the index page that
links to all the Info files through the script, the system is brought to
its knees without providing anything useful to the downloader.
For such and similar cases various robot exclusion schemes have been To avoid this kind of accident, as well as to preserve privacy for
devised as a means for the server administrators and document authors to documents that need to be protected from well-behaved robots, the
protect chosen portions of their sites from the wandering of robots. concept of @dfn{robot exclusion} has been invented. The idea is that
the server administrators and document authors can specify which
portions of the site they wish to protect from the robots.
The more popular mechanism is the @dfn{Robots Exclusion Standard}, or The most popular mechanism, and the de facto standard supported by all
@sc{res}, written by Martijn Koster et al. in 1994. It specifies the the major robots, is the ``Robots Exclusion Standard'' (RES) written by
format of a text file containing directives that instruct the robots Martijn Koster et al. in 1994. It specifies the format of a text file
which URL paths to avoid. To be found by the robots, the specifications containing directives that instruct the robots which URL paths to avoid.
must be placed in @file{/robots.txt} in the server root, which the To be found by the robots, the specifications must be placed in
robots are supposed to download and parse. @file{/robots.txt} in the server root, which the robots are supposed to
download and parse.
Wget supports @sc{res} when downloading recursively. So, when you Although Wget is not a web robot in the strictest sense of the word, it
issue: can downloads large parts of the site without the user's intervention to
download an individual page. Because of that, Wget honors RES when
downloading recursively. For instance, when you issue:
@example @example
wget -r http://www.server.com/ wget -r http://www.server.com/
@ -2815,7 +2827,12 @@ This is explained in some detail at
method of robot exclusion in addition to the usual @file{/robots.txt} method of robot exclusion in addition to the usual @file{/robots.txt}
exclusion. exclusion.
@node Security Considerations, Contributors, Robots, Appendices If you know what you are doing and really really wish to turn off the
robot exclusion, set the @code{robots} variable to @samp{off} in your
@file{.wgetrc}. You can achieve the same effect from the command line
using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
@node Security Considerations, Contributors, Robot Exclusion, Appendices
@section Security Considerations @section Security Considerations
@cindex security @cindex security