[svn] Update the documentation on RES.

Published in <sxssn5mvsnk.fsf@florida.munich.redhat.com>.
2024-07-03 16:38:41 -04:00 · 2002-04-23 17:37:39 -07:00 · 2002-04-23 17:37:39 -07:00 · 556b2cf861
commit 556b2cf861
parent 7a6bf2477d
2 changed files with 48 additions and 24 deletions
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@ -1,3 +1,10 @@
 2002-04-24  Hrvoje Niksic  <hniksic@arsdigita.com>
 	* wget.texi (Robot Exclusion): Explain how to turn off the robot
 	exclusion support from the command line.
 	(Wgetrc Commands): Explain that the `robots' variable also takes
 	effect on the "nofollow" matching.
 2002-04-15  Hrvoje Niksic  <hniksic@arsdigita.com>
 	* wget.texi (Download Options): Fix the documentation of
--- a/doc/wget.texi
+++ b/doc/wget.texi
@ -2219,8 +2219,11 @@ When set to on, retrieve symbolic links as if they were plain files; the
 same as @samp{--retr-symlinks}.
@item robots = on/off
-Use (or not) @file{/robots.txt} file (@pxref{Robots}).  Be sure to know
+Specify whether the norobots convention is respected by Wget, ``on'' by
-what you are doing before changing the default (which is @samp{on}).
+default.  This switch controls both the @file{/robots.txt} and the
@samp{nofollow} aspect of the spec.  @xref{Robot Exclusion}, for more
 details about this.  Be sure you know what you are doing before turning
 this off.
@item server_response = on/off
 Choose whether or not to print the @sc{http} and @sc{ftp} server
@ -2744,14 +2747,14 @@ Other than that, Wget will not try to interfere with signals in any way.
 This chapter contains some references I consider useful.
@menu
-* Robots::                  Wget as a WWW robot.
+* Robot Exclusion::         Wget's support for RES.
 * Security Considerations:: Security with Wget.
 * Contributors::            People who helped.
@end menu
-@node Robots, Security Considerations, Appendices, Appendices
+@node Robot Exclusion, Security Considerations, Appendices, Appendices
-@section Robots
+@section Robot Exclusion
-@cindex robots
+@cindex robot exclusion
@cindex robots.txt
@cindex server maintenance
@ -2759,26 +2762,35 @@ It is extremely easy to make Wget wander aimlessly around a web site,
 sucking all the available data in progress.  @samp{wget -r @var{site}},
 and you're set.  Great?  Not for the server admin.
-While Wget is retrieving static pages, there's not much of a problem.
+As long as Wget is only retrieving static pages, and doing it at a
-But for Wget, there is no real difference between a static page and the
+reasonable rate (see the @samp{--wait} option), there's not much of a
-most demanding CGI.  For instance, a site I know has a section handled
+problem.  The trouble is that Wget can't tell the difference between the
-by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to
+smallest static page and the most demanding CGI.  A site I know has a
-HTML.  The script can and does bring the machine to its knees without
+section handled by an, uh, @dfn{bitchin'} CGI Perl script that converts
-providing anything useful to the downloader.
+Info files to HTML on the fly.  The script is slow, but works well
 enough for human users viewing an occasional Info file.  However, when
 someone's recursive Wget download stumbles upon the index page that
 links to all the Info files through the script, the system is brought to
 its knees without providing anything useful to the downloader.
-For such and similar cases various robot exclusion schemes have been
+To avoid this kind of accident, as well as to preserve privacy for
-devised as a means for the server administrators and document authors to
+documents that need to be protected from well-behaved robots, the
-protect chosen portions of their sites from the wandering of robots.
+concept of @dfn{robot exclusion} has been invented.  The idea is that
 the server administrators and document authors can specify which
 portions of the site they wish to protect from the robots.
-The more popular mechanism is the @dfn{Robots Exclusion Standard}, or
+The most popular mechanism, and the de facto standard supported by all
-@sc{res}, written by Martijn Koster et al. in 1994.  It specifies the
+the major robots, is the ``Robots Exclusion Standard'' (RES) written by
-format of a text file containing directives that instruct the robots
+Martijn Koster et al. in 1994.  It specifies the format of a text file
-which URL paths to avoid.  To be found by the robots, the specifications
+containing directives that instruct the robots which URL paths to avoid.
-must be placed in @file{/robots.txt} in the server root, which the
+To be found by the robots, the specifications must be placed in
-robots are supposed to download and parse.
+@file{/robots.txt} in the server root, which the robots are supposed to
 download and parse.
-Wget supports @sc{res} when downloading recursively.  So, when you
+Although Wget is not a web robot in the strictest sense of the word, it
-issue:
+can downloads large parts of the site without the user's intervention to
 download an individual page.  Because of that, Wget honors RES when
 downloading recursively.  For instance, when you issue:
@example
 wget -r http://www.server.com/
@ -2815,7 +2827,12 @@ This is explained in some detail at
 method of robot exclusion in addition to the usual @file{/robots.txt}
 exclusion.
-@node Security Considerations, Contributors, Robots, Appendices
+If you know what you are doing and really really wish to turn off the
 robot exclusion, set the @code{robots} variable to @samp{off} in your
@file{.wgetrc}.  You can achieve the same effect from the command line
 using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
@node Security Considerations, Contributors, Robot Exclusion, Appendices
@section Security Considerations
@cindex security