mirror of
https://github.com/moparisthebest/wget
synced 2024-07-03 16:38:41 -04:00
[svn] Update the documentation on RES.
Published in <sxssn5mvsnk.fsf@florida.munich.redhat.com>.
This commit is contained in:
parent
7a6bf2477d
commit
556b2cf861
@ -1,3 +1,10 @@
|
|||||||
|
2002-04-24 Hrvoje Niksic <hniksic@arsdigita.com>
|
||||||
|
|
||||||
|
* wget.texi (Robot Exclusion): Explain how to turn off the robot
|
||||||
|
exclusion support from the command line.
|
||||||
|
(Wgetrc Commands): Explain that the `robots' variable also takes
|
||||||
|
effect on the "nofollow" matching.
|
||||||
|
|
||||||
2002-04-15 Hrvoje Niksic <hniksic@arsdigita.com>
|
2002-04-15 Hrvoje Niksic <hniksic@arsdigita.com>
|
||||||
|
|
||||||
* wget.texi (Download Options): Fix the documentation of
|
* wget.texi (Download Options): Fix the documentation of
|
||||||
|
@ -2219,8 +2219,11 @@ When set to on, retrieve symbolic links as if they were plain files; the
|
|||||||
same as @samp{--retr-symlinks}.
|
same as @samp{--retr-symlinks}.
|
||||||
|
|
||||||
@item robots = on/off
|
@item robots = on/off
|
||||||
Use (or not) @file{/robots.txt} file (@pxref{Robots}). Be sure to know
|
Specify whether the norobots convention is respected by Wget, ``on'' by
|
||||||
what you are doing before changing the default (which is @samp{on}).
|
default. This switch controls both the @file{/robots.txt} and the
|
||||||
|
@samp{nofollow} aspect of the spec. @xref{Robot Exclusion}, for more
|
||||||
|
details about this. Be sure you know what you are doing before turning
|
||||||
|
this off.
|
||||||
|
|
||||||
@item server_response = on/off
|
@item server_response = on/off
|
||||||
Choose whether or not to print the @sc{http} and @sc{ftp} server
|
Choose whether or not to print the @sc{http} and @sc{ftp} server
|
||||||
@ -2744,14 +2747,14 @@ Other than that, Wget will not try to interfere with signals in any way.
|
|||||||
This chapter contains some references I consider useful.
|
This chapter contains some references I consider useful.
|
||||||
|
|
||||||
@menu
|
@menu
|
||||||
* Robots:: Wget as a WWW robot.
|
* Robot Exclusion:: Wget's support for RES.
|
||||||
* Security Considerations:: Security with Wget.
|
* Security Considerations:: Security with Wget.
|
||||||
* Contributors:: People who helped.
|
* Contributors:: People who helped.
|
||||||
@end menu
|
@end menu
|
||||||
|
|
||||||
@node Robots, Security Considerations, Appendices, Appendices
|
@node Robot Exclusion, Security Considerations, Appendices, Appendices
|
||||||
@section Robots
|
@section Robot Exclusion
|
||||||
@cindex robots
|
@cindex robot exclusion
|
||||||
@cindex robots.txt
|
@cindex robots.txt
|
||||||
@cindex server maintenance
|
@cindex server maintenance
|
||||||
|
|
||||||
@ -2759,26 +2762,35 @@ It is extremely easy to make Wget wander aimlessly around a web site,
|
|||||||
sucking all the available data in progress. @samp{wget -r @var{site}},
|
sucking all the available data in progress. @samp{wget -r @var{site}},
|
||||||
and you're set. Great? Not for the server admin.
|
and you're set. Great? Not for the server admin.
|
||||||
|
|
||||||
While Wget is retrieving static pages, there's not much of a problem.
|
As long as Wget is only retrieving static pages, and doing it at a
|
||||||
But for Wget, there is no real difference between a static page and the
|
reasonable rate (see the @samp{--wait} option), there's not much of a
|
||||||
most demanding CGI. For instance, a site I know has a section handled
|
problem. The trouble is that Wget can't tell the difference between the
|
||||||
by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to
|
smallest static page and the most demanding CGI. A site I know has a
|
||||||
HTML. The script can and does bring the machine to its knees without
|
section handled by an, uh, @dfn{bitchin'} CGI Perl script that converts
|
||||||
providing anything useful to the downloader.
|
Info files to HTML on the fly. The script is slow, but works well
|
||||||
|
enough for human users viewing an occasional Info file. However, when
|
||||||
|
someone's recursive Wget download stumbles upon the index page that
|
||||||
|
links to all the Info files through the script, the system is brought to
|
||||||
|
its knees without providing anything useful to the downloader.
|
||||||
|
|
||||||
For such and similar cases various robot exclusion schemes have been
|
To avoid this kind of accident, as well as to preserve privacy for
|
||||||
devised as a means for the server administrators and document authors to
|
documents that need to be protected from well-behaved robots, the
|
||||||
protect chosen portions of their sites from the wandering of robots.
|
concept of @dfn{robot exclusion} has been invented. The idea is that
|
||||||
|
the server administrators and document authors can specify which
|
||||||
|
portions of the site they wish to protect from the robots.
|
||||||
|
|
||||||
The more popular mechanism is the @dfn{Robots Exclusion Standard}, or
|
The most popular mechanism, and the de facto standard supported by all
|
||||||
@sc{res}, written by Martijn Koster et al. in 1994. It specifies the
|
the major robots, is the ``Robots Exclusion Standard'' (RES) written by
|
||||||
format of a text file containing directives that instruct the robots
|
Martijn Koster et al. in 1994. It specifies the format of a text file
|
||||||
which URL paths to avoid. To be found by the robots, the specifications
|
containing directives that instruct the robots which URL paths to avoid.
|
||||||
must be placed in @file{/robots.txt} in the server root, which the
|
To be found by the robots, the specifications must be placed in
|
||||||
robots are supposed to download and parse.
|
@file{/robots.txt} in the server root, which the robots are supposed to
|
||||||
|
download and parse.
|
||||||
|
|
||||||
Wget supports @sc{res} when downloading recursively. So, when you
|
Although Wget is not a web robot in the strictest sense of the word, it
|
||||||
issue:
|
can downloads large parts of the site without the user's intervention to
|
||||||
|
download an individual page. Because of that, Wget honors RES when
|
||||||
|
downloading recursively. For instance, when you issue:
|
||||||
|
|
||||||
@example
|
@example
|
||||||
wget -r http://www.server.com/
|
wget -r http://www.server.com/
|
||||||
@ -2815,7 +2827,12 @@ This is explained in some detail at
|
|||||||
method of robot exclusion in addition to the usual @file{/robots.txt}
|
method of robot exclusion in addition to the usual @file{/robots.txt}
|
||||||
exclusion.
|
exclusion.
|
||||||
|
|
||||||
@node Security Considerations, Contributors, Robots, Appendices
|
If you know what you are doing and really really wish to turn off the
|
||||||
|
robot exclusion, set the @code{robots} variable to @samp{off} in your
|
||||||
|
@file{.wgetrc}. You can achieve the same effect from the command line
|
||||||
|
using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
|
||||||
|
|
||||||
|
@node Security Considerations, Contributors, Robot Exclusion, Appendices
|
||||||
@section Security Considerations
|
@section Security Considerations
|
||||||
@cindex security
|
@cindex security
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user