1
0
mirror of https://github.com/moparisthebest/wget synced 2024-07-03 16:38:41 -04:00

[svn] Doc update.

Published in <sxsy9kny8e1.fsf@florida.arsdigita.de>.
This commit is contained in:
hniksic 2001-11-30 18:36:21 -08:00
parent 406fb8bbef
commit a244a67bc3
2 changed files with 156 additions and 172 deletions

View File

@ -1,3 +1,8 @@
2001-12-01 Hrvoje Niksic <hniksic@arsdigita.com>
* wget.texi: Update the manual with the new recursive retrieval
stuff.
2001-11-30 Ingo T. Storm <tux-sparc@computerbild.de>
* sample.wgetrc: Document ftp_proxy, too.

View File

@ -1203,7 +1203,7 @@ websites), and make sure the lot displays properly locally, this author
likes to use a few options in addition to @samp{-p}:
@example
wget -E -H -k -K -nh -p http://@var{site}/@var{document}
wget -E -H -k -K -p http://@var{site}/@var{document}
@end example
In one case you'll need to add a couple more options. If @var{document}
@ -1234,14 +1234,12 @@ accept or reject (@pxref{Types of Files} for more details).
@item -D @var{domain-list}
@itemx --domains=@var{domain-list}
Set domains to be accepted and @sc{dns} looked-up, where
@var{domain-list} is a comma-separated list. Note that it does
@emph{not} turn on @samp{-H}. This option speeds things up, even if
only one host is spanned (@pxref{Domain Acceptance}).
Set domains to be followed. @var{domain-list} is a comma-separated list
of domains. Note that it does @emph{not} turn on @samp{-H}.
@item --exclude-domains @var{domain-list}
Exclude the domains given in a comma-separated @var{domain-list} from
@sc{dns}-lookup (@pxref{Domain Acceptance}).
Specify the domains that are @emph{not} to be followed.
(@pxref{Spanning Hosts}).
@cindex follow FTP links
@item --follow-ftp
@ -1266,7 +1264,7 @@ In the past, the @samp{-G} option was the best bet for downloading a
single page and its requisites, using a commandline like:
@example
wget -Ga,area -H -k -K -nh -r http://@var{site}/@var{document}
wget -Ga,area -H -k -K -r http://@var{site}/@var{document}
@end example
However, the author of this option came across a page with tags like
@ -1278,8 +1276,8 @@ dedicated @samp{--page-requisites} option.
@item -H
@itemx --span-hosts
Enable spanning across hosts when doing recursive retrieving (@pxref{All
Hosts}).
Enable spanning across hosts when doing recursive retrieving
(@pxref{Spanning Hosts}).
@item -L
@itemx --relative
@ -1299,11 +1297,6 @@ Specify a comma-separated list of directories you wish to exclude from
download (@pxref{Directory-Based Limits} for more details.) Elements of
@var{list} may contain wildcards.
@item -nh
@itemx --no-host-lookup
Disable the time-consuming @sc{dns} lookup of almost all hosts
(@pxref{Host Checking}).
@item -np
@item --no-parent
Do not ever ascend to the parent directory when retrieving recursively.
@ -1321,9 +1314,8 @@ This is a useful option, since it guarantees that only the files
@cindex recursive retrieval
GNU Wget is capable of traversing parts of the Web (or a single
@sc{http} or @sc{ftp} server), depth-first following links and directory
structure. This is called @dfn{recursive} retrieving, or
@dfn{recursion}.
@sc{http} or @sc{ftp} server), following links and directory structure.
We refer to this as to @dfn{recursive retrieving}, or @dfn{recursion}.
With @sc{http} @sc{url}s, Wget retrieves and parses the @sc{html} from
the given @sc{url}, documents, retrieving the files the @sc{html}
@ -1331,15 +1323,22 @@ document was referring to, through markups like @code{href}, or
@code{src}. If the freshly downloaded file is also of type
@code{text/html}, it will be parsed and followed further.
Recursive retrieval of @sc{http} and @sc{html} content is
@dfn{breadth-first}. This means that Wget first downloads the requested
HTML document, then the documents linked from that document, then the
documents linked by them, and so on. In other words, Wget first
downloads the documents at depth 1, then those at depth 2, and so on
until the specified maximum depth.
The maximum @dfn{depth} to which the retrieval may descend is specified
with the @samp{-l} option (the default maximum depth is five layers).
@xref{Recursive Retrieval}.
with the @samp{-l} option. The default maximum depth is five layers.
When retrieving an @sc{ftp} @sc{url} recursively, Wget will retrieve all
the data from the given directory tree (including the subdirectories up
to the specified depth) on the remote server, creating its mirror image
locally. @sc{ftp} retrieval is also limited by the @code{depth}
parameter.
parameter. Unlike @sc{http} recursion, @sc{ftp} recursion is performed
depth-first.
By default, Wget will create a local directory tree, corresponding to
the one found on the remote server.
@ -1349,23 +1348,30 @@ important of which is mirroring. It is also useful for @sc{www}
presentations, and any other opportunities where slow network
connections should be bypassed by storing the files locally.
You should be warned that invoking recursion may cause grave overloading
on your system, because of the fast exchange of data through the
network; all of this may hamper other users' work. The same stands for
the foreign server you are mirroring---the more requests it gets in a
rows, the greater is its load.
You should be warned that recursive downloads can overload the remote
servers. Because of that, many administrators frown upon them and may
ban access from your site if they detect very fast downloads of big
amounts of content. When downloading from Internet servers, consider
using the @samp{-w} option to introduce a delay between accesses to the
server. The download will take a while longer, but the server
administrator will not be alarmed by your rudeness.
Careless retrieving can also fill your file system uncontrollably, which
can grind the machine to a halt.
Of course, recursive download may cause problems on your machine. If
left to run unchecked, it can easily fill up the disk. If downloading
from local network, it can also take bandwidth on the system, as well as
consume memory and CPU.
The load can be minimized by lowering the maximum recursion level
(@samp{-l}) and/or by lowering the number of retries (@samp{-t}). You
may also consider using the @samp{-w} option to slow down your requests
to the remote servers, as well as the numerous options to narrow the
number of followed links (@pxref{Following Links}).
Try to specify the criteria that match the kind of download you are
trying to achieve. If you want to download only one page, use
@samp{--page-requisites} without any additional recursion. If you want
to download things under one directory, use @samp{-np} to avoid
downloading things from other directories. If you want to download all
the files from one directory, use @samp{-l 1} to make sure the recursion
depth never exceeds one. @xref{Following Links}, for more information
about this.
Recursive retrieval is a good thing when used properly. Please take all
precautions not to wreak havoc through carelessness.
Recursive retrieval should be used with care. Don't say you were not
warned.
@node Following Links, Time-Stamping, Recursive Retrieval, Top
@chapter Following Links
@ -1384,98 +1390,55 @@ Wget possesses several mechanisms that allows you to fine-tune which
links it will follow.
@menu
* Relative Links:: Follow relative links only.
* Host Checking:: Follow links on the same host.
* Domain Acceptance:: Check on a list of domains.
* All Hosts:: No host restrictions.
* Spanning Hosts:: (Un)limiting retrieval based on host name.
* Types of Files:: Getting only certain files.
* Directory-Based Limits:: Getting only certain directories.
* Relative Links:: Follow relative links only.
* FTP Links:: Following FTP links.
@end menu
@node Relative Links, Host Checking, Following Links, Following Links
@section Relative Links
@cindex relative links
@node Spanning Hosts, Types of Files, Following Links, Following Links
@section Spanning Hosts
@cindex spanning hosts
@cindex hosts, spanning
When only relative links are followed (option @samp{-L}), recursive
retrieving will never span hosts. No time-expensive @sc{dns}-lookups
will be performed, and the process will be very fast, with the minimum
strain of the network. This will suit your needs often, especially when
mirroring the output of various @code{x2html} converters, since they
generally output relative links.
Wget's recursive retrieval normally refuses to visit hosts different
than the one you specified on the command line. This is a reasonable
default; without it, every retrieval would have the potential to turn
your Wget into a small version of google.
@node Host Checking, Domain Acceptance, Relative Links, Following Links
@section Host Checking
@cindex DNS lookup
@cindex host lookup
@cindex host checking
However, visiting different hosts, or @dfn{host spanning,} is sometimes
a useful option. Maybe the images are served from a different server.
Maybe you're mirroring a site that consists of pages interlinked between
three servers. Maybe the server has two equivalent names, and the HTML
pages refer to both interchangeably.
The drawback of following the relative links solely is that humans often
tend to mix them with absolute links to the very same host, and the very
same page. In this mode (which is the default mode for following links)
all @sc{url}s that refer to the same host will be retrieved.
@table @asis
@item Span to any host---@samp{-H}
The problem with this option are the aliases of the hosts and domains.
Thus there is no way for Wget to know that @samp{regoc.srce.hr} and
@samp{www.srce.hr} are the same host, or that @samp{fly.srk.fer.hr} is
the same as @samp{fly.cc.fer.hr}. Whenever an absolute link is
encountered, the host is @sc{dns}-looked-up with @code{gethostbyname} to
check whether we are maybe dealing with the same hosts. Although the
results of @code{gethostbyname} are cached, it is still a great
slowdown, e.g. when dealing with large indices of home pages on different
hosts (because each of the hosts must be @sc{dns}-resolved to see
whether it just @emph{might} be an alias of the starting host).
The @samp{-H} option turns on host spanning, thus allowing Wget's
recursive run to visit any host referenced by a link. Unless sufficient
recursion-limiting criteria are applied depth, these foreign hosts will
typically link to yet more hosts, and so on until Wget ends up sucking
up much more data than you have intended.
To avoid the overhead you may use @samp{-nh}, which will turn off
@sc{dns}-resolving and make Wget compare hosts literally. This will
make things run much faster, but also much less reliable
(e.g. @samp{www.srce.hr} and @samp{regoc.srce.hr} will be flagged as
different hosts).
@item Limit spanning to certain domains---@samp{-D}
Note that modern @sc{http} servers allow one IP address to host several
@dfn{virtual servers}, each having its own directory hierarchy. Such
``servers'' are distinguished by their hostnames (all of which point to
the same IP address); for this to work, a client must send a @code{Host}
header, which is what Wget does. However, in that case Wget @emph{must
not} try to divine a host's ``real'' address, nor try to use the same
hostname for each access, i.e. @samp{-nh} must be turned on.
In other words, the @samp{-nh} option must be used to enable the
retrieval from virtual servers distinguished by their hostnames. As the
number of such server setups grow, the behavior of @samp{-nh} may become
the default in the future.
@node Domain Acceptance, All Hosts, Host Checking, Following Links
@section Domain Acceptance
With the @samp{-D} option you may specify the domains that will be
followed. The hosts the domain of which is not in this list will not be
@sc{dns}-resolved. Thus you can specify @samp{-Dmit.edu} just to make
sure that @strong{nothing outside of @sc{mit} gets looked up}. This is
very important and useful. It also means that @samp{-D} does @emph{not}
imply @samp{-H} (span all hosts), which must be specified explicitly.
Feel free to use this options since it will speed things up, with almost
all the reliability of checking for all hosts. Thus you could invoke
The @samp{-D} option allows you to specify the domains that will be
followed, thus limiting the recursion only to the hosts that belong to
these domains. Obviously, this makes sense only in conjunction with
@samp{-H}. A typical example would be downloading the contents of
@samp{www.server.com}, but allowing downloads from
@samp{images.server.com}, etc.:
@example
wget -r -D.hr http://fly.srk.fer.hr/
wget -rH -Dserver.com http://www.server.com/
@end example
to make sure that only the hosts in @samp{.hr} domain get
@sc{dns}-looked-up for being equal to @samp{fly.srk.fer.hr}. So
@samp{fly.cc.fer.hr} will be checked (only once!) and found equal, but
@samp{www.gnu.ai.mit.edu} will not even be checked.
You can specify more than one address by separating them with a comma,
e.g. @samp{-Ddomain1.com,domain2.com}.
Of course, domain acceptance can be used to limit the retrieval to
particular domains with spanning of hosts in them, but then you must
specify @samp{-H} explicitly. E.g.:
@example
wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/
@end example
will start with @samp{http://www.mit.edu/}, following links across
@sc{mit} and Stanford.
@item Keep download off certain domains---@samp{--exclude-domains}
If there are domains you want to exclude specifically, you can do it
with @samp{--exclude-domains}, which accepts the same type of arguments
@ -1485,21 +1448,13 @@ domain, with the exception of @samp{sunsite.foo.edu}, you can do it like
this:
@example
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu \
http://www.foo.edu/
@end example
@node All Hosts, Types of Files, Domain Acceptance, Following Links
@section All Hosts
@cindex all hosts
@cindex span hosts
@end table
When @samp{-H} is specified without @samp{-D}, all hosts are freely
spanned. There are no restrictions whatsoever as to what part of the
net Wget will go to fetch documents, other than maximum retrieval depth.
If a page references @samp{www.yahoo.com}, so be it. Such an option is
rarely useful for itself.
@node Types of Files, Directory-Based Limits, All Hosts, Following Links
@node Types of Files, Directory-Based Limits, Spanning Hosts, Following Links
@section Types of Files
@cindex types of files
@ -1563,7 +1518,7 @@ Note that these two options do not affect the downloading of @sc{html}
files; Wget must load all the @sc{html}s to know where to go at
all---recursive retrieval would make no sense otherwise.
@node Directory-Based Limits, FTP Links, Types of Files, Following Links
@node Directory-Based Limits, Relative Links, Types of Files, Following Links
@section Directory-Based Limits
@cindex directories
@cindex directory limits
@ -1639,7 +1594,36 @@ Essentially, @samp{--no-parent} is similar to
intelligent fashion.
@end table
@node FTP Links, , Directory-Based Limits, Following Links
@node Relative Links, FTP Links, Directory-Based Limits, Following Links
@section Relative Links
@cindex relative links
When @samp{-L} is turned on, only the relative links are ever followed.
Relative links are here defined those that do not refer to the web
server root. For example, these links are relative:
@example
<a href="foo.gif">
<a href="foo/bar.gif">
<a href="../foo/bar.gif">
@end example
These links are not relative:
@example
<a href="/foo.gif">
<a href="/foo/bar.gif">
<a href="http://www.server.com/foo/bar.gif">
@end example
Using this option guarantees that recursive retrieval will not span
hosts, even without @samp{-H}. In simple cases it also allows downloads
to ``just work'' without having to convert links.
This option is probably not very useful and might be removed in a future
release.
@node FTP Links, , Relative Links, Following Links
@section Following FTP Links
@cindex following ftp links
@ -1985,7 +1969,7 @@ Turning dirstruct on or off---the same as @samp{-x} or @samp{-nd},
respectively.
@item domains = @var{string}
Same as @samp{-D} (@pxref{Domain Acceptance}).
Same as @samp{-D} (@pxref{Spanning Hosts}).
@item dot_bytes = @var{n}
Specify the number of bytes ``contained'' in a dot, as seen throughout
@ -2007,7 +1991,7 @@ Specify a comma-separated list of directories you wish to exclude from
download---the same as @samp{-X} (@pxref{Directory-Based Limits}).
@item exclude_domains = @var{string}
Same as @samp{--exclude-domains} (@pxref{Domain Acceptance}).
Same as @samp{--exclude-domains} (@pxref{Spanning Hosts}).
@item follow_ftp = on/off
Follow @sc{ftp} links from @sc{html} documents---the same as
@ -2161,7 +2145,7 @@ Choose whether or not to print the @sc{http} and @sc{ftp} server
responses---the same as @samp{-S}.
@item simple_host_check = on/off
Same as @samp{-nh} (@pxref{Host Checking}).
Same as @samp{-nh} (@pxref{Spanning Hosts}).
@item span_hosts = on/off
Same as @samp{-H}.
@ -2441,19 +2425,6 @@ want to download all those images---you're only interested in @sc{html}.
wget --mirror -A.html http://www.w3.org/
@end example
@item
But what about mirroring the hosts networkologically close to you? It
seems so awfully slow because of all that @sc{dns} resolving. Just use
@samp{-D} (@pxref{Domain Acceptance}).
@example
wget -rN -Dsrce.hr http://www.srce.hr/
@end example
Now Wget will correctly find out that @samp{regoc.srce.hr} is the same
as @samp{www.srce.hr}, but will not even take into consideration the
link to @samp{www.mit.edu}.
@item
You have a presentation and would like the dumb absolute links to be
converted to relative? Use @samp{-k}:
@ -2716,47 +2687,46 @@ sucking all the available data in progress. @samp{wget -r @var{site}},
and you're set. Great? Not for the server admin.
While Wget is retrieving static pages, there's not much of a problem.
But for Wget, there is no real difference between the smallest static
page and the hardest, most demanding CGI or dynamic page. For instance,
a site I know has a section handled by an, uh, bitchin' CGI script that
converts all the Info files to HTML. The script can and does bring the
machine to its knees without providing anything useful to the
downloader.
But for Wget, there is no real difference between a static page and the
most demanding CGI. For instance, a site I know has a section handled
by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to
HTML. The script can and does bring the machine to its knees without
providing anything useful to the downloader.
For such and similar cases various robot exclusion schemes have been
devised as a means for the server administrators and document authors to
protect chosen portions of their sites from the wandering of robots.
The more popular mechanism is the @dfn{Robots Exclusion Standard}
written by Martijn Koster et al. in 1994. It is specified by placing a
file named @file{/robots.txt} in the server root, which the robots are
supposed to download and parse. Wget supports this specification.
The more popular mechanism is the @dfn{Robots Exclusion Standard}, or
@sc{res}, written by Martijn Koster et al. in 1994. It specifies the
format of a text file containing directives that instruct the robots
which URL paths to avoid. To be found by the robots, the specifications
must be placed in @file{/robots.txt} in the server root, which the
robots are supposed to download and parse.
Norobots support is turned on only when retrieving recursively, and
@emph{never} for the first page. Thus, you may issue:
Wget supports @sc{res} when downloading recursively. So, when you
issue:
@example
wget -r http://fly.srk.fer.hr/
wget -r http://www.server.com/
@end example
First the index of fly.srk.fer.hr will be downloaded. If Wget finds
anything worth downloading on the same host, only @emph{then} will it
load the robots, and decide whether or not to load the links after all.
@file{/robots.txt} is loaded only once per host.
First the index of @samp{www.server.com} will be downloaded. If Wget
finds that it wants to download more documents from that server, it will
request @samp{http://www.server.com/robots.txt} and, if found, use it
for further downloads. @file{robots.txt} is loaded only once per each
server.
Note that the exlusion standard discussed here has undergone some
revisions. However, but Wget supports only the first version of
@sc{res}, the one written by Martijn Koster in 1994, available at
@url{http://info.webcrawler.com/mak/projects/robots/norobots.html}. A
later version exists in the form of an internet draft
<draft-koster-robots-00.txt> titled ``A Method for Web Robots Control'',
which expired on June 4, 1997. I am not aware if it ever made to an
@sc{rfc}. The text of the draft is available at
Until version 1.8, Wget supported the first version of the standard,
written by Martijn Koster in 1994 and available at
@url{http://info.webcrawler.com/mak/projects/robots/norobots.html}. As
of version 1.8, Wget has supported the additional directives specified
in the internet draft @samp{<draft-koster-robots-00.txt>} titled ``A
Method for Web Robots Control''. The draft, which has as far as I know
never made to an @sc{rfc}, is available at
@url{http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html}.
Wget does not yet support the new directives specified by this draft,
but we plan to add them.
This manual no longer includes the text of the old standard.
This manual no longer includes the text of the Robot Exclusion Standard.
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
@ -2875,20 +2845,24 @@ Junio Hamano---donated support for Opie and @sc{http} @code{Digest}
authentication.
@item
Brian Gough---a generous donation.
The people who provided donations for development, including Brian
Gough.
@end itemize
The following people have provided patches, bug/build reports, useful
suggestions, beta testing services, fan mail and all the other things
that make maintenance so much fun:
Ian Abbott
Tim Adam,
Adrian Aichner,
Martin Baehr,
Dieter Baron,
Roger Beeman and the Gurus at Cisco,
Roger Beeman,
Dan Berger,
T. Bharath,
Paul Bludov,
Daniel Bodea,
Mark Boyns,
John Burden,
Wanderlei Cavassin,
@ -2912,6 +2886,7 @@ Damir D@v{z}eko,
@ifinfo
Damir Dzeko,
@end ifinfo
Alan Eldridge,
@iftex
Aleksandar Erkalovi@'{c},
@end iftex
@ -2923,10 +2898,12 @@ Christian Fraenkel,
Masashi Fujita,
Howard Gayle,
Marcel Gerrits,
Lemble Gregory,
Hans Grobler,
Mathieu Guillaume,
Dan Harkless,
Heiko Herold,
Herold Heiko,
Jochen Hein,
Karl Heuer,
HIROSE Masaaki,
Gregor Hoffleit,
@ -3011,6 +2988,7 @@ Edward J. Sabol,
Heinz Salzmann,
Robert Schmidt,
Andreas Schwab,
Chris Seawood,
Toomas Soome,
Tage Stabell-Kulo,
Sven Sternberger,
@ -3019,6 +2997,7 @@ John Summerfield,
Szakacsits Szabolcs,
Mike Thomas,
Philipp Thomas,
Dave Turner,
Russell Vincent,
Charles G Waldman,
Douglas E. Wegscheid,