1
0
mirror of https://github.com/moparisthebest/wget synced 2024-07-03 16:38:41 -04:00

[svn] Examples section of the documentation revamped.

Include EXAMPLES in the man page.
This commit is contained in:
hniksic 2001-12-07 22:47:48 -08:00
parent 171feaa3f2
commit 5379abeee0
2 changed files with 141 additions and 153 deletions

View File

@ -1,3 +1,11 @@
2001-12-08 Hrvoje Niksic <hniksic@arsdigita.com>
* texi2pod.pl: Include the EXAMPLES section.
* wget.texi (Overview): Shorten the man page DESCRIPTION.
(Examples): Redo the Examples chapter. Include it in the man
page.
2001-12-01 Hrvoje Niksic <hniksic@arsdigita.com> 2001-12-01 Hrvoje Niksic <hniksic@arsdigita.com>
* wget.texi: Update the manual with the new recursive retrieval * wget.texi: Update the manual with the new recursive retrieval

View File

@ -112,14 +112,16 @@ Foundation, Inc.
@cindex features @cindex features
@c man begin DESCRIPTION @c man begin DESCRIPTION
GNU Wget is a freely available network utility to retrieve files from GNU Wget is a free utility for non-interactive download of files from
the World Wide Web, using @sc{http} (Hyper Text Transfer Protocol) and the Web. It supports @sc{http}, @sc{https}, and @sc{ftp} protocols, as
@sc{ftp} (File Transfer Protocol), the two most widely used Internet well as retrieval through @sc{http} proxies.
protocols. It has many useful features to make downloading easier, some
of them being: @c man end
This chapter is a partial overview of Wget's features.
@itemize @bullet @itemize @bullet
@item @item
@c man begin DESCRIPTION
Wget is non-interactive, meaning that it can work in the background, Wget is non-interactive, meaning that it can work in the background,
while the user is not logged on. This allows you to start a retrieval while the user is not logged on. This allows you to start a retrieval
and disconnect from the system, letting Wget finish the work. By and disconnect from the system, letting Wget finish the work. By
@ -128,18 +130,23 @@ which can be a great hindrance when transferring a lot of data.
@c man end @c man end
@sp 1 @sp 1
@c man begin DESCRIPTION
@item @item
Wget is capable of descending recursively through the structure of @ignore
@sc{html} documents and @sc{ftp} directory trees, making a local copy of @c man begin DESCRIPTION
the directory hierarchy similar to the one on the remote server. This
feature can be used to mirror archives and home pages, or traverse the @c man end
web in search of data, like a @sc{www} robot (@pxref{Robots}). In that @end ignore
spirit, Wget understands the @code{norobots} convention. @c man begin DESCRIPTION
Wget can follow links in @sc{html} pages and create local versions of
remote web sites, fully recreating the directory structure of the
original site. This is sometimes referred to as ``recursive
downloading.'' While doing that, Wget respects the Robot Exclusion
Standard (@file{/robots.txt}). Wget can be instructed to convert the
links in downloaded @sc{html} files to the local files for offline
viewing.
@c man end @c man end
@sp 1 @sp 1
@c man begin DESCRIPTION
@item @item
File name wildcard matching and recursive mirroring of directories are File name wildcard matching and recursive mirroring of directories are
available when retrieving via @sc{ftp}. Wget can read the time-stamp available when retrieving via @sc{ftp}. Wget can read the time-stamp
@ -148,52 +155,47 @@ locally. Thus Wget can see if the remote file has changed since last
retrieval, and automatically retrieve the new version if it has. This retrieval, and automatically retrieve the new version if it has. This
makes Wget suitable for mirroring of @sc{ftp} sites, as well as home makes Wget suitable for mirroring of @sc{ftp} sites, as well as home
pages. pages.
@c man end
@sp 1 @sp 1
@c man begin DESCRIPTION
@item @item
Wget works exceedingly well on slow or unstable connections, @ignore
retrying the document until it is fully retrieved, or until a @c man begin DESCRIPTION
user-specified retry count is surpassed. It will try to resume the
download from the point of interruption, using @code{REST} with @sc{ftp} @c man end
and @code{Range} with @sc{http} servers that support them. @end ignore
@c man begin DESCRIPTION
Wget has been designed for robustness over slow or unstable network
connections; if a download fails due to a network problem, it will
keep retrying until the whole file has been retrieved. If the server
supports regetting, it will instruct the server to continue the
download from where it left off.
@c man end @c man end
@sp 1 @sp 1
@c man begin DESCRIPTION
@item @item
By default, Wget supports proxy servers, which can lighten the network Wget supports proxy servers, which can lighten the network load, speed
load, speed up retrieval and provide access behind firewalls. However, up retrieval and provide access behind firewalls. However, if you are
if you are behind a firewall that requires that you use a socks style behind a firewall that requires that you use a socks style gateway, you
gateway, you can get the socks library and build Wget with support for can get the socks library and build Wget with support for socks. Wget
socks. Wget also supports the passive @sc{ftp} downloading as an also supports the passive @sc{ftp} downloading as an option.
option.
@c man end
@sp 1 @sp 1
@c man begin DESCRIPTION
@item @item
Builtin features offer mechanisms to tune which links you wish to follow Builtin features offer mechanisms to tune which links you wish to follow
(@pxref{Following Links}). (@pxref{Following Links}).
@c man end
@sp 1 @sp 1
@c man begin DESCRIPTION
@item @item
The retrieval is conveniently traced with printing dots, each dot The retrieval is conveniently traced with printing dots, each dot
representing a fixed amount of data received (1KB by default). These representing a fixed amount of data received (1KB by default). These
representations can be customized to your preferences. representations can be customized to your preferences.
@c man end
@sp 1 @sp 1
@c man begin DESCRIPTION
@item @item
Most of the features are fully configurable, either through command line Most of the features are fully configurable, either through command line
options, or via the initialization file @file{.wgetrc} (@pxref{Startup options, or via the initialization file @file{.wgetrc} (@pxref{Startup
File}). Wget allows you to define @dfn{global} startup files File}). Wget allows you to define @dfn{global} startup files
(@file{/usr/local/etc/wgetrc} by default) for site settings. (@file{/usr/local/etc/wgetrc} by default) for site settings.
@c man end
@ignore @ignore
@c man begin FILES @c man begin FILES
@ -208,14 +210,12 @@ User startup file.
@end ignore @end ignore
@sp 1 @sp 1
@c man begin DESCRIPTION
@item @item
Finally, GNU Wget is free software. This means that everyone may use Finally, GNU Wget is free software. This means that everyone may use
it, redistribute it and/or modify it under the terms of the GNU General it, redistribute it and/or modify it under the terms of the GNU General
Public License, as published by the Free Software Foundation Public License, as published by the Free Software Foundation
(@pxref{Copying}). (@pxref{Copying}).
@end itemize @end itemize
@c man end
@node Invoking, Recursive Retrieval, Overview, Top @node Invoking, Recursive Retrieval, Overview, Top
@chapter Invoking @chapter Invoking
@ -1206,17 +1206,6 @@ likes to use a few options in addition to @samp{-p}:
wget -E -H -k -K -p http://@var{site}/@var{document} wget -E -H -k -K -p http://@var{site}/@var{document}
@end example @end example
In one case you'll need to add a couple more options. If @var{document}
is a @code{<FRAMESET>} page, the "one more hop" that @samp{-p} gives you
won't be enough---you'll get the @code{<FRAME>} pages that are
referenced, but you won't get @emph{their} requisites. Therefore, in
this case you'll need to add @samp{-r -l1} to the commandline. The
@samp{-r -l1} will recurse from the @code{<FRAMESET>} page to to the
@code{<FRAME>} pages, and the @samp{-p} will get their requisites. If
you're already using a recursion level of 1 or more, you'll need to up
it by one. In the future, @samp{-p} may be made smarter so that it'll
do "two more hops" in the case of a @code{<FRAMESET>} page.
To finish off this topic, it's worth knowing that Wget's idea of an To finish off this topic, it's worth knowing that Wget's idea of an
external document link is any URL specified in an @code{<A>} tag, an external document link is any URL specified in an @code{<A>} tag, an
@code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINK @code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINK
@ -2199,16 +2188,14 @@ its line.
@chapter Examples @chapter Examples
@cindex examples @cindex examples
The examples are classified into three sections, because of clarity. @c man begin EXAMPLES
The first section is a tutorial for beginners. The second section The examples are divided into three sections loosely based on their
explains some of the more complex program features. The third section complexity.
contains advice for mirror administrators, as well as even more complex
features (that some would call perverted).
@menu @menu
* Simple Usage:: Simple, basic usage of the program. * Simple Usage:: Simple, basic usage of the program.
* Advanced Usage:: Advanced techniques of usage. * Advanced Usage:: Advanced tips.
* Guru Usage:: Mirroring and the hairy stuff. * Very Advanced Usage:: The hairy stuff.
@end menu @end menu
@node Simple Usage, Advanced Usage, Examples, Examples @node Simple Usage, Advanced Usage, Examples, Examples
@ -2222,22 +2209,6 @@ Say you want to download a @sc{url}. Just type:
wget http://fly.srk.fer.hr/ wget http://fly.srk.fer.hr/
@end example @end example
The response will be something like:
@example
@group
--13:30:45-- http://fly.srk.fer.hr:80/en/
=> `index.html'
Connecting to fly.srk.fer.hr:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 4,694 [text/html]
0K -> .... [100%]
13:30:46 (23.75 KB/s) - `index.html' saved [4694/4694]
@end group
@end example
@item @item
But what will happen if the connection is slow, and the file is lengthy? But what will happen if the connection is slow, and the file is lengthy?
The connection will probably fail before the whole file is retrieved, The connection will probably fail before the whole file is retrieved,
@ -2267,20 +2238,7 @@ The usage of @sc{ftp} is as simple. Wget will take care of login and
password. password.
@example @example
@group wget ftp://gnjilux.srk.fer.hr/welcome.msg
$ wget ftp://gnjilux.srk.fer.hr/welcome.msg
--10:08:47-- ftp://gnjilux.srk.fer.hr:21/welcome.msg
=> `welcome.msg'
Connecting to gnjilux.srk.fer.hr:21... connected!
Logging in as anonymous ... Logged in!
==> TYPE I ... done. ==> CWD not needed.
==> PORT ... done. ==> RETR welcome.msg ... done.
Length: 1,340 (unauthoritative)
0K -> . [100%]
10:08:48 (1.28 MB/s) - `welcome.msg' saved [1340]
@end group
@end example @end example
@item @item
@ -2289,39 +2247,65 @@ parse it and convert it to @sc{html}. Try:
@example @example
wget ftp://prep.ai.mit.edu/pub/gnu/ wget ftp://prep.ai.mit.edu/pub/gnu/
lynx index.html links index.html
@end example @end example
@end itemize @end itemize
@node Advanced Usage, Guru Usage, Simple Usage, Examples @node Advanced Usage, Very Advanced Usage, Simple Usage, Examples
@section Advanced Usage @section Advanced Usage
@itemize @bullet @itemize @bullet
@item @item
You would like to read the list of @sc{url}s from a file? Not a problem You have a file that contains the URLs you want to download? Use the
with that: @samp{-i} switch:
@example @example
wget -i file wget -i @var{file}
@end example @end example
If you specify @samp{-} as file name, the @sc{url}s will be read from If you specify @samp{-} as file name, the @sc{url}s will be read from
standard input. standard input.
@item @item
Create a mirror image of GNU @sc{www} site (with the same directory structure Create a five levels deep mirror image of the GNU web site, with the
the original has) with only one try per document, saving the log of the same directory structure the original has, with only one try per
activities to @file{gnulog}: document, saving the log of the activities to @file{gnulog}:
@example @example
wget -r -t1 http://www.gnu.ai.mit.edu/ -o gnulog wget -r http://www.gnu.org/ -o gnulog
@end example @end example
@item @item
Retrieve the first layer of yahoo links: The same as the above, but convert the links in the @sc{html} files to
point to local files, so you can view the documents off-line:
@example @example
wget -r -l1 http://www.yahoo.com/ wget --convert-links -r http://www.gnu.org/ -o gnulog
@end example
@item
Retrieve only one HTML page, but make sure that all the elements needed
for the page to be displayed, such as inline images and external style
sheets, are also downloaded. Also make sure the downloaded page
references the downloaded links.
@example
wget -p --convert-links http://www.server.com/dir/page.html
@end example
The HTML page will be saved to @file{www.server.com/dir/page.html}, and
the images, stylesheets, etc., somewhere under @file{www.server.com/},
depending on where they were on the remote server.
@item
The same as the above, but without the @file{www.server.com/} directory.
In fact, I don't want to have all those random server directories
anyway---just save @emph{all} those files under a @file{download/}
subdirectory of the current directory.
@example
wget -p --convert-links -nH -nd -Pdownload \
http://www.server.com/dir/page.html
@end example @end example
@item @item
@ -2333,7 +2317,8 @@ wget -S http://www.lycos.com/
@end example @end example
@item @item
Save the server headers with the file: Save the server headers with the file, perhaps for post-processing.
@example @example
wget -s http://www.lycos.com/ wget -s http://www.lycos.com/
more index.html more index.html
@ -2341,25 +2326,26 @@ more index.html
@item @item
Retrieve the first two levels of @samp{wuarchive.wustl.edu}, saving them Retrieve the first two levels of @samp{wuarchive.wustl.edu}, saving them
to /tmp. to @file{/tmp}.
@example @example
wget -P/tmp -l2 ftp://wuarchive.wustl.edu/ wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/
@end example @end example
@item @item
You want to download all the @sc{gif}s from an @sc{http} directory. You want to download all the @sc{gif}s from a directory on an @sc{http}
@samp{wget http://host/dir/*.gif} doesn't work, since @sc{http} server. @samp{wget http://www.server.com/dir/*.gif} doesn't work
retrieval does not support globbing. In that case, use: because @sc{http} retrieval does not support globbing. In that case,
use:
@example @example
wget -r -l1 --no-parent -A.gif http://host/dir/ wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
@end example @end example
It is a bit of a kludge, but it works. @samp{-r -l1} means to retrieve More verbose, but the effect is the same. @samp{-r -l1} means to
recursively (@pxref{Recursive Retrieval}), with maximum depth of 1. retrieve recursively (@pxref{Recursive Retrieval}), with maximum depth
@samp{--no-parent} means that references to the parent directory are of 1. @samp{--no-parent} means that references to the parent directory
ignored (@pxref{Directory-Based Limits}), and @samp{-A.gif} means to are ignored (@pxref{Directory-Based Limits}), and @samp{-A.gif} means to
download only the @sc{gif} files. @samp{-A "*.gif"} would have worked download only the @sc{gif} files. @samp{-A "*.gif"} would have worked
too. too.
@ -2369,7 +2355,7 @@ interrupted. Now you do not want to clobber the files already present.
It would be: It would be:
@example @example
wget -nc -r http://www.gnu.ai.mit.edu/ wget -nc -r http://www.gnu.org/
@end example @end example
@item @item
@ -2377,81 +2363,76 @@ If you want to encode your own username and password to @sc{http} or
@sc{ftp}, use the appropriate @sc{url} syntax (@pxref{URL Format}). @sc{ftp}, use the appropriate @sc{url} syntax (@pxref{URL Format}).
@example @example
wget ftp://hniksic:mypassword@@jagor.srce.hr/.emacs wget ftp://hniksic:mypassword@@unix.server.com/.emacs
@end example @end example
@cindex redirecting output
@item @item
If you do not like the default retrieval visualization (1K dots with 10 You would like the output documents to go to standard output instead of
dots per cluster and 50 dots per line), you can customize it through dot to files?
settings (@pxref{Wgetrc Commands}). For example, many people like the
``binary'' style of retrieval, with 8K dots and 512K lines:
@example @example
wget --dot-style=binary ftp://prep.ai.mit.edu/pub/gnu/README wget -O - http://jagor.srce.hr/ http://www.srce.hr/
@end example @end example
You can experiment with other styles, like: You can also combine the two options and make pipelines to retrieve the
documents from remote hotlists:
@example @example
wget --dot-style=mega ftp://ftp.xemacs.org/pub/xemacs/xemacs-20.4/xemacs-20.4.tar.gz wget -O - http://cool.list.com/ | wget --force-html -i -
wget --dot-style=micro http://fly.srk.fer.hr/
@end example @end example
To make these settings permanent, put them in your @file{.wgetrc}, as
described before (@pxref{Sample Wgetrc}).
@end itemize @end itemize
@node Guru Usage, , Advanced Usage, Examples @node Very Advanced Usage, , Advanced Usage, Examples
@section Guru Usage @section Very Advanced Usage
@cindex mirroring @cindex mirroring
@itemize @bullet @itemize @bullet
@item @item
If you wish Wget to keep a mirror of a page (or @sc{ftp} If you wish Wget to keep a mirror of a page (or @sc{ftp}
subdirectories), use @samp{--mirror} (@samp{-m}), which is the shorthand subdirectories), use @samp{--mirror} (@samp{-m}), which is the shorthand
for @samp{-r -N}. You can put Wget in the crontab file asking it to for @samp{-r -l inf -N}. You can put Wget in the crontab file asking it
recheck a site each Sunday: to recheck a site each Sunday:
@example @example
crontab crontab
0 0 * * 0 wget --mirror ftp://ftp.xemacs.org/pub/xemacs/ -o /home/me/weeklog 0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog
@end example @end example
@item @item
You may wish to do the same with someone's home page. But you do not In addition to the above, you want the links to be converted for local
want to download all those images---you're only interested in @sc{html}. viewing. But, after having read this manual, you know that link
conversion doesn't play well with timestamping, so you also want Wget to
back up the original HTML files before the conversion. Wget invocation
would look like this:
@example @example
wget --mirror -A.html http://www.w3.org/ wget --mirror --convert-links --backup-converted \
http://www.gnu.org/ -o /home/me/weeklog
@end example @end example
@item @item
You have a presentation and would like the dumb absolute links to be But you've also noticed that local viewing doesn't work all that well
converted to relative? Use @samp{-k}: when HTML files are saved under extensions other than @samp{.html},
perhaps because they were served as @file{index.cgi}. So you'd like
Wget to rename all the files served with content-type @samp{text/html}
to @file{@var{name}.html}.
@example @example
wget -k -r @var{URL} wget --mirror --convert-links --backup-converted \
--html-extension -o /home/me/weeklog \
http://www.gnu.org/
@end example @end example
@cindex redirecting output Or, with less typing:
@item
You would like the output documents to go to standard output instead of
to files? OK, but Wget will automatically shut up (turn on
@samp{--quiet}) to prevent mixing of Wget output and the retrieved
documents.
@example @example
wget -O - http://jagor.srce.hr/ http://www.srce.hr/ wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog
@end example
You can also combine the two options and make weird pipelines to
retrieve the documents from remote hotlists:
@example
wget -O - http://cool.list.com/ | wget --force-html -i -
@end example @end example
@end itemize @end itemize
@c man end
@node Various, Appendices, Examples, Top @node Various, Appendices, Examples, Top
@chapter Various @chapter Various
@cindex various @cindex various
@ -2592,16 +2573,18 @@ they are supposed to work, it might well be a bug.
@item @item
Try to repeat the bug in as simple circumstances as possible. E.g. if Try to repeat the bug in as simple circumstances as possible. E.g. if
Wget crashes on @samp{wget -rLl0 -t5 -Y0 http://yoyodyne.com -o Wget crashes while downloading @samp{wget -rl0 -kKE -t5 -Y0
/tmp/log}, you should try to see if it will crash with a simpler set of http://yoyodyne.com -o /tmp/log}, you should try to see if the crash is
options. repeatable, and if will occur with a simpler set of options. You might
even try to start the download at the page where the crash occurred to
see if that page somehow triggered the crash.
Also, while I will probably be interested to know the contents of your Also, while I will probably be interested to know the contents of your
@file{.wgetrc} file, just dumping it into the debug message is probably @file{.wgetrc} file, just dumping it into the debug message is probably
a bad idea. Instead, you should first try to see if the bug repeats a bad idea. Instead, you should first try to see if the bug repeats
with @file{.wgetrc} moved out of the way. Only if it turns out that with @file{.wgetrc} moved out of the way. Only if it turns out that
@file{.wgetrc} settings affect the bug, should you mail me the relevant @file{.wgetrc} settings affect the bug, mail me the relevant parts of
parts of the file. the file.
@item @item
Please start Wget with @samp{-d} option and send the log (or the Please start Wget with @samp{-d} option and send the log (or the
@ -2612,9 +2595,6 @@ on.
@item @item
If Wget has crashed, try to run it in a debugger, e.g. @code{gdb `which If Wget has crashed, try to run it in a debugger, e.g. @code{gdb `which
wget` core} and type @code{where} to get the backtrace. wget` core} and type @code{where} to get the backtrace.
@item
Find where the bug is, fix it and send me the patches. :-)
@end enumerate @end enumerate
@c man end @c man end