2000-02-29 20:03:39 -05:00
|
|
|
|
This is Info file wget.info, produced by Makeinfo version 1.68 from the
|
1999-12-02 02:42:23 -05:00
|
|
|
|
input file ./wget.texi.
|
|
|
|
|
|
|
|
|
|
INFO-DIR-SECTION Net Utilities
|
|
|
|
|
INFO-DIR-SECTION World Wide Web
|
|
|
|
|
START-INFO-DIR-ENTRY
|
|
|
|
|
* Wget: (wget). The non-interactive network downloader.
|
|
|
|
|
END-INFO-DIR-ENTRY
|
|
|
|
|
|
|
|
|
|
This file documents the the GNU Wget utility for downloading network
|
|
|
|
|
data.
|
|
|
|
|
|
2000-02-29 20:03:39 -05:00
|
|
|
|
Copyright (C) 1996, 1997, 1998, 2000 Free Software Foundation, Inc.
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
Permission is granted to make and distribute verbatim copies of this
|
|
|
|
|
manual provided the copyright notice and this permission notice are
|
|
|
|
|
preserved on all copies.
|
|
|
|
|
|
|
|
|
|
Permission is granted to copy and distribute modified versions of
|
|
|
|
|
this manual under the conditions for verbatim copying, provided also
|
|
|
|
|
that the sections entitled "Copying" and "GNU General Public License"
|
|
|
|
|
are included exactly as in the original, and provided that the entire
|
|
|
|
|
resulting derived work is distributed under the terms of a permission
|
|
|
|
|
notice identical to this one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Top, Next: Overview, Prev: (dir), Up: (dir)
|
|
|
|
|
|
2000-02-29 20:03:39 -05:00
|
|
|
|
Wget 1.5.3+dev
|
|
|
|
|
**************
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
2000-02-29 20:03:39 -05:00
|
|
|
|
This manual documents version 1.5.3+dev of GNU Wget, the freely
|
1999-12-02 02:42:23 -05:00
|
|
|
|
available utility for network download.
|
|
|
|
|
|
|
|
|
|
Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.
|
|
|
|
|
|
|
|
|
|
* Menu:
|
|
|
|
|
|
|
|
|
|
* Overview:: Features of Wget.
|
|
|
|
|
* Invoking:: Wget command-line arguments.
|
|
|
|
|
* Recursive Retrieval:: Description of recursive retrieval.
|
|
|
|
|
* Following Links:: The available methods of chasing links.
|
|
|
|
|
* Time-Stamping:: Mirroring according to time-stamps.
|
|
|
|
|
* Startup File:: Wget's initialization file.
|
|
|
|
|
* Examples:: Examples of usage.
|
|
|
|
|
* Various:: The stuff that doesn't fit anywhere else.
|
|
|
|
|
* Appendices:: Some useful references.
|
|
|
|
|
* Copying:: You may give out copies of Wget.
|
|
|
|
|
* Concept Index:: Topics covered by this manual.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Overview, Next: Invoking, Prev: Top, Up: Top
|
|
|
|
|
|
|
|
|
|
Overview
|
|
|
|
|
********
|
|
|
|
|
|
|
|
|
|
GNU Wget is a freely available network utility to retrieve files from
|
|
|
|
|
the World Wide Web, using HTTP (Hyper Text Transfer Protocol) and FTP
|
|
|
|
|
(File Transfer Protocol), the two most widely used Internet protocols.
|
|
|
|
|
It has many useful features to make downloading easier, some of them
|
|
|
|
|
being:
|
|
|
|
|
|
|
|
|
|
* Wget is non-interactive, meaning that it can work in the
|
|
|
|
|
background, while the user is not logged on. This allows you to
|
|
|
|
|
start a retrieval and disconnect from the system, letting Wget
|
|
|
|
|
finish the work. By contrast, most of the Web browsers require
|
|
|
|
|
constant user's presence, which can be a great hindrance when
|
|
|
|
|
transferring a lot of data.
|
|
|
|
|
|
|
|
|
|
* Wget is capable of descending recursively through the structure of
|
|
|
|
|
HTML documents and FTP directory trees, making a local copy of the
|
|
|
|
|
directory hierarchy similar to the one on the remote server. This
|
|
|
|
|
feature can be used to mirror archives and home pages, or traverse
|
|
|
|
|
the web in search of data, like a WWW robot (*Note Robots::). In
|
|
|
|
|
that spirit, Wget understands the `norobots' convention.
|
|
|
|
|
|
|
|
|
|
* File name wildcard matching and recursive mirroring of directories
|
|
|
|
|
are available when retrieving via FTP. Wget can read the
|
|
|
|
|
time-stamp information given by both HTTP and FTP servers, and
|
|
|
|
|
store it locally. Thus Wget can see if the remote file has
|
|
|
|
|
changed since last retrieval, and automatically retrieve the new
|
|
|
|
|
version if it has. This makes Wget suitable for mirroring of FTP
|
|
|
|
|
sites, as well as home pages.
|
|
|
|
|
|
|
|
|
|
* Wget works exceedingly well on slow or unstable connections,
|
|
|
|
|
retrying the document until it is fully retrieved, or until a
|
|
|
|
|
user-specified retry count is surpassed. It will try to resume the
|
|
|
|
|
download from the point of interruption, using `REST' with FTP and
|
|
|
|
|
`Range' with HTTP servers that support them.
|
|
|
|
|
|
|
|
|
|
* By default, Wget supports proxy servers, which can lighten the
|
|
|
|
|
network load, speed up retrieval and provide access behind
|
|
|
|
|
firewalls. However, if you are behind a firewall that requires
|
|
|
|
|
that you use a socks style gateway, you can get the socks library
|
|
|
|
|
and build wget with support for socks. Wget also supports the
|
|
|
|
|
passive FTP downloading as an option.
|
|
|
|
|
|
|
|
|
|
* Builtin features offer mechanisms to tune which links you wish to
|
|
|
|
|
follow (*Note Following Links::).
|
|
|
|
|
|
|
|
|
|
* The retrieval is conveniently traced with printing dots, each dot
|
|
|
|
|
representing a fixed amount of data received (1KB by default).
|
|
|
|
|
These representations can be customized to your preferences.
|
|
|
|
|
|
|
|
|
|
* Most of the features are fully configurable, either through
|
|
|
|
|
command line options, or via the initialization file `.wgetrc'
|
|
|
|
|
(*Note Startup File::). Wget allows you to define "global"
|
|
|
|
|
startup files (`/usr/local/etc/wgetrc' by default) for site
|
|
|
|
|
settings.
|
|
|
|
|
|
|
|
|
|
* Finally, GNU Wget is free software. This means that everyone may
|
|
|
|
|
use it, redistribute it and/or modify it under the terms of the
|
|
|
|
|
GNU General Public License, as published by the Free Software
|
|
|
|
|
Foundation (*Note Copying::).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Invoking, Next: Recursive Retrieval, Prev: Overview, Up: Top
|
|
|
|
|
|
|
|
|
|
Invoking
|
|
|
|
|
********
|
|
|
|
|
|
|
|
|
|
By default, Wget is very simple to invoke. The basic syntax is:
|
|
|
|
|
|
|
|
|
|
wget [OPTION]... [URL]...
|
|
|
|
|
|
|
|
|
|
Wget will simply download all the URLs specified on the command
|
|
|
|
|
line. URL is a "Uniform Resource Locator", as defined below.
|
|
|
|
|
|
|
|
|
|
However, you may wish to change some of the default parameters of
|
|
|
|
|
Wget. You can do it two ways: permanently, adding the appropriate
|
|
|
|
|
command to `.wgetrc' (*Note Startup File::), or specifying it on the
|
|
|
|
|
command line.
|
|
|
|
|
|
|
|
|
|
* Menu:
|
|
|
|
|
|
|
|
|
|
* URL Format::
|
|
|
|
|
* Option Syntax::
|
|
|
|
|
* Basic Startup Options::
|
|
|
|
|
* Logging and Input File Options::
|
|
|
|
|
* Download Options::
|
|
|
|
|
* Directory Options::
|
|
|
|
|
* HTTP Options::
|
|
|
|
|
* FTP Options::
|
|
|
|
|
* Recursive Retrieval Options::
|
|
|
|
|
* Recursive Accept/Reject Options::
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: URL Format, Next: Option Syntax, Prev: Invoking, Up: Invoking
|
|
|
|
|
|
|
|
|
|
URL Format
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
"URL" is an acronym for Uniform Resource Locator. A uniform
|
|
|
|
|
resource locator is a compact string representation for a resource
|
|
|
|
|
available via the Internet. Wget recognizes the URL syntax as per
|
|
|
|
|
RFC1738. This is the most widely used form (square brackets denote
|
|
|
|
|
optional parts):
|
|
|
|
|
|
|
|
|
|
http://host[:port]/directory/file
|
|
|
|
|
ftp://host[:port]/directory/file
|
|
|
|
|
|
|
|
|
|
You can also encode your username and password within a URL:
|
|
|
|
|
|
|
|
|
|
ftp://user:password@host/path
|
|
|
|
|
http://user:password@host/path
|
|
|
|
|
|
|
|
|
|
Either USER or PASSWORD, or both, may be left out. If you leave out
|
|
|
|
|
either the HTTP username or password, no authentication will be sent.
|
|
|
|
|
If you leave out the FTP username, `anonymous' will be used. If you
|
|
|
|
|
leave out the FTP password, your email address will be supplied as a
|
|
|
|
|
default password.(1)
|
|
|
|
|
|
|
|
|
|
You can encode unsafe characters in a URL as `%xy', `xy' being the
|
|
|
|
|
hexadecimal representation of the character's ASCII value. Some common
|
|
|
|
|
unsafe characters include `%' (quoted as `%25'), `:' (quoted as `%3A'),
|
|
|
|
|
and `@' (quoted as `%40'). Refer to RFC1738 for a comprehensive list
|
|
|
|
|
of unsafe characters.
|
|
|
|
|
|
|
|
|
|
Wget also supports the `type' feature for FTP URLs. By default, FTP
|
|
|
|
|
documents are retrieved in the binary mode (type `i'), which means that
|
|
|
|
|
they are downloaded unchanged. Another useful mode is the `a'
|
|
|
|
|
("ASCII") mode, which converts the line delimiters between the
|
|
|
|
|
different operating systems, and is thus useful for text files. Here
|
|
|
|
|
is an example:
|
|
|
|
|
|
|
|
|
|
ftp://host/directory/file;type=a
|
|
|
|
|
|
|
|
|
|
Two alternative variants of URL specification are also supported,
|
2000-03-02 16:17:47 -05:00
|
|
|
|
because of historical (hysterical?) reasons and their widespreaded use.
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
FTP-only syntax (supported by `NcFTP'):
|
|
|
|
|
host:/dir/file
|
|
|
|
|
|
|
|
|
|
HTTP-only syntax (introduced by `Netscape'):
|
|
|
|
|
host[:port]/dir/file
|
|
|
|
|
|
|
|
|
|
These two alternative forms are deprecated, and may cease being
|
|
|
|
|
supported in the future.
|
|
|
|
|
|
|
|
|
|
If you do not understand the difference between these notations, or
|
|
|
|
|
do not know which one to use, just use the plain ordinary format you use
|
|
|
|
|
with your favorite browser, like `Lynx' or `Netscape'.
|
|
|
|
|
|
|
|
|
|
---------- Footnotes ----------
|
|
|
|
|
|
2000-02-29 20:03:39 -05:00
|
|
|
|
(1) If you have a `.netrc' file in your home directory, password
|
1999-12-02 02:42:23 -05:00
|
|
|
|
will also be searched for there.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Option Syntax, Next: Basic Startup Options, Prev: URL Format, Up: Invoking
|
|
|
|
|
|
|
|
|
|
Option Syntax
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
Since Wget uses GNU getopts to process its arguments, every option
|
|
|
|
|
has a short form and a long form. Long options are more convenient to
|
|
|
|
|
remember, but take time to type. You may freely mix different option
|
|
|
|
|
styles, or specify options after the command-line arguments. Thus you
|
|
|
|
|
may write:
|
|
|
|
|
|
|
|
|
|
wget -r --tries=10 http://fly.cc.fer.hr/ -o log
|
|
|
|
|
|
|
|
|
|
The space between the option accepting an argument and the argument
|
|
|
|
|
may be omitted. Instead `-o log' you can write `-olog'.
|
|
|
|
|
|
|
|
|
|
You may put several options that do not require arguments together,
|
|
|
|
|
like:
|
|
|
|
|
|
|
|
|
|
wget -drc URL
|
|
|
|
|
|
|
|
|
|
This is a complete equivalent of:
|
|
|
|
|
|
|
|
|
|
wget -d -r -c URL
|
|
|
|
|
|
|
|
|
|
Since the options can be specified after the arguments, you may
|
|
|
|
|
terminate them with `--'. So the following will try to download URL
|
|
|
|
|
`-x', reporting failure to `log':
|
|
|
|
|
|
|
|
|
|
wget -o log -- -x
|
|
|
|
|
|
|
|
|
|
The options that accept comma-separated lists all respect the
|
|
|
|
|
convention that specifying an empty list clears its value. This can be
|
|
|
|
|
useful to clear the `.wgetrc' settings. For instance, if your `.wgetrc'
|
|
|
|
|
sets `exclude_directories' to `/cgi-bin', the following example will
|
|
|
|
|
first reset it, and then set it to exclude `/~nobody' and `/~somebody'.
|
|
|
|
|
You can also clear the lists in `.wgetrc' (*Note Wgetrc Syntax::).
|
|
|
|
|
|
|
|
|
|
wget -X '' -X /~nobody,/~somebody
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Basic Startup Options, Next: Logging and Input File Options, Prev: Option Syntax, Up: Invoking
|
|
|
|
|
|
|
|
|
|
Basic Startup Options
|
|
|
|
|
=====================
|
|
|
|
|
|
|
|
|
|
`-V'
|
|
|
|
|
`--version'
|
|
|
|
|
Display the version of Wget.
|
|
|
|
|
|
|
|
|
|
`-h'
|
|
|
|
|
`--help'
|
|
|
|
|
Print a help message describing all of Wget's command-line options.
|
|
|
|
|
|
|
|
|
|
`-b'
|
|
|
|
|
`--background'
|
|
|
|
|
Go to background immediately after startup. If no output file is
|
|
|
|
|
specified via the `-o', output is redirected to `wget-log'.
|
|
|
|
|
|
|
|
|
|
`-e COMMAND'
|
|
|
|
|
`--execute COMMAND'
|
|
|
|
|
Execute COMMAND as if it were a part of `.wgetrc' (*Note Startup
|
|
|
|
|
File::). A command thus invoked will be executed *after* the
|
|
|
|
|
commands in `.wgetrc', thus taking precedence over them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Logging and Input File Options, Next: Download Options, Prev: Basic Startup Options, Up: Invoking
|
|
|
|
|
|
|
|
|
|
Logging and Input File Options
|
|
|
|
|
==============================
|
|
|
|
|
|
|
|
|
|
`-o LOGFILE'
|
|
|
|
|
`--output-file=LOGFILE'
|
|
|
|
|
Log all messages to LOGFILE. The messages are normally reported
|
|
|
|
|
to standard error.
|
|
|
|
|
|
|
|
|
|
`-a LOGFILE'
|
|
|
|
|
`--append-output=LOGFILE'
|
|
|
|
|
Append to LOGFILE. This is the same as `-o', only it appends to
|
|
|
|
|
LOGFILE instead of overwriting the old log file. If LOGFILE does
|
|
|
|
|
not exist, a new file is created.
|
|
|
|
|
|
|
|
|
|
`-d'
|
|
|
|
|
`--debug'
|
|
|
|
|
Turn on debug output, meaning various information important to the
|
|
|
|
|
developers of Wget if it does not work properly. Your system
|
|
|
|
|
administrator may have chosen to compile Wget without debug
|
|
|
|
|
support, in which case `-d' will not work. Please note that
|
|
|
|
|
compiling with debug support is always safe--Wget compiled with
|
|
|
|
|
the debug support will *not* print any debug info unless requested
|
|
|
|
|
with `-d'. *Note Reporting Bugs:: for more information on how to
|
|
|
|
|
use `-d' for sending bug reports.
|
|
|
|
|
|
|
|
|
|
`-q'
|
|
|
|
|
`--quiet'
|
|
|
|
|
Turn off Wget's output.
|
|
|
|
|
|
|
|
|
|
`-v'
|
|
|
|
|
`--verbose'
|
|
|
|
|
Turn on verbose output, with all the available data. The default
|
|
|
|
|
output is verbose.
|
|
|
|
|
|
|
|
|
|
`-nv'
|
|
|
|
|
`--non-verbose'
|
|
|
|
|
Non-verbose output--turn off verbose without being completely quiet
|
|
|
|
|
(use `-q' for that), which means that error messages and basic
|
|
|
|
|
information still get printed.
|
|
|
|
|
|
|
|
|
|
`-i FILE'
|
|
|
|
|
`--input-file=FILE'
|
|
|
|
|
Read URLs from FILE, in which case no URLs need to be on the
|
|
|
|
|
command line. If there are URLs both on the command line and in
|
|
|
|
|
an input file, those on the command lines will be the first ones to
|
|
|
|
|
be retrieved. The FILE need not be an HTML document (but no harm
|
|
|
|
|
if it is)--it is enough if the URLs are just listed sequentially.
|
|
|
|
|
|
|
|
|
|
However, if you specify `--force-html', the document will be
|
|
|
|
|
regarded as `html'. In that case you may have problems with
|
|
|
|
|
relative links, which you can solve either by adding `<base
|
|
|
|
|
href="URL">' to the documents or by specifying `--base=URL' on the
|
|
|
|
|
command line.
|
|
|
|
|
|
|
|
|
|
`-F'
|
|
|
|
|
`--force-html'
|
|
|
|
|
When input is read from a file, force it to be treated as an HTML
|
|
|
|
|
file. This enables you to retrieve relative links from existing
|
|
|
|
|
HTML files on your local disk, by adding `<base href="URL">' to
|
|
|
|
|
HTML, or using the `--base' command-line option.
|
|
|
|
|
|
2000-08-23 18:41:21 -04:00
|
|
|
|
`-B URL'
|
|
|
|
|
`--base=URL'
|
|
|
|
|
When used in conjunction with `-F', prepends URL to relative links
|
|
|
|
|
in the file specified by `-i'.
|
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Download Options, Next: Directory Options, Prev: Logging and Input File Options, Up: Invoking
|
|
|
|
|
|
|
|
|
|
Download Options
|
|
|
|
|
================
|
|
|
|
|
|
|
|
|
|
`-t NUMBER'
|
|
|
|
|
`--tries=NUMBER'
|
|
|
|
|
Set number of retries to NUMBER. Specify 0 or `inf' for infinite
|
|
|
|
|
retrying.
|
|
|
|
|
|
|
|
|
|
`-O FILE'
|
|
|
|
|
`--output-document=FILE'
|
|
|
|
|
The documents will not be written to the appropriate files, but
|
|
|
|
|
all will be concatenated together and written to FILE. If FILE
|
|
|
|
|
already exists, it will be overwritten. If the FILE is `-', the
|
|
|
|
|
documents will be written to standard output. Including this
|
|
|
|
|
option automatically sets the number of tries to 1.
|
|
|
|
|
|
|
|
|
|
`-nc'
|
|
|
|
|
`--no-clobber'
|
2000-08-22 23:04:20 -04:00
|
|
|
|
If a file is downloaded more than once in the same directory,
|
|
|
|
|
wget's behavior depends on a few options, including `-nc'. In
|
|
|
|
|
certain cases, the local file will be "clobbered", or overwritten,
|
|
|
|
|
upon repeated download. In other cases it will be preserved.
|
|
|
|
|
|
|
|
|
|
When running wget without `-N', `-nc', or `-r', downloading the
|
|
|
|
|
same file in the same directory will result in the original copy
|
|
|
|
|
of `FILE' being preserved and the second copy being named
|
|
|
|
|
`FILE.1'. If that file is downloaded yet again, the third copy
|
|
|
|
|
will be named `FILE.2', and so on. When `-nc' is specified, this
|
|
|
|
|
behavior is suppressed, and wget will refuse to download newer
|
|
|
|
|
copies of `FILE'. Therefore, "no-clobber" is actually a misnomer
|
|
|
|
|
in this mode - it's not clobbering that's prevented (as the
|
|
|
|
|
numeric suffixes were already preventing clobbering), but rather
|
|
|
|
|
the multiple version saving that's prevented.
|
|
|
|
|
|
|
|
|
|
When running wget with `-r', but without `-N' or `-nc',
|
|
|
|
|
re-downloading a file will result in the new copy simply
|
|
|
|
|
overwriting the old. Adding `-nc' will prevent this behavior,
|
|
|
|
|
instead causing the original version to be preserved and any newer
|
|
|
|
|
copies on the server to be ignored.
|
|
|
|
|
|
|
|
|
|
When running wget with `-N', with or without `-r', the decision as
|
|
|
|
|
to whether or not to download a newer copy of a file depends on
|
|
|
|
|
the local and remote timestamp and size of the file (*Note
|
|
|
|
|
Time-Stamping::). `-nc' may not be specified at the same time as
|
|
|
|
|
`-N'.
|
|
|
|
|
|
|
|
|
|
Note that when `-nc' is specified, files with the suffixes `.html'
|
|
|
|
|
or (yuck) `.htm' will be loaded from the local disk and parsed as
|
|
|
|
|
if they had been retrieved from the Web.
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
`-c'
|
|
|
|
|
`--continue'
|
|
|
|
|
Continue getting an existing file. This is useful when you want to
|
|
|
|
|
finish up the download started by another program, or a previous
|
|
|
|
|
instance of Wget. Thus you can write:
|
|
|
|
|
|
|
|
|
|
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
|
|
|
|
|
|
|
|
|
|
If there is a file name `ls-lR.Z' in the current directory, Wget
|
|
|
|
|
will assume that it is the first portion of the remote file, and
|
|
|
|
|
will require the server to continue the retrieval from an offset
|
|
|
|
|
equal to the length of the local file.
|
|
|
|
|
|
|
|
|
|
Note that you need not specify this option if all you want is Wget
|
|
|
|
|
to continue retrieving where it left off when the connection is
|
|
|
|
|
lost--Wget does this by default. You need this option only when
|
|
|
|
|
you want to continue retrieval of a file already halfway
|
|
|
|
|
retrieved, saved by another FTP client, or left by Wget being
|
|
|
|
|
killed.
|
|
|
|
|
|
|
|
|
|
Without `-c', the previous example would just begin to download the
|
|
|
|
|
remote file to `ls-lR.Z.1'. The `-c' option is also applicable
|
|
|
|
|
for HTTP servers that support the `Range' header.
|
|
|
|
|
|
2000-08-23 17:36:31 -04:00
|
|
|
|
Note that if you use `-c' on a file that's already downloaded
|
|
|
|
|
completely, `FILE' will not be changed, nor will a second `FILE.1'
|
|
|
|
|
copy be created.
|
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
`--dot-style=STYLE'
|
|
|
|
|
Set the retrieval style to STYLE. Wget traces the retrieval of
|
|
|
|
|
each document by printing dots on the screen, each dot
|
|
|
|
|
representing a fixed amount of retrieved data. Any number of dots
|
|
|
|
|
may be separated in a "cluster", to make counting easier. This
|
|
|
|
|
option allows you to choose one of the pre-defined styles,
|
|
|
|
|
determining the number of bytes represented by a dot, the number
|
|
|
|
|
of dots in a cluster, and the number of dots on the line.
|
|
|
|
|
|
|
|
|
|
With the `default' style each dot represents 1K, there are ten dots
|
|
|
|
|
in a cluster and 50 dots in a line. The `binary' style has a more
|
|
|
|
|
"computer"-like orientation--8K dots, 16-dots clusters and 48 dots
|
|
|
|
|
per line (which makes for 384K lines). The `mega' style is
|
|
|
|
|
suitable for downloading very large files--each dot represents 64K
|
|
|
|
|
retrieved, there are eight dots in a cluster, and 48 dots on each
|
|
|
|
|
line (so each line contains 3M). The `micro' style is exactly the
|
|
|
|
|
reverse; it is suitable for downloading small files, with 128-byte
|
|
|
|
|
dots, 8 dots per cluster, and 48 dots (6K) per line.
|
|
|
|
|
|
|
|
|
|
`-N'
|
|
|
|
|
`--timestamping'
|
|
|
|
|
Turn on time-stamping. *Note Time-Stamping:: for details.
|
|
|
|
|
|
|
|
|
|
`-S'
|
|
|
|
|
`--server-response'
|
|
|
|
|
Print the headers sent by HTTP servers and responses sent by FTP
|
|
|
|
|
servers.
|
|
|
|
|
|
|
|
|
|
`--spider'
|
|
|
|
|
When invoked with this option, Wget will behave as a Web "spider",
|
|
|
|
|
which means that it will not download the pages, just check that
|
|
|
|
|
they are there. You can use it to check your bookmarks, e.g. with:
|
|
|
|
|
|
|
|
|
|
wget --spider --force-html -i bookmarks.html
|
|
|
|
|
|
|
|
|
|
This feature needs much more work for Wget to get close to the
|
|
|
|
|
functionality of real WWW spiders.
|
|
|
|
|
|
|
|
|
|
`-T seconds'
|
|
|
|
|
`--timeout=SECONDS'
|
|
|
|
|
Set the read timeout to SECONDS seconds. Whenever a network read
|
|
|
|
|
is issued, the file descriptor is checked for a timeout, which
|
|
|
|
|
could otherwise leave a pending connection (uninterrupted read).
|
|
|
|
|
The default timeout is 900 seconds (fifteen minutes). Setting
|
|
|
|
|
timeout to 0 will disable checking for timeouts.
|
|
|
|
|
|
|
|
|
|
Please do not lower the default timeout value with this option
|
|
|
|
|
unless you know what you are doing.
|
|
|
|
|
|
|
|
|
|
`-w SECONDS'
|
|
|
|
|
`--wait=SECONDS'
|
|
|
|
|
Wait the specified number of seconds between the retrievals. Use
|
|
|
|
|
of this option is recommended, as it lightens the server load by
|
|
|
|
|
making the requests less frequent. Instead of in seconds, the
|
|
|
|
|
time can be specified in minutes using the `m' suffix, in hours
|
|
|
|
|
using `h' suffix, or in days using `d' suffix.
|
|
|
|
|
|
|
|
|
|
Specifying a large value for this option is useful if the network
|
|
|
|
|
or the destination host is down, so that Wget can wait long enough
|
|
|
|
|
to reasonably expect the network error to be fixed before the
|
|
|
|
|
retry.
|
|
|
|
|
|
2000-04-12 23:41:58 -04:00
|
|
|
|
`--waitretry=SECONDS'
|
|
|
|
|
If you don't want Wget to wait between *every* retrieval, but only
|
2000-04-13 15:37:52 -04:00
|
|
|
|
between retries of failed downloads, you can use this option.
|
|
|
|
|
Wget will use "linear backoff", waiting 1 second after the first
|
|
|
|
|
failure on a given file, then waiting 2 seconds after the second
|
|
|
|
|
failure on that file, up to the maximum number of SECONDS you
|
|
|
|
|
specify. Therefore, a value of 10 will actually make Wget wait up
|
|
|
|
|
to (1 + 2 + ... + 10) = 55 seconds per file.
|
|
|
|
|
|
|
|
|
|
Note that this option is turned on by default in the global
|
|
|
|
|
`wgetrc' file.
|
2000-04-12 23:41:58 -04:00
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
`-Y on/off'
|
|
|
|
|
`--proxy=on/off'
|
|
|
|
|
Turn proxy support on or off. The proxy is on by default if the
|
|
|
|
|
appropriate environmental variable is defined.
|
|
|
|
|
|
|
|
|
|
`-Q QUOTA'
|
|
|
|
|
`--quota=QUOTA'
|
|
|
|
|
Specify download quota for automatic retrievals. The value can be
|
|
|
|
|
specified in bytes (default), kilobytes (with `k' suffix), or
|
|
|
|
|
megabytes (with `m' suffix).
|
|
|
|
|
|
|
|
|
|
Note that quota will never affect downloading a single file. So
|
|
|
|
|
if you specify `wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz',
|
|
|
|
|
all of the `ls-lR.gz' will be downloaded. The same goes even when
|
|
|
|
|
several URLs are specified on the command-line. However, quota is
|
|
|
|
|
respected when retrieving either recursively, or from an input
|
|
|
|
|
file. Thus you may safely type `wget -Q2m -i sites'--download
|
|
|
|
|
will be aborted when the quota is exceeded.
|
|
|
|
|
|
|
|
|
|
Setting quota to 0 or to `inf' unlimits the download quota.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Directory Options, Next: HTTP Options, Prev: Download Options, Up: Invoking
|
|
|
|
|
|
|
|
|
|
Directory Options
|
|
|
|
|
=================
|
|
|
|
|
|
|
|
|
|
`-nd'
|
|
|
|
|
`--no-directories'
|
|
|
|
|
Do not create a hierarchy of directories when retrieving
|
|
|
|
|
recursively. With this option turned on, all files will get saved
|
|
|
|
|
to the current directory, without clobbering (if a name shows up
|
|
|
|
|
more than once, the filenames will get extensions `.n').
|
|
|
|
|
|
|
|
|
|
`-x'
|
|
|
|
|
`--force-directories'
|
|
|
|
|
The opposite of `-nd'--create a hierarchy of directories, even if
|
|
|
|
|
one would not have been created otherwise. E.g. `wget -x
|
|
|
|
|
http://fly.cc.fer.hr/robots.txt' will save the downloaded file to
|
|
|
|
|
`fly.cc.fer.hr/robots.txt'.
|
|
|
|
|
|
|
|
|
|
`-nH'
|
|
|
|
|
`--no-host-directories'
|
|
|
|
|
Disable generation of host-prefixed directories. By default,
|
|
|
|
|
invoking Wget with `-r http://fly.cc.fer.hr/' will create a
|
|
|
|
|
structure of directories beginning with `fly.cc.fer.hr/'. This
|
|
|
|
|
option disables such behavior.
|
|
|
|
|
|
|
|
|
|
`--cut-dirs=NUMBER'
|
|
|
|
|
Ignore NUMBER directory components. This is useful for getting a
|
|
|
|
|
fine-grained control over the directory where recursive retrieval
|
|
|
|
|
will be saved.
|
|
|
|
|
|
|
|
|
|
Take, for example, the directory at
|
|
|
|
|
`ftp://ftp.xemacs.org/pub/xemacs/'. If you retrieve it with `-r',
|
|
|
|
|
it will be saved locally under `ftp.xemacs.org/pub/xemacs/'.
|
|
|
|
|
While the `-nH' option can remove the `ftp.xemacs.org/' part, you
|
|
|
|
|
are still stuck with `pub/xemacs'. This is where `--cut-dirs'
|
|
|
|
|
comes in handy; it makes Wget not "see" NUMBER remote directory
|
|
|
|
|
components. Here are several examples of how `--cut-dirs' option
|
|
|
|
|
works.
|
|
|
|
|
|
|
|
|
|
No options -> ftp.xemacs.org/pub/xemacs/
|
|
|
|
|
-nH -> pub/xemacs/
|
|
|
|
|
-nH --cut-dirs=1 -> xemacs/
|
|
|
|
|
-nH --cut-dirs=2 -> .
|
|
|
|
|
|
|
|
|
|
--cut-dirs=1 -> ftp.xemacs.org/xemacs/
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
If you just want to get rid of the directory structure, this
|
|
|
|
|
option is similar to a combination of `-nd' and `-P'. However,
|
|
|
|
|
unlike `-nd', `--cut-dirs' does not lose with subdirectories--for
|
|
|
|
|
instance, with `-nH --cut-dirs=1', a `beta/' subdirectory will be
|
|
|
|
|
placed to `xemacs/beta', as one would expect.
|
|
|
|
|
|
|
|
|
|
`-P PREFIX'
|
|
|
|
|
`--directory-prefix=PREFIX'
|
|
|
|
|
Set directory prefix to PREFIX. The "directory prefix" is the
|
|
|
|
|
directory where all other files and subdirectories will be saved
|
|
|
|
|
to, i.e. the top of the retrieval tree. The default is `.' (the
|
|
|
|
|
current directory).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: HTTP Options, Next: FTP Options, Prev: Directory Options, Up: Invoking
|
|
|
|
|
|
|
|
|
|
HTTP Options
|
|
|
|
|
============
|
|
|
|
|
|
2000-10-20 01:55:46 -04:00
|
|
|
|
`-E'
|
|
|
|
|
`--html-extension'
|
|
|
|
|
If a file of type `text/html' is downloaded and the URL does not
|
|
|
|
|
end with the regexp "\.[Hh][Tt][Mm][Ll]?", this option will cause
|
|
|
|
|
the suffix `.html' to be appended to the local filename. This is
|
|
|
|
|
useful, for instance, when you're you're mirroring a remote site
|
|
|
|
|
that uses `.asp' pages, but you want the mirrored pages to be
|
|
|
|
|
viewable on your stock Apache server. Another good use for this
|
|
|
|
|
is when you're downloading the output of CGIs. A URL like
|
|
|
|
|
`http://site.com/article.cgi?25' will be saved as
|
|
|
|
|
`article.cgi?25.html'.
|
|
|
|
|
|
|
|
|
|
Note that filenames changed in this way will be re-downloaded
|
|
|
|
|
every time you re-mirror a site, because wget can't tell that the
|
|
|
|
|
local `X.html' file corresponds to remote URL `X' (since it
|
|
|
|
|
doesn't yet know that the URL produces output of type `text/html'.
|
|
|
|
|
To prevent this re-downloading, you must use `-k' and `-K' so
|
|
|
|
|
that the original version of the file will be saved as `X.orig'
|
|
|
|
|
(*Note Recursive Retrieval Options::).
|
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
`--http-user=USER'
|
|
|
|
|
`--http-passwd=PASSWORD'
|
|
|
|
|
Specify the username USER and password PASSWORD on an HTTP server.
|
|
|
|
|
According to the type of the challenge, Wget will encode them
|
|
|
|
|
using either the `basic' (insecure) or the `digest' authentication
|
|
|
|
|
scheme.
|
|
|
|
|
|
|
|
|
|
Another way to specify username and password is in the URL itself
|
|
|
|
|
(*Note URL Format::). For more information about security issues
|
|
|
|
|
with Wget, *Note Security Considerations::.
|
|
|
|
|
|
|
|
|
|
`-C on/off'
|
|
|
|
|
`--cache=on/off'
|
|
|
|
|
When set to off, disable server-side cache. In this case, Wget
|
|
|
|
|
will send the remote server an appropriate directive (`Pragma:
|
|
|
|
|
no-cache') to get the file from the remote service, rather than
|
|
|
|
|
returning the cached version. This is especially useful for
|
|
|
|
|
retrieving and flushing out-of-date documents on proxy servers.
|
|
|
|
|
|
|
|
|
|
Caching is allowed by default.
|
|
|
|
|
|
|
|
|
|
`--ignore-length'
|
|
|
|
|
Unfortunately, some HTTP servers (CGI programs, to be more
|
|
|
|
|
precise) send out bogus `Content-Length' headers, which makes Wget
|
|
|
|
|
go wild, as it thinks not all the document was retrieved. You can
|
|
|
|
|
spot this syndrome if Wget retries getting the same document again
|
|
|
|
|
and again, each time claiming that the (otherwise normal)
|
|
|
|
|
connection has closed on the very same byte.
|
|
|
|
|
|
|
|
|
|
With this option, Wget will ignore the `Content-Length' header--as
|
|
|
|
|
if it never existed.
|
|
|
|
|
|
|
|
|
|
`--header=ADDITIONAL-HEADER'
|
|
|
|
|
Define an ADDITIONAL-HEADER to be passed to the HTTP servers.
|
|
|
|
|
Headers must contain a `:' preceded by one or more non-blank
|
|
|
|
|
characters, and must not contain newlines.
|
|
|
|
|
|
|
|
|
|
You may define more than one additional header by specifying
|
|
|
|
|
`--header' more than once.
|
|
|
|
|
|
|
|
|
|
wget --header='Accept-Charset: iso-8859-2' \
|
|
|
|
|
--header='Accept-Language: hr' \
|
|
|
|
|
http://fly.cc.fer.hr/
|
|
|
|
|
|
|
|
|
|
Specification of an empty string as the header value will clear all
|
|
|
|
|
previous user-defined headers.
|
|
|
|
|
|
|
|
|
|
`--proxy-user=USER'
|
|
|
|
|
`--proxy-passwd=PASSWORD'
|
|
|
|
|
Specify the username USER and password PASSWORD for authentication
|
|
|
|
|
on a proxy server. Wget will encode them using the `basic'
|
|
|
|
|
authentication scheme.
|
|
|
|
|
|
2000-05-22 22:29:38 -04:00
|
|
|
|
`--referer=URL'
|
|
|
|
|
Include `Referer: URL' header in HTTP request. Useful for
|
|
|
|
|
retrieving documents with server-side processing that assume they
|
|
|
|
|
are always being retrieved by interactive web browsers and only
|
|
|
|
|
come out properly when Referer is set to one of the pages that
|
|
|
|
|
point to them.
|
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
`-s'
|
|
|
|
|
`--save-headers'
|
|
|
|
|
Save the headers sent by the HTTP server to the file, preceding the
|
|
|
|
|
actual contents, with an empty line as the separator.
|
|
|
|
|
|
|
|
|
|
`-U AGENT-STRING'
|
|
|
|
|
`--user-agent=AGENT-STRING'
|
|
|
|
|
Identify as AGENT-STRING to the HTTP server.
|
|
|
|
|
|
|
|
|
|
The HTTP protocol allows the clients to identify themselves using a
|
|
|
|
|
`User-Agent' header field. This enables distinguishing the WWW
|
|
|
|
|
software, usually for statistical purposes or for tracing of
|
|
|
|
|
protocol violations. Wget normally identifies as `Wget/VERSION',
|
|
|
|
|
VERSION being the current version number of Wget.
|
|
|
|
|
|
|
|
|
|
However, some sites have been known to impose the policy of
|
|
|
|
|
tailoring the output according to the `User-Agent'-supplied
|
|
|
|
|
information. While conceptually this is not such a bad idea, it
|
|
|
|
|
has been abused by servers denying information to clients other
|
|
|
|
|
than `Mozilla' or Microsoft `Internet Explorer'. This option
|
|
|
|
|
allows you to change the `User-Agent' line issued by Wget. Use of
|
|
|
|
|
this option is discouraged, unless you really know what you are
|
|
|
|
|
doing.
|
|
|
|
|
|
|
|
|
|
*NOTE* that Netscape Communications Corp. has claimed that false
|
|
|
|
|
transmissions of `Mozilla' as the `User-Agent' are a copyright
|
|
|
|
|
infringement, which will be prosecuted. *DO NOT* misrepresent
|
|
|
|
|
Wget as Mozilla.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: FTP Options, Next: Recursive Retrieval Options, Prev: HTTP Options, Up: Invoking
|
|
|
|
|
|
|
|
|
|
FTP Options
|
|
|
|
|
===========
|
|
|
|
|
|
|
|
|
|
`--retr-symlinks'
|
2000-10-09 18:43:11 -04:00
|
|
|
|
Usually, when retrieving FTP directories recursively and a symbolic
|
|
|
|
|
link is encountered, the linked-to file is not downloaded.
|
|
|
|
|
Instead, a matching symbolic link is created on the local
|
|
|
|
|
filesystem. The pointed-to file will not be downloaded unless
|
|
|
|
|
this recursive retrieval would have encountered it separately and
|
|
|
|
|
downloaded it anyway.
|
|
|
|
|
|
|
|
|
|
When `--retr-symlinks' is specified, however, symbolic links are
|
|
|
|
|
traversed and the pointed-to files are retrieved. At this time,
|
|
|
|
|
this option does not cause wget to traverse symlinks to
|
|
|
|
|
directories and recurse through them, but in the future it should
|
|
|
|
|
be enhanced to do this.
|
|
|
|
|
|
|
|
|
|
Note that when retrieving a file (not a directory) because it was
|
|
|
|
|
specified on the commandline, rather than because it was recursed
|
|
|
|
|
to, this option has no effect. Symbolic links are always
|
|
|
|
|
traversed in this case.
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
`-g on/off'
|
|
|
|
|
`--glob=on/off'
|
|
|
|
|
Turn FTP globbing on or off. Globbing means you may use the
|
|
|
|
|
shell-like special characters ("wildcards"), like `*', `?', `['
|
|
|
|
|
and `]' to retrieve more than one file from the same directory at
|
|
|
|
|
once, like:
|
|
|
|
|
|
|
|
|
|
wget ftp://gnjilux.cc.fer.hr/*.msg
|
|
|
|
|
|
|
|
|
|
By default, globbing will be turned on if the URL contains a
|
|
|
|
|
globbing character. This option may be used to turn globbing on
|
|
|
|
|
or off permanently.
|
|
|
|
|
|
|
|
|
|
You may have to quote the URL to protect it from being expanded by
|
|
|
|
|
your shell. Globbing makes Wget look for a directory listing,
|
|
|
|
|
which is system-specific. This is why it currently works only
|
|
|
|
|
with Unix FTP servers (and the ones emulating Unix `ls' output).
|
|
|
|
|
|
|
|
|
|
`--passive-ftp'
|
|
|
|
|
Use the "passive" FTP retrieval scheme, in which the client
|
|
|
|
|
initiates the data connection. This is sometimes required for FTP
|
|
|
|
|
to work behind firewalls.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Recursive Retrieval Options, Next: Recursive Accept/Reject Options, Prev: FTP Options, Up: Invoking
|
|
|
|
|
|
|
|
|
|
Recursive Retrieval Options
|
|
|
|
|
===========================
|
|
|
|
|
|
|
|
|
|
`-r'
|
|
|
|
|
`--recursive'
|
|
|
|
|
Turn on recursive retrieving. *Note Recursive Retrieval:: for more
|
|
|
|
|
details.
|
|
|
|
|
|
|
|
|
|
`-l DEPTH'
|
|
|
|
|
`--level=DEPTH'
|
|
|
|
|
Specify recursion maximum depth level DEPTH (*Note Recursive
|
|
|
|
|
Retrieval::). The default maximum depth is 5.
|
|
|
|
|
|
|
|
|
|
`--delete-after'
|
|
|
|
|
This option tells Wget to delete every single file it downloads,
|
|
|
|
|
*after* having done so. It is useful for pre-fetching popular
|
|
|
|
|
pages through proxy, e.g.:
|
|
|
|
|
|
|
|
|
|
wget -r -nd --delete-after http://whatever.com/~popular/page/
|
|
|
|
|
|
|
|
|
|
The `-r' option is to retrieve recursively, and `-nd' not to
|
|
|
|
|
create directories.
|
|
|
|
|
|
|
|
|
|
`-k'
|
|
|
|
|
`--convert-links'
|
|
|
|
|
Convert the non-relative links to relative ones locally. Only the
|
|
|
|
|
references to the documents actually downloaded will be converted;
|
|
|
|
|
the rest will be left unchanged.
|
|
|
|
|
|
|
|
|
|
Note that only at the end of the download can Wget know which
|
|
|
|
|
links have been downloaded. Because of that, much of the work
|
|
|
|
|
done by `-k' will be performed at the end of the downloads.
|
|
|
|
|
|
2000-02-29 20:03:39 -05:00
|
|
|
|
`-K'
|
|
|
|
|
`--backup-converted'
|
2000-03-11 01:48:06 -05:00
|
|
|
|
When converting a file, back up the original version with a `.orig'
|
|
|
|
|
suffix. Affects the behavior of `-N' (*Note HTTP Time-Stamping
|
|
|
|
|
Internals::).
|
2000-02-29 20:03:39 -05:00
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
`-m'
|
|
|
|
|
`--mirror'
|
|
|
|
|
Turn on options suitable for mirroring. This option turns on
|
|
|
|
|
recursion and time-stamping, sets infinite recursion depth and
|
|
|
|
|
keeps FTP directory listings. It is currently equivalent to `-r
|
|
|
|
|
-N -l inf -nr'.
|
|
|
|
|
|
|
|
|
|
`-nr'
|
|
|
|
|
`--dont-remove-listing'
|
|
|
|
|
Don't remove the temporary `.listing' files generated by FTP
|
|
|
|
|
retrievals. Normally, these files contain the raw directory
|
|
|
|
|
listings received from FTP servers. Not removing them can be
|
|
|
|
|
useful to access the full remote file list when running a mirror,
|
|
|
|
|
or for debugging purposes.
|
|
|
|
|
|
2000-08-30 07:26:21 -04:00
|
|
|
|
`-p'
|
|
|
|
|
`--page-requisites'
|
|
|
|
|
This option causes wget to download all the files that are
|
|
|
|
|
necessary to properly display a given HTML page. This includes
|
|
|
|
|
such things as inlined images, sounds, and referenced stylesheets.
|
|
|
|
|
|
|
|
|
|
Ordinarily, when downloading a single HTML page, any requisite
|
|
|
|
|
documents that may be needed to display it properly are not
|
|
|
|
|
downloaded. Using `-r' together with `-l' can help, but since
|
|
|
|
|
wget does not ordinarily distinguish between external and inlined
|
|
|
|
|
documents, one is generally left with "leaf documents" that are
|
|
|
|
|
missing their requisites.
|
|
|
|
|
|
|
|
|
|
For instance, say document `1.html' contains an `<IMG>' tag
|
|
|
|
|
referencing `1.gif' and an `<A>' tag pointing to external document
|
|
|
|
|
`2.html'. Say that `2.html' is the same but that its image is
|
|
|
|
|
`2.gif' and it links to `3.html'. Say this continues up to some
|
|
|
|
|
arbitrarily high number.
|
|
|
|
|
|
|
|
|
|
If one executes the command:
|
|
|
|
|
|
|
|
|
|
wget -r -l 2 http://SITE/1.html
|
|
|
|
|
|
|
|
|
|
then `1.html', `1.gif', `2.html', `2.gif', and `3.html' will be
|
|
|
|
|
downloaded. As you can see, `3.html' is without its requisite
|
|
|
|
|
`3.gif' because wget is simply counting the number of hops (up to
|
|
|
|
|
2) away from `1.html' in order to determine where to stop the
|
|
|
|
|
recursion. However, with this command:
|
|
|
|
|
|
|
|
|
|
wget -r -l 2 -p http://SITE/1.html
|
|
|
|
|
|
|
|
|
|
all the above files *and* `3.html''s requisite `3.gif' will be
|
|
|
|
|
downloaded. Similarly,
|
|
|
|
|
|
|
|
|
|
wget -r -l 1 -p http://SITE/1.html
|
|
|
|
|
|
|
|
|
|
will cause `1.html', `1.gif', `2.html', and `2.gif' to be
|
|
|
|
|
downloaded. One might think that:
|
|
|
|
|
|
|
|
|
|
wget -r -l 0 -p http://SITE/1.html
|
|
|
|
|
|
|
|
|
|
would download just `1.html' and `1.gif', but unfortunately this
|
|
|
|
|
is not the case, because `-l 0' is equivalent to `-l inf' - that
|
|
|
|
|
is, infinite recursion. To download a single HTML page (or a
|
|
|
|
|
handful of them, all specified on the commandline or in a `-i' URL
|
|
|
|
|
input file) and its requisites, simply leave off `-p' and `-l':
|
|
|
|
|
|
|
|
|
|
wget -p http://SITE/1.html
|
|
|
|
|
|
|
|
|
|
Note that wget will behave as if `-r' had been specified, but only
|
|
|
|
|
that single page and its requisites will be downloaded. Links
|
|
|
|
|
from that page to external documents will not be followed.
|
|
|
|
|
Actually, to download a single page and all its requisites (even
|
|
|
|
|
if they exist on separate websites), and make sure the lot
|
|
|
|
|
displays properly locally, this author likes to use a few options
|
|
|
|
|
in addition to `-p':
|
|
|
|
|
|
|
|
|
|
wget -H -k -K -nh -p http://SITE/DOCUMENT
|
|
|
|
|
|
|
|
|
|
To finish off this topic, it's worth knowing that wget's idea of an
|
|
|
|
|
external document link is any URL specified in an `<A>' tag, an
|
|
|
|
|
`<AREA>' tag, or a `<LINK>' tag other than `<LINK
|
|
|
|
|
REL="stylesheet">'.
|
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Recursive Accept/Reject Options, Prev: Recursive Retrieval Options, Up: Invoking
|
|
|
|
|
|
|
|
|
|
Recursive Accept/Reject Options
|
|
|
|
|
===============================
|
|
|
|
|
|
|
|
|
|
`-A ACCLIST --accept ACCLIST'
|
|
|
|
|
`-R REJLIST --reject REJLIST'
|
|
|
|
|
Specify comma-separated lists of file name suffixes or patterns to
|
|
|
|
|
accept or reject (*Note Types of Files:: for more details).
|
|
|
|
|
|
|
|
|
|
`-D DOMAIN-LIST'
|
|
|
|
|
`--domains=DOMAIN-LIST'
|
|
|
|
|
Set domains to be accepted and DNS looked-up, where DOMAIN-LIST is
|
|
|
|
|
a comma-separated list. Note that it does *not* turn on `-H'.
|
|
|
|
|
This option speeds things up, even if only one host is spanned
|
|
|
|
|
(*Note Domain Acceptance::).
|
|
|
|
|
|
|
|
|
|
`--exclude-domains DOMAIN-LIST'
|
|
|
|
|
Exclude the domains given in a comma-separated DOMAIN-LIST from
|
|
|
|
|
DNS-lookup (*Note Domain Acceptance::).
|
|
|
|
|
|
|
|
|
|
`--follow-ftp'
|
|
|
|
|
Follow FTP links from HTML documents. Without this option, Wget
|
|
|
|
|
will ignore all the FTP links.
|
|
|
|
|
|
2000-03-11 01:48:06 -05:00
|
|
|
|
`--follow-tags=LIST'
|
|
|
|
|
Wget has an internal table of HTML tag / attribute pairs that it
|
|
|
|
|
considers when looking for linked documents during a recursive
|
|
|
|
|
retrieval. If a user wants only a subset of those tags to be
|
|
|
|
|
considered, however, he or she should be specify such tags in a
|
|
|
|
|
comma-separated LIST with this option.
|
|
|
|
|
|
|
|
|
|
`-G LIST'
|
|
|
|
|
`--ignore-tags=LIST'
|
|
|
|
|
This is the opposite of the `--follow-tags' option. To skip
|
|
|
|
|
certain HTML tags when recursively looking for documents to
|
2000-08-30 07:26:21 -04:00
|
|
|
|
download, specify them in a comma-separated LIST.
|
|
|
|
|
|
|
|
|
|
In the past, the `-G' option was the best bet for downloading a
|
|
|
|
|
single page and its requisites, using a commandline like:
|
2000-03-11 01:48:06 -05:00
|
|
|
|
|
|
|
|
|
wget -Ga,area -H -k -K -nh -r http://SITE/DOCUMENT
|
|
|
|
|
|
2000-08-30 07:26:21 -04:00
|
|
|
|
However, the author of this option came across a page with tags
|
|
|
|
|
like `<LINK REL="home" HREF="/">' and came to the realization that
|
|
|
|
|
`-G' was not enough. One can't just tell wget to ignore `<LINK>',
|
|
|
|
|
because then stylesheets will not be downloaded. Now the best bet
|
|
|
|
|
for downloading a single page and its requisites is the dedicated
|
|
|
|
|
`--page-requisites' option.
|
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
`-H'
|
|
|
|
|
`--span-hosts'
|
|
|
|
|
Enable spanning across hosts when doing recursive retrieving
|
|
|
|
|
(*Note All Hosts::).
|
|
|
|
|
|
2000-03-11 01:48:06 -05:00
|
|
|
|
`-L'
|
|
|
|
|
`--relative'
|
|
|
|
|
Follow relative links only. Useful for retrieving a specific home
|
|
|
|
|
page without any distractions, not even those from the same hosts
|
|
|
|
|
(*Note Relative Links::).
|
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
`-I LIST'
|
|
|
|
|
`--include-directories=LIST'
|
|
|
|
|
Specify a comma-separated list of directories you wish to follow
|
|
|
|
|
when downloading (*Note Directory-Based Limits:: for more
|
|
|
|
|
details.) Elements of LIST may contain wildcards.
|
|
|
|
|
|
|
|
|
|
`-X LIST'
|
|
|
|
|
`--exclude-directories=LIST'
|
|
|
|
|
Specify a comma-separated list of directories you wish to exclude
|
|
|
|
|
from download (*Note Directory-Based Limits:: for more details.)
|
|
|
|
|
Elements of LIST may contain wildcards.
|
|
|
|
|
|
|
|
|
|
`-nh'
|
|
|
|
|
`--no-host-lookup'
|
|
|
|
|
Disable the time-consuming DNS lookup of almost all hosts (*Note
|
|
|
|
|
Host Checking::).
|
|
|
|
|
|
|
|
|
|
`-np'
|
2000-02-29 20:03:39 -05:00
|
|
|
|
|
1999-12-02 02:42:23 -05:00
|
|
|
|
`--no-parent'
|
|
|
|
|
Do not ever ascend to the parent directory when retrieving
|
|
|
|
|
recursively. This is a useful option, since it guarantees that
|
|
|
|
|
only the files *below* a certain hierarchy will be downloaded.
|
|
|
|
|
*Note Directory-Based Limits:: for more details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Recursive Retrieval, Next: Following Links, Prev: Invoking, Up: Top
|
|
|
|
|
|
|
|
|
|
Recursive Retrieval
|
|
|
|
|
*******************
|
|
|
|
|
|
|
|
|
|
GNU Wget is capable of traversing parts of the Web (or a single HTTP
|
|
|
|
|
or FTP server), depth-first following links and directory structure.
|
|
|
|
|
This is called "recursive" retrieving, or "recursion".
|
|
|
|
|
|
|
|
|
|
With HTTP URLs, Wget retrieves and parses the HTML from the given
|
|
|
|
|
URL, documents, retrieving the files the HTML document was referring
|
|
|
|
|
to, through markups like `href', or `src'. If the freshly downloaded
|
|
|
|
|
file is also of type `text/html', it will be parsed and followed
|
|
|
|
|
further.
|
|
|
|
|
|
|
|
|
|
The maximum "depth" to which the retrieval may descend is specified
|
|
|
|
|
with the `-l' option (the default maximum depth is five layers). *Note
|
|
|
|
|
Recursive Retrieval::.
|
|
|
|
|
|
|
|
|
|
When retrieving an FTP URL recursively, Wget will retrieve all the
|
|
|
|
|
data from the given directory tree (including the subdirectories up to
|
|
|
|
|
the specified depth) on the remote server, creating its mirror image
|
|
|
|
|
locally. FTP retrieval is also limited by the `depth' parameter.
|
|
|
|
|
|
|
|
|
|
By default, Wget will create a local directory tree, corresponding to
|
|
|
|
|
the one found on the remote server.
|
|
|
|
|
|
|
|
|
|
Recursive retrieving can find a number of applications, the most
|
|
|
|
|
important of which is mirroring. It is also useful for WWW
|
|
|
|
|
presentations, and any other opportunities where slow network
|
|
|
|
|
connections should be bypassed by storing the files locally.
|
|
|
|
|
|
|
|
|
|
You should be warned that invoking recursion may cause grave
|
|
|
|
|
overloading on your system, because of the fast exchange of data
|
|
|
|
|
through the network; all of this may hamper other users' work. The
|
|
|
|
|
same stands for the foreign server you are mirroring--the more requests
|
|
|
|
|
it gets in a rows, the greater is its load.
|
|
|
|
|
|
2000-03-02 16:17:47 -05:00
|
|
|
|
Careless retrieving can also fill your file system uncontrollably,
|
1999-12-02 02:42:23 -05:00
|
|
|
|
which can grind the machine to a halt.
|
|
|
|
|
|
|
|
|
|
The load can be minimized by lowering the maximum recursion level
|
|
|
|
|
(`-l') and/or by lowering the number of retries (`-t'). You may also
|
|
|
|
|
consider using the `-w' option to slow down your requests to the remote
|
|
|
|
|
servers, as well as the numerous options to narrow the number of
|
|
|
|
|
followed links (*Note Following Links::).
|
|
|
|
|
|
|
|
|
|
Recursive retrieval is a good thing when used properly. Please take
|
|
|
|
|
all precautions not to wreak havoc through carelessness.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Following Links, Next: Time-Stamping, Prev: Recursive Retrieval, Up: Top
|
|
|
|
|
|
|
|
|
|
Following Links
|
|
|
|
|
***************
|
|
|
|
|
|
2000-03-11 01:48:06 -05:00
|
|
|
|
When retrieving recursively, one does not wish to retrieve loads of
|
|
|
|
|
unnecessary data. Most of the time the users bear in mind exactly what
|
|
|
|
|
they want to download, and want Wget to follow only specific links.
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
For example, if you wish to download the music archive from
|
|
|
|
|
`fly.cc.fer.hr', you will not want to download all the home pages that
|
|
|
|
|
happen to be referenced by an obscure part of the archive.
|
|
|
|
|
|
|
|
|
|
Wget possesses several mechanisms that allows you to fine-tune which
|
|
|
|
|
links it will follow.
|
|
|
|
|
|
|
|
|
|
* Menu:
|
|
|
|
|
|
|
|
|
|
* Relative Links:: Follow relative links only.
|
|
|
|
|
* Host Checking:: Follow links on the same host.
|
|
|
|
|
* Domain Acceptance:: Check on a list of domains.
|
|
|
|
|
* All Hosts:: No host restrictions.
|
|
|
|
|
* Types of Files:: Getting only certain files.
|
|
|
|
|
* Directory-Based Limits:: Getting only certain directories.
|
|
|
|
|
* FTP Links:: Following FTP links.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Relative Links, Next: Host Checking, Prev: Following Links, Up: Following Links
|
|
|
|
|
|
|
|
|
|
Relative Links
|
|
|
|
|
==============
|
|
|
|
|
|
|
|
|
|
When only relative links are followed (option `-L'), recursive
|
|
|
|
|
retrieving will never span hosts. No time-expensive DNS-lookups will
|
|
|
|
|
be performed, and the process will be very fast, with the minimum
|
|
|
|
|
strain of the network. This will suit your needs often, especially when
|
|
|
|
|
mirroring the output of various `x2html' converters, since they
|
|
|
|
|
generally output relative links.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Host Checking, Next: Domain Acceptance, Prev: Relative Links, Up: Following Links
|
|
|
|
|
|
|
|
|
|
Host Checking
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
The drawback of following the relative links solely is that humans
|
|
|
|
|
often tend to mix them with absolute links to the very same host, and
|
|
|
|
|
the very same page. In this mode (which is the default mode for
|
2000-03-02 16:17:47 -05:00
|
|
|
|
following links) all URLs that refer to the same host will be retrieved.
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
The problem with this option are the aliases of the hosts and
|
|
|
|
|
domains. Thus there is no way for Wget to know that `regoc.srce.hr' and
|
|
|
|
|
`www.srce.hr' are the same host, or that `fly.cc.fer.hr' is the same as
|
|
|
|
|
`fly.cc.etf.hr'. Whenever an absolute link is encountered, the host is
|
|
|
|
|
DNS-looked-up with `gethostbyname' to check whether we are maybe
|
|
|
|
|
dealing with the same hosts. Although the results of `gethostbyname'
|
|
|
|
|
are cached, it is still a great slowdown, e.g. when dealing with large
|
|
|
|
|
indices of home pages on different hosts (because each of the hosts
|
2000-03-02 16:17:47 -05:00
|
|
|
|
must be DNS-resolved to see whether it just *might* be an alias of the
|
1999-12-02 02:42:23 -05:00
|
|
|
|
starting host).
|
|
|
|
|
|
|
|
|
|
To avoid the overhead you may use `-nh', which will turn off
|
|
|
|
|
DNS-resolving and make Wget compare hosts literally. This will make
|
|
|
|
|
things run much faster, but also much less reliable (e.g. `www.srce.hr'
|
|
|
|
|
and `regoc.srce.hr' will be flagged as different hosts).
|
|
|
|
|
|
2000-03-02 16:17:47 -05:00
|
|
|
|
Note that modern HTTP servers allow one IP address to host several
|
|
|
|
|
"virtual servers", each having its own directory hierarchy. Such
|
1999-12-02 02:42:23 -05:00
|
|
|
|
"servers" are distinguished by their hostnames (all of which point to
|
|
|
|
|
the same IP address); for this to work, a client must send a `Host'
|
|
|
|
|
header, which is what Wget does. However, in that case Wget *must not*
|
|
|
|
|
try to divine a host's "real" address, nor try to use the same hostname
|
|
|
|
|
for each access, i.e. `-nh' must be turned on.
|
|
|
|
|
|
2000-03-02 16:17:47 -05:00
|
|
|
|
In other words, the `-nh' option must be used to enable the
|
1999-12-02 02:42:23 -05:00
|
|
|
|
retrieval from virtual servers distinguished by their hostnames. As the
|
|
|
|
|
number of such server setups grow, the behavior of `-nh' may become the
|
|
|
|
|
default in the future.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Domain Acceptance, Next: All Hosts, Prev: Host Checking, Up: Following Links
|
|
|
|
|
|
|
|
|
|
Domain Acceptance
|
|
|
|
|
=================
|
|
|
|
|
|
|
|
|
|
With the `-D' option you may specify the domains that will be
|
|
|
|
|
followed. The hosts the domain of which is not in this list will not be
|
|
|
|
|
DNS-resolved. Thus you can specify `-Dmit.edu' just to make sure that
|
|
|
|
|
*nothing outside of MIT gets looked up*. This is very important and
|
|
|
|
|
useful. It also means that `-D' does *not* imply `-H' (span all
|
|
|
|
|
hosts), which must be specified explicitly. Feel free to use this
|
|
|
|
|
options since it will speed things up, with almost all the reliability
|
|
|
|
|
of checking for all hosts. Thus you could invoke
|
|
|
|
|
|
|
|
|
|
wget -r -D.hr http://fly.cc.fer.hr/
|
|
|
|
|
|
|
|
|
|
to make sure that only the hosts in `.hr' domain get DNS-looked-up
|
|
|
|
|
for being equal to `fly.cc.fer.hr'. So `fly.cc.etf.hr' will be checked
|
|
|
|
|
(only once!) and found equal, but `www.gnu.ai.mit.edu' will not even be
|
|
|
|
|
checked.
|
|
|
|
|
|
|
|
|
|
Of course, domain acceptance can be used to limit the retrieval to
|
|
|
|
|
particular domains with spanning of hosts in them, but then you must
|
|
|
|
|
specify `-H' explicitly. E.g.:
|
|
|
|
|
|
|
|
|
|
wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/
|
|
|
|
|
|
|
|
|
|
will start with `http://www.mit.edu/', following links across MIT
|
|
|
|
|
and Stanford.
|
|
|
|
|
|
|
|
|
|
If there are domains you want to exclude specifically, you can do it
|
|
|
|
|
with `--exclude-domains', which accepts the same type of arguments of
|
|
|
|
|
`-D', but will *exclude* all the listed domains. For example, if you
|
|
|
|
|
want to download all the hosts from `foo.edu' domain, with the
|
|
|
|
|
exception of `sunsite.foo.edu', you can do it like this:
|
|
|
|
|
|
|
|
|
|
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: All Hosts, Next: Types of Files, Prev: Domain Acceptance, Up: Following Links
|
|
|
|
|
|
|
|
|
|
All Hosts
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
When `-H' is specified without `-D', all hosts are freely spanned.
|
|
|
|
|
There are no restrictions whatsoever as to what part of the net Wget
|
|
|
|
|
will go to fetch documents, other than maximum retrieval depth. If a
|
|
|
|
|
page references `www.yahoo.com', so be it. Such an option is rarely
|
|
|
|
|
useful for itself.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
File: wget.info, Node: Types of Files, Next: Directory-Based Limits, Prev: All Hosts, Up: Following Links
|
|
|
|
|
|
|
|
|
|
Types of Files
|
|
|
|
|
==============
|
|
|
|
|
|
|
|
|
|
When downloading material from the web, you will often want to
|
|
|
|
|
restrict the retrieval to only certain file types. For example, if you
|
2000-03-02 16:17:47 -05:00
|
|
|
|
are interested in downloading GIFs, you will not be overjoyed to get
|
|
|
|
|
loads of PostScript documents, and vice versa.
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
Wget offers two options to deal with this problem. Each option
|
|
|
|
|
description lists a short name, a long name, and the equivalent command
|
|
|
|
|
in `.wgetrc'.
|
|
|
|
|
|
|
|
|
|
`-A ACCLIST'
|
|
|
|
|
`--accept ACCLIST'
|
|
|
|
|
`accept = ACCLIST'
|
|
|
|
|
The argument to `--accept' option is a list of file suffixes or
|
|
|
|
|
patterns that Wget will download during recursive retrieval. A
|
|
|
|
|
suffix is the ending part of a file, and consists of "normal"
|
|
|
|
|
letters, e.g. `gif' or `.jpg'. A matching pattern contains
|
|
|
|
|
shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'.
|
|
|
|
|
|
|
|
|
|
So, specifying `wget -A gif,jpg' will make Wget download only the
|
|
|
|
|
files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the
|
|
|
|
|
other hand, `wget -A "zelazny*196[0-9]*"' will download only files
|
|
|
|
|
beginning with `zelazny' and containing numbers from 1960 to 1969
|
|
|
|
|
anywhere within. Look up the manual of your shell for a
|
|
|
|
|
description of how pattern matching works.
|
|
|
|
|
|
|
|
|
|
Of course, any number of suffixes and patterns can be combined
|
|
|
|
|
into a comma-separated list, and given as an argument to `-A'.
|
|
|
|
|
|
|
|
|
|
`-R REJLIST'
|
|
|
|
|
`--reject REJLIST'
|
|
|
|
|
`reject = REJLIST'
|
|
|
|
|
The `--reject' option works the same way as `--accept', only its
|
|
|
|
|
logic is the reverse; Wget will download all files *except* the
|
|
|
|
|
ones matching the suffixes (or patterns) in the list.
|
|
|
|
|
|
|
|
|
|
So, if you want to download a whole page except for the cumbersome
|
|
|
|
|
MPEGs and .AU files, you can use `wget -R mpg,mpeg,au'.
|
|
|
|
|
Analogously, to download all files except the ones beginning with
|
|
|
|
|
`bjork', use `wget -R "bjork*"'. The quotes are to prevent
|
|
|
|
|
expansion by the shell.
|
|
|
|
|
|
|
|
|
|
The `-A' and `-R' options may be combined to achieve even better
|
|
|
|
|
fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R
|
|
|
|
|
.ps' will download all the files having `zelazny' as a part of their
|
2000-03-02 16:17:47 -05:00
|
|
|
|
name, but *not* the PostScript files.
|
1999-12-02 02:42:23 -05:00
|
|
|
|
|
|
|
|
|
Note that these two options do not affect the downloading of HTML
|
|
|
|
|
files; Wget must load all the HTMLs to know where to go at
|
|
|
|
|
all--recursive retrieval would make no sense otherwise.
|
|
|
|
|
|