mirror of
https://github.com/moparisthebest/wget
synced 2024-07-03 16:38:41 -04:00
4454f6ce0a
download a single HTML document and all its constituents. * po/*.{gmo,po,pot}: Regenerated after adding new options. * po/hr.po: Hrvoje forgot '\n's on his translations of my altered messages, causing msgfmt to balk and `make install' to fail. * wget.texi (Recursive Retrieval Options): In -K description, added a link to the discussion of interaction with -N. (Recursive Accept/Reject Options): Did some alphabetizing and added descriptions of new --follow-tags and -G / --ignore-tags options. (Following Links): Changed "the loads of" to "loads of". (Wgetrc Commands): Added descriptions of new follow_tags and ignore_tags commands. * html.c (idmatch): Implemented checking of my new --follow-tags and --ignore-tags options. * init.c (commands): Added comment reminding people adding new entries doing allocation to add corresponding freeing in cleanup(). (commands): Added new followtags and ignoretags commands. (cleanup): Free storage for new followtags and ignoretags. * main.c: Use of "comma-separated list" was random -- normalized it. Did some alphabetization. Added comments pointing out "Options without arguments" and "Options accepting an argument" sections of long_options[]. Added new options --follow-tags and -G / --ignore-tags. Added comment that Damir's --referer is currently undocumented. Added comment that Heiko's --waitretry is partially undocumented (mentioned in --help but not in wget.texi). Moved improperly sorted 24, 129, and 'G' cases. * options.h (struct options): Added new fields follow_tags and ignore_tags. * wget.h: Added "#define EQ 0" so we can say "strcmp(a, b) == EQ".
1265 lines
49 KiB
Plaintext
1265 lines
49 KiB
Plaintext
This is Info file wget.info, produced by Makeinfo version 1.68 from the
|
||
input file ./wget.texi.
|
||
|
||
INFO-DIR-SECTION Net Utilities
|
||
INFO-DIR-SECTION World Wide Web
|
||
START-INFO-DIR-ENTRY
|
||
* Wget: (wget). The non-interactive network downloader.
|
||
END-INFO-DIR-ENTRY
|
||
|
||
This file documents the the GNU Wget utility for downloading network
|
||
data.
|
||
|
||
Copyright (C) 1996, 1997, 1998, 2000 Free Software Foundation, Inc.
|
||
|
||
Permission is granted to make and distribute verbatim copies of this
|
||
manual provided the copyright notice and this permission notice are
|
||
preserved on all copies.
|
||
|
||
Permission is granted to copy and distribute modified versions of
|
||
this manual under the conditions for verbatim copying, provided also
|
||
that the sections entitled "Copying" and "GNU General Public License"
|
||
are included exactly as in the original, and provided that the entire
|
||
resulting derived work is distributed under the terms of a permission
|
||
notice identical to this one.
|
||
|
||
|
||
File: wget.info, Node: Top, Next: Overview, Prev: (dir), Up: (dir)
|
||
|
||
Wget 1.5.3+dev
|
||
**************
|
||
|
||
This manual documents version 1.5.3+dev of GNU Wget, the freely
|
||
available utility for network download.
|
||
|
||
Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.
|
||
|
||
* Menu:
|
||
|
||
* Overview:: Features of Wget.
|
||
* Invoking:: Wget command-line arguments.
|
||
* Recursive Retrieval:: Description of recursive retrieval.
|
||
* Following Links:: The available methods of chasing links.
|
||
* Time-Stamping:: Mirroring according to time-stamps.
|
||
* Startup File:: Wget's initialization file.
|
||
* Examples:: Examples of usage.
|
||
* Various:: The stuff that doesn't fit anywhere else.
|
||
* Appendices:: Some useful references.
|
||
* Copying:: You may give out copies of Wget.
|
||
* Concept Index:: Topics covered by this manual.
|
||
|
||
|
||
File: wget.info, Node: Overview, Next: Invoking, Prev: Top, Up: Top
|
||
|
||
Overview
|
||
********
|
||
|
||
GNU Wget is a freely available network utility to retrieve files from
|
||
the World Wide Web, using HTTP (Hyper Text Transfer Protocol) and FTP
|
||
(File Transfer Protocol), the two most widely used Internet protocols.
|
||
It has many useful features to make downloading easier, some of them
|
||
being:
|
||
|
||
* Wget is non-interactive, meaning that it can work in the
|
||
background, while the user is not logged on. This allows you to
|
||
start a retrieval and disconnect from the system, letting Wget
|
||
finish the work. By contrast, most of the Web browsers require
|
||
constant user's presence, which can be a great hindrance when
|
||
transferring a lot of data.
|
||
|
||
* Wget is capable of descending recursively through the structure of
|
||
HTML documents and FTP directory trees, making a local copy of the
|
||
directory hierarchy similar to the one on the remote server. This
|
||
feature can be used to mirror archives and home pages, or traverse
|
||
the web in search of data, like a WWW robot (*Note Robots::). In
|
||
that spirit, Wget understands the `norobots' convention.
|
||
|
||
* File name wildcard matching and recursive mirroring of directories
|
||
are available when retrieving via FTP. Wget can read the
|
||
time-stamp information given by both HTTP and FTP servers, and
|
||
store it locally. Thus Wget can see if the remote file has
|
||
changed since last retrieval, and automatically retrieve the new
|
||
version if it has. This makes Wget suitable for mirroring of FTP
|
||
sites, as well as home pages.
|
||
|
||
* Wget works exceedingly well on slow or unstable connections,
|
||
retrying the document until it is fully retrieved, or until a
|
||
user-specified retry count is surpassed. It will try to resume the
|
||
download from the point of interruption, using `REST' with FTP and
|
||
`Range' with HTTP servers that support them.
|
||
|
||
* By default, Wget supports proxy servers, which can lighten the
|
||
network load, speed up retrieval and provide access behind
|
||
firewalls. However, if you are behind a firewall that requires
|
||
that you use a socks style gateway, you can get the socks library
|
||
and build wget with support for socks. Wget also supports the
|
||
passive FTP downloading as an option.
|
||
|
||
* Builtin features offer mechanisms to tune which links you wish to
|
||
follow (*Note Following Links::).
|
||
|
||
* The retrieval is conveniently traced with printing dots, each dot
|
||
representing a fixed amount of data received (1KB by default).
|
||
These representations can be customized to your preferences.
|
||
|
||
* Most of the features are fully configurable, either through
|
||
command line options, or via the initialization file `.wgetrc'
|
||
(*Note Startup File::). Wget allows you to define "global"
|
||
startup files (`/usr/local/etc/wgetrc' by default) for site
|
||
settings.
|
||
|
||
* Finally, GNU Wget is free software. This means that everyone may
|
||
use it, redistribute it and/or modify it under the terms of the
|
||
GNU General Public License, as published by the Free Software
|
||
Foundation (*Note Copying::).
|
||
|
||
|
||
File: wget.info, Node: Invoking, Next: Recursive Retrieval, Prev: Overview, Up: Top
|
||
|
||
Invoking
|
||
********
|
||
|
||
By default, Wget is very simple to invoke. The basic syntax is:
|
||
|
||
wget [OPTION]... [URL]...
|
||
|
||
Wget will simply download all the URLs specified on the command
|
||
line. URL is a "Uniform Resource Locator", as defined below.
|
||
|
||
However, you may wish to change some of the default parameters of
|
||
Wget. You can do it two ways: permanently, adding the appropriate
|
||
command to `.wgetrc' (*Note Startup File::), or specifying it on the
|
||
command line.
|
||
|
||
* Menu:
|
||
|
||
* URL Format::
|
||
* Option Syntax::
|
||
* Basic Startup Options::
|
||
* Logging and Input File Options::
|
||
* Download Options::
|
||
* Directory Options::
|
||
* HTTP Options::
|
||
* FTP Options::
|
||
* Recursive Retrieval Options::
|
||
* Recursive Accept/Reject Options::
|
||
|
||
|
||
File: wget.info, Node: URL Format, Next: Option Syntax, Prev: Invoking, Up: Invoking
|
||
|
||
URL Format
|
||
==========
|
||
|
||
"URL" is an acronym for Uniform Resource Locator. A uniform
|
||
resource locator is a compact string representation for a resource
|
||
available via the Internet. Wget recognizes the URL syntax as per
|
||
RFC1738. This is the most widely used form (square brackets denote
|
||
optional parts):
|
||
|
||
http://host[:port]/directory/file
|
||
ftp://host[:port]/directory/file
|
||
|
||
You can also encode your username and password within a URL:
|
||
|
||
ftp://user:password@host/path
|
||
http://user:password@host/path
|
||
|
||
Either USER or PASSWORD, or both, may be left out. If you leave out
|
||
either the HTTP username or password, no authentication will be sent.
|
||
If you leave out the FTP username, `anonymous' will be used. If you
|
||
leave out the FTP password, your email address will be supplied as a
|
||
default password.(1)
|
||
|
||
You can encode unsafe characters in a URL as `%xy', `xy' being the
|
||
hexadecimal representation of the character's ASCII value. Some common
|
||
unsafe characters include `%' (quoted as `%25'), `:' (quoted as `%3A'),
|
||
and `@' (quoted as `%40'). Refer to RFC1738 for a comprehensive list
|
||
of unsafe characters.
|
||
|
||
Wget also supports the `type' feature for FTP URLs. By default, FTP
|
||
documents are retrieved in the binary mode (type `i'), which means that
|
||
they are downloaded unchanged. Another useful mode is the `a'
|
||
("ASCII") mode, which converts the line delimiters between the
|
||
different operating systems, and is thus useful for text files. Here
|
||
is an example:
|
||
|
||
ftp://host/directory/file;type=a
|
||
|
||
Two alternative variants of URL specification are also supported,
|
||
because of historical (hysterical?) reasons and their widespreaded use.
|
||
|
||
FTP-only syntax (supported by `NcFTP'):
|
||
host:/dir/file
|
||
|
||
HTTP-only syntax (introduced by `Netscape'):
|
||
host[:port]/dir/file
|
||
|
||
These two alternative forms are deprecated, and may cease being
|
||
supported in the future.
|
||
|
||
If you do not understand the difference between these notations, or
|
||
do not know which one to use, just use the plain ordinary format you use
|
||
with your favorite browser, like `Lynx' or `Netscape'.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) If you have a `.netrc' file in your home directory, password
|
||
will also be searched for there.
|
||
|
||
|
||
File: wget.info, Node: Option Syntax, Next: Basic Startup Options, Prev: URL Format, Up: Invoking
|
||
|
||
Option Syntax
|
||
=============
|
||
|
||
Since Wget uses GNU getopts to process its arguments, every option
|
||
has a short form and a long form. Long options are more convenient to
|
||
remember, but take time to type. You may freely mix different option
|
||
styles, or specify options after the command-line arguments. Thus you
|
||
may write:
|
||
|
||
wget -r --tries=10 http://fly.cc.fer.hr/ -o log
|
||
|
||
The space between the option accepting an argument and the argument
|
||
may be omitted. Instead `-o log' you can write `-olog'.
|
||
|
||
You may put several options that do not require arguments together,
|
||
like:
|
||
|
||
wget -drc URL
|
||
|
||
This is a complete equivalent of:
|
||
|
||
wget -d -r -c URL
|
||
|
||
Since the options can be specified after the arguments, you may
|
||
terminate them with `--'. So the following will try to download URL
|
||
`-x', reporting failure to `log':
|
||
|
||
wget -o log -- -x
|
||
|
||
The options that accept comma-separated lists all respect the
|
||
convention that specifying an empty list clears its value. This can be
|
||
useful to clear the `.wgetrc' settings. For instance, if your `.wgetrc'
|
||
sets `exclude_directories' to `/cgi-bin', the following example will
|
||
first reset it, and then set it to exclude `/~nobody' and `/~somebody'.
|
||
You can also clear the lists in `.wgetrc' (*Note Wgetrc Syntax::).
|
||
|
||
wget -X '' -X /~nobody,/~somebody
|
||
|
||
|
||
File: wget.info, Node: Basic Startup Options, Next: Logging and Input File Options, Prev: Option Syntax, Up: Invoking
|
||
|
||
Basic Startup Options
|
||
=====================
|
||
|
||
`-V'
|
||
`--version'
|
||
Display the version of Wget.
|
||
|
||
`-h'
|
||
`--help'
|
||
Print a help message describing all of Wget's command-line options.
|
||
|
||
`-b'
|
||
`--background'
|
||
Go to background immediately after startup. If no output file is
|
||
specified via the `-o', output is redirected to `wget-log'.
|
||
|
||
`-e COMMAND'
|
||
`--execute COMMAND'
|
||
Execute COMMAND as if it were a part of `.wgetrc' (*Note Startup
|
||
File::). A command thus invoked will be executed *after* the
|
||
commands in `.wgetrc', thus taking precedence over them.
|
||
|
||
|
||
File: wget.info, Node: Logging and Input File Options, Next: Download Options, Prev: Basic Startup Options, Up: Invoking
|
||
|
||
Logging and Input File Options
|
||
==============================
|
||
|
||
`-o LOGFILE'
|
||
`--output-file=LOGFILE'
|
||
Log all messages to LOGFILE. The messages are normally reported
|
||
to standard error.
|
||
|
||
`-a LOGFILE'
|
||
`--append-output=LOGFILE'
|
||
Append to LOGFILE. This is the same as `-o', only it appends to
|
||
LOGFILE instead of overwriting the old log file. If LOGFILE does
|
||
not exist, a new file is created.
|
||
|
||
`-d'
|
||
`--debug'
|
||
Turn on debug output, meaning various information important to the
|
||
developers of Wget if it does not work properly. Your system
|
||
administrator may have chosen to compile Wget without debug
|
||
support, in which case `-d' will not work. Please note that
|
||
compiling with debug support is always safe--Wget compiled with
|
||
the debug support will *not* print any debug info unless requested
|
||
with `-d'. *Note Reporting Bugs:: for more information on how to
|
||
use `-d' for sending bug reports.
|
||
|
||
`-q'
|
||
`--quiet'
|
||
Turn off Wget's output.
|
||
|
||
`-v'
|
||
`--verbose'
|
||
Turn on verbose output, with all the available data. The default
|
||
output is verbose.
|
||
|
||
`-nv'
|
||
`--non-verbose'
|
||
Non-verbose output--turn off verbose without being completely quiet
|
||
(use `-q' for that), which means that error messages and basic
|
||
information still get printed.
|
||
|
||
`-i FILE'
|
||
`--input-file=FILE'
|
||
Read URLs from FILE, in which case no URLs need to be on the
|
||
command line. If there are URLs both on the command line and in
|
||
an input file, those on the command lines will be the first ones to
|
||
be retrieved. The FILE need not be an HTML document (but no harm
|
||
if it is)--it is enough if the URLs are just listed sequentially.
|
||
|
||
However, if you specify `--force-html', the document will be
|
||
regarded as `html'. In that case you may have problems with
|
||
relative links, which you can solve either by adding `<base
|
||
href="URL">' to the documents or by specifying `--base=URL' on the
|
||
command line.
|
||
|
||
`-F'
|
||
`--force-html'
|
||
When input is read from a file, force it to be treated as an HTML
|
||
file. This enables you to retrieve relative links from existing
|
||
HTML files on your local disk, by adding `<base href="URL">' to
|
||
HTML, or using the `--base' command-line option.
|
||
|
||
|
||
File: wget.info, Node: Download Options, Next: Directory Options, Prev: Logging and Input File Options, Up: Invoking
|
||
|
||
Download Options
|
||
================
|
||
|
||
`-t NUMBER'
|
||
`--tries=NUMBER'
|
||
Set number of retries to NUMBER. Specify 0 or `inf' for infinite
|
||
retrying.
|
||
|
||
`-O FILE'
|
||
`--output-document=FILE'
|
||
The documents will not be written to the appropriate files, but
|
||
all will be concatenated together and written to FILE. If FILE
|
||
already exists, it will be overwritten. If the FILE is `-', the
|
||
documents will be written to standard output. Including this
|
||
option automatically sets the number of tries to 1.
|
||
|
||
`-nc'
|
||
`--no-clobber'
|
||
Do not clobber existing files when saving to directory hierarchy
|
||
within recursive retrieval of several files. This option is
|
||
*extremely* useful when you wish to continue where you left off
|
||
with retrieval of many files. If the files have the `.html' or
|
||
(yuck) `.htm' suffix, they will be loaded from the local disk, and
|
||
parsed as if they have been retrieved from the Web.
|
||
|
||
`-c'
|
||
`--continue'
|
||
Continue getting an existing file. This is useful when you want to
|
||
finish up the download started by another program, or a previous
|
||
instance of Wget. Thus you can write:
|
||
|
||
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
|
||
|
||
If there is a file name `ls-lR.Z' in the current directory, Wget
|
||
will assume that it is the first portion of the remote file, and
|
||
will require the server to continue the retrieval from an offset
|
||
equal to the length of the local file.
|
||
|
||
Note that you need not specify this option if all you want is Wget
|
||
to continue retrieving where it left off when the connection is
|
||
lost--Wget does this by default. You need this option only when
|
||
you want to continue retrieval of a file already halfway
|
||
retrieved, saved by another FTP client, or left by Wget being
|
||
killed.
|
||
|
||
Without `-c', the previous example would just begin to download the
|
||
remote file to `ls-lR.Z.1'. The `-c' option is also applicable
|
||
for HTTP servers that support the `Range' header.
|
||
|
||
`--dot-style=STYLE'
|
||
Set the retrieval style to STYLE. Wget traces the retrieval of
|
||
each document by printing dots on the screen, each dot
|
||
representing a fixed amount of retrieved data. Any number of dots
|
||
may be separated in a "cluster", to make counting easier. This
|
||
option allows you to choose one of the pre-defined styles,
|
||
determining the number of bytes represented by a dot, the number
|
||
of dots in a cluster, and the number of dots on the line.
|
||
|
||
With the `default' style each dot represents 1K, there are ten dots
|
||
in a cluster and 50 dots in a line. The `binary' style has a more
|
||
"computer"-like orientation--8K dots, 16-dots clusters and 48 dots
|
||
per line (which makes for 384K lines). The `mega' style is
|
||
suitable for downloading very large files--each dot represents 64K
|
||
retrieved, there are eight dots in a cluster, and 48 dots on each
|
||
line (so each line contains 3M). The `micro' style is exactly the
|
||
reverse; it is suitable for downloading small files, with 128-byte
|
||
dots, 8 dots per cluster, and 48 dots (6K) per line.
|
||
|
||
`-N'
|
||
`--timestamping'
|
||
Turn on time-stamping. *Note Time-Stamping:: for details.
|
||
|
||
`-S'
|
||
`--server-response'
|
||
Print the headers sent by HTTP servers and responses sent by FTP
|
||
servers.
|
||
|
||
`--spider'
|
||
When invoked with this option, Wget will behave as a Web "spider",
|
||
which means that it will not download the pages, just check that
|
||
they are there. You can use it to check your bookmarks, e.g. with:
|
||
|
||
wget --spider --force-html -i bookmarks.html
|
||
|
||
This feature needs much more work for Wget to get close to the
|
||
functionality of real WWW spiders.
|
||
|
||
`-T seconds'
|
||
`--timeout=SECONDS'
|
||
Set the read timeout to SECONDS seconds. Whenever a network read
|
||
is issued, the file descriptor is checked for a timeout, which
|
||
could otherwise leave a pending connection (uninterrupted read).
|
||
The default timeout is 900 seconds (fifteen minutes). Setting
|
||
timeout to 0 will disable checking for timeouts.
|
||
|
||
Please do not lower the default timeout value with this option
|
||
unless you know what you are doing.
|
||
|
||
`-w SECONDS'
|
||
`--wait=SECONDS'
|
||
Wait the specified number of seconds between the retrievals. Use
|
||
of this option is recommended, as it lightens the server load by
|
||
making the requests less frequent. Instead of in seconds, the
|
||
time can be specified in minutes using the `m' suffix, in hours
|
||
using `h' suffix, or in days using `d' suffix.
|
||
|
||
Specifying a large value for this option is useful if the network
|
||
or the destination host is down, so that Wget can wait long enough
|
||
to reasonably expect the network error to be fixed before the
|
||
retry.
|
||
|
||
`-Y on/off'
|
||
`--proxy=on/off'
|
||
Turn proxy support on or off. The proxy is on by default if the
|
||
appropriate environmental variable is defined.
|
||
|
||
`-Q QUOTA'
|
||
`--quota=QUOTA'
|
||
Specify download quota for automatic retrievals. The value can be
|
||
specified in bytes (default), kilobytes (with `k' suffix), or
|
||
megabytes (with `m' suffix).
|
||
|
||
Note that quota will never affect downloading a single file. So
|
||
if you specify `wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz',
|
||
all of the `ls-lR.gz' will be downloaded. The same goes even when
|
||
several URLs are specified on the command-line. However, quota is
|
||
respected when retrieving either recursively, or from an input
|
||
file. Thus you may safely type `wget -Q2m -i sites'--download
|
||
will be aborted when the quota is exceeded.
|
||
|
||
Setting quota to 0 or to `inf' unlimits the download quota.
|
||
|
||
|
||
File: wget.info, Node: Directory Options, Next: HTTP Options, Prev: Download Options, Up: Invoking
|
||
|
||
Directory Options
|
||
=================
|
||
|
||
`-nd'
|
||
`--no-directories'
|
||
Do not create a hierarchy of directories when retrieving
|
||
recursively. With this option turned on, all files will get saved
|
||
to the current directory, without clobbering (if a name shows up
|
||
more than once, the filenames will get extensions `.n').
|
||
|
||
`-x'
|
||
`--force-directories'
|
||
The opposite of `-nd'--create a hierarchy of directories, even if
|
||
one would not have been created otherwise. E.g. `wget -x
|
||
http://fly.cc.fer.hr/robots.txt' will save the downloaded file to
|
||
`fly.cc.fer.hr/robots.txt'.
|
||
|
||
`-nH'
|
||
`--no-host-directories'
|
||
Disable generation of host-prefixed directories. By default,
|
||
invoking Wget with `-r http://fly.cc.fer.hr/' will create a
|
||
structure of directories beginning with `fly.cc.fer.hr/'. This
|
||
option disables such behavior.
|
||
|
||
`--cut-dirs=NUMBER'
|
||
Ignore NUMBER directory components. This is useful for getting a
|
||
fine-grained control over the directory where recursive retrieval
|
||
will be saved.
|
||
|
||
Take, for example, the directory at
|
||
`ftp://ftp.xemacs.org/pub/xemacs/'. If you retrieve it with `-r',
|
||
it will be saved locally under `ftp.xemacs.org/pub/xemacs/'.
|
||
While the `-nH' option can remove the `ftp.xemacs.org/' part, you
|
||
are still stuck with `pub/xemacs'. This is where `--cut-dirs'
|
||
comes in handy; it makes Wget not "see" NUMBER remote directory
|
||
components. Here are several examples of how `--cut-dirs' option
|
||
works.
|
||
|
||
No options -> ftp.xemacs.org/pub/xemacs/
|
||
-nH -> pub/xemacs/
|
||
-nH --cut-dirs=1 -> xemacs/
|
||
-nH --cut-dirs=2 -> .
|
||
|
||
--cut-dirs=1 -> ftp.xemacs.org/xemacs/
|
||
...
|
||
|
||
If you just want to get rid of the directory structure, this
|
||
option is similar to a combination of `-nd' and `-P'. However,
|
||
unlike `-nd', `--cut-dirs' does not lose with subdirectories--for
|
||
instance, with `-nH --cut-dirs=1', a `beta/' subdirectory will be
|
||
placed to `xemacs/beta', as one would expect.
|
||
|
||
`-P PREFIX'
|
||
`--directory-prefix=PREFIX'
|
||
Set directory prefix to PREFIX. The "directory prefix" is the
|
||
directory where all other files and subdirectories will be saved
|
||
to, i.e. the top of the retrieval tree. The default is `.' (the
|
||
current directory).
|
||
|
||
|
||
File: wget.info, Node: HTTP Options, Next: FTP Options, Prev: Directory Options, Up: Invoking
|
||
|
||
HTTP Options
|
||
============
|
||
|
||
`--http-user=USER'
|
||
`--http-passwd=PASSWORD'
|
||
Specify the username USER and password PASSWORD on an HTTP server.
|
||
According to the type of the challenge, Wget will encode them
|
||
using either the `basic' (insecure) or the `digest' authentication
|
||
scheme.
|
||
|
||
Another way to specify username and password is in the URL itself
|
||
(*Note URL Format::). For more information about security issues
|
||
with Wget, *Note Security Considerations::.
|
||
|
||
`-C on/off'
|
||
`--cache=on/off'
|
||
When set to off, disable server-side cache. In this case, Wget
|
||
will send the remote server an appropriate directive (`Pragma:
|
||
no-cache') to get the file from the remote service, rather than
|
||
returning the cached version. This is especially useful for
|
||
retrieving and flushing out-of-date documents on proxy servers.
|
||
|
||
Caching is allowed by default.
|
||
|
||
`--ignore-length'
|
||
Unfortunately, some HTTP servers (CGI programs, to be more
|
||
precise) send out bogus `Content-Length' headers, which makes Wget
|
||
go wild, as it thinks not all the document was retrieved. You can
|
||
spot this syndrome if Wget retries getting the same document again
|
||
and again, each time claiming that the (otherwise normal)
|
||
connection has closed on the very same byte.
|
||
|
||
With this option, Wget will ignore the `Content-Length' header--as
|
||
if it never existed.
|
||
|
||
`--header=ADDITIONAL-HEADER'
|
||
Define an ADDITIONAL-HEADER to be passed to the HTTP servers.
|
||
Headers must contain a `:' preceded by one or more non-blank
|
||
characters, and must not contain newlines.
|
||
|
||
You may define more than one additional header by specifying
|
||
`--header' more than once.
|
||
|
||
wget --header='Accept-Charset: iso-8859-2' \
|
||
--header='Accept-Language: hr' \
|
||
http://fly.cc.fer.hr/
|
||
|
||
Specification of an empty string as the header value will clear all
|
||
previous user-defined headers.
|
||
|
||
`--proxy-user=USER'
|
||
`--proxy-passwd=PASSWORD'
|
||
Specify the username USER and password PASSWORD for authentication
|
||
on a proxy server. Wget will encode them using the `basic'
|
||
authentication scheme.
|
||
|
||
`-s'
|
||
`--save-headers'
|
||
Save the headers sent by the HTTP server to the file, preceding the
|
||
actual contents, with an empty line as the separator.
|
||
|
||
`-U AGENT-STRING'
|
||
`--user-agent=AGENT-STRING'
|
||
Identify as AGENT-STRING to the HTTP server.
|
||
|
||
The HTTP protocol allows the clients to identify themselves using a
|
||
`User-Agent' header field. This enables distinguishing the WWW
|
||
software, usually for statistical purposes or for tracing of
|
||
protocol violations. Wget normally identifies as `Wget/VERSION',
|
||
VERSION being the current version number of Wget.
|
||
|
||
However, some sites have been known to impose the policy of
|
||
tailoring the output according to the `User-Agent'-supplied
|
||
information. While conceptually this is not such a bad idea, it
|
||
has been abused by servers denying information to clients other
|
||
than `Mozilla' or Microsoft `Internet Explorer'. This option
|
||
allows you to change the `User-Agent' line issued by Wget. Use of
|
||
this option is discouraged, unless you really know what you are
|
||
doing.
|
||
|
||
*NOTE* that Netscape Communications Corp. has claimed that false
|
||
transmissions of `Mozilla' as the `User-Agent' are a copyright
|
||
infringement, which will be prosecuted. *DO NOT* misrepresent
|
||
Wget as Mozilla.
|
||
|
||
|
||
File: wget.info, Node: FTP Options, Next: Recursive Retrieval Options, Prev: HTTP Options, Up: Invoking
|
||
|
||
FTP Options
|
||
===========
|
||
|
||
`--retr-symlinks'
|
||
Retrieve symbolic links on FTP sites as if they were plain files,
|
||
i.e. don't just create links locally.
|
||
|
||
`-g on/off'
|
||
`--glob=on/off'
|
||
Turn FTP globbing on or off. Globbing means you may use the
|
||
shell-like special characters ("wildcards"), like `*', `?', `['
|
||
and `]' to retrieve more than one file from the same directory at
|
||
once, like:
|
||
|
||
wget ftp://gnjilux.cc.fer.hr/*.msg
|
||
|
||
By default, globbing will be turned on if the URL contains a
|
||
globbing character. This option may be used to turn globbing on
|
||
or off permanently.
|
||
|
||
You may have to quote the URL to protect it from being expanded by
|
||
your shell. Globbing makes Wget look for a directory listing,
|
||
which is system-specific. This is why it currently works only
|
||
with Unix FTP servers (and the ones emulating Unix `ls' output).
|
||
|
||
`--passive-ftp'
|
||
Use the "passive" FTP retrieval scheme, in which the client
|
||
initiates the data connection. This is sometimes required for FTP
|
||
to work behind firewalls.
|
||
|
||
|
||
File: wget.info, Node: Recursive Retrieval Options, Next: Recursive Accept/Reject Options, Prev: FTP Options, Up: Invoking
|
||
|
||
Recursive Retrieval Options
|
||
===========================
|
||
|
||
`-r'
|
||
`--recursive'
|
||
Turn on recursive retrieving. *Note Recursive Retrieval:: for more
|
||
details.
|
||
|
||
`-l DEPTH'
|
||
`--level=DEPTH'
|
||
Specify recursion maximum depth level DEPTH (*Note Recursive
|
||
Retrieval::). The default maximum depth is 5.
|
||
|
||
`--delete-after'
|
||
This option tells Wget to delete every single file it downloads,
|
||
*after* having done so. It is useful for pre-fetching popular
|
||
pages through proxy, e.g.:
|
||
|
||
wget -r -nd --delete-after http://whatever.com/~popular/page/
|
||
|
||
The `-r' option is to retrieve recursively, and `-nd' not to
|
||
create directories.
|
||
|
||
`-k'
|
||
`--convert-links'
|
||
Convert the non-relative links to relative ones locally. Only the
|
||
references to the documents actually downloaded will be converted;
|
||
the rest will be left unchanged.
|
||
|
||
Note that only at the end of the download can Wget know which
|
||
links have been downloaded. Because of that, much of the work
|
||
done by `-k' will be performed at the end of the downloads.
|
||
|
||
`-K'
|
||
`--backup-converted'
|
||
When converting a file, back up the original version with a `.orig'
|
||
suffix. Affects the behavior of `-N' (*Note HTTP Time-Stamping
|
||
Internals::).
|
||
|
||
`-m'
|
||
`--mirror'
|
||
Turn on options suitable for mirroring. This option turns on
|
||
recursion and time-stamping, sets infinite recursion depth and
|
||
keeps FTP directory listings. It is currently equivalent to `-r
|
||
-N -l inf -nr'.
|
||
|
||
`-nr'
|
||
`--dont-remove-listing'
|
||
Don't remove the temporary `.listing' files generated by FTP
|
||
retrievals. Normally, these files contain the raw directory
|
||
listings received from FTP servers. Not removing them can be
|
||
useful to access the full remote file list when running a mirror,
|
||
or for debugging purposes.
|
||
|
||
|
||
File: wget.info, Node: Recursive Accept/Reject Options, Prev: Recursive Retrieval Options, Up: Invoking
|
||
|
||
Recursive Accept/Reject Options
|
||
===============================
|
||
|
||
`-A ACCLIST --accept ACCLIST'
|
||
`-R REJLIST --reject REJLIST'
|
||
Specify comma-separated lists of file name suffixes or patterns to
|
||
accept or reject (*Note Types of Files:: for more details).
|
||
|
||
`-D DOMAIN-LIST'
|
||
`--domains=DOMAIN-LIST'
|
||
Set domains to be accepted and DNS looked-up, where DOMAIN-LIST is
|
||
a comma-separated list. Note that it does *not* turn on `-H'.
|
||
This option speeds things up, even if only one host is spanned
|
||
(*Note Domain Acceptance::).
|
||
|
||
`--exclude-domains DOMAIN-LIST'
|
||
Exclude the domains given in a comma-separated DOMAIN-LIST from
|
||
DNS-lookup (*Note Domain Acceptance::).
|
||
|
||
`--follow-ftp'
|
||
Follow FTP links from HTML documents. Without this option, Wget
|
||
will ignore all the FTP links.
|
||
|
||
`--follow-tags=LIST'
|
||
Wget has an internal table of HTML tag / attribute pairs that it
|
||
considers when looking for linked documents during a recursive
|
||
retrieval. If a user wants only a subset of those tags to be
|
||
considered, however, he or she should be specify such tags in a
|
||
comma-separated LIST with this option.
|
||
|
||
`-G LIST'
|
||
`--ignore-tags=LIST'
|
||
This is the opposite of the `--follow-tags' option. To skip
|
||
certain HTML tags when recursively looking for documents to
|
||
download, specify them in a comma-separated LIST. The author of
|
||
this option likes to use the following command to download a
|
||
single HTML page and all documents necessary to display it
|
||
properly:
|
||
|
||
wget -Ga,area -H -k -K -nh -r http://SITE/DOCUMENT
|
||
|
||
`-H'
|
||
`--span-hosts'
|
||
Enable spanning across hosts when doing recursive retrieving
|
||
(*Note All Hosts::).
|
||
|
||
`-L'
|
||
`--relative'
|
||
Follow relative links only. Useful for retrieving a specific home
|
||
page without any distractions, not even those from the same hosts
|
||
(*Note Relative Links::).
|
||
|
||
`-I LIST'
|
||
`--include-directories=LIST'
|
||
Specify a comma-separated list of directories you wish to follow
|
||
when downloading (*Note Directory-Based Limits:: for more
|
||
details.) Elements of LIST may contain wildcards.
|
||
|
||
`-X LIST'
|
||
`--exclude-directories=LIST'
|
||
Specify a comma-separated list of directories you wish to exclude
|
||
from download (*Note Directory-Based Limits:: for more details.)
|
||
Elements of LIST may contain wildcards.
|
||
|
||
`-nh'
|
||
`--no-host-lookup'
|
||
Disable the time-consuming DNS lookup of almost all hosts (*Note
|
||
Host Checking::).
|
||
|
||
`-np'
|
||
|
||
`--no-parent'
|
||
Do not ever ascend to the parent directory when retrieving
|
||
recursively. This is a useful option, since it guarantees that
|
||
only the files *below* a certain hierarchy will be downloaded.
|
||
*Note Directory-Based Limits:: for more details.
|
||
|
||
|
||
File: wget.info, Node: Recursive Retrieval, Next: Following Links, Prev: Invoking, Up: Top
|
||
|
||
Recursive Retrieval
|
||
*******************
|
||
|
||
GNU Wget is capable of traversing parts of the Web (or a single HTTP
|
||
or FTP server), depth-first following links and directory structure.
|
||
This is called "recursive" retrieving, or "recursion".
|
||
|
||
With HTTP URLs, Wget retrieves and parses the HTML from the given
|
||
URL, documents, retrieving the files the HTML document was referring
|
||
to, through markups like `href', or `src'. If the freshly downloaded
|
||
file is also of type `text/html', it will be parsed and followed
|
||
further.
|
||
|
||
The maximum "depth" to which the retrieval may descend is specified
|
||
with the `-l' option (the default maximum depth is five layers). *Note
|
||
Recursive Retrieval::.
|
||
|
||
When retrieving an FTP URL recursively, Wget will retrieve all the
|
||
data from the given directory tree (including the subdirectories up to
|
||
the specified depth) on the remote server, creating its mirror image
|
||
locally. FTP retrieval is also limited by the `depth' parameter.
|
||
|
||
By default, Wget will create a local directory tree, corresponding to
|
||
the one found on the remote server.
|
||
|
||
Recursive retrieving can find a number of applications, the most
|
||
important of which is mirroring. It is also useful for WWW
|
||
presentations, and any other opportunities where slow network
|
||
connections should be bypassed by storing the files locally.
|
||
|
||
You should be warned that invoking recursion may cause grave
|
||
overloading on your system, because of the fast exchange of data
|
||
through the network; all of this may hamper other users' work. The
|
||
same stands for the foreign server you are mirroring--the more requests
|
||
it gets in a rows, the greater is its load.
|
||
|
||
Careless retrieving can also fill your file system uncontrollably,
|
||
which can grind the machine to a halt.
|
||
|
||
The load can be minimized by lowering the maximum recursion level
|
||
(`-l') and/or by lowering the number of retries (`-t'). You may also
|
||
consider using the `-w' option to slow down your requests to the remote
|
||
servers, as well as the numerous options to narrow the number of
|
||
followed links (*Note Following Links::).
|
||
|
||
Recursive retrieval is a good thing when used properly. Please take
|
||
all precautions not to wreak havoc through carelessness.
|
||
|
||
|
||
File: wget.info, Node: Following Links, Next: Time-Stamping, Prev: Recursive Retrieval, Up: Top
|
||
|
||
Following Links
|
||
***************
|
||
|
||
When retrieving recursively, one does not wish to retrieve loads of
|
||
unnecessary data. Most of the time the users bear in mind exactly what
|
||
they want to download, and want Wget to follow only specific links.
|
||
|
||
For example, if you wish to download the music archive from
|
||
`fly.cc.fer.hr', you will not want to download all the home pages that
|
||
happen to be referenced by an obscure part of the archive.
|
||
|
||
Wget possesses several mechanisms that allows you to fine-tune which
|
||
links it will follow.
|
||
|
||
* Menu:
|
||
|
||
* Relative Links:: Follow relative links only.
|
||
* Host Checking:: Follow links on the same host.
|
||
* Domain Acceptance:: Check on a list of domains.
|
||
* All Hosts:: No host restrictions.
|
||
* Types of Files:: Getting only certain files.
|
||
* Directory-Based Limits:: Getting only certain directories.
|
||
* FTP Links:: Following FTP links.
|
||
|
||
|
||
File: wget.info, Node: Relative Links, Next: Host Checking, Prev: Following Links, Up: Following Links
|
||
|
||
Relative Links
|
||
==============
|
||
|
||
When only relative links are followed (option `-L'), recursive
|
||
retrieving will never span hosts. No time-expensive DNS-lookups will
|
||
be performed, and the process will be very fast, with the minimum
|
||
strain of the network. This will suit your needs often, especially when
|
||
mirroring the output of various `x2html' converters, since they
|
||
generally output relative links.
|
||
|
||
|
||
File: wget.info, Node: Host Checking, Next: Domain Acceptance, Prev: Relative Links, Up: Following Links
|
||
|
||
Host Checking
|
||
=============
|
||
|
||
The drawback of following the relative links solely is that humans
|
||
often tend to mix them with absolute links to the very same host, and
|
||
the very same page. In this mode (which is the default mode for
|
||
following links) all URLs that refer to the same host will be retrieved.
|
||
|
||
The problem with this option are the aliases of the hosts and
|
||
domains. Thus there is no way for Wget to know that `regoc.srce.hr' and
|
||
`www.srce.hr' are the same host, or that `fly.cc.fer.hr' is the same as
|
||
`fly.cc.etf.hr'. Whenever an absolute link is encountered, the host is
|
||
DNS-looked-up with `gethostbyname' to check whether we are maybe
|
||
dealing with the same hosts. Although the results of `gethostbyname'
|
||
are cached, it is still a great slowdown, e.g. when dealing with large
|
||
indices of home pages on different hosts (because each of the hosts
|
||
must be DNS-resolved to see whether it just *might* be an alias of the
|
||
starting host).
|
||
|
||
To avoid the overhead you may use `-nh', which will turn off
|
||
DNS-resolving and make Wget compare hosts literally. This will make
|
||
things run much faster, but also much less reliable (e.g. `www.srce.hr'
|
||
and `regoc.srce.hr' will be flagged as different hosts).
|
||
|
||
Note that modern HTTP servers allow one IP address to host several
|
||
"virtual servers", each having its own directory hierarchy. Such
|
||
"servers" are distinguished by their hostnames (all of which point to
|
||
the same IP address); for this to work, a client must send a `Host'
|
||
header, which is what Wget does. However, in that case Wget *must not*
|
||
try to divine a host's "real" address, nor try to use the same hostname
|
||
for each access, i.e. `-nh' must be turned on.
|
||
|
||
In other words, the `-nh' option must be used to enable the
|
||
retrieval from virtual servers distinguished by their hostnames. As the
|
||
number of such server setups grow, the behavior of `-nh' may become the
|
||
default in the future.
|
||
|
||
|
||
File: wget.info, Node: Domain Acceptance, Next: All Hosts, Prev: Host Checking, Up: Following Links
|
||
|
||
Domain Acceptance
|
||
=================
|
||
|
||
With the `-D' option you may specify the domains that will be
|
||
followed. The hosts the domain of which is not in this list will not be
|
||
DNS-resolved. Thus you can specify `-Dmit.edu' just to make sure that
|
||
*nothing outside of MIT gets looked up*. This is very important and
|
||
useful. It also means that `-D' does *not* imply `-H' (span all
|
||
hosts), which must be specified explicitly. Feel free to use this
|
||
options since it will speed things up, with almost all the reliability
|
||
of checking for all hosts. Thus you could invoke
|
||
|
||
wget -r -D.hr http://fly.cc.fer.hr/
|
||
|
||
to make sure that only the hosts in `.hr' domain get DNS-looked-up
|
||
for being equal to `fly.cc.fer.hr'. So `fly.cc.etf.hr' will be checked
|
||
(only once!) and found equal, but `www.gnu.ai.mit.edu' will not even be
|
||
checked.
|
||
|
||
Of course, domain acceptance can be used to limit the retrieval to
|
||
particular domains with spanning of hosts in them, but then you must
|
||
specify `-H' explicitly. E.g.:
|
||
|
||
wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/
|
||
|
||
will start with `http://www.mit.edu/', following links across MIT
|
||
and Stanford.
|
||
|
||
If there are domains you want to exclude specifically, you can do it
|
||
with `--exclude-domains', which accepts the same type of arguments of
|
||
`-D', but will *exclude* all the listed domains. For example, if you
|
||
want to download all the hosts from `foo.edu' domain, with the
|
||
exception of `sunsite.foo.edu', you can do it like this:
|
||
|
||
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/
|
||
|
||
|
||
File: wget.info, Node: All Hosts, Next: Types of Files, Prev: Domain Acceptance, Up: Following Links
|
||
|
||
All Hosts
|
||
=========
|
||
|
||
When `-H' is specified without `-D', all hosts are freely spanned.
|
||
There are no restrictions whatsoever as to what part of the net Wget
|
||
will go to fetch documents, other than maximum retrieval depth. If a
|
||
page references `www.yahoo.com', so be it. Such an option is rarely
|
||
useful for itself.
|
||
|
||
|
||
File: wget.info, Node: Types of Files, Next: Directory-Based Limits, Prev: All Hosts, Up: Following Links
|
||
|
||
Types of Files
|
||
==============
|
||
|
||
When downloading material from the web, you will often want to
|
||
restrict the retrieval to only certain file types. For example, if you
|
||
are interested in downloading GIFs, you will not be overjoyed to get
|
||
loads of PostScript documents, and vice versa.
|
||
|
||
Wget offers two options to deal with this problem. Each option
|
||
description lists a short name, a long name, and the equivalent command
|
||
in `.wgetrc'.
|
||
|
||
`-A ACCLIST'
|
||
`--accept ACCLIST'
|
||
`accept = ACCLIST'
|
||
The argument to `--accept' option is a list of file suffixes or
|
||
patterns that Wget will download during recursive retrieval. A
|
||
suffix is the ending part of a file, and consists of "normal"
|
||
letters, e.g. `gif' or `.jpg'. A matching pattern contains
|
||
shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'.
|
||
|
||
So, specifying `wget -A gif,jpg' will make Wget download only the
|
||
files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the
|
||
other hand, `wget -A "zelazny*196[0-9]*"' will download only files
|
||
beginning with `zelazny' and containing numbers from 1960 to 1969
|
||
anywhere within. Look up the manual of your shell for a
|
||
description of how pattern matching works.
|
||
|
||
Of course, any number of suffixes and patterns can be combined
|
||
into a comma-separated list, and given as an argument to `-A'.
|
||
|
||
`-R REJLIST'
|
||
`--reject REJLIST'
|
||
`reject = REJLIST'
|
||
The `--reject' option works the same way as `--accept', only its
|
||
logic is the reverse; Wget will download all files *except* the
|
||
ones matching the suffixes (or patterns) in the list.
|
||
|
||
So, if you want to download a whole page except for the cumbersome
|
||
MPEGs and .AU files, you can use `wget -R mpg,mpeg,au'.
|
||
Analogously, to download all files except the ones beginning with
|
||
`bjork', use `wget -R "bjork*"'. The quotes are to prevent
|
||
expansion by the shell.
|
||
|
||
The `-A' and `-R' options may be combined to achieve even better
|
||
fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R
|
||
.ps' will download all the files having `zelazny' as a part of their
|
||
name, but *not* the PostScript files.
|
||
|
||
Note that these two options do not affect the downloading of HTML
|
||
files; Wget must load all the HTMLs to know where to go at
|
||
all--recursive retrieval would make no sense otherwise.
|
||
|
||
|
||
File: wget.info, Node: Directory-Based Limits, Next: FTP Links, Prev: Types of Files, Up: Following Links
|
||
|
||
Directory-Based Limits
|
||
======================
|
||
|
||
Regardless of other link-following facilities, it is often useful to
|
||
place the restriction of what files to retrieve based on the directories
|
||
those files are placed in. There can be many reasons for this--the
|
||
home pages may be organized in a reasonable directory structure; or some
|
||
directories may contain useless information, e.g. `/cgi-bin' or `/dev'
|
||
directories.
|
||
|
||
Wget offers three different options to deal with this requirement.
|
||
Each option description lists a short name, a long name, and the
|
||
equivalent command in `.wgetrc'.
|
||
|
||
`-I LIST'
|
||
`--include LIST'
|
||
`include_directories = LIST'
|
||
`-I' option accepts a comma-separated list of directories included
|
||
in the retrieval. Any other directories will simply be ignored.
|
||
The directories are absolute paths.
|
||
|
||
So, if you wish to download from `http://host/people/bozo/'
|
||
following only links to bozo's colleagues in the `/people'
|
||
directory and the bogus scripts in `/cgi-bin', you can specify:
|
||
|
||
wget -I /people,/cgi-bin http://host/people/bozo/
|
||
|
||
`-X LIST'
|
||
`--exclude LIST'
|
||
`exclude_directories = LIST'
|
||
`-X' option is exactly the reverse of `-I'--this is a list of
|
||
directories *excluded* from the download. E.g. if you do not want
|
||
Wget to download things from `/cgi-bin' directory, specify `-X
|
||
/cgi-bin' on the command line.
|
||
|
||
The same as with `-A'/`-R', these two options can be combined to
|
||
get a better fine-tuning of downloading subdirectories. E.g. if
|
||
you want to load all the files from `/pub' hierarchy except for
|
||
`/pub/worthless', specify `-I/pub -X/pub/worthless'.
|
||
|
||
`-np'
|
||
`--no-parent'
|
||
`no_parent = on'
|
||
The simplest, and often very useful way of limiting directories is
|
||
disallowing retrieval of the links that refer to the hierarchy
|
||
"above" than the beginning directory, i.e. disallowing ascent to
|
||
the parent directory/directories.
|
||
|
||
The `--no-parent' option (short `-np') is useful in this case.
|
||
Using it guarantees that you will never leave the existing
|
||
hierarchy. Supposing you issue Wget with:
|
||
|
||
wget -r --no-parent http://somehost/~luzer/my-archive/
|
||
|
||
You may rest assured that none of the references to
|
||
`/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be
|
||
followed. Only the archive you are interested in will be
|
||
downloaded. Essentially, `--no-parent' is similar to
|
||
`-I/~luzer/my-archive', only it handles redirections in a more
|
||
intelligent fashion.
|
||
|
||
|
||
File: wget.info, Node: FTP Links, Prev: Directory-Based Limits, Up: Following Links
|
||
|
||
Following FTP Links
|
||
===================
|
||
|
||
The rules for FTP are somewhat specific, as it is necessary for them
|
||
to be. FTP links in HTML documents are often included for purposes of
|
||
reference, and it is often inconvenient to download them by default.
|
||
|
||
To have FTP links followed from HTML documents, you need to specify
|
||
the `--follow-ftp' option. Having done that, FTP links will span hosts
|
||
regardless of `-H' setting. This is logical, as FTP links rarely point
|
||
to the same host where the HTTP server resides. For similar reasons,
|
||
the `-L' options has no effect on such downloads. On the other hand,
|
||
domain acceptance (`-D') and suffix rules (`-A' and `-R') apply
|
||
normally.
|
||
|
||
Also note that followed links to FTP directories will not be
|
||
retrieved recursively further.
|
||
|
||
|
||
File: wget.info, Node: Time-Stamping, Next: Startup File, Prev: Following Links, Up: Top
|
||
|
||
Time-Stamping
|
||
*************
|
||
|
||
One of the most important aspects of mirroring information from the
|
||
Internet is updating your archives.
|
||
|
||
Downloading the whole archive again and again, just to replace a few
|
||
changed files is expensive, both in terms of wasted bandwidth and money,
|
||
and the time to do the update. This is why all the mirroring tools
|
||
offer the option of incremental updating.
|
||
|
||
Such an updating mechanism means that the remote server is scanned in
|
||
search of "new" files. Only those new files will be downloaded in the
|
||
place of the old ones.
|
||
|
||
A file is considered new if one of these two conditions are met:
|
||
|
||
1. A file of that name does not already exist locally.
|
||
|
||
2. A file of that name does exist, but the remote file was modified
|
||
more recently than the local file.
|
||
|
||
To implement this, the program needs to be aware of the time of last
|
||
modification of both remote and local files. Such information are
|
||
called the "time-stamps".
|
||
|
||
The time-stamping in GNU Wget is turned on using `--timestamping'
|
||
(`-N') option, or through `timestamping = on' directive in `.wgetrc'.
|
||
With this option, for each file it intends to download, Wget will check
|
||
whether a local file of the same name exists. If it does, and the
|
||
remote file is older, Wget will not download it.
|
||
|
||
If the local file does not exist, or the sizes of the files do not
|
||
match, Wget will download the remote file no matter what the time-stamps
|
||
say.
|
||
|
||
* Menu:
|
||
|
||
* Time-Stamping Usage::
|
||
* HTTP Time-Stamping Internals::
|
||
* FTP Time-Stamping Internals::
|
||
|
||
|
||
File: wget.info, Node: Time-Stamping Usage, Next: HTTP Time-Stamping Internals, Prev: Time-Stamping, Up: Time-Stamping
|
||
|
||
Time-Stamping Usage
|
||
===================
|
||
|
||
The usage of time-stamping is simple. Say you would like to
|
||
download a file so that it keeps its date of modification.
|
||
|
||
wget -S http://www.gnu.ai.mit.edu/
|
||
|
||
A simple `ls -l' shows that the time stamp on the local file equals
|
||
the state of the `Last-Modified' header, as returned by the server. As
|
||
you can see, the time-stamping info is preserved locally, even without
|
||
`-N'.
|
||
|
||
Several days later, you would like Wget to check if the remote file
|
||
has changed, and download it if it has.
|
||
|
||
wget -N http://www.gnu.ai.mit.edu/
|
||
|
||
Wget will ask the server for the last-modified date. If the local
|
||
file is newer, the remote file will not be re-fetched. However, if the
|
||
remote file is more recent, Wget will proceed fetching it normally.
|
||
|
||
The same goes for FTP. For example:
|
||
|
||
wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/*
|
||
|
||
`ls' will show that the timestamps are set according to the state on
|
||
the remote server. Reissuing the command with `-N' will make Wget
|
||
re-fetch *only* the files that have been modified.
|
||
|
||
In both HTTP and FTP retrieval Wget will time-stamp the local file
|
||
correctly (with or without `-N') if it gets the stamps, i.e. gets the
|
||
directory listing for FTP or the `Last-Modified' header for HTTP.
|
||
|
||
If you wished to mirror the GNU archive every week, you would use the
|
||
following command every week:
|
||
|
||
wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/
|
||
|
||
|
||
File: wget.info, Node: HTTP Time-Stamping Internals, Next: FTP Time-Stamping Internals, Prev: Time-Stamping Usage, Up: Time-Stamping
|
||
|
||
HTTP Time-Stamping Internals
|
||
============================
|
||
|
||
Time-stamping in HTTP is implemented by checking of the
|
||
`Last-Modified' header. If you wish to retrieve the file `foo.html'
|
||
through HTTP, Wget will check whether `foo.html' exists locally. If it
|
||
doesn't, `foo.html' will be retrieved unconditionally.
|
||
|
||
If the file does exist locally, Wget will first check its local
|
||
time-stamp (similar to the way `ls -l' checks it), and then send a
|
||
`HEAD' request to the remote server, demanding the information on the
|
||
remote file.
|
||
|
||
The `Last-Modified' header is examined to find which file was
|
||
modified more recently (which makes it "newer"). If the remote file is
|
||
newer, it will be downloaded; if it is older, Wget will give up.(1)
|
||
|
||
When `--backup-converted' (`-K') is specified in conjunction with
|
||
`-N', server file `X' is compared to local file `X.orig', if extant,
|
||
rather than being compared to local file `X', which will always differ
|
||
if it's been converted by `--convert-links' (`-k').
|
||
|
||
Arguably, HTTP time-stamping should be implemented using the
|
||
`If-Modified-Since' request.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) As an additional check, Wget will look at the `Content-Length'
|
||
header, and compare the sizes; if they are not the same, the remote
|
||
file will be downloaded no matter what the time-stamp says.
|
||
|
||
|
||
File: wget.info, Node: FTP Time-Stamping Internals, Prev: HTTP Time-Stamping Internals, Up: Time-Stamping
|
||
|
||
FTP Time-Stamping Internals
|
||
===========================
|
||
|
||
In theory, FTP time-stamping works much the same as HTTP, only FTP
|
||
has no headers--time-stamps must be received from the directory
|
||
listings.
|
||
|
||
For each directory files must be retrieved from, Wget will use the
|
||
`LIST' command to get the listing. It will try to analyze the listing,
|
||
assuming that it is a Unix `ls -l' listing, and extract the
|
||
time-stamps. The rest is exactly the same as for HTTP.
|
||
|
||
Assumption that every directory listing is a Unix-style listing may
|
||
sound extremely constraining, but in practice it is not, as many
|
||
non-Unix FTP servers use the Unixoid listing format because most (all?)
|
||
of the clients understand it. Bear in mind that RFC959 defines no
|
||
standard way to get a file list, let alone the time-stamps. We can
|
||
only hope that a future standard will define this.
|
||
|
||
Another non-standard solution includes the use of `MDTM' command
|
||
that is supported by some FTP servers (including the popular
|
||
`wu-ftpd'), which returns the exact time of the specified file. Wget
|
||
may support this command in the future.
|
||
|