mirror of
https://github.com/moparisthebest/wget
synced 2024-07-03 16:38:41 -04:00
[svn] Implemented breadth-first retrieval.
Published in <sxsherjczw2.fsf@florida.arsdigita.de>.
This commit is contained in:
parent
b88223f99d
commit
222e9465b7
@ -1,3 +1,9 @@
|
||||
2001-11-25 Hrvoje Niksic <hniksic@arsdigita.com>
|
||||
|
||||
* TODO: Ditto.
|
||||
|
||||
* NEWS: Updated with the latest stuff.
|
||||
|
||||
2001-11-23 Hrvoje Niksic <hniksic@arsdigita.com>
|
||||
|
||||
* po/hr.po: A major overhaul.
|
||||
|
10
NEWS
10
NEWS
@ -7,9 +7,19 @@ Please send GNU Wget bug reports to <bug-wget@gnu.org>.
|
||||
|
||||
* Changes in Wget 1.8.
|
||||
|
||||
** "Recursive retrieval" now uses a breadth-first algorithm.
|
||||
Recursive downloads are faster and consume *significantly* less memory
|
||||
than before.
|
||||
|
||||
** A new progress indicator is now available. Try it with
|
||||
--progress=bar or using `progress = bar' in `.wgetrc'.
|
||||
|
||||
** Host directories now contain port information if the URL is at a
|
||||
non-standard port.
|
||||
|
||||
** Wget now supports the robots.txt directives specified in
|
||||
<http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html>.
|
||||
|
||||
** URL parser has been fixed, especially the infamous overzealous
|
||||
quoting bug. Wget no longer dequotes reserved characters, e.g. `%3F'
|
||||
is no longer translated to `?', nor `%2B' to `+'. Unsafe characters
|
||||
|
36
TODO
36
TODO
@ -20,15 +20,6 @@ changes.
|
||||
file, though forcibly disconnecting from the server at the desired endpoint
|
||||
might be workable).
|
||||
|
||||
* RFC 1738 says that if logging on to an FTP server puts you in a directory
|
||||
other than '/', the way to specify a file relative to '/' in a URL (let's use
|
||||
"/bin/ls" in this example) is "ftp://host/%2Fbin/ls". Wget needs to support
|
||||
this (and ideally not consider "ftp://host//bin/ls" to be equivalent, as that
|
||||
would equate to the command "CWD " rather than "CWD /"). To accomodate people
|
||||
used to broken FTP clients like Internet Explorer and Netscape, if
|
||||
"ftp://host/bin/ls" doesn't exist, Wget should try again (perhaps under
|
||||
control of an option), acting as if the user had typed "ftp://host/%2Fbin/ls".
|
||||
|
||||
* If multiple FTP URLs are specified that are on the same host, Wget should
|
||||
re-use the connection rather than opening a new one for each file.
|
||||
|
||||
@ -37,16 +28,9 @@ changes.
|
||||
|
||||
* Limit the number of successive redirection to max. 20 or so.
|
||||
|
||||
* If -c used on a file that's already completely downloaded, don't re-download
|
||||
it (unless normal --timestamping processing would cause you to do so).
|
||||
|
||||
* If -c used with -N, check to make sure a file hasn't changed on the server
|
||||
before "continuing" to download it (preventing a bogus hybrid file).
|
||||
|
||||
* Take a look at
|
||||
<http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html>
|
||||
and support the new directives.
|
||||
|
||||
* Generalize --html-extension to something like --mime-extensions and have it
|
||||
look at mime.types/mimecap file for preferred extension. Non-HTML files with
|
||||
filenames changed this way would be re-downloaded each time despite -N unless
|
||||
@ -87,9 +71,6 @@ changes.
|
||||
turning it off. Get rid of `--foo=no' stuff. Short options would
|
||||
be handled as `-x' vs. `-nx'.
|
||||
|
||||
* Implement "thermometer" display (not all that hard; use an
|
||||
alternative show_progress() if the output goes to a terminal.)
|
||||
|
||||
* Add option to only list wildcard matches without doing the download.
|
||||
|
||||
* Add case-insensitivity as an option.
|
||||
@ -102,19 +83,13 @@ changes.
|
||||
|
||||
* Allow time-stamping by arbitrary date.
|
||||
|
||||
* Fix Unix directory parser to allow for spaces in file names.
|
||||
|
||||
* Allow size limit to files (perhaps with an option to download oversize files
|
||||
up through the limit or not at all, to get more functionality than [u]limit.
|
||||
|
||||
* Implement breadth-first retrieval.
|
||||
|
||||
* Download to .in* when mirroring.
|
||||
|
||||
* Add an option to delete or move no-longer-existent files when mirroring.
|
||||
|
||||
* Implement a switch to avoid downloading multiple files (e.g. x and x.gz).
|
||||
|
||||
* Implement uploading (--upload URL?) in FTP and HTTP.
|
||||
|
||||
* Rewrite FTP code to allow for easy addition of new commands. It
|
||||
@ -129,13 +104,10 @@ changes.
|
||||
|
||||
* Implement a concept of "packages" a la mirror.
|
||||
|
||||
* Implement correct RFC1808 URL parsing.
|
||||
|
||||
* Implement more HTTP/1.1 bells and whistles (ETag, Content-MD5 etc.)
|
||||
|
||||
* Add a "rollback" option to have --continue throw away a configurable number of
|
||||
bytes at the end of a file before resuming download. Apparently, some stupid
|
||||
proxies insert a "transfer interrupted" string we need to get rid of.
|
||||
* Add a "rollback" option to have continued retrieval throw away a
|
||||
configurable number of bytes at the end of a file before resuming
|
||||
download. Apparently, some stupid proxies insert a "transfer
|
||||
interrupted" string we need to get rid of.
|
||||
|
||||
* When using --accept and --reject, you can end up with empty directories. Have
|
||||
Wget any such at the end.
|
||||
|
@ -1,3 +1,68 @@
|
||||
2001-11-25 Hrvoje Niksic <hniksic@arsdigita.com>
|
||||
|
||||
* url.c (reencode_string): Use unsigned char, not char --
|
||||
otherwise the hex digits come out wrong for 8-bit chars such as
|
||||
nbsp.
|
||||
(lowercase_str): New function.
|
||||
(url_parse): Canonicalize u->url if needed.
|
||||
(get_urls_file): Parse each URL, and return only the valid ones.
|
||||
(free_urlpos): Call url_free.
|
||||
(mkstruct): Add :port if the port is non-standard.
|
||||
(mkstruct): Append the query string to the file name, if any.
|
||||
(urlpath_length): Use strpbrk_or_eos.
|
||||
(uri_merge_1): Handle the cases where LINK is an empty string,
|
||||
where LINK consists only of query, and where LINK consists only of
|
||||
fragment.
|
||||
(convert_links): Count and report both kinds of conversion.
|
||||
(downloaded_file): Use a hash table, not a list.
|
||||
(downloaded_files_free): Free the hash table.
|
||||
|
||||
* retr.c (retrieve_from_file): Ditto.
|
||||
|
||||
* main.c (main): Call either retrieve_url or retrieve_tree
|
||||
for each URL, not both.
|
||||
|
||||
* retr.c (register_all_redirections): New function.
|
||||
(register_redirections_mapper): Ditto.
|
||||
(retrieve_url): Register the redirections.
|
||||
(retrieve_url): Make the string "Error parsing proxy ..."
|
||||
translatable.
|
||||
|
||||
* res.c (add_path): Strip leading slash from robots.txt paths so
|
||||
that the path representations are "compatible".
|
||||
(free_specs): Free each individual path, too.
|
||||
(res_cleanup): New function.
|
||||
(cleanup_hash_table_mapper): Ditto.
|
||||
|
||||
* recur.c (url_queue_new): New function.
|
||||
(url_queue_delete): Ditto.
|
||||
(url_enqueue): Ditto.
|
||||
(url_dequeue): Ditto.
|
||||
(retrieve_tree): New function, replacement for recursive_retrieve.
|
||||
(descend_url_p): New function.
|
||||
(register_redirection): New function.
|
||||
|
||||
* progress.c (create_image): Cosmetic changes.
|
||||
|
||||
* init.c (cleanup): Do all those complex cleanups only if
|
||||
DEBUG_MALLOC is defined.
|
||||
|
||||
* main.c: Removed --simple-check and the corresponding
|
||||
simple_host_check in init.c.
|
||||
|
||||
* html-url.c (handle_link): Parse the URL here, and propagate the
|
||||
parsed URL to the caller, who would otherwise have to parse it
|
||||
again.
|
||||
|
||||
* host.c (xstrdup_lower): Moved to utils.c.
|
||||
(realhost): Removed.
|
||||
(same_host): Ditto.
|
||||
|
||||
2001-11-24 Hrvoje Niksic <hniksic@arsdigita.com>
|
||||
|
||||
* utils.c (path_simplify): Preserver the (non-)existence of
|
||||
leading slash. Return non-zero if changes were made.
|
||||
|
||||
2001-11-24 Hrvoje Niksic <hniksic@arsdigita.com>
|
||||
|
||||
* progress.c (bar_update): Don't modify bp->total_length if it is
|
||||
|
@ -162,8 +162,10 @@ main$o: wget.h utils.h init.h retr.h recur.h host.h cookies.h
|
||||
gnu-md5$o: wget.h gnu-md5.h
|
||||
mswindows$o: wget.h url.h
|
||||
netrc$o: wget.h utils.h netrc.h init.h
|
||||
progress$o: wget.h progress.h utils.h retr.h
|
||||
rbuf$o: wget.h rbuf.h connect.h
|
||||
recur$o: wget.h url.h recur.h utils.h retr.h ftp.h fnmatch.h host.h hash.h
|
||||
res$o: wget.h utils.h hash.h url.h retr.h res.h
|
||||
retr$o: wget.h utils.h retr.h url.h recur.h ftp.h host.h connect.h hash.h
|
||||
snprintf$o:
|
||||
safe-ctype$o: safe-ctype.h
|
||||
|
128
src/host.c
128
src/host.c
@ -60,8 +60,14 @@ extern int errno;
|
||||
#endif
|
||||
|
||||
/* Mapping between all known hosts to their addresses (n.n.n.n). */
|
||||
|
||||
/* #### We should map to *lists* of IP addresses. */
|
||||
|
||||
struct hash_table *host_name_address_map;
|
||||
|
||||
/* The following two tables are obsolete, since we no longer do host
|
||||
canonicalization. */
|
||||
|
||||
/* Mapping between all known addresses (n.n.n.n) to their hosts. This
|
||||
is the inverse of host_name_address_map. These two tables share
|
||||
the strdup'ed strings. */
|
||||
@ -70,18 +76,6 @@ struct hash_table *host_address_name_map;
|
||||
/* Mapping between auxilliary (slave) and master host names. */
|
||||
struct hash_table *host_slave_master_map;
|
||||
|
||||
/* Utility function: like xstrdup(), but also lowercases S. */
|
||||
|
||||
static char *
|
||||
xstrdup_lower (const char *s)
|
||||
{
|
||||
char *copy = xstrdup (s);
|
||||
char *p = copy;
|
||||
for (; *p; p++)
|
||||
*p = TOLOWER (*p);
|
||||
return copy;
|
||||
}
|
||||
|
||||
/* The same as gethostbyname, but supports internet addresses of the
|
||||
form `N.N.N.N'. On some systems gethostbyname() knows how to do
|
||||
this automatically. */
|
||||
@ -216,114 +210,6 @@ store_hostaddress (unsigned char *where, const char *hostname)
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* Determine the "real" name of HOST, as perceived by Wget. If HOST
|
||||
is referenced by more than one name, "real" name is considered to
|
||||
be the first one encountered in the past. */
|
||||
char *
|
||||
realhost (const char *host)
|
||||
{
|
||||
struct in_addr in;
|
||||
struct hostent *hptr;
|
||||
char *master_name;
|
||||
|
||||
DEBUGP (("Checking for %s in host_name_address_map.\n", host));
|
||||
if (hash_table_contains (host_name_address_map, host))
|
||||
{
|
||||
DEBUGP (("Found; %s was already used, by that name.\n", host));
|
||||
return xstrdup_lower (host);
|
||||
}
|
||||
|
||||
DEBUGP (("Checking for %s in host_slave_master_map.\n", host));
|
||||
master_name = hash_table_get (host_slave_master_map, host);
|
||||
if (master_name)
|
||||
{
|
||||
has_master:
|
||||
DEBUGP (("Found; %s was already used, by the name %s.\n",
|
||||
host, master_name));
|
||||
return xstrdup (master_name);
|
||||
}
|
||||
|
||||
DEBUGP (("First time I hear about %s by that name; looking it up.\n",
|
||||
host));
|
||||
hptr = ngethostbyname (host);
|
||||
if (hptr)
|
||||
{
|
||||
char *inet_s;
|
||||
/* Originally, we copied to in.s_addr, but it appears to be
|
||||
missing on some systems. */
|
||||
memcpy (&in, *hptr->h_addr_list, sizeof (in));
|
||||
inet_s = inet_ntoa (in);
|
||||
|
||||
add_host_to_cache (host, inet_s);
|
||||
|
||||
/* add_host_to_cache() can establish a slave-master mapping. */
|
||||
DEBUGP (("Checking again for %s in host_slave_master_map.\n", host));
|
||||
master_name = hash_table_get (host_slave_master_map, host);
|
||||
if (master_name)
|
||||
goto has_master;
|
||||
}
|
||||
|
||||
return xstrdup_lower (host);
|
||||
}
|
||||
|
||||
/* Compare two hostnames (out of URL-s if the arguments are URL-s),
|
||||
taking care of aliases. It uses realhost() to determine a unique
|
||||
hostname for each of two hosts. If simple_check is non-zero, only
|
||||
strcmp() is used for comparison. */
|
||||
int
|
||||
same_host (const char *u1, const char *u2)
|
||||
{
|
||||
const char *s;
|
||||
char *p1, *p2;
|
||||
char *real1, *real2;
|
||||
|
||||
/* Skip protocol, if present. */
|
||||
u1 += url_skip_scheme (u1);
|
||||
u2 += url_skip_scheme (u2);
|
||||
|
||||
/* Skip username ans password, if present. */
|
||||
u1 += url_skip_uname (u1);
|
||||
u2 += url_skip_uname (u2);
|
||||
|
||||
for (s = u1; *u1 && *u1 != '/' && *u1 != ':'; u1++);
|
||||
p1 = strdupdelim (s, u1);
|
||||
for (s = u2; *u2 && *u2 != '/' && *u2 != ':'; u2++);
|
||||
p2 = strdupdelim (s, u2);
|
||||
DEBUGP (("Comparing hosts %s and %s...\n", p1, p2));
|
||||
if (strcasecmp (p1, p2) == 0)
|
||||
{
|
||||
xfree (p1);
|
||||
xfree (p2);
|
||||
DEBUGP (("They are quite alike.\n"));
|
||||
return 1;
|
||||
}
|
||||
else if (opt.simple_check)
|
||||
{
|
||||
xfree (p1);
|
||||
xfree (p2);
|
||||
DEBUGP (("Since checking is simple, I'd say they are not the same.\n"));
|
||||
return 0;
|
||||
}
|
||||
real1 = realhost (p1);
|
||||
real2 = realhost (p2);
|
||||
xfree (p1);
|
||||
xfree (p2);
|
||||
if (strcasecmp (real1, real2) == 0)
|
||||
{
|
||||
DEBUGP (("They are alike, after realhost()->%s.\n", real1));
|
||||
xfree (real1);
|
||||
xfree (real2);
|
||||
return 1;
|
||||
}
|
||||
else
|
||||
{
|
||||
DEBUGP (("They are not the same (%s, %s).\n", real1, real2));
|
||||
xfree (real1);
|
||||
xfree (real2);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
/* Determine whether a URL is acceptable to be followed, according to
|
||||
a list of domains to accept. */
|
||||
int
|
||||
@ -383,7 +269,7 @@ herrmsg (int error)
|
||||
}
|
||||
|
||||
void
|
||||
clean_hosts (void)
|
||||
host_cleanup (void)
|
||||
{
|
||||
/* host_name_address_map and host_address_name_map share the
|
||||
strings. Because of that, calling free_keys_and_values once
|
||||
|
@ -27,15 +27,11 @@ struct url;
|
||||
struct hostent *ngethostbyname PARAMS ((const char *));
|
||||
int store_hostaddress PARAMS ((unsigned char *, const char *));
|
||||
|
||||
void clean_hosts PARAMS ((void));
|
||||
void host_cleanup PARAMS ((void));
|
||||
|
||||
char *realhost PARAMS ((const char *));
|
||||
int same_host PARAMS ((const char *, const char *));
|
||||
int accept_domain PARAMS ((struct url *));
|
||||
int sufmatch PARAMS ((const char **, const char *));
|
||||
|
||||
char *ftp_getaddress PARAMS ((void));
|
||||
|
||||
char *herrmsg PARAMS ((int));
|
||||
|
||||
#endif /* HOST_H */
|
||||
|
@ -284,7 +284,7 @@ struct collect_urls_closure {
|
||||
char *text; /* HTML text. */
|
||||
char *base; /* Base URI of the document, possibly
|
||||
changed through <base href=...>. */
|
||||
urlpos *head, *tail; /* List of URLs */
|
||||
struct urlpos *head, *tail; /* List of URLs */
|
||||
const char *parent_base; /* Base of the current document. */
|
||||
const char *document_file; /* File name of this document. */
|
||||
int dash_p_leaf_HTML; /* Whether -p is specified, and this
|
||||
@ -301,59 +301,67 @@ static void
|
||||
handle_link (struct collect_urls_closure *closure, const char *link_uri,
|
||||
struct taginfo *tag, int attrid)
|
||||
{
|
||||
int no_scheme = !url_has_scheme (link_uri);
|
||||
urlpos *newel;
|
||||
|
||||
int link_has_scheme = url_has_scheme (link_uri);
|
||||
struct urlpos *newel;
|
||||
const char *base = closure->base ? closure->base : closure->parent_base;
|
||||
char *complete_uri;
|
||||
|
||||
char *fragment = strrchr (link_uri, '#');
|
||||
|
||||
if (fragment)
|
||||
{
|
||||
/* Nullify the fragment identifier, i.e. everything after the
|
||||
last occurrence of `#', inclusive. This copying is
|
||||
relatively inefficient, but it doesn't matter because
|
||||
fragment identifiers don't come up all that often. */
|
||||
int hashlen = fragment - link_uri;
|
||||
char *p = alloca (hashlen + 1);
|
||||
memcpy (p, link_uri, hashlen);
|
||||
p[hashlen] = '\0';
|
||||
link_uri = p;
|
||||
}
|
||||
struct url *url;
|
||||
|
||||
if (!base)
|
||||
{
|
||||
if (no_scheme)
|
||||
DEBUGP (("%s: no base, merge will use \"%s\".\n",
|
||||
closure->document_file, link_uri));
|
||||
|
||||
if (!link_has_scheme)
|
||||
{
|
||||
/* We have no base, and the link does not have a host
|
||||
attached to it. Nothing we can do. */
|
||||
/* #### Should we print a warning here? Wget 1.5.x used to. */
|
||||
return;
|
||||
}
|
||||
else
|
||||
complete_uri = xstrdup (link_uri);
|
||||
|
||||
url = url_parse (link_uri, NULL);
|
||||
if (!url)
|
||||
{
|
||||
DEBUGP (("%s: link \"%s\" doesn't parse.\n",
|
||||
closure->document_file, link_uri));
|
||||
return;
|
||||
}
|
||||
}
|
||||
else
|
||||
complete_uri = uri_merge (base, link_uri);
|
||||
{
|
||||
/* Merge BASE with LINK_URI, but also make sure the result is
|
||||
canonicalized, i.e. that "../" have been resolved.
|
||||
(parse_url will do that for us.) */
|
||||
|
||||
DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n",
|
||||
closure->document_file, base ? base : "(null)",
|
||||
link_uri, complete_uri));
|
||||
char *complete_uri = uri_merge (base, link_uri);
|
||||
|
||||
newel = (urlpos *)xmalloc (sizeof (urlpos));
|
||||
DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n",
|
||||
closure->document_file, base, link_uri, complete_uri));
|
||||
|
||||
url = url_parse (complete_uri, NULL);
|
||||
if (!url)
|
||||
{
|
||||
DEBUGP (("%s: merged link \"%s\" doesn't parse.\n",
|
||||
closure->document_file, complete_uri));
|
||||
xfree (complete_uri);
|
||||
return;
|
||||
}
|
||||
xfree (complete_uri);
|
||||
}
|
||||
|
||||
newel = (struct urlpos *)xmalloc (sizeof (struct urlpos));
|
||||
|
||||
memset (newel, 0, sizeof (*newel));
|
||||
newel->next = NULL;
|
||||
newel->url = complete_uri;
|
||||
newel->url = url;
|
||||
newel->pos = tag->attrs[attrid].value_raw_beginning - closure->text;
|
||||
newel->size = tag->attrs[attrid].value_raw_size;
|
||||
|
||||
/* A URL is relative if the host is not named, and the name does not
|
||||
start with `/'. */
|
||||
if (no_scheme && *link_uri != '/')
|
||||
if (!link_has_scheme && *link_uri != '/')
|
||||
newel->link_relative_p = 1;
|
||||
else if (!no_scheme)
|
||||
else if (link_has_scheme)
|
||||
newel->link_complete_p = 1;
|
||||
|
||||
if (closure->tail)
|
||||
@ -542,7 +550,7 @@ collect_tags_mapper (struct taginfo *tag, void *arg)
|
||||
|
||||
If dash_p_leaf_HTML is non-zero, only the elements needed to render
|
||||
FILE ("non-external" links) will be returned. */
|
||||
urlpos *
|
||||
struct urlpos *
|
||||
get_urls_html (const char *file, const char *this_url, int dash_p_leaf_HTML,
|
||||
int *meta_disallow_follow)
|
||||
{
|
||||
|
@ -1452,8 +1452,8 @@ File `%s' already there, will not retrieve.\n"), *hstat.local_file);
|
||||
if (((suf = suffix (*hstat.local_file)) != NULL)
|
||||
&& (!strcmp (suf, "html") || !strcmp (suf, "htm")))
|
||||
*dt |= TEXTHTML;
|
||||
xfree (suf);
|
||||
|
||||
FREE_MAYBE (suf);
|
||||
FREE_MAYBE (dummy);
|
||||
return RETROK;
|
||||
}
|
||||
|
26
src/init.c
26
src/init.c
@ -171,7 +171,6 @@ static struct {
|
||||
{ "savecookies", &opt.cookies_output, cmd_file },
|
||||
{ "saveheaders", &opt.save_headers, cmd_boolean },
|
||||
{ "serverresponse", &opt.server_response, cmd_boolean },
|
||||
{ "simplehostcheck", &opt.simple_check, cmd_boolean },
|
||||
{ "spanhosts", &opt.spanhost, cmd_boolean },
|
||||
{ "spider", &opt.spider, cmd_boolean },
|
||||
#ifdef HAVE_SSL
|
||||
@ -1009,6 +1008,7 @@ check_user_specified_header (const char *s)
|
||||
}
|
||||
|
||||
void cleanup_html_url PARAMS ((void));
|
||||
void res_cleanup PARAMS ((void));
|
||||
void downloaded_files_free PARAMS ((void));
|
||||
|
||||
|
||||
@ -1016,13 +1016,27 @@ void downloaded_files_free PARAMS ((void));
|
||||
void
|
||||
cleanup (void)
|
||||
{
|
||||
extern acc_t *netrc_list;
|
||||
/* Free external resources, close files, etc. */
|
||||
|
||||
recursive_cleanup ();
|
||||
clean_hosts ();
|
||||
free_netrc (netrc_list);
|
||||
if (opt.dfp)
|
||||
fclose (opt.dfp);
|
||||
|
||||
/* We're exiting anyway so there's no real need to call free()
|
||||
hundreds of times. Skipping the frees will make Wget exit
|
||||
faster.
|
||||
|
||||
However, when detecting leaks, it's crucial to free() everything
|
||||
because then you can find the real leaks, i.e. the allocated
|
||||
memory which grows with the size of the program. */
|
||||
|
||||
#ifdef DEBUG_MALLOC
|
||||
recursive_cleanup ();
|
||||
res_cleanup ();
|
||||
host_cleanup ();
|
||||
{
|
||||
extern acc_t *netrc_list;
|
||||
free_netrc (netrc_list);
|
||||
}
|
||||
cleanup_html_url ();
|
||||
downloaded_files_free ();
|
||||
cookies_cleanup ();
|
||||
@ -1037,6 +1051,7 @@ cleanup (void)
|
||||
free_vec (opt.domains);
|
||||
free_vec (opt.follow_tags);
|
||||
free_vec (opt.ignore_tags);
|
||||
FREE_MAYBE (opt.progress_type);
|
||||
xfree (opt.ftp_acc);
|
||||
FREE_MAYBE (opt.ftp_pass);
|
||||
FREE_MAYBE (opt.ftp_proxy);
|
||||
@ -1055,4 +1070,5 @@ cleanup (void)
|
||||
FREE_MAYBE (opt.bind_address);
|
||||
FREE_MAYBE (opt.cookies_input);
|
||||
FREE_MAYBE (opt.cookies_output);
|
||||
#endif
|
||||
}
|
||||
|
20
src/main.c
20
src/main.c
@ -402,9 +402,6 @@ hpVqvdkKsxmNWrHSLcFbEY:G:g:T:U:O:l:n:i:o:a:t:D:A:R:P:B:e:Q:X:I:w:C:",
|
||||
case 149:
|
||||
setval ("removelisting", "off");
|
||||
break;
|
||||
case 150:
|
||||
setval ("simplehostcheck", "on");
|
||||
break;
|
||||
case 155:
|
||||
setval ("bindaddress", optarg);
|
||||
break;
|
||||
@ -604,7 +601,7 @@ GNU General Public License for more details.\n"));
|
||||
break;
|
||||
case 'n':
|
||||
{
|
||||
/* #### The n? options are utter crock! */
|
||||
/* #### What we really want here is --no-foo. */
|
||||
char *p;
|
||||
|
||||
for (p = optarg; *p; p++)
|
||||
@ -613,9 +610,6 @@ GNU General Public License for more details.\n"));
|
||||
case 'v':
|
||||
setval ("verbose", "off");
|
||||
break;
|
||||
case 'h':
|
||||
setval ("simplehostcheck", "on");
|
||||
break;
|
||||
case 'H':
|
||||
setval ("addhostdir", "off");
|
||||
break;
|
||||
@ -806,17 +800,17 @@ Can't timestamp and not clobber old files at the same time.\n"));
|
||||
#endif /* HAVE_SIGNAL */
|
||||
|
||||
status = RETROK; /* initialize it, just-in-case */
|
||||
recursive_reset ();
|
||||
/*recursive_reset ();*/
|
||||
/* Retrieve the URLs from argument list. */
|
||||
for (t = url; *t; t++)
|
||||
{
|
||||
char *filename, *redirected_URL;
|
||||
char *filename = NULL, *redirected_URL = NULL;
|
||||
int dt;
|
||||
|
||||
status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt);
|
||||
if (opt.recursive && status == RETROK && (dt & TEXTHTML))
|
||||
status = recursive_retrieve (filename,
|
||||
redirected_URL ? redirected_URL : *t);
|
||||
if (opt.recursive && url_scheme (*t) != SCHEME_FTP)
|
||||
status = retrieve_tree (*t);
|
||||
else
|
||||
status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt);
|
||||
|
||||
if (opt.delete_after && file_exists_p(filename))
|
||||
{
|
||||
|
@ -36,9 +36,6 @@ struct options
|
||||
int relative_only; /* Follow only relative links. */
|
||||
int no_parent; /* Restrict access to the parent
|
||||
directory. */
|
||||
int simple_check; /* Should we use simple checking
|
||||
(strcmp) or do we create a host
|
||||
hash and call gethostbyname? */
|
||||
int reclevel; /* Maximum level of recursion */
|
||||
int dirstruct; /* Do we build the directory structure
|
||||
as we go along? */
|
||||
|
@ -27,6 +27,9 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
|
||||
# include <strings.h>
|
||||
#endif /* HAVE_STRING_H */
|
||||
#include <assert.h>
|
||||
#ifdef HAVE_UNISTD_H
|
||||
# include <unistd.h>
|
||||
#endif
|
||||
|
||||
#include "wget.h"
|
||||
#include "progress.h"
|
||||
@ -470,14 +473,14 @@ create_image (struct bar_progress *bp, long dltime)
|
||||
Calculate its geometry:
|
||||
|
||||
"xxx% " - percentage - 5 chars
|
||||
"| ... | " - progress bar decorations - 3 chars
|
||||
"| ... |" - progress bar decorations - 2 chars
|
||||
"1012.56 K/s " - dl rate - 12 chars
|
||||
"nnnn " - downloaded bytes - 11 chars
|
||||
"ETA: xx:xx:xx" - ETA - 13 chars
|
||||
|
||||
"=====>..." - progress bar content - the rest
|
||||
*/
|
||||
int progress_len = screen_width - (5 + 3 + 12 + 11 + 13);
|
||||
int progress_len = screen_width - (5 + 2 + 12 + 11 + 13);
|
||||
|
||||
if (progress_len < 7)
|
||||
progress_len = 0;
|
||||
@ -530,7 +533,7 @@ create_image (struct bar_progress *bp, long dltime)
|
||||
}
|
||||
else
|
||||
{
|
||||
strcpy (p, "----.-- K/s ");
|
||||
strcpy (p, " --.-- K/s ");
|
||||
p += 12;
|
||||
}
|
||||
|
||||
|
883
src/recur.c
883
src/recur.c
@ -1,5 +1,5 @@
|
||||
/* Handling of recursive HTTP retrieving.
|
||||
Copyright (C) 1995, 1996, 1997, 2000 Free Software Foundation, Inc.
|
||||
Copyright (C) 1995, 1996, 1997, 2000, 2001 Free Software Foundation, Inc.
|
||||
|
||||
This file is part of GNU Wget.
|
||||
|
||||
@ -54,452 +54,480 @@ static struct hash_table *dl_file_url_map;
|
||||
static struct hash_table *dl_url_file_map;
|
||||
|
||||
/* List of HTML files downloaded in this Wget run. Used for link
|
||||
conversion after Wget is done. */
|
||||
conversion after Wget is done. This list should only be traversed
|
||||
in order. If you need to check whether a file has been downloaded,
|
||||
use a hash table, e.g. dl_file_url_map. */
|
||||
static slist *downloaded_html_files;
|
||||
|
||||
/* Functions for maintaining the URL queue. */
|
||||
|
||||
/* List of undesirable-to-load URLs. */
|
||||
static struct hash_table *undesirable_urls;
|
||||
struct queue_element {
|
||||
const char *url;
|
||||
const char *referer;
|
||||
int depth;
|
||||
struct queue_element *next;
|
||||
};
|
||||
|
||||
/* Current recursion depth. */
|
||||
static int depth;
|
||||
struct url_queue {
|
||||
struct queue_element *head;
|
||||
struct queue_element *tail;
|
||||
int count, maxcount;
|
||||
};
|
||||
|
||||
/* Base directory we're recursing from (used by no_parent). */
|
||||
static char *base_dir;
|
||||
/* Create a URL queue. */
|
||||
|
||||
static int first_time = 1;
|
||||
|
||||
|
||||
/* Cleanup the data structures associated with recursive retrieving
|
||||
(the variables above). */
|
||||
void
|
||||
recursive_cleanup (void)
|
||||
static struct url_queue *
|
||||
url_queue_new (void)
|
||||
{
|
||||
if (undesirable_urls)
|
||||
{
|
||||
string_set_free (undesirable_urls);
|
||||
undesirable_urls = NULL;
|
||||
}
|
||||
if (dl_file_url_map)
|
||||
{
|
||||
free_keys_and_values (dl_file_url_map);
|
||||
hash_table_destroy (dl_file_url_map);
|
||||
dl_file_url_map = NULL;
|
||||
}
|
||||
if (dl_url_file_map)
|
||||
{
|
||||
free_keys_and_values (dl_url_file_map);
|
||||
hash_table_destroy (dl_url_file_map);
|
||||
dl_url_file_map = NULL;
|
||||
}
|
||||
undesirable_urls = NULL;
|
||||
slist_free (downloaded_html_files);
|
||||
downloaded_html_files = NULL;
|
||||
FREE_MAYBE (base_dir);
|
||||
first_time = 1;
|
||||
struct url_queue *queue = xmalloc (sizeof (*queue));
|
||||
memset (queue, '\0', sizeof (*queue));
|
||||
return queue;
|
||||
}
|
||||
|
||||
/* Reset FIRST_TIME to 1, so that some action can be taken in
|
||||
recursive_retrieve(). */
|
||||
void
|
||||
recursive_reset (void)
|
||||
/* Delete a URL queue. */
|
||||
|
||||
static void
|
||||
url_queue_delete (struct url_queue *queue)
|
||||
{
|
||||
first_time = 1;
|
||||
xfree (queue);
|
||||
}
|
||||
|
||||
/* The core of recursive retrieving. Endless recursion is avoided by
|
||||
having all URLs stored to a linked list of URLs, which is checked
|
||||
before loading any URL. That way no URL can get loaded twice.
|
||||
/* Enqueue a URL in the queue. The queue is FIFO: the items will be
|
||||
retrieved ("dequeued") from the queue in the order they were placed
|
||||
into it. */
|
||||
|
||||
static void
|
||||
url_enqueue (struct url_queue *queue,
|
||||
const char *url, const char *referer, int depth)
|
||||
{
|
||||
struct queue_element *qel = xmalloc (sizeof (*qel));
|
||||
qel->url = url;
|
||||
qel->referer = referer;
|
||||
qel->depth = depth;
|
||||
qel->next = NULL;
|
||||
|
||||
++queue->count;
|
||||
if (queue->count > queue->maxcount)
|
||||
queue->maxcount = queue->count;
|
||||
|
||||
DEBUGP (("Enqueuing %s at depth %d\n", url, depth));
|
||||
DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount));
|
||||
|
||||
if (queue->tail)
|
||||
queue->tail->next = qel;
|
||||
queue->tail = qel;
|
||||
|
||||
if (!queue->head)
|
||||
queue->head = queue->tail;
|
||||
}
|
||||
|
||||
/* Take a URL out of the queue. Return 1 if this operation succeeded,
|
||||
or 0 if the queue is empty. */
|
||||
|
||||
static int
|
||||
url_dequeue (struct url_queue *queue,
|
||||
const char **url, const char **referer, int *depth)
|
||||
{
|
||||
struct queue_element *qel = queue->head;
|
||||
|
||||
if (!qel)
|
||||
return 0;
|
||||
|
||||
queue->head = queue->head->next;
|
||||
if (!queue->head)
|
||||
queue->tail = NULL;
|
||||
|
||||
*url = qel->url;
|
||||
*referer = qel->referer;
|
||||
*depth = qel->depth;
|
||||
|
||||
--queue->count;
|
||||
|
||||
DEBUGP (("Dequeuing %s at depth %d\n", qel->url, qel->depth));
|
||||
DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount));
|
||||
|
||||
xfree (qel);
|
||||
return 1;
|
||||
}
|
||||
|
||||
static int descend_url_p PARAMS ((const struct urlpos *, struct url *, int,
|
||||
struct url *, struct hash_table *));
|
||||
|
||||
/* Retrieve a part of the web beginning with START_URL. This used to
|
||||
be called "recursive retrieval", because the old function was
|
||||
recursive and implemented depth-first search. retrieve_tree on the
|
||||
other hand implements breadth-search traversal of the tree, which
|
||||
results in much nicer ordering of downloads.
|
||||
|
||||
The algorithm this function uses is simple:
|
||||
|
||||
1. put START_URL in the queue.
|
||||
2. while there are URLs in the queue:
|
||||
|
||||
3. get next URL from the queue.
|
||||
4. download it.
|
||||
5. if the URL is HTML and its depth does not exceed maximum depth,
|
||||
get the list of URLs embedded therein.
|
||||
6. for each of those URLs do the following:
|
||||
|
||||
7. if the URL is not one of those downloaded before, and if it
|
||||
satisfies the criteria specified by the various command-line
|
||||
options, add it to the queue. */
|
||||
|
||||
The function also supports specification of maximum recursion depth
|
||||
and a number of other goodies. */
|
||||
uerr_t
|
||||
recursive_retrieve (const char *file, const char *this_url)
|
||||
retrieve_tree (const char *start_url)
|
||||
{
|
||||
char *constr, *filename, *newloc;
|
||||
char *canon_this_url = NULL;
|
||||
int dt, inl, dash_p_leaf_HTML = FALSE;
|
||||
int meta_disallow_follow;
|
||||
int this_url_ftp; /* See below the explanation */
|
||||
urlpos *url_list, *cur_url;
|
||||
struct url *u;
|
||||
uerr_t status = RETROK;
|
||||
|
||||
assert (this_url != NULL);
|
||||
assert (file != NULL);
|
||||
/* If quota was exceeded earlier, bail out. */
|
||||
if (downloaded_exceeds_quota ())
|
||||
return QUOTEXC;
|
||||
/* Cache the current URL in the list. */
|
||||
if (first_time)
|
||||
/* The queue of URLs we need to load. */
|
||||
struct url_queue *queue = url_queue_new ();
|
||||
|
||||
/* The URLs we decided we don't want to load. */
|
||||
struct hash_table *blacklist = make_string_hash_table (0);
|
||||
|
||||
/* We'll need various components of this, so better get it over with
|
||||
now. */
|
||||
struct url *start_url_parsed = url_parse (start_url, NULL);
|
||||
|
||||
url_enqueue (queue, xstrdup (start_url), NULL, 0);
|
||||
string_set_add (blacklist, start_url);
|
||||
|
||||
while (1)
|
||||
{
|
||||
/* These three operations need to be done only once per Wget
|
||||
run. They should probably be at a different location. */
|
||||
if (!undesirable_urls)
|
||||
undesirable_urls = make_string_hash_table (0);
|
||||
int descend = 0;
|
||||
char *url, *referer, *file = NULL;
|
||||
int depth;
|
||||
boolean dash_p_leaf_HTML = FALSE;
|
||||
|
||||
hash_table_clear (undesirable_urls);
|
||||
string_set_add (undesirable_urls, this_url);
|
||||
/* Enter this_url to the hash table, in original and "enhanced" form. */
|
||||
u = url_parse (this_url, NULL);
|
||||
if (u)
|
||||
{
|
||||
string_set_add (undesirable_urls, u->url);
|
||||
if (opt.no_parent)
|
||||
base_dir = xstrdup (u->dir); /* Set the base dir. */
|
||||
/* Set the canonical this_url to be sent as referer. This
|
||||
problem exists only when running the first time. */
|
||||
canon_this_url = xstrdup (u->url);
|
||||
}
|
||||
else
|
||||
{
|
||||
DEBUGP (("Double yuck! The *base* URL is broken.\n"));
|
||||
base_dir = NULL;
|
||||
}
|
||||
url_free (u);
|
||||
depth = 1;
|
||||
first_time = 0;
|
||||
}
|
||||
else
|
||||
++depth;
|
||||
|
||||
if (opt.reclevel != INFINITE_RECURSION && depth > opt.reclevel)
|
||||
/* We've exceeded the maximum recursion depth specified by the user. */
|
||||
{
|
||||
if (opt.page_requisites && depth <= opt.reclevel + 1)
|
||||
/* When -p is specified, we can do one more partial recursion from the
|
||||
"leaf nodes" on the HTML document tree. The recursion is partial in
|
||||
that we won't traverse any <A> or <AREA> tags, nor any <LINK> tags
|
||||
except for <LINK REL="stylesheet">. */
|
||||
dash_p_leaf_HTML = TRUE;
|
||||
else
|
||||
/* Either -p wasn't specified or it was and we've already gone the one
|
||||
extra (pseudo-)level that it affords us, so we need to bail out. */
|
||||
{
|
||||
DEBUGP (("Recursion depth %d exceeded max. depth %d.\n",
|
||||
depth, opt.reclevel));
|
||||
--depth;
|
||||
return RECLEVELEXC;
|
||||
}
|
||||
}
|
||||
|
||||
/* Determine whether this_url is an FTP URL. If it is, it means
|
||||
that the retrieval is done through proxy. In that case, FTP
|
||||
links will be followed by default and recursion will not be
|
||||
turned off when following them. */
|
||||
this_url_ftp = (url_scheme (this_url) == SCHEME_FTP);
|
||||
|
||||
/* Get the URL-s from an HTML file: */
|
||||
url_list = get_urls_html (file, canon_this_url ? canon_this_url : this_url,
|
||||
dash_p_leaf_HTML, &meta_disallow_follow);
|
||||
|
||||
if (opt.use_robots && meta_disallow_follow)
|
||||
{
|
||||
/* The META tag says we are not to follow this file. Respect
|
||||
that. */
|
||||
free_urlpos (url_list);
|
||||
url_list = NULL;
|
||||
}
|
||||
|
||||
/* Decide what to do with each of the URLs. A URL will be loaded if
|
||||
it meets several requirements, discussed later. */
|
||||
for (cur_url = url_list; cur_url; cur_url = cur_url->next)
|
||||
{
|
||||
/* If quota was exceeded earlier, bail out. */
|
||||
if (downloaded_exceeds_quota ())
|
||||
break;
|
||||
/* Parse the URL for convenient use in other functions, as well
|
||||
as to get the optimized form. It also checks URL integrity. */
|
||||
u = url_parse (cur_url->url, NULL);
|
||||
if (!u)
|
||||
{
|
||||
DEBUGP (("Yuck! A bad URL.\n"));
|
||||
continue;
|
||||
}
|
||||
assert (u->url != NULL);
|
||||
constr = xstrdup (u->url);
|
||||
|
||||
/* Several checkings whether a file is acceptable to load:
|
||||
1. check if URL is ftp, and we don't load it
|
||||
2. check for relative links (if relative_only is set)
|
||||
3. check for domain
|
||||
4. check for no-parent
|
||||
5. check for excludes && includes
|
||||
6. check for suffix
|
||||
7. check for same host (if spanhost is unset), with possible
|
||||
gethostbyname baggage
|
||||
8. check for robots.txt
|
||||
if (status == FWRITEERR)
|
||||
break;
|
||||
|
||||
Addendum: If the URL is FTP, and it is to be loaded, only the
|
||||
domain and suffix settings are "stronger".
|
||||
/* Get the next URL from the queue. */
|
||||
|
||||
Note that .html and (yuck) .htm will get loaded regardless of
|
||||
suffix rules (but that is remedied later with unlink) unless
|
||||
the depth equals the maximum depth.
|
||||
if (!url_dequeue (queue,
|
||||
(const char **)&url, (const char **)&referer,
|
||||
&depth))
|
||||
break;
|
||||
|
||||
More time- and memory- consuming tests should be put later on
|
||||
the list. */
|
||||
/* And download it. */
|
||||
|
||||
/* inl is set if the URL we are working on (constr) is stored in
|
||||
undesirable_urls. Using it is crucial to avoid unnecessary
|
||||
repeated continuous hits to the hash table. */
|
||||
inl = string_set_contains (undesirable_urls, constr);
|
||||
{
|
||||
int dt = 0;
|
||||
char *redirected = NULL;
|
||||
int oldrec = opt.recursive;
|
||||
|
||||
/* If it is FTP, and FTP is not followed, chuck it out. */
|
||||
if (!inl)
|
||||
if (u->scheme == SCHEME_FTP && !opt.follow_ftp && !this_url_ftp)
|
||||
opt.recursive = 0;
|
||||
status = retrieve_url (url, &file, &redirected, NULL, &dt);
|
||||
opt.recursive = oldrec;
|
||||
|
||||
if (redirected)
|
||||
{
|
||||
DEBUGP (("Uh, it is FTP but i'm not in the mood to follow FTP.\n"));
|
||||
string_set_add (undesirable_urls, constr);
|
||||
inl = 1;
|
||||
xfree (url);
|
||||
url = redirected;
|
||||
}
|
||||
/* If it is absolute link and they are not followed, chuck it
|
||||
out. */
|
||||
if (!inl && u->scheme != SCHEME_FTP)
|
||||
if (opt.relative_only && !cur_url->link_relative_p)
|
||||
{
|
||||
DEBUGP (("It doesn't really look like a relative link.\n"));
|
||||
string_set_add (undesirable_urls, constr);
|
||||
inl = 1;
|
||||
}
|
||||
/* If its domain is not to be accepted/looked-up, chuck it out. */
|
||||
if (!inl)
|
||||
if (!accept_domain (u))
|
||||
{
|
||||
DEBUGP (("I don't like the smell of that domain.\n"));
|
||||
string_set_add (undesirable_urls, constr);
|
||||
inl = 1;
|
||||
}
|
||||
/* Check for parent directory. */
|
||||
if (!inl && opt.no_parent
|
||||
/* If the new URL is FTP and the old was not, ignore
|
||||
opt.no_parent. */
|
||||
&& !(!this_url_ftp && u->scheme == SCHEME_FTP))
|
||||
{
|
||||
/* Check for base_dir first. */
|
||||
if (!(base_dir && frontcmp (base_dir, u->dir)))
|
||||
{
|
||||
/* Failing that, check for parent dir. */
|
||||
struct url *ut = url_parse (this_url, NULL);
|
||||
if (!ut)
|
||||
DEBUGP (("Double yuck! The *base* URL is broken.\n"));
|
||||
else if (!frontcmp (ut->dir, u->dir))
|
||||
{
|
||||
/* Failing that too, kill the URL. */
|
||||
DEBUGP (("Trying to escape parental guidance with no_parent on.\n"));
|
||||
string_set_add (undesirable_urls, constr);
|
||||
inl = 1;
|
||||
}
|
||||
url_free (ut);
|
||||
}
|
||||
}
|
||||
/* If the file does not match the acceptance list, or is on the
|
||||
rejection list, chuck it out. The same goes for the
|
||||
directory exclude- and include- lists. */
|
||||
if (!inl && (opt.includes || opt.excludes))
|
||||
{
|
||||
if (!accdir (u->dir, ALLABS))
|
||||
{
|
||||
DEBUGP (("%s (%s) is excluded/not-included.\n", constr, u->dir));
|
||||
string_set_add (undesirable_urls, constr);
|
||||
inl = 1;
|
||||
}
|
||||
}
|
||||
if (!inl)
|
||||
{
|
||||
char *suf = NULL;
|
||||
/* We check for acceptance/rejection rules only for non-HTML
|
||||
documents. Since we don't know whether they really are
|
||||
HTML, it will be deduced from (an OR-ed list):
|
||||
if (file && status == RETROK
|
||||
&& (dt & RETROKF) && (dt & TEXTHTML))
|
||||
descend = 1;
|
||||
}
|
||||
|
||||
1) u->file is "" (meaning it is a directory)
|
||||
2) suffix exists, AND:
|
||||
a) it is "html", OR
|
||||
b) it is "htm"
|
||||
|
||||
If the file *is* supposed to be HTML, it will *not* be
|
||||
subject to acc/rej rules, unless a finite maximum depth has
|
||||
been specified and the current depth is the maximum depth. */
|
||||
if (!
|
||||
(!*u->file
|
||||
|| (((suf = suffix (constr)) != NULL)
|
||||
&& ((!strcmp (suf, "html") || !strcmp (suf, "htm"))
|
||||
&& ((opt.reclevel != INFINITE_RECURSION) &&
|
||||
(depth != opt.reclevel))))))
|
||||
{
|
||||
if (!acceptable (u->file))
|
||||
{
|
||||
DEBUGP (("%s (%s) does not match acc/rej rules.\n",
|
||||
constr, u->file));
|
||||
string_set_add (undesirable_urls, constr);
|
||||
inl = 1;
|
||||
}
|
||||
}
|
||||
FREE_MAYBE (suf);
|
||||
}
|
||||
/* Optimize the URL (which includes possible DNS lookup) only
|
||||
after all other possibilities have been exhausted. */
|
||||
if (!inl)
|
||||
if (descend
|
||||
&& depth >= opt.reclevel && opt.reclevel != INFINITE_RECURSION)
|
||||
{
|
||||
if (!opt.simple_check)
|
||||
{
|
||||
/* Find the "true" host. */
|
||||
char *host = realhost (u->host);
|
||||
xfree (u->host);
|
||||
u->host = host;
|
||||
|
||||
/* Refresh the printed representation of the URL. */
|
||||
xfree (u->url);
|
||||
u->url = url_string (u, 0);
|
||||
}
|
||||
if (opt.page_requisites && depth == opt.reclevel)
|
||||
/* When -p is specified, we can do one more partial
|
||||
recursion from the "leaf nodes" on the HTML document
|
||||
tree. The recursion is partial in that we won't
|
||||
traverse any <A> or <AREA> tags, nor any <LINK> tags
|
||||
except for <LINK REL="stylesheet">. */
|
||||
/* #### This would be the place to implement the TODO
|
||||
entry saying that -p should do two more hops on
|
||||
framesets. */
|
||||
dash_p_leaf_HTML = TRUE;
|
||||
else
|
||||
{
|
||||
char *p;
|
||||
/* Just lowercase the hostname. */
|
||||
for (p = u->host; *p; p++)
|
||||
*p = TOLOWER (*p);
|
||||
xfree (u->url);
|
||||
u->url = url_string (u, 0);
|
||||
/* Either -p wasn't specified or it was and we've
|
||||
already gone the one extra (pseudo-)level that it
|
||||
affords us, so we need to bail out. */
|
||||
DEBUGP (("Not descending further; at depth %d, max. %d.\n",
|
||||
depth, opt.reclevel));
|
||||
descend = 0;
|
||||
}
|
||||
xfree (constr);
|
||||
constr = xstrdup (u->url);
|
||||
/* After we have canonicalized the URL, check if we have it
|
||||
on the black list. */
|
||||
if (string_set_contains (undesirable_urls, constr))
|
||||
inl = 1;
|
||||
/* This line is bogus. */
|
||||
/*string_set_add (undesirable_urls, constr);*/
|
||||
|
||||
if (!inl && !((u->scheme == SCHEME_FTP) && !this_url_ftp))
|
||||
if (!opt.spanhost && this_url && !same_host (this_url, constr))
|
||||
{
|
||||
DEBUGP (("This is not the same hostname as the parent's.\n"));
|
||||
string_set_add (undesirable_urls, constr);
|
||||
inl = 1;
|
||||
}
|
||||
}
|
||||
/* What about robots.txt? */
|
||||
if (!inl && opt.use_robots && u->scheme == SCHEME_HTTP)
|
||||
|
||||
/* If the downloaded document was HTML, parse it and enqueue the
|
||||
links it contains. */
|
||||
|
||||
if (descend)
|
||||
{
|
||||
struct robot_specs *specs = res_get_specs (u->host, u->port);
|
||||
if (!specs)
|
||||
int meta_disallow_follow = 0;
|
||||
struct urlpos *children = get_urls_html (file, url, dash_p_leaf_HTML,
|
||||
&meta_disallow_follow);
|
||||
|
||||
if (opt.use_robots && meta_disallow_follow)
|
||||
{
|
||||
char *rfile;
|
||||
if (res_retrieve_file (constr, &rfile))
|
||||
{
|
||||
specs = res_parse_from_file (rfile);
|
||||
xfree (rfile);
|
||||
}
|
||||
else
|
||||
{
|
||||
/* If we cannot get real specs, at least produce
|
||||
dummy ones so that we can register them and stop
|
||||
trying to retrieve them. */
|
||||
specs = res_parse ("", 0);
|
||||
}
|
||||
res_register_specs (u->host, u->port, specs);
|
||||
free_urlpos (children);
|
||||
children = NULL;
|
||||
}
|
||||
|
||||
/* Now that we have (or don't have) robots.txt specs, we can
|
||||
check what they say. */
|
||||
if (!res_match_path (specs, u->path))
|
||||
if (children)
|
||||
{
|
||||
DEBUGP (("Not following %s because robots.txt forbids it.\n",
|
||||
constr));
|
||||
string_set_add (undesirable_urls, constr);
|
||||
inl = 1;
|
||||
struct urlpos *child = children;
|
||||
struct url *url_parsed = url_parsed = url_parse (url, NULL);
|
||||
assert (url_parsed != NULL);
|
||||
|
||||
for (; child; child = child->next)
|
||||
{
|
||||
if (descend_url_p (child, url_parsed, depth, start_url_parsed,
|
||||
blacklist))
|
||||
{
|
||||
url_enqueue (queue, xstrdup (child->url->url),
|
||||
xstrdup (url), depth + 1);
|
||||
/* We blacklist the URL we have enqueued, because we
|
||||
don't want to enqueue (and hence download) the
|
||||
same URL twice. */
|
||||
string_set_add (blacklist, child->url->url);
|
||||
}
|
||||
}
|
||||
|
||||
url_free (url_parsed);
|
||||
free_urlpos (children);
|
||||
}
|
||||
}
|
||||
|
||||
filename = NULL;
|
||||
/* If it wasn't chucked out, do something with it. */
|
||||
if (!inl)
|
||||
if (opt.delete_after || (file && !acceptable (file)))
|
||||
{
|
||||
DEBUGP (("I've decided to load it -> "));
|
||||
/* Add it to the list of already-loaded URL-s. */
|
||||
string_set_add (undesirable_urls, constr);
|
||||
/* Automatically followed FTPs will *not* be downloaded
|
||||
recursively. */
|
||||
if (u->scheme == SCHEME_FTP)
|
||||
{
|
||||
/* Don't you adore side-effects? */
|
||||
opt.recursive = 0;
|
||||
}
|
||||
/* Reset its type. */
|
||||
dt = 0;
|
||||
/* Retrieve it. */
|
||||
retrieve_url (constr, &filename, &newloc,
|
||||
canon_this_url ? canon_this_url : this_url, &dt);
|
||||
if (u->scheme == SCHEME_FTP)
|
||||
{
|
||||
/* Restore... */
|
||||
opt.recursive = 1;
|
||||
}
|
||||
if (newloc)
|
||||
{
|
||||
xfree (constr);
|
||||
constr = newloc;
|
||||
}
|
||||
/* If there was no error, and the type is text/html, parse
|
||||
it recursively. */
|
||||
if (dt & TEXTHTML)
|
||||
{
|
||||
if (dt & RETROKF)
|
||||
recursive_retrieve (filename, constr);
|
||||
}
|
||||
else
|
||||
DEBUGP (("%s is not text/html so we don't chase.\n",
|
||||
filename ? filename: "(null)"));
|
||||
|
||||
if (opt.delete_after || (filename && !acceptable (filename)))
|
||||
/* Either --delete-after was specified, or we loaded this otherwise
|
||||
rejected (e.g. by -R) HTML file just so we could harvest its
|
||||
hyperlinks -- in either case, delete the local file. */
|
||||
{
|
||||
DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
|
||||
opt.delete_after ? "--delete-after" :
|
||||
"recursive rejection criteria"));
|
||||
logprintf (LOG_VERBOSE,
|
||||
(opt.delete_after ? _("Removing %s.\n")
|
||||
: _("Removing %s since it should be rejected.\n")),
|
||||
filename);
|
||||
if (unlink (filename))
|
||||
logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
|
||||
dt &= ~RETROKF;
|
||||
}
|
||||
|
||||
/* If everything was OK, and links are to be converted, let's
|
||||
store the local filename. */
|
||||
if (opt.convert_links && (dt & RETROKF) && (filename != NULL))
|
||||
{
|
||||
cur_url->convert = CO_CONVERT_TO_RELATIVE;
|
||||
cur_url->local_name = xstrdup (filename);
|
||||
}
|
||||
/* Either --delete-after was specified, or we loaded this
|
||||
otherwise rejected (e.g. by -R) HTML file just so we
|
||||
could harvest its hyperlinks -- in either case, delete
|
||||
the local file. */
|
||||
DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
|
||||
opt.delete_after ? "--delete-after" :
|
||||
"recursive rejection criteria"));
|
||||
logprintf (LOG_VERBOSE,
|
||||
(opt.delete_after ? _("Removing %s.\n")
|
||||
: _("Removing %s since it should be rejected.\n")),
|
||||
file);
|
||||
if (unlink (file))
|
||||
logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
|
||||
}
|
||||
else
|
||||
DEBUGP (("%s already in list, so we don't load.\n", constr));
|
||||
/* Free filename and constr. */
|
||||
FREE_MAYBE (filename);
|
||||
FREE_MAYBE (constr);
|
||||
url_free (u);
|
||||
/* Increment the pbuf for the appropriate size. */
|
||||
|
||||
xfree (url);
|
||||
FREE_MAYBE (referer);
|
||||
FREE_MAYBE (file);
|
||||
}
|
||||
if (opt.convert_links && !opt.delete_after)
|
||||
/* This is merely the first pass: the links that have been
|
||||
successfully downloaded are converted. In the second pass,
|
||||
convert_all_links() will also convert those links that have NOT
|
||||
been downloaded to their canonical form. */
|
||||
convert_links (file, url_list);
|
||||
/* Free the linked list of URL-s. */
|
||||
free_urlpos (url_list);
|
||||
/* Free the canonical this_url. */
|
||||
FREE_MAYBE (canon_this_url);
|
||||
/* Decrement the recursion depth. */
|
||||
--depth;
|
||||
|
||||
/* If anything is left of the queue due to a premature exit, free it
|
||||
now. */
|
||||
{
|
||||
char *d1, *d2;
|
||||
int d3;
|
||||
while (url_dequeue (queue, (const char **)&d1, (const char **)&d2, &d3))
|
||||
{
|
||||
xfree (d1);
|
||||
FREE_MAYBE (d2);
|
||||
}
|
||||
}
|
||||
url_queue_delete (queue);
|
||||
|
||||
if (start_url_parsed)
|
||||
url_free (start_url_parsed);
|
||||
string_set_free (blacklist);
|
||||
|
||||
if (downloaded_exceeds_quota ())
|
||||
return QUOTEXC;
|
||||
else if (status == FWRITEERR)
|
||||
return FWRITEERR;
|
||||
else
|
||||
return RETROK;
|
||||
}
|
||||
|
||||
/* Based on the context provided by retrieve_tree, decide whether a
|
||||
URL is to be descended to. This is only ever called from
|
||||
retrieve_tree, but is in a separate function for clarity. */
|
||||
|
||||
static int
|
||||
descend_url_p (const struct urlpos *upos, struct url *parent, int depth,
|
||||
struct url *start_url_parsed, struct hash_table *blacklist)
|
||||
{
|
||||
struct url *u = upos->url;
|
||||
const char *url = u->url;
|
||||
|
||||
DEBUGP (("Deciding whether to enqueue \"%s\".\n", url));
|
||||
|
||||
if (string_set_contains (blacklist, url))
|
||||
{
|
||||
DEBUGP (("Already on the black list.\n"));
|
||||
goto out;
|
||||
}
|
||||
|
||||
/* Several things to check for:
|
||||
1. if scheme is not http, and we don't load it
|
||||
2. check for relative links (if relative_only is set)
|
||||
3. check for domain
|
||||
4. check for no-parent
|
||||
5. check for excludes && includes
|
||||
6. check for suffix
|
||||
7. check for same host (if spanhost is unset), with possible
|
||||
gethostbyname baggage
|
||||
8. check for robots.txt
|
||||
|
||||
Addendum: If the URL is FTP, and it is to be loaded, only the
|
||||
domain and suffix settings are "stronger".
|
||||
|
||||
Note that .html files will get loaded regardless of suffix rules
|
||||
(but that is remedied later with unlink) unless the depth equals
|
||||
the maximum depth.
|
||||
|
||||
More time- and memory- consuming tests should be put later on
|
||||
the list. */
|
||||
|
||||
/* 1. Schemes other than HTTP are normally not recursed into. */
|
||||
if (u->scheme != SCHEME_HTTP
|
||||
&& !(u->scheme == SCHEME_FTP && opt.follow_ftp))
|
||||
{
|
||||
DEBUGP (("Not following non-HTTP schemes.\n"));
|
||||
goto blacklist;
|
||||
}
|
||||
|
||||
/* 2. If it is an absolute link and they are not followed, throw it
|
||||
out. */
|
||||
if (u->scheme == SCHEME_HTTP)
|
||||
if (opt.relative_only && !upos->link_relative_p)
|
||||
{
|
||||
DEBUGP (("It doesn't really look like a relative link.\n"));
|
||||
goto blacklist;
|
||||
}
|
||||
|
||||
/* 3. If its domain is not to be accepted/looked-up, chuck it
|
||||
out. */
|
||||
if (!accept_domain (u))
|
||||
{
|
||||
DEBUGP (("The domain was not accepted.\n"));
|
||||
goto blacklist;
|
||||
}
|
||||
|
||||
/* 4. Check for parent directory.
|
||||
|
||||
If we descended to a different host or changed the scheme, ignore
|
||||
opt.no_parent. Also ignore it for -p leaf retrievals. */
|
||||
if (opt.no_parent
|
||||
&& u->scheme == parent->scheme
|
||||
&& 0 == strcasecmp (u->host, parent->host)
|
||||
&& u->port == parent->port)
|
||||
{
|
||||
if (!frontcmp (parent->dir, u->dir))
|
||||
{
|
||||
DEBUGP (("Trying to escape the root directory with no_parent in effect.\n"));
|
||||
goto blacklist;
|
||||
}
|
||||
}
|
||||
|
||||
/* 5. If the file does not match the acceptance list, or is on the
|
||||
rejection list, chuck it out. The same goes for the directory
|
||||
exclusion and inclusion lists. */
|
||||
if (opt.includes || opt.excludes)
|
||||
{
|
||||
if (!accdir (u->dir, ALLABS))
|
||||
{
|
||||
DEBUGP (("%s (%s) is excluded/not-included.\n", url, u->dir));
|
||||
goto blacklist;
|
||||
}
|
||||
}
|
||||
|
||||
/* 6. */
|
||||
{
|
||||
char *suf = NULL;
|
||||
/* Check for acceptance/rejection rules. We ignore these rules
|
||||
for HTML documents because they might lead to other files which
|
||||
need to be downloaded. Of course, we don't know which
|
||||
documents are HTML before downloading them, so we guess.
|
||||
|
||||
A file is subject to acceptance/rejection rules if:
|
||||
|
||||
* u->file is not "" (i.e. it is not a directory)
|
||||
and either:
|
||||
+ there is no file suffix,
|
||||
+ or there is a suffix, but is not "html" or "htm",
|
||||
+ both:
|
||||
- recursion is not infinite,
|
||||
- and we are at its very end. */
|
||||
|
||||
if (u->file[0] != '\0'
|
||||
&& ((suf = suffix (url)) == NULL
|
||||
|| (0 != strcmp (suf, "html") && 0 != strcmp (suf, "htm"))
|
||||
|| (opt.reclevel == INFINITE_RECURSION && depth >= opt.reclevel)))
|
||||
{
|
||||
if (!acceptable (u->file))
|
||||
{
|
||||
DEBUGP (("%s (%s) does not match acc/rej rules.\n",
|
||||
url, u->file));
|
||||
FREE_MAYBE (suf);
|
||||
goto blacklist;
|
||||
}
|
||||
}
|
||||
FREE_MAYBE (suf);
|
||||
}
|
||||
|
||||
/* 7. */
|
||||
if (u->scheme == parent->scheme)
|
||||
if (!opt.spanhost && 0 != strcasecmp (parent->host, u->host))
|
||||
{
|
||||
DEBUGP (("This is not the same hostname as the parent's (%s and %s).\n",
|
||||
u->host, parent->host));
|
||||
goto blacklist;
|
||||
}
|
||||
|
||||
/* 8. */
|
||||
if (opt.use_robots && u->scheme == SCHEME_HTTP)
|
||||
{
|
||||
struct robot_specs *specs = res_get_specs (u->host, u->port);
|
||||
if (!specs)
|
||||
{
|
||||
char *rfile;
|
||||
if (res_retrieve_file (url, &rfile))
|
||||
{
|
||||
specs = res_parse_from_file (rfile);
|
||||
xfree (rfile);
|
||||
}
|
||||
else
|
||||
{
|
||||
/* If we cannot get real specs, at least produce
|
||||
dummy ones so that we can register them and stop
|
||||
trying to retrieve them. */
|
||||
specs = res_parse ("", 0);
|
||||
}
|
||||
res_register_specs (u->host, u->port, specs);
|
||||
}
|
||||
|
||||
/* Now that we have (or don't have) robots.txt specs, we can
|
||||
check what they say. */
|
||||
if (!res_match_path (specs, u->path))
|
||||
{
|
||||
DEBUGP (("Not following %s because robots.txt forbids it.\n", url));
|
||||
goto blacklist;
|
||||
}
|
||||
}
|
||||
|
||||
/* The URL has passed all the tests. It can be placed in the
|
||||
download queue. */
|
||||
DEBUGP (("Decided to load it.\n"));
|
||||
|
||||
return 1;
|
||||
|
||||
blacklist:
|
||||
string_set_add (blacklist, url);
|
||||
|
||||
out:
|
||||
DEBUGP (("Decided NOT to load it.\n"));
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Register that URL has been successfully downloaded to FILE. */
|
||||
|
||||
void
|
||||
register_download (const char *url, const char *file)
|
||||
{
|
||||
@ -507,12 +535,35 @@ register_download (const char *url, const char *file)
|
||||
return;
|
||||
if (!dl_file_url_map)
|
||||
dl_file_url_map = make_string_hash_table (0);
|
||||
hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url));
|
||||
if (!dl_url_file_map)
|
||||
dl_url_file_map = make_string_hash_table (0);
|
||||
hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file));
|
||||
|
||||
if (!hash_table_contains (dl_file_url_map, file))
|
||||
hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url));
|
||||
if (!hash_table_contains (dl_url_file_map, url))
|
||||
hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file));
|
||||
}
|
||||
|
||||
/* Register that FROM has been redirected to TO. This assumes that TO
|
||||
is successfully downloaded and already registered using
|
||||
register_download() above. */
|
||||
|
||||
void
|
||||
register_redirection (const char *from, const char *to)
|
||||
{
|
||||
char *file;
|
||||
|
||||
if (!opt.convert_links)
|
||||
return;
|
||||
|
||||
file = hash_table_get (dl_url_file_map, to);
|
||||
assert (file != NULL);
|
||||
if (!hash_table_contains (dl_url_file_map, from))
|
||||
hash_table_put (dl_url_file_map, xstrdup (from), xstrdup (file));
|
||||
}
|
||||
|
||||
/* Register that URL corresponds to the HTML file FILE. */
|
||||
|
||||
void
|
||||
register_html (const char *url, const char *file)
|
||||
{
|
||||
@ -558,10 +609,11 @@ convert_all_links (void)
|
||||
|
||||
for (html = downloaded_html_files; html; html = html->next)
|
||||
{
|
||||
urlpos *urls, *cur_url;
|
||||
struct urlpos *urls, *cur_url;
|
||||
char *url;
|
||||
|
||||
DEBUGP (("Rescanning %s\n", html->string));
|
||||
|
||||
/* Determine the URL of the HTML file. get_urls_html will need
|
||||
it. */
|
||||
url = hash_table_get (dl_file_url_map, html->string);
|
||||
@ -569,19 +621,19 @@ convert_all_links (void)
|
||||
DEBUGP (("It should correspond to %s.\n", url));
|
||||
else
|
||||
DEBUGP (("I cannot find the corresponding URL.\n"));
|
||||
|
||||
/* Parse the HTML file... */
|
||||
urls = get_urls_html (html->string, url, FALSE, NULL);
|
||||
|
||||
/* We don't respect meta_disallow_follow here because, even if
|
||||
the file is not followed, we might still want to convert the
|
||||
links that have been followed from other files. */
|
||||
|
||||
for (cur_url = urls; cur_url; cur_url = cur_url->next)
|
||||
{
|
||||
char *local_name;
|
||||
struct url *u = cur_url->url;
|
||||
|
||||
/* The URL must be in canonical form to be compared. */
|
||||
struct url *u = url_parse (cur_url->url, NULL);
|
||||
if (!u)
|
||||
continue;
|
||||
/* We decide the direction of conversion according to whether
|
||||
a URL was downloaded. Downloaded URLs will be converted
|
||||
ABS2REL, whereas non-downloaded will be converted REL2ABS. */
|
||||
@ -589,6 +641,7 @@ convert_all_links (void)
|
||||
if (local_name)
|
||||
DEBUGP (("%s marked for conversion, local %s\n",
|
||||
u->url, local_name));
|
||||
|
||||
/* Decide on the conversion direction. */
|
||||
if (local_name)
|
||||
{
|
||||
@ -610,7 +663,6 @@ convert_all_links (void)
|
||||
cur_url->convert = CO_CONVERT_TO_COMPLETE;
|
||||
cur_url->local_name = NULL;
|
||||
}
|
||||
url_free (u);
|
||||
}
|
||||
/* Convert the links in the file. */
|
||||
convert_links (html->string, urls);
|
||||
@ -618,3 +670,24 @@ convert_all_links (void)
|
||||
free_urlpos (urls);
|
||||
}
|
||||
}
|
||||
|
||||
/* Cleanup the data structures associated with recursive retrieving
|
||||
(the variables above). */
|
||||
void
|
||||
recursive_cleanup (void)
|
||||
{
|
||||
if (dl_file_url_map)
|
||||
{
|
||||
free_keys_and_values (dl_file_url_map);
|
||||
hash_table_destroy (dl_file_url_map);
|
||||
dl_file_url_map = NULL;
|
||||
}
|
||||
if (dl_url_file_map)
|
||||
{
|
||||
free_keys_and_values (dl_url_file_map);
|
||||
hash_table_destroy (dl_url_file_map);
|
||||
dl_url_file_map = NULL;
|
||||
}
|
||||
slist_free (downloaded_html_files);
|
||||
downloaded_html_files = NULL;
|
||||
}
|
||||
|
@ -21,10 +21,10 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
|
||||
#define RECUR_H
|
||||
|
||||
void recursive_cleanup PARAMS ((void));
|
||||
void recursive_reset PARAMS ((void));
|
||||
uerr_t recursive_retrieve PARAMS ((const char *, const char *));
|
||||
uerr_t retrieve_tree PARAMS ((const char *));
|
||||
|
||||
void register_download PARAMS ((const char *, const char *));
|
||||
void register_redirection PARAMS ((const char *, const char *));
|
||||
void register_html PARAMS ((const char *, const char *));
|
||||
void convert_all_links PARAMS ((void));
|
||||
|
||||
|
26
src/res.c
26
src/res.c
@ -125,6 +125,10 @@ add_path (struct robot_specs *specs, const char *path_b, const char *path_e,
|
||||
int allowedp, int exactp)
|
||||
{
|
||||
struct path_info pp;
|
||||
if (path_b < path_e && *path_b == '/')
|
||||
/* Our path representation doesn't use a leading slash, so remove
|
||||
one from theirs. */
|
||||
++path_b;
|
||||
pp.path = strdupdelim (path_b, path_e);
|
||||
pp.allowedp = allowedp;
|
||||
pp.user_agent_exact_p = exactp;
|
||||
@ -390,6 +394,9 @@ res_parse_from_file (const char *filename)
|
||||
static void
|
||||
free_specs (struct robot_specs *specs)
|
||||
{
|
||||
int i;
|
||||
for (i = 0; i < specs->count; i++)
|
||||
xfree (specs->paths[i].path);
|
||||
FREE_MAYBE (specs->paths);
|
||||
xfree (specs);
|
||||
}
|
||||
@ -546,3 +553,22 @@ res_retrieve_file (const char *url, char **file)
|
||||
}
|
||||
return err == RETROK;
|
||||
}
|
||||
|
||||
static int
|
||||
cleanup_hash_table_mapper (void *key, void *value, void *arg_ignored)
|
||||
{
|
||||
xfree (key);
|
||||
free_specs (value);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void
|
||||
res_cleanup (void)
|
||||
{
|
||||
if (registered_specs)
|
||||
{
|
||||
hash_table_map (registered_specs, cleanup_hash_table_mapper, NULL);
|
||||
hash_table_destroy (registered_specs);
|
||||
registered_specs = NULL;
|
||||
}
|
||||
}
|
||||
|
@ -29,3 +29,4 @@ struct robot_specs *res_get_specs PARAMS ((const char *, int));
|
||||
|
||||
int res_retrieve_file PARAMS ((const char *, char **));
|
||||
|
||||
void res_cleanup PARAMS ((void));
|
||||
|
57
src/retr.c
57
src/retr.c
@ -184,6 +184,26 @@ rate (long bytes, long msecs, int pad)
|
||||
return res;
|
||||
}
|
||||
|
||||
static int
|
||||
register_redirections_mapper (void *key, void *value, void *arg)
|
||||
{
|
||||
const char *redirected_from = (const char *)key;
|
||||
const char *redirected_to = (const char *)arg;
|
||||
if (0 != strcmp (redirected_from, redirected_to))
|
||||
register_redirection (redirected_from, redirected_to);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Register the redirections that lead to the successful download of
|
||||
this URL. This is necessary so that the link converter can convert
|
||||
redirected URLs to the local file. */
|
||||
|
||||
static void
|
||||
register_all_redirections (struct hash_table *redirections, const char *final)
|
||||
{
|
||||
hash_table_map (redirections, register_redirections_mapper, (void *)final);
|
||||
}
|
||||
|
||||
#define USE_PROXY_P(u) (opt.use_proxy && getproxy((u)->scheme) \
|
||||
&& no_proxy_match((u)->host, \
|
||||
(const char **)opt.no_proxy))
|
||||
@ -254,7 +274,7 @@ retrieve_url (const char *origurl, char **file, char **newloc,
|
||||
proxy_url = url_parse (proxy, &up_error_code);
|
||||
if (!proxy_url)
|
||||
{
|
||||
logprintf (LOG_NOTQUIET, "Error parsing proxy URL %s: %s.\n",
|
||||
logprintf (LOG_NOTQUIET, _("Error parsing proxy URL %s: %s.\n"),
|
||||
proxy, url_error (up_error_code));
|
||||
if (redirections)
|
||||
string_set_free (redirections);
|
||||
@ -310,7 +330,7 @@ retrieve_url (const char *origurl, char **file, char **newloc,
|
||||
if (location_changed)
|
||||
{
|
||||
char *construced_newloc;
|
||||
struct url *newloc_struct;
|
||||
struct url *newloc_parsed;
|
||||
|
||||
assert (mynewloc != NULL);
|
||||
|
||||
@ -326,12 +346,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
|
||||
mynewloc = construced_newloc;
|
||||
|
||||
/* Now, see if this new location makes sense. */
|
||||
newloc_struct = url_parse (mynewloc, &up_error_code);
|
||||
if (!newloc_struct)
|
||||
newloc_parsed = url_parse (mynewloc, &up_error_code);
|
||||
if (!newloc_parsed)
|
||||
{
|
||||
logprintf (LOG_NOTQUIET, "%s: %s.\n", mynewloc,
|
||||
url_error (up_error_code));
|
||||
url_free (newloc_struct);
|
||||
url_free (u);
|
||||
if (redirections)
|
||||
string_set_free (redirections);
|
||||
@ -340,11 +359,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
|
||||
return result;
|
||||
}
|
||||
|
||||
/* Now mynewloc will become newloc_struct->url, because if the
|
||||
/* Now mynewloc will become newloc_parsed->url, because if the
|
||||
Location contained relative paths like .././something, we
|
||||
don't want that propagating as url. */
|
||||
xfree (mynewloc);
|
||||
mynewloc = xstrdup (newloc_struct->url);
|
||||
mynewloc = xstrdup (newloc_parsed->url);
|
||||
|
||||
if (!redirections)
|
||||
{
|
||||
@ -356,11 +375,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
|
||||
|
||||
/* The new location is OK. Check for redirection cycle by
|
||||
peeking through the history of redirections. */
|
||||
if (string_set_contains (redirections, newloc_struct->url))
|
||||
if (string_set_contains (redirections, newloc_parsed->url))
|
||||
{
|
||||
logprintf (LOG_NOTQUIET, _("%s: Redirection cycle detected.\n"),
|
||||
mynewloc);
|
||||
url_free (newloc_struct);
|
||||
url_free (newloc_parsed);
|
||||
url_free (u);
|
||||
if (redirections)
|
||||
string_set_free (redirections);
|
||||
@ -368,12 +387,12 @@ retrieve_url (const char *origurl, char **file, char **newloc,
|
||||
xfree (mynewloc);
|
||||
return WRONGCODE;
|
||||
}
|
||||
string_set_add (redirections, newloc_struct->url);
|
||||
string_set_add (redirections, newloc_parsed->url);
|
||||
|
||||
xfree (url);
|
||||
url = mynewloc;
|
||||
url_free (u);
|
||||
u = newloc_struct;
|
||||
u = newloc_parsed;
|
||||
goto redirected;
|
||||
}
|
||||
|
||||
@ -382,6 +401,8 @@ retrieve_url (const char *origurl, char **file, char **newloc,
|
||||
if (*dt & RETROKF)
|
||||
{
|
||||
register_download (url, local_file);
|
||||
if (redirections)
|
||||
register_all_redirections (redirections, url);
|
||||
if (*dt & TEXTHTML)
|
||||
register_html (url, local_file);
|
||||
}
|
||||
@ -415,16 +436,16 @@ uerr_t
|
||||
retrieve_from_file (const char *file, int html, int *count)
|
||||
{
|
||||
uerr_t status;
|
||||
urlpos *url_list, *cur_url;
|
||||
struct urlpos *url_list, *cur_url;
|
||||
|
||||
url_list = (html ? get_urls_html (file, NULL, FALSE, NULL)
|
||||
: get_urls_file (file));
|
||||
status = RETROK; /* Suppose everything is OK. */
|
||||
*count = 0; /* Reset the URL count. */
|
||||
recursive_reset ();
|
||||
|
||||
for (cur_url = url_list; cur_url; cur_url = cur_url->next, ++*count)
|
||||
{
|
||||
char *filename, *new_file;
|
||||
char *filename = NULL, *new_file;
|
||||
int dt;
|
||||
|
||||
if (downloaded_exceeds_quota ())
|
||||
@ -432,10 +453,10 @@ retrieve_from_file (const char *file, int html, int *count)
|
||||
status = QUOTEXC;
|
||||
break;
|
||||
}
|
||||
status = retrieve_url (cur_url->url, &filename, &new_file, NULL, &dt);
|
||||
if (opt.recursive && status == RETROK && (dt & TEXTHTML))
|
||||
status = recursive_retrieve (filename, new_file ? new_file
|
||||
: cur_url->url);
|
||||
if (opt.recursive && cur_url->url->scheme != SCHEME_FTP)
|
||||
status = retrieve_tree (cur_url->url->url);
|
||||
else
|
||||
status = retrieve_url (cur_url->url->url, &filename, &new_file, NULL, &dt);
|
||||
|
||||
if (filename && opt.delete_after && file_exists_p (filename))
|
||||
{
|
||||
|
455
src/url.c
455
src/url.c
@ -37,6 +37,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
|
||||
#include "utils.h"
|
||||
#include "url.h"
|
||||
#include "host.h"
|
||||
#include "hash.h"
|
||||
|
||||
#ifndef errno
|
||||
extern int errno;
|
||||
@ -182,7 +183,7 @@ encode_string_maybe (const char *s)
|
||||
{
|
||||
if (UNSAFE_CHAR (*p1))
|
||||
{
|
||||
const unsigned char c = *p1++;
|
||||
unsigned char c = *p1++;
|
||||
*p2++ = '%';
|
||||
*p2++ = XDIGIT_TO_XCHAR (c >> 4);
|
||||
*p2++ = XDIGIT_TO_XCHAR (c & 0xf);
|
||||
@ -378,7 +379,7 @@ reencode_string (const char *s)
|
||||
{
|
||||
case CM_ENCODE:
|
||||
{
|
||||
char c = *p1++;
|
||||
unsigned char c = *p1++;
|
||||
*p2++ = '%';
|
||||
*p2++ = XDIGIT_TO_XCHAR (c >> 4);
|
||||
*p2++ = XDIGIT_TO_XCHAR (c & 0xf);
|
||||
@ -586,6 +587,22 @@ strpbrk_or_eos (const char *s, const char *accept)
|
||||
return p;
|
||||
}
|
||||
|
||||
/* Turn STR into lowercase; return non-zero if a character was
|
||||
actually changed. */
|
||||
|
||||
static int
|
||||
lowercase_str (char *str)
|
||||
{
|
||||
int change = 0;
|
||||
for (; *str; str++)
|
||||
if (!ISLOWER (*str))
|
||||
{
|
||||
change = 1;
|
||||
*str = TOLOWER (*str);
|
||||
}
|
||||
return change;
|
||||
}
|
||||
|
||||
static char *parse_errors[] = {
|
||||
#define PE_NO_ERROR 0
|
||||
"No error",
|
||||
@ -614,6 +631,7 @@ url_parse (const char *url, int *error)
|
||||
{
|
||||
struct url *u;
|
||||
const char *p;
|
||||
int path_modified, host_modified;
|
||||
|
||||
enum url_scheme scheme;
|
||||
|
||||
@ -627,9 +645,7 @@ url_parse (const char *url, int *error)
|
||||
int port;
|
||||
char *user = NULL, *passwd = NULL;
|
||||
|
||||
const char *url_orig = url;
|
||||
|
||||
p = url = reencode_string (url);
|
||||
char *url_encoded;
|
||||
|
||||
scheme = url_scheme (url);
|
||||
if (scheme == SCHEME_INVALID)
|
||||
@ -638,6 +654,9 @@ url_parse (const char *url, int *error)
|
||||
return NULL;
|
||||
}
|
||||
|
||||
url_encoded = reencode_string (url);
|
||||
p = url_encoded;
|
||||
|
||||
p += strlen (supported_schemes[scheme].leading_string);
|
||||
uname_b = p;
|
||||
p += url_skip_uname (p);
|
||||
@ -749,11 +768,6 @@ url_parse (const char *url, int *error)
|
||||
u = (struct url *)xmalloc (sizeof (struct url));
|
||||
memset (u, 0, sizeof (*u));
|
||||
|
||||
if (url == url_orig)
|
||||
u->url = xstrdup (url);
|
||||
else
|
||||
u->url = (char *)url;
|
||||
|
||||
u->scheme = scheme;
|
||||
u->host = strdupdelim (host_b, host_e);
|
||||
u->port = port;
|
||||
@ -761,7 +775,10 @@ url_parse (const char *url, int *error)
|
||||
u->passwd = passwd;
|
||||
|
||||
u->path = strdupdelim (path_b, path_e);
|
||||
path_simplify (u->path);
|
||||
path_modified = path_simplify (u->path);
|
||||
parse_path (u->path, &u->dir, &u->file);
|
||||
|
||||
host_modified = lowercase_str (u->host);
|
||||
|
||||
if (params_b)
|
||||
u->params = strdupdelim (params_b, params_e);
|
||||
@ -770,7 +787,26 @@ url_parse (const char *url, int *error)
|
||||
if (fragment_b)
|
||||
u->fragment = strdupdelim (fragment_b, fragment_e);
|
||||
|
||||
parse_path (u->path, &u->dir, &u->file);
|
||||
|
||||
if (path_modified || u->fragment || host_modified)
|
||||
{
|
||||
/* If path_simplify modified the path, or if a fragment is
|
||||
present, or if the original host name had caps in it, make
|
||||
sure that u->url is equivalent to what would be printed by
|
||||
url_string. */
|
||||
u->url = url_string (u, 0);
|
||||
|
||||
if (url_encoded != url)
|
||||
xfree ((char *) url_encoded);
|
||||
}
|
||||
else
|
||||
{
|
||||
if (url_encoded == url)
|
||||
u->url = xstrdup (url);
|
||||
else
|
||||
u->url = url_encoded;
|
||||
}
|
||||
url_encoded = NULL;
|
||||
|
||||
return u;
|
||||
}
|
||||
@ -927,17 +963,18 @@ url_free (struct url *url)
|
||||
FREE_MAYBE (url->fragment);
|
||||
FREE_MAYBE (url->user);
|
||||
FREE_MAYBE (url->passwd);
|
||||
FREE_MAYBE (url->dir);
|
||||
FREE_MAYBE (url->file);
|
||||
|
||||
xfree (url->dir);
|
||||
xfree (url->file);
|
||||
|
||||
xfree (url);
|
||||
}
|
||||
|
||||
urlpos *
|
||||
struct urlpos *
|
||||
get_urls_file (const char *file)
|
||||
{
|
||||
struct file_memory *fm;
|
||||
urlpos *head, *tail;
|
||||
struct urlpos *head, *tail;
|
||||
const char *text, *text_end;
|
||||
|
||||
/* Load the file. */
|
||||
@ -968,10 +1005,28 @@ get_urls_file (const char *file)
|
||||
--line_end;
|
||||
if (line_end > line_beg)
|
||||
{
|
||||
urlpos *entry = (urlpos *)xmalloc (sizeof (urlpos));
|
||||
int up_error_code;
|
||||
char *url_text;
|
||||
struct urlpos *entry;
|
||||
struct url *url;
|
||||
|
||||
/* We must copy the URL to a zero-terminated string. *sigh*. */
|
||||
url_text = strdupdelim (line_beg, line_end);
|
||||
url = url_parse (url_text, &up_error_code);
|
||||
if (!url)
|
||||
{
|
||||
logprintf (LOG_NOTQUIET, "%s: Invalid URL %s: %s\n",
|
||||
file, url_text, url_error (up_error_code));
|
||||
xfree (url_text);
|
||||
continue;
|
||||
}
|
||||
xfree (url_text);
|
||||
|
||||
entry = (struct urlpos *)xmalloc (sizeof (struct urlpos));
|
||||
memset (entry, 0, sizeof (*entry));
|
||||
entry->next = NULL;
|
||||
entry->url = strdupdelim (line_beg, line_end);
|
||||
entry->url = url;
|
||||
|
||||
if (!head)
|
||||
head = entry;
|
||||
else
|
||||
@ -985,12 +1040,13 @@ get_urls_file (const char *file)
|
||||
|
||||
/* Free the linked list of urlpos. */
|
||||
void
|
||||
free_urlpos (urlpos *l)
|
||||
free_urlpos (struct urlpos *l)
|
||||
{
|
||||
while (l)
|
||||
{
|
||||
urlpos *next = l->next;
|
||||
xfree (l->url);
|
||||
struct urlpos *next = l->next;
|
||||
if (l->url)
|
||||
url_free (l->url);
|
||||
FREE_MAYBE (l->local_name);
|
||||
xfree (l);
|
||||
l = next;
|
||||
@ -1088,7 +1144,9 @@ count_slashes (const char *s)
|
||||
static char *
|
||||
mkstruct (const struct url *u)
|
||||
{
|
||||
char *host, *dir, *file, *res, *dirpref;
|
||||
char *dir, *dir_preencoding;
|
||||
char *file, *res, *dirpref;
|
||||
char *query = u->query && *u->query ? u->query : NULL;
|
||||
int l;
|
||||
|
||||
if (opt.cut_dirs)
|
||||
@ -1104,36 +1162,35 @@ mkstruct (const struct url *u)
|
||||
else
|
||||
dir = u->dir + (*u->dir == '/');
|
||||
|
||||
host = xstrdup (u->host);
|
||||
/* Check for the true name (or at least a consistent name for saving
|
||||
to directory) of HOST, reusing the hlist if possible. */
|
||||
if (opt.add_hostdir && !opt.simple_check)
|
||||
{
|
||||
char *nhost = realhost (host);
|
||||
xfree (host);
|
||||
host = nhost;
|
||||
}
|
||||
/* Add dir_prefix and hostname (if required) to the beginning of
|
||||
dir. */
|
||||
if (opt.add_hostdir)
|
||||
{
|
||||
/* Add dir_prefix and hostname (if required) to the beginning of
|
||||
dir. */
|
||||
dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
|
||||
+ strlen (u->host)
|
||||
+ 1 + numdigit (u->port)
|
||||
+ 1);
|
||||
if (!DOTP (opt.dir_prefix))
|
||||
{
|
||||
dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
|
||||
+ strlen (host) + 1);
|
||||
sprintf (dirpref, "%s/%s", opt.dir_prefix, host);
|
||||
}
|
||||
sprintf (dirpref, "%s/%s", opt.dir_prefix, u->host);
|
||||
else
|
||||
STRDUP_ALLOCA (dirpref, host);
|
||||
strcpy (dirpref, u->host);
|
||||
|
||||
if (u->port != scheme_default_port (u->scheme))
|
||||
{
|
||||
int len = strlen (dirpref);
|
||||
dirpref[len] = ':';
|
||||
long_to_string (dirpref + len + 1, u->port);
|
||||
}
|
||||
}
|
||||
else /* not add_hostdir */
|
||||
else /* not add_hostdir */
|
||||
{
|
||||
if (!DOTP (opt.dir_prefix))
|
||||
dirpref = opt.dir_prefix;
|
||||
else
|
||||
dirpref = "";
|
||||
}
|
||||
xfree (host);
|
||||
|
||||
/* If there is a prefix, prepend it. */
|
||||
if (*dirpref)
|
||||
@ -1142,7 +1199,10 @@ mkstruct (const struct url *u)
|
||||
sprintf (newdir, "%s%s%s", dirpref, *dir == '/' ? "" : "/", dir);
|
||||
dir = newdir;
|
||||
}
|
||||
dir = encode_string (dir);
|
||||
|
||||
dir_preencoding = dir;
|
||||
dir = reencode_string (dir_preencoding);
|
||||
|
||||
l = strlen (dir);
|
||||
if (l && dir[l - 1] == '/')
|
||||
dir[l - 1] = '\0';
|
||||
@ -1153,9 +1213,17 @@ mkstruct (const struct url *u)
|
||||
file = u->file;
|
||||
|
||||
/* Finally, construct the full name. */
|
||||
res = (char *)xmalloc (strlen (dir) + 1 + strlen (file) + 1);
|
||||
res = (char *)xmalloc (strlen (dir) + 1 + strlen (file)
|
||||
+ (query ? (1 + strlen (query)) : 0)
|
||||
+ 1);
|
||||
sprintf (res, "%s%s%s", dir, *dir ? "/" : "", file);
|
||||
xfree (dir);
|
||||
if (query)
|
||||
{
|
||||
strcat (res, "?");
|
||||
strcat (res, query);
|
||||
}
|
||||
if (dir != dir_preencoding)
|
||||
xfree (dir);
|
||||
return res;
|
||||
}
|
||||
|
||||
@ -1177,7 +1245,7 @@ compose_file_name (char *base, char *query)
|
||||
{
|
||||
if (UNSAFE_CHAR (*from))
|
||||
{
|
||||
const unsigned char c = *from++;
|
||||
unsigned char c = *from++;
|
||||
*to++ = '%';
|
||||
*to++ = XDIGIT_TO_XCHAR (c >> 4);
|
||||
*to++ = XDIGIT_TO_XCHAR (c & 0xf);
|
||||
@ -1282,10 +1350,8 @@ url_filename (const struct url *u)
|
||||
static int
|
||||
urlpath_length (const char *url)
|
||||
{
|
||||
const char *q = strchr (url, '?');
|
||||
if (q)
|
||||
return q - url;
|
||||
return strlen (url);
|
||||
const char *q = strpbrk_or_eos (url, "?;#");
|
||||
return q - url;
|
||||
}
|
||||
|
||||
/* Find the last occurrence of character C in the range [b, e), or
|
||||
@ -1323,63 +1389,42 @@ uri_merge_1 (const char *base, const char *link, int linklength, int no_scheme)
|
||||
{
|
||||
const char *end = base + urlpath_length (base);
|
||||
|
||||
if (*link != '/')
|
||||
if (!*link)
|
||||
{
|
||||
/* LINK is a relative URL: we need to replace everything
|
||||
after last slash (possibly empty) with LINK.
|
||||
|
||||
So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy",
|
||||
our result should be "whatever/foo/qux/xyzzy". */
|
||||
int need_explicit_slash = 0;
|
||||
int span;
|
||||
const char *start_insert;
|
||||
const char *last_slash = find_last_char (base, end, '/');
|
||||
if (!last_slash)
|
||||
{
|
||||
/* No slash found at all. Append LINK to what we have,
|
||||
but we'll need a slash as a separator.
|
||||
|
||||
Example: if base == "foo" and link == "qux/xyzzy", then
|
||||
we cannot just append link to base, because we'd get
|
||||
"fooqux/xyzzy", whereas what we want is
|
||||
"foo/qux/xyzzy".
|
||||
|
||||
To make sure the / gets inserted, we set
|
||||
need_explicit_slash to 1. We also set start_insert
|
||||
to end + 1, so that the length calculations work out
|
||||
correctly for one more (slash) character. Accessing
|
||||
that character is fine, since it will be the
|
||||
delimiter, '\0' or '?'. */
|
||||
/* example: "foo?..." */
|
||||
/* ^ ('?' gets changed to '/') */
|
||||
start_insert = end + 1;
|
||||
need_explicit_slash = 1;
|
||||
}
|
||||
else if (last_slash && last_slash != base && *(last_slash - 1) == '/')
|
||||
{
|
||||
/* example: http://host" */
|
||||
/* ^ */
|
||||
start_insert = end + 1;
|
||||
need_explicit_slash = 1;
|
||||
}
|
||||
else
|
||||
{
|
||||
/* example: "whatever/foo/bar" */
|
||||
/* ^ */
|
||||
start_insert = last_slash + 1;
|
||||
}
|
||||
|
||||
span = start_insert - base;
|
||||
constr = (char *)xmalloc (span + linklength + 1);
|
||||
if (span)
|
||||
memcpy (constr, base, span);
|
||||
if (need_explicit_slash)
|
||||
constr[span - 1] = '/';
|
||||
if (linklength)
|
||||
memcpy (constr + span, link, linklength);
|
||||
constr[span + linklength] = '\0';
|
||||
/* Empty LINK points back to BASE, query string and all. */
|
||||
constr = xstrdup (base);
|
||||
}
|
||||
else /* *link == `/' */
|
||||
else if (*link == '?')
|
||||
{
|
||||
/* LINK points to the same location, but changes the query
|
||||
string. Examples: */
|
||||
/* uri_merge("path", "?new") -> "path?new" */
|
||||
/* uri_merge("path?foo", "?new") -> "path?new" */
|
||||
/* uri_merge("path?foo#bar", "?new") -> "path?new" */
|
||||
/* uri_merge("path#foo", "?new") -> "path?new" */
|
||||
int baselength = end - base;
|
||||
constr = xmalloc (baselength + linklength + 1);
|
||||
memcpy (constr, base, baselength);
|
||||
memcpy (constr + baselength, link, linklength);
|
||||
constr[baselength + linklength] = '\0';
|
||||
}
|
||||
else if (*link == '#')
|
||||
{
|
||||
/* uri_merge("path", "#new") -> "path#new" */
|
||||
/* uri_merge("path#foo", "#new") -> "path#new" */
|
||||
/* uri_merge("path?foo", "#new") -> "path?foo#new" */
|
||||
/* uri_merge("path?foo#bar", "#new") -> "path?foo#new" */
|
||||
int baselength;
|
||||
const char *end1 = strchr (base, '#');
|
||||
if (!end1)
|
||||
end1 = base + strlen (base);
|
||||
baselength = end1 - base;
|
||||
constr = xmalloc (baselength + linklength + 1);
|
||||
memcpy (constr, base, baselength);
|
||||
memcpy (constr + baselength, link, linklength);
|
||||
constr[baselength + linklength] = '\0';
|
||||
}
|
||||
else if (*link == '/')
|
||||
{
|
||||
/* LINK is an absolute path: we need to replace everything
|
||||
after (and including) the FIRST slash with LINK.
|
||||
@ -1435,6 +1480,62 @@ uri_merge_1 (const char *base, const char *link, int linklength, int no_scheme)
|
||||
memcpy (constr + span, link, linklength);
|
||||
constr[span + linklength] = '\0';
|
||||
}
|
||||
else
|
||||
{
|
||||
/* LINK is a relative URL: we need to replace everything
|
||||
after last slash (possibly empty) with LINK.
|
||||
|
||||
So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy",
|
||||
our result should be "whatever/foo/qux/xyzzy". */
|
||||
int need_explicit_slash = 0;
|
||||
int span;
|
||||
const char *start_insert;
|
||||
const char *last_slash = find_last_char (base, end, '/');
|
||||
if (!last_slash)
|
||||
{
|
||||
/* No slash found at all. Append LINK to what we have,
|
||||
but we'll need a slash as a separator.
|
||||
|
||||
Example: if base == "foo" and link == "qux/xyzzy", then
|
||||
we cannot just append link to base, because we'd get
|
||||
"fooqux/xyzzy", whereas what we want is
|
||||
"foo/qux/xyzzy".
|
||||
|
||||
To make sure the / gets inserted, we set
|
||||
need_explicit_slash to 1. We also set start_insert
|
||||
to end + 1, so that the length calculations work out
|
||||
correctly for one more (slash) character. Accessing
|
||||
that character is fine, since it will be the
|
||||
delimiter, '\0' or '?'. */
|
||||
/* example: "foo?..." */
|
||||
/* ^ ('?' gets changed to '/') */
|
||||
start_insert = end + 1;
|
||||
need_explicit_slash = 1;
|
||||
}
|
||||
else if (last_slash && last_slash != base && *(last_slash - 1) == '/')
|
||||
{
|
||||
/* example: http://host" */
|
||||
/* ^ */
|
||||
start_insert = end + 1;
|
||||
need_explicit_slash = 1;
|
||||
}
|
||||
else
|
||||
{
|
||||
/* example: "whatever/foo/bar" */
|
||||
/* ^ */
|
||||
start_insert = last_slash + 1;
|
||||
}
|
||||
|
||||
span = start_insert - base;
|
||||
constr = (char *)xmalloc (span + linklength + 1);
|
||||
if (span)
|
||||
memcpy (constr, base, span);
|
||||
if (need_explicit_slash)
|
||||
constr[span - 1] = '/';
|
||||
if (linklength)
|
||||
memcpy (constr + span, link, linklength);
|
||||
constr[span + linklength] = '\0';
|
||||
}
|
||||
}
|
||||
else /* !no_scheme */
|
||||
{
|
||||
@ -1602,12 +1703,13 @@ static void replace_attr PARAMS ((const char **, int, FILE *, const char *));
|
||||
/* Change the links in an HTML document. Accepts a structure that
|
||||
defines the positions of all the links. */
|
||||
void
|
||||
convert_links (const char *file, urlpos *l)
|
||||
convert_links (const char *file, struct urlpos *l)
|
||||
{
|
||||
struct file_memory *fm;
|
||||
FILE *fp;
|
||||
const char *p;
|
||||
downloaded_file_t downloaded_file_return;
|
||||
int to_url_count = 0, to_file_count = 0;
|
||||
|
||||
logprintf (LOG_VERBOSE, _("Converting %s... "), file);
|
||||
|
||||
@ -1615,12 +1717,12 @@ convert_links (const char *file, urlpos *l)
|
||||
/* First we do a "dry run": go through the list L and see whether
|
||||
any URL needs to be converted in the first place. If not, just
|
||||
leave the file alone. */
|
||||
int count = 0;
|
||||
urlpos *dry = l;
|
||||
int dry_count = 0;
|
||||
struct urlpos *dry = l;
|
||||
for (dry = l; dry; dry = dry->next)
|
||||
if (dry->convert != CO_NOCONVERT)
|
||||
++count;
|
||||
if (!count)
|
||||
++dry_count;
|
||||
if (!dry_count)
|
||||
{
|
||||
logputs (LOG_VERBOSE, _("nothing to do.\n"));
|
||||
return;
|
||||
@ -1674,7 +1776,7 @@ convert_links (const char *file, urlpos *l)
|
||||
/* If the URL is not to be converted, skip it. */
|
||||
if (l->convert == CO_NOCONVERT)
|
||||
{
|
||||
DEBUGP (("Skipping %s at position %d.\n", l->url, l->pos));
|
||||
DEBUGP (("Skipping %s at position %d.\n", l->url->url, l->pos));
|
||||
continue;
|
||||
}
|
||||
|
||||
@ -1689,19 +1791,21 @@ convert_links (const char *file, urlpos *l)
|
||||
char *quoted_newname = html_quote_string (newname);
|
||||
replace_attr (&p, l->size, fp, quoted_newname);
|
||||
DEBUGP (("TO_RELATIVE: %s to %s at position %d in %s.\n",
|
||||
l->url, newname, l->pos, file));
|
||||
l->url->url, newname, l->pos, file));
|
||||
xfree (newname);
|
||||
xfree (quoted_newname);
|
||||
++to_file_count;
|
||||
}
|
||||
else if (l->convert == CO_CONVERT_TO_COMPLETE)
|
||||
{
|
||||
/* Convert the link to absolute URL. */
|
||||
char *newlink = l->url;
|
||||
char *newlink = l->url->url;
|
||||
char *quoted_newlink = html_quote_string (newlink);
|
||||
replace_attr (&p, l->size, fp, quoted_newlink);
|
||||
DEBUGP (("TO_COMPLETE: <something> to %s at position %d in %s.\n",
|
||||
newlink, l->pos, file));
|
||||
xfree (quoted_newlink);
|
||||
++to_url_count;
|
||||
}
|
||||
}
|
||||
/* Output the rest of the file. */
|
||||
@ -1709,7 +1813,8 @@ convert_links (const char *file, urlpos *l)
|
||||
fwrite (p, 1, fm->length - (p - fm->content), fp);
|
||||
fclose (fp);
|
||||
read_file_free (fm);
|
||||
logputs (LOG_VERBOSE, _("done.\n"));
|
||||
logprintf (LOG_VERBOSE,
|
||||
_("%d-%d\n"), to_file_count, to_url_count);
|
||||
}
|
||||
|
||||
/* Construct and return a malloced copy of the relative link from two
|
||||
@ -1766,20 +1871,6 @@ construct_relative (const char *s1, const char *s2)
|
||||
return res;
|
||||
}
|
||||
|
||||
/* Add URL to the head of the list L. */
|
||||
urlpos *
|
||||
add_url (urlpos *l, const char *url, const char *file)
|
||||
{
|
||||
urlpos *t;
|
||||
|
||||
t = (urlpos *)xmalloc (sizeof (urlpos));
|
||||
memset (t, 0, sizeof (*t));
|
||||
t->url = xstrdup (url);
|
||||
t->local_name = xstrdup (file);
|
||||
t->next = l;
|
||||
return t;
|
||||
}
|
||||
|
||||
static void
|
||||
write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
|
||||
{
|
||||
@ -1850,15 +1941,9 @@ write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
|
||||
-- Dan Harkless <wget@harkless.org>
|
||||
|
||||
This [adding a field to the urlpos structure] didn't work
|
||||
because convert_file() is called twice: once after all its
|
||||
sublinks have been retrieved in recursive_retrieve(), and
|
||||
once at the end of the day in convert_all_links(). The
|
||||
original linked list collected in recursive_retrieve() is
|
||||
lost after the first invocation of convert_links(), and
|
||||
convert_all_links() makes a new one (it calls get_urls_html()
|
||||
for each file it covers.) That's why your first approach didn't
|
||||
work. The way to make it work is perhaps to make this flag a
|
||||
field in the `urls_html' list.
|
||||
because convert_file() is called from convert_all_links at
|
||||
the end of the retrieval with a freshly built new urlpos
|
||||
list.
|
||||
-- Hrvoje Niksic <hniksic@arsdigita.com>
|
||||
*/
|
||||
converted_file_ptr = xmalloc(sizeof(*converted_file_ptr));
|
||||
@ -1941,13 +2026,40 @@ find_fragment (const char *beg, int size, const char **bp, const char **ep)
|
||||
return 0;
|
||||
}
|
||||
|
||||
typedef struct _downloaded_file_list {
|
||||
char* file;
|
||||
downloaded_file_t download_type;
|
||||
struct _downloaded_file_list* next;
|
||||
} downloaded_file_list;
|
||||
/* We're storing "modes" of type downloaded_file_t in the hash table.
|
||||
However, our hash tables only accept pointers for keys and values.
|
||||
So when we need a pointer, we use the address of a
|
||||
downloaded_file_t variable of static storage. */
|
||||
|
||||
static downloaded_file_t *
|
||||
downloaded_mode_to_ptr (downloaded_file_t mode)
|
||||
{
|
||||
static downloaded_file_t
|
||||
v1 = FILE_NOT_ALREADY_DOWNLOADED,
|
||||
v2 = FILE_DOWNLOADED_NORMALLY,
|
||||
v3 = FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED,
|
||||
v4 = CHECK_FOR_FILE;
|
||||
|
||||
static downloaded_file_list *downloaded_files;
|
||||
switch (mode)
|
||||
{
|
||||
case FILE_NOT_ALREADY_DOWNLOADED:
|
||||
return &v1;
|
||||
case FILE_DOWNLOADED_NORMALLY:
|
||||
return &v2;
|
||||
case FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED:
|
||||
return &v3;
|
||||
case CHECK_FOR_FILE:
|
||||
return &v4;
|
||||
}
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/* This should really be merged with dl_file_url_map and
|
||||
downloaded_html_files in recur.c. This was originally a list, but
|
||||
I changed it to a hash table beause it was actually taking a lot of
|
||||
time to find things in it. */
|
||||
|
||||
static struct hash_table *downloaded_files_hash;
|
||||
|
||||
/* Remembers which files have been downloaded. In the standard case, should be
|
||||
called with mode == FILE_DOWNLOADED_NORMALLY for each file we actually
|
||||
@ -1962,46 +2074,47 @@ static downloaded_file_list *downloaded_files;
|
||||
it, call with mode == CHECK_FOR_FILE. Please be sure to call this function
|
||||
with local filenames, not remote URLs. */
|
||||
downloaded_file_t
|
||||
downloaded_file (downloaded_file_t mode, const char* file)
|
||||
downloaded_file (downloaded_file_t mode, const char *file)
|
||||
{
|
||||
boolean found_file = FALSE;
|
||||
downloaded_file_list* rover = downloaded_files;
|
||||
downloaded_file_t *ptr;
|
||||
|
||||
while (rover != NULL)
|
||||
if (strcmp(rover->file, file) == 0)
|
||||
{
|
||||
found_file = TRUE;
|
||||
break;
|
||||
}
|
||||
else
|
||||
rover = rover->next;
|
||||
|
||||
if (found_file)
|
||||
return rover->download_type; /* file had already been downloaded */
|
||||
else
|
||||
if (mode == CHECK_FOR_FILE)
|
||||
{
|
||||
if (mode != CHECK_FOR_FILE)
|
||||
{
|
||||
rover = xmalloc(sizeof(*rover));
|
||||
rover->file = xstrdup(file); /* use xstrdup() so die on out-of-mem. */
|
||||
rover->download_type = mode;
|
||||
rover->next = downloaded_files;
|
||||
downloaded_files = rover;
|
||||
}
|
||||
|
||||
return FILE_NOT_ALREADY_DOWNLOADED;
|
||||
if (!downloaded_files_hash)
|
||||
return FILE_NOT_ALREADY_DOWNLOADED;
|
||||
ptr = hash_table_get (downloaded_files_hash, file);
|
||||
if (!ptr)
|
||||
return FILE_NOT_ALREADY_DOWNLOADED;
|
||||
return *ptr;
|
||||
}
|
||||
|
||||
if (!downloaded_files_hash)
|
||||
downloaded_files_hash = make_string_hash_table (0);
|
||||
|
||||
ptr = hash_table_get (downloaded_files_hash, file);
|
||||
if (ptr)
|
||||
return *ptr;
|
||||
|
||||
ptr = downloaded_mode_to_ptr (mode);
|
||||
hash_table_put (downloaded_files_hash, xstrdup (file), &ptr);
|
||||
|
||||
return FILE_NOT_ALREADY_DOWNLOADED;
|
||||
}
|
||||
|
||||
static int
|
||||
df_free_mapper (void *key, void *value, void *ignored)
|
||||
{
|
||||
xfree (key);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void
|
||||
downloaded_files_free (void)
|
||||
{
|
||||
downloaded_file_list* rover = downloaded_files;
|
||||
while (rover)
|
||||
if (downloaded_files_hash)
|
||||
{
|
||||
downloaded_file_list *next = rover->next;
|
||||
xfree (rover->file);
|
||||
xfree (rover);
|
||||
rover = next;
|
||||
hash_table_map (downloaded_files_hash, df_free_mapper, NULL);
|
||||
hash_table_destroy (downloaded_files_hash);
|
||||
downloaded_files_hash = NULL;
|
||||
}
|
||||
}
|
||||
|
25
src/url.h
25
src/url.h
@ -72,11 +72,11 @@ enum convert_options {
|
||||
/* A structure that defines the whereabouts of a URL, i.e. its
|
||||
position in an HTML document, etc. */
|
||||
|
||||
typedef struct _urlpos
|
||||
{
|
||||
char *url; /* linked URL, after it has been
|
||||
merged with the base */
|
||||
char *local_name; /* Local file to which it was saved */
|
||||
struct urlpos {
|
||||
struct url *url; /* the URL of the link, after it has
|
||||
been merged with the base */
|
||||
char *local_name; /* local file to which it was saved
|
||||
(used by convert_links) */
|
||||
|
||||
/* Information about the original link: */
|
||||
int link_relative_p; /* was the link relative? */
|
||||
@ -89,8 +89,8 @@ typedef struct _urlpos
|
||||
/* URL's position in the buffer. */
|
||||
int pos, size;
|
||||
|
||||
struct _urlpos *next; /* Next struct in list */
|
||||
} urlpos;
|
||||
struct urlpos *next; /* next list element */
|
||||
};
|
||||
|
||||
/* downloaded_file() takes a parameter of this type and returns this type. */
|
||||
typedef enum
|
||||
@ -126,9 +126,9 @@ int url_skip_uname PARAMS ((const char *));
|
||||
|
||||
char *url_string PARAMS ((const struct url *, int));
|
||||
|
||||
urlpos *get_urls_file PARAMS ((const char *));
|
||||
urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *));
|
||||
void free_urlpos PARAMS ((urlpos *));
|
||||
struct urlpos *get_urls_file PARAMS ((const char *));
|
||||
struct urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *));
|
||||
void free_urlpos PARAMS ((struct urlpos *));
|
||||
|
||||
char *uri_merge PARAMS ((const char *, const char *));
|
||||
|
||||
@ -136,11 +136,10 @@ void rotate_backups PARAMS ((const char *));
|
||||
int mkalldirs PARAMS ((const char *));
|
||||
char *url_filename PARAMS ((const struct url *));
|
||||
|
||||
char *getproxy PARAMS ((uerr_t));
|
||||
char *getproxy PARAMS ((enum url_scheme));
|
||||
int no_proxy_match PARAMS ((const char *, const char **));
|
||||
|
||||
void convert_links PARAMS ((const char *, urlpos *));
|
||||
urlpos *add_url PARAMS ((urlpos *, const char *, const char *));
|
||||
void convert_links PARAMS ((const char *, struct urlpos *));
|
||||
|
||||
downloaded_file_t downloaded_file PARAMS ((downloaded_file_t, const char *));
|
||||
|
||||
|
78
src/utils.c
78
src/utils.c
@ -307,6 +307,18 @@ xstrdup_debug (const char *s, const char *source_file, int source_line)
|
||||
|
||||
#endif /* DEBUG_MALLOC */
|
||||
|
||||
/* Utility function: like xstrdup(), but also lowercases S. */
|
||||
|
||||
char *
|
||||
xstrdup_lower (const char *s)
|
||||
{
|
||||
char *copy = xstrdup (s);
|
||||
char *p = copy;
|
||||
for (; *p; p++)
|
||||
*p = TOLOWER (*p);
|
||||
return copy;
|
||||
}
|
||||
|
||||
/* Copy the string formed by two pointers (one on the beginning, other
|
||||
on the char after the last char) to a new, malloc-ed location.
|
||||
0-terminate it. */
|
||||
@ -443,6 +455,8 @@ fork_to_background (void)
|
||||
}
|
||||
#endif /* not WINDOWS */
|
||||
|
||||
#if 0
|
||||
/* debug */
|
||||
char *
|
||||
ps (char *orig)
|
||||
{
|
||||
@ -450,6 +464,7 @@ ps (char *orig)
|
||||
path_simplify (r);
|
||||
return r;
|
||||
}
|
||||
#endif
|
||||
|
||||
/* Canonicalize PATH, and return a new path. The new path differs from PATH
|
||||
in that:
|
||||
@ -468,45 +483,31 @@ ps (char *orig)
|
||||
Change the original string instead of strdup-ing.
|
||||
React correctly when beginning with `./' and `../'.
|
||||
Don't zip out trailing slashes. */
|
||||
void
|
||||
int
|
||||
path_simplify (char *path)
|
||||
{
|
||||
register int i, start, ddot;
|
||||
register int i, start;
|
||||
int changes = 0;
|
||||
char stub_char;
|
||||
|
||||
if (!*path)
|
||||
return;
|
||||
return 0;
|
||||
|
||||
/*stub_char = (*path == '/') ? '/' : '.';*/
|
||||
stub_char = '/';
|
||||
|
||||
/* Addition: Remove all `./'-s preceding the string. If `../'-s
|
||||
precede, put `/' in front and remove them too. */
|
||||
i = 0;
|
||||
ddot = 0;
|
||||
while (1)
|
||||
{
|
||||
if (path[i] == '.' && path[i + 1] == '/')
|
||||
i += 2;
|
||||
else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/')
|
||||
{
|
||||
i += 3;
|
||||
ddot = 1;
|
||||
}
|
||||
else
|
||||
break;
|
||||
}
|
||||
if (i)
|
||||
strcpy (path, path + i - ddot);
|
||||
if (path[0] == '/')
|
||||
/* Preserve initial '/'. */
|
||||
++path;
|
||||
|
||||
/* Replace single `.' or `..' with `/'. */
|
||||
/* Nix out leading `.' or `..' with. */
|
||||
if ((path[0] == '.' && path[1] == '\0')
|
||||
|| (path[0] == '.' && path[1] == '.' && path[2] == '\0'))
|
||||
{
|
||||
path[0] = stub_char;
|
||||
path[1] = '\0';
|
||||
return;
|
||||
path[0] = '\0';
|
||||
changes = 1;
|
||||
return changes;
|
||||
}
|
||||
|
||||
/* Walk along PATH looking for things to compact. */
|
||||
i = 0;
|
||||
while (1)
|
||||
@ -531,6 +532,7 @@ path_simplify (char *path)
|
||||
{
|
||||
strcpy (path + start + 1, path + i);
|
||||
i = start + 1;
|
||||
changes = 1;
|
||||
}
|
||||
|
||||
/* Check for `../', `./' or trailing `.' by itself. */
|
||||
@ -540,6 +542,7 @@ path_simplify (char *path)
|
||||
if (!path[i + 1])
|
||||
{
|
||||
path[--i] = '\0';
|
||||
changes = 1;
|
||||
break;
|
||||
}
|
||||
|
||||
@ -548,6 +551,7 @@ path_simplify (char *path)
|
||||
{
|
||||
strcpy (path + i, path + i + 1);
|
||||
i = (start < 0) ? 0 : start;
|
||||
changes = 1;
|
||||
continue;
|
||||
}
|
||||
|
||||
@ -556,12 +560,32 @@ path_simplify (char *path)
|
||||
(path[i + 2] == '/' || !path[i + 2]))
|
||||
{
|
||||
while (--start > -1 && path[start] != '/');
|
||||
strcpy (path + start + 1, path + i + 2);
|
||||
strcpy (path + start + 1, path + i + 2 + (start == -1 && path[i + 2]));
|
||||
i = (start < 0) ? 0 : start;
|
||||
changes = 1;
|
||||
continue;
|
||||
}
|
||||
} /* path == '.' */
|
||||
} /* while */
|
||||
|
||||
/* Addition: Remove all `./'-s and `../'-s preceding the string. */
|
||||
i = 0;
|
||||
while (1)
|
||||
{
|
||||
if (path[i] == '.' && path[i + 1] == '/')
|
||||
i += 2;
|
||||
else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/')
|
||||
i += 3;
|
||||
else
|
||||
break;
|
||||
}
|
||||
if (i)
|
||||
{
|
||||
strcpy (path, path + i - 0);
|
||||
changes = 1;
|
||||
}
|
||||
|
||||
return changes;
|
||||
}
|
||||
|
||||
/* "Touch" FILE, i.e. make its atime and mtime equal to the time
|
||||
|
@ -48,12 +48,13 @@ char *datetime_str PARAMS ((time_t *));
|
||||
void print_malloc_debug_stats ();
|
||||
#endif
|
||||
|
||||
char *xstrdup_lower PARAMS ((const char *));
|
||||
char *strdupdelim PARAMS ((const char *, const char *));
|
||||
char **sepstring PARAMS ((const char *));
|
||||
int frontcmp PARAMS ((const char *, const char *));
|
||||
char *pwd_cuserid PARAMS ((char *));
|
||||
void fork_to_background PARAMS ((void));
|
||||
void path_simplify PARAMS ((char *));
|
||||
int path_simplify PARAMS ((char *));
|
||||
|
||||
void touch PARAMS ((const char *, time_t));
|
||||
int remove_link PARAMS ((const char *));
|
||||
@ -98,4 +99,6 @@ long wtimer_granularity PARAMS ((void));
|
||||
|
||||
char *html_quote_string PARAMS ((const char *));
|
||||
|
||||
int determine_screen_width PARAMS ((void));
|
||||
|
||||
#endif /* UTILS_H */
|
||||
|
@ -28,6 +28,11 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
|
||||
# define NDEBUG /* To kill off assertions */
|
||||
#endif /* not DEBUG */
|
||||
|
||||
/* Define this if you want primitive but extensive malloc debugging.
|
||||
It will make Wget extremely slow, so only do it in development
|
||||
builds. */
|
||||
#undef DEBUG_MALLOC
|
||||
|
||||
#ifndef PARAMS
|
||||
# if PROTOTYPES
|
||||
# define PARAMS(args) args
|
||||
@ -60,7 +65,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
|
||||
|
||||
3) Finally, the debug messages are meant to be a clue for me to
|
||||
debug problems with Wget. If I get them in a language I don't
|
||||
understand, debugging will become a new challenge of its own! :-) */
|
||||
understand, debugging will become a new challenge of its own! */
|
||||
|
||||
|
||||
/* Include these, so random files need not include them. */
|
||||
|
Loading…
Reference in New Issue
Block a user