1
0
mirror of https://github.com/moparisthebest/wget synced 2024-07-03 16:38:41 -04:00

[svn] Implemented breadth-first retrieval.

Published in <sxsherjczw2.fsf@florida.arsdigita.de>.
This commit is contained in:
hniksic 2001-11-24 19:10:34 -08:00
parent b88223f99d
commit 222e9465b7
23 changed files with 1073 additions and 853 deletions

View File

@ -1,3 +1,9 @@
2001-11-25 Hrvoje Niksic <hniksic@arsdigita.com>
* TODO: Ditto.
* NEWS: Updated with the latest stuff.
2001-11-23 Hrvoje Niksic <hniksic@arsdigita.com>
* po/hr.po: A major overhaul.

10
NEWS
View File

@ -7,9 +7,19 @@ Please send GNU Wget bug reports to <bug-wget@gnu.org>.
* Changes in Wget 1.8.
** "Recursive retrieval" now uses a breadth-first algorithm.
Recursive downloads are faster and consume *significantly* less memory
than before.
** A new progress indicator is now available. Try it with
--progress=bar or using `progress = bar' in `.wgetrc'.
** Host directories now contain port information if the URL is at a
non-standard port.
** Wget now supports the robots.txt directives specified in
<http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html>.
** URL parser has been fixed, especially the infamous overzealous
quoting bug. Wget no longer dequotes reserved characters, e.g. `%3F'
is no longer translated to `?', nor `%2B' to `+'. Unsafe characters

36
TODO
View File

@ -20,15 +20,6 @@ changes.
file, though forcibly disconnecting from the server at the desired endpoint
might be workable).
* RFC 1738 says that if logging on to an FTP server puts you in a directory
other than '/', the way to specify a file relative to '/' in a URL (let's use
"/bin/ls" in this example) is "ftp://host/%2Fbin/ls". Wget needs to support
this (and ideally not consider "ftp://host//bin/ls" to be equivalent, as that
would equate to the command "CWD " rather than "CWD /"). To accomodate people
used to broken FTP clients like Internet Explorer and Netscape, if
"ftp://host/bin/ls" doesn't exist, Wget should try again (perhaps under
control of an option), acting as if the user had typed "ftp://host/%2Fbin/ls".
* If multiple FTP URLs are specified that are on the same host, Wget should
re-use the connection rather than opening a new one for each file.
@ -37,16 +28,9 @@ changes.
* Limit the number of successive redirection to max. 20 or so.
* If -c used on a file that's already completely downloaded, don't re-download
it (unless normal --timestamping processing would cause you to do so).
* If -c used with -N, check to make sure a file hasn't changed on the server
before "continuing" to download it (preventing a bogus hybrid file).
* Take a look at
<http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html>
and support the new directives.
* Generalize --html-extension to something like --mime-extensions and have it
look at mime.types/mimecap file for preferred extension. Non-HTML files with
filenames changed this way would be re-downloaded each time despite -N unless
@ -87,9 +71,6 @@ changes.
turning it off. Get rid of `--foo=no' stuff. Short options would
be handled as `-x' vs. `-nx'.
* Implement "thermometer" display (not all that hard; use an
alternative show_progress() if the output goes to a terminal.)
* Add option to only list wildcard matches without doing the download.
* Add case-insensitivity as an option.
@ -102,19 +83,13 @@ changes.
* Allow time-stamping by arbitrary date.
* Fix Unix directory parser to allow for spaces in file names.
* Allow size limit to files (perhaps with an option to download oversize files
up through the limit or not at all, to get more functionality than [u]limit.
* Implement breadth-first retrieval.
* Download to .in* when mirroring.
* Add an option to delete or move no-longer-existent files when mirroring.
* Implement a switch to avoid downloading multiple files (e.g. x and x.gz).
* Implement uploading (--upload URL?) in FTP and HTTP.
* Rewrite FTP code to allow for easy addition of new commands. It
@ -129,13 +104,10 @@ changes.
* Implement a concept of "packages" a la mirror.
* Implement correct RFC1808 URL parsing.
* Implement more HTTP/1.1 bells and whistles (ETag, Content-MD5 etc.)
* Add a "rollback" option to have --continue throw away a configurable number of
bytes at the end of a file before resuming download. Apparently, some stupid
proxies insert a "transfer interrupted" string we need to get rid of.
* Add a "rollback" option to have continued retrieval throw away a
configurable number of bytes at the end of a file before resuming
download. Apparently, some stupid proxies insert a "transfer
interrupted" string we need to get rid of.
* When using --accept and --reject, you can end up with empty directories. Have
Wget any such at the end.

View File

@ -1,3 +1,68 @@
2001-11-25 Hrvoje Niksic <hniksic@arsdigita.com>
* url.c (reencode_string): Use unsigned char, not char --
otherwise the hex digits come out wrong for 8-bit chars such as
nbsp.
(lowercase_str): New function.
(url_parse): Canonicalize u->url if needed.
(get_urls_file): Parse each URL, and return only the valid ones.
(free_urlpos): Call url_free.
(mkstruct): Add :port if the port is non-standard.
(mkstruct): Append the query string to the file name, if any.
(urlpath_length): Use strpbrk_or_eos.
(uri_merge_1): Handle the cases where LINK is an empty string,
where LINK consists only of query, and where LINK consists only of
fragment.
(convert_links): Count and report both kinds of conversion.
(downloaded_file): Use a hash table, not a list.
(downloaded_files_free): Free the hash table.
* retr.c (retrieve_from_file): Ditto.
* main.c (main): Call either retrieve_url or retrieve_tree
for each URL, not both.
* retr.c (register_all_redirections): New function.
(register_redirections_mapper): Ditto.
(retrieve_url): Register the redirections.
(retrieve_url): Make the string "Error parsing proxy ..."
translatable.
* res.c (add_path): Strip leading slash from robots.txt paths so
that the path representations are "compatible".
(free_specs): Free each individual path, too.
(res_cleanup): New function.
(cleanup_hash_table_mapper): Ditto.
* recur.c (url_queue_new): New function.
(url_queue_delete): Ditto.
(url_enqueue): Ditto.
(url_dequeue): Ditto.
(retrieve_tree): New function, replacement for recursive_retrieve.
(descend_url_p): New function.
(register_redirection): New function.
* progress.c (create_image): Cosmetic changes.
* init.c (cleanup): Do all those complex cleanups only if
DEBUG_MALLOC is defined.
* main.c: Removed --simple-check and the corresponding
simple_host_check in init.c.
* html-url.c (handle_link): Parse the URL here, and propagate the
parsed URL to the caller, who would otherwise have to parse it
again.
* host.c (xstrdup_lower): Moved to utils.c.
(realhost): Removed.
(same_host): Ditto.
2001-11-24 Hrvoje Niksic <hniksic@arsdigita.com>
* utils.c (path_simplify): Preserver the (non-)existence of
leading slash. Return non-zero if changes were made.
2001-11-24 Hrvoje Niksic <hniksic@arsdigita.com>
* progress.c (bar_update): Don't modify bp->total_length if it is

View File

@ -162,8 +162,10 @@ main$o: wget.h utils.h init.h retr.h recur.h host.h cookies.h
gnu-md5$o: wget.h gnu-md5.h
mswindows$o: wget.h url.h
netrc$o: wget.h utils.h netrc.h init.h
progress$o: wget.h progress.h utils.h retr.h
rbuf$o: wget.h rbuf.h connect.h
recur$o: wget.h url.h recur.h utils.h retr.h ftp.h fnmatch.h host.h hash.h
res$o: wget.h utils.h hash.h url.h retr.h res.h
retr$o: wget.h utils.h retr.h url.h recur.h ftp.h host.h connect.h hash.h
snprintf$o:
safe-ctype$o: safe-ctype.h

View File

@ -60,8 +60,14 @@ extern int errno;
#endif
/* Mapping between all known hosts to their addresses (n.n.n.n). */
/* #### We should map to *lists* of IP addresses. */
struct hash_table *host_name_address_map;
/* The following two tables are obsolete, since we no longer do host
canonicalization. */
/* Mapping between all known addresses (n.n.n.n) to their hosts. This
is the inverse of host_name_address_map. These two tables share
the strdup'ed strings. */
@ -70,18 +76,6 @@ struct hash_table *host_address_name_map;
/* Mapping between auxilliary (slave) and master host names. */
struct hash_table *host_slave_master_map;
/* Utility function: like xstrdup(), but also lowercases S. */
static char *
xstrdup_lower (const char *s)
{
char *copy = xstrdup (s);
char *p = copy;
for (; *p; p++)
*p = TOLOWER (*p);
return copy;
}
/* The same as gethostbyname, but supports internet addresses of the
form `N.N.N.N'. On some systems gethostbyname() knows how to do
this automatically. */
@ -216,114 +210,6 @@ store_hostaddress (unsigned char *where, const char *hostname)
return 1;
}
/* Determine the "real" name of HOST, as perceived by Wget. If HOST
is referenced by more than one name, "real" name is considered to
be the first one encountered in the past. */
char *
realhost (const char *host)
{
struct in_addr in;
struct hostent *hptr;
char *master_name;
DEBUGP (("Checking for %s in host_name_address_map.\n", host));
if (hash_table_contains (host_name_address_map, host))
{
DEBUGP (("Found; %s was already used, by that name.\n", host));
return xstrdup_lower (host);
}
DEBUGP (("Checking for %s in host_slave_master_map.\n", host));
master_name = hash_table_get (host_slave_master_map, host);
if (master_name)
{
has_master:
DEBUGP (("Found; %s was already used, by the name %s.\n",
host, master_name));
return xstrdup (master_name);
}
DEBUGP (("First time I hear about %s by that name; looking it up.\n",
host));
hptr = ngethostbyname (host);
if (hptr)
{
char *inet_s;
/* Originally, we copied to in.s_addr, but it appears to be
missing on some systems. */
memcpy (&in, *hptr->h_addr_list, sizeof (in));
inet_s = inet_ntoa (in);
add_host_to_cache (host, inet_s);
/* add_host_to_cache() can establish a slave-master mapping. */
DEBUGP (("Checking again for %s in host_slave_master_map.\n", host));
master_name = hash_table_get (host_slave_master_map, host);
if (master_name)
goto has_master;
}
return xstrdup_lower (host);
}
/* Compare two hostnames (out of URL-s if the arguments are URL-s),
taking care of aliases. It uses realhost() to determine a unique
hostname for each of two hosts. If simple_check is non-zero, only
strcmp() is used for comparison. */
int
same_host (const char *u1, const char *u2)
{
const char *s;
char *p1, *p2;
char *real1, *real2;
/* Skip protocol, if present. */
u1 += url_skip_scheme (u1);
u2 += url_skip_scheme (u2);
/* Skip username ans password, if present. */
u1 += url_skip_uname (u1);
u2 += url_skip_uname (u2);
for (s = u1; *u1 && *u1 != '/' && *u1 != ':'; u1++);
p1 = strdupdelim (s, u1);
for (s = u2; *u2 && *u2 != '/' && *u2 != ':'; u2++);
p2 = strdupdelim (s, u2);
DEBUGP (("Comparing hosts %s and %s...\n", p1, p2));
if (strcasecmp (p1, p2) == 0)
{
xfree (p1);
xfree (p2);
DEBUGP (("They are quite alike.\n"));
return 1;
}
else if (opt.simple_check)
{
xfree (p1);
xfree (p2);
DEBUGP (("Since checking is simple, I'd say they are not the same.\n"));
return 0;
}
real1 = realhost (p1);
real2 = realhost (p2);
xfree (p1);
xfree (p2);
if (strcasecmp (real1, real2) == 0)
{
DEBUGP (("They are alike, after realhost()->%s.\n", real1));
xfree (real1);
xfree (real2);
return 1;
}
else
{
DEBUGP (("They are not the same (%s, %s).\n", real1, real2));
xfree (real1);
xfree (real2);
return 0;
}
}
/* Determine whether a URL is acceptable to be followed, according to
a list of domains to accept. */
int
@ -383,7 +269,7 @@ herrmsg (int error)
}
void
clean_hosts (void)
host_cleanup (void)
{
/* host_name_address_map and host_address_name_map share the
strings. Because of that, calling free_keys_and_values once

View File

@ -27,15 +27,11 @@ struct url;
struct hostent *ngethostbyname PARAMS ((const char *));
int store_hostaddress PARAMS ((unsigned char *, const char *));
void clean_hosts PARAMS ((void));
void host_cleanup PARAMS ((void));
char *realhost PARAMS ((const char *));
int same_host PARAMS ((const char *, const char *));
int accept_domain PARAMS ((struct url *));
int sufmatch PARAMS ((const char **, const char *));
char *ftp_getaddress PARAMS ((void));
char *herrmsg PARAMS ((int));
#endif /* HOST_H */

View File

@ -284,7 +284,7 @@ struct collect_urls_closure {
char *text; /* HTML text. */
char *base; /* Base URI of the document, possibly
changed through <base href=...>. */
urlpos *head, *tail; /* List of URLs */
struct urlpos *head, *tail; /* List of URLs */
const char *parent_base; /* Base of the current document. */
const char *document_file; /* File name of this document. */
int dash_p_leaf_HTML; /* Whether -p is specified, and this
@ -301,59 +301,67 @@ static void
handle_link (struct collect_urls_closure *closure, const char *link_uri,
struct taginfo *tag, int attrid)
{
int no_scheme = !url_has_scheme (link_uri);
urlpos *newel;
int link_has_scheme = url_has_scheme (link_uri);
struct urlpos *newel;
const char *base = closure->base ? closure->base : closure->parent_base;
char *complete_uri;
char *fragment = strrchr (link_uri, '#');
if (fragment)
{
/* Nullify the fragment identifier, i.e. everything after the
last occurrence of `#', inclusive. This copying is
relatively inefficient, but it doesn't matter because
fragment identifiers don't come up all that often. */
int hashlen = fragment - link_uri;
char *p = alloca (hashlen + 1);
memcpy (p, link_uri, hashlen);
p[hashlen] = '\0';
link_uri = p;
}
struct url *url;
if (!base)
{
if (no_scheme)
DEBUGP (("%s: no base, merge will use \"%s\".\n",
closure->document_file, link_uri));
if (!link_has_scheme)
{
/* We have no base, and the link does not have a host
attached to it. Nothing we can do. */
/* #### Should we print a warning here? Wget 1.5.x used to. */
return;
}
else
complete_uri = xstrdup (link_uri);
url = url_parse (link_uri, NULL);
if (!url)
{
DEBUGP (("%s: link \"%s\" doesn't parse.\n",
closure->document_file, link_uri));
return;
}
}
else
complete_uri = uri_merge (base, link_uri);
{
/* Merge BASE with LINK_URI, but also make sure the result is
canonicalized, i.e. that "../" have been resolved.
(parse_url will do that for us.) */
DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n",
closure->document_file, base ? base : "(null)",
link_uri, complete_uri));
char *complete_uri = uri_merge (base, link_uri);
newel = (urlpos *)xmalloc (sizeof (urlpos));
DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n",
closure->document_file, base, link_uri, complete_uri));
url = url_parse (complete_uri, NULL);
if (!url)
{
DEBUGP (("%s: merged link \"%s\" doesn't parse.\n",
closure->document_file, complete_uri));
xfree (complete_uri);
return;
}
xfree (complete_uri);
}
newel = (struct urlpos *)xmalloc (sizeof (struct urlpos));
memset (newel, 0, sizeof (*newel));
newel->next = NULL;
newel->url = complete_uri;
newel->url = url;
newel->pos = tag->attrs[attrid].value_raw_beginning - closure->text;
newel->size = tag->attrs[attrid].value_raw_size;
/* A URL is relative if the host is not named, and the name does not
start with `/'. */
if (no_scheme && *link_uri != '/')
if (!link_has_scheme && *link_uri != '/')
newel->link_relative_p = 1;
else if (!no_scheme)
else if (link_has_scheme)
newel->link_complete_p = 1;
if (closure->tail)
@ -542,7 +550,7 @@ collect_tags_mapper (struct taginfo *tag, void *arg)
If dash_p_leaf_HTML is non-zero, only the elements needed to render
FILE ("non-external" links) will be returned. */
urlpos *
struct urlpos *
get_urls_html (const char *file, const char *this_url, int dash_p_leaf_HTML,
int *meta_disallow_follow)
{

View File

@ -1452,8 +1452,8 @@ File `%s' already there, will not retrieve.\n"), *hstat.local_file);
if (((suf = suffix (*hstat.local_file)) != NULL)
&& (!strcmp (suf, "html") || !strcmp (suf, "htm")))
*dt |= TEXTHTML;
xfree (suf);
FREE_MAYBE (suf);
FREE_MAYBE (dummy);
return RETROK;
}

View File

@ -171,7 +171,6 @@ static struct {
{ "savecookies", &opt.cookies_output, cmd_file },
{ "saveheaders", &opt.save_headers, cmd_boolean },
{ "serverresponse", &opt.server_response, cmd_boolean },
{ "simplehostcheck", &opt.simple_check, cmd_boolean },
{ "spanhosts", &opt.spanhost, cmd_boolean },
{ "spider", &opt.spider, cmd_boolean },
#ifdef HAVE_SSL
@ -1009,6 +1008,7 @@ check_user_specified_header (const char *s)
}
void cleanup_html_url PARAMS ((void));
void res_cleanup PARAMS ((void));
void downloaded_files_free PARAMS ((void));
@ -1016,13 +1016,27 @@ void downloaded_files_free PARAMS ((void));
void
cleanup (void)
{
extern acc_t *netrc_list;
/* Free external resources, close files, etc. */
recursive_cleanup ();
clean_hosts ();
free_netrc (netrc_list);
if (opt.dfp)
fclose (opt.dfp);
/* We're exiting anyway so there's no real need to call free()
hundreds of times. Skipping the frees will make Wget exit
faster.
However, when detecting leaks, it's crucial to free() everything
because then you can find the real leaks, i.e. the allocated
memory which grows with the size of the program. */
#ifdef DEBUG_MALLOC
recursive_cleanup ();
res_cleanup ();
host_cleanup ();
{
extern acc_t *netrc_list;
free_netrc (netrc_list);
}
cleanup_html_url ();
downloaded_files_free ();
cookies_cleanup ();
@ -1037,6 +1051,7 @@ cleanup (void)
free_vec (opt.domains);
free_vec (opt.follow_tags);
free_vec (opt.ignore_tags);
FREE_MAYBE (opt.progress_type);
xfree (opt.ftp_acc);
FREE_MAYBE (opt.ftp_pass);
FREE_MAYBE (opt.ftp_proxy);
@ -1055,4 +1070,5 @@ cleanup (void)
FREE_MAYBE (opt.bind_address);
FREE_MAYBE (opt.cookies_input);
FREE_MAYBE (opt.cookies_output);
#endif
}

View File

@ -402,9 +402,6 @@ hpVqvdkKsxmNWrHSLcFbEY:G:g:T:U:O:l:n:i:o:a:t:D:A:R:P:B:e:Q:X:I:w:C:",
case 149:
setval ("removelisting", "off");
break;
case 150:
setval ("simplehostcheck", "on");
break;
case 155:
setval ("bindaddress", optarg);
break;
@ -604,7 +601,7 @@ GNU General Public License for more details.\n"));
break;
case 'n':
{
/* #### The n? options are utter crock! */
/* #### What we really want here is --no-foo. */
char *p;
for (p = optarg; *p; p++)
@ -613,9 +610,6 @@ GNU General Public License for more details.\n"));
case 'v':
setval ("verbose", "off");
break;
case 'h':
setval ("simplehostcheck", "on");
break;
case 'H':
setval ("addhostdir", "off");
break;
@ -806,17 +800,17 @@ Can't timestamp and not clobber old files at the same time.\n"));
#endif /* HAVE_SIGNAL */
status = RETROK; /* initialize it, just-in-case */
recursive_reset ();
/*recursive_reset ();*/
/* Retrieve the URLs from argument list. */
for (t = url; *t; t++)
{
char *filename, *redirected_URL;
char *filename = NULL, *redirected_URL = NULL;
int dt;
status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt);
if (opt.recursive && status == RETROK && (dt & TEXTHTML))
status = recursive_retrieve (filename,
redirected_URL ? redirected_URL : *t);
if (opt.recursive && url_scheme (*t) != SCHEME_FTP)
status = retrieve_tree (*t);
else
status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt);
if (opt.delete_after && file_exists_p(filename))
{

View File

@ -36,9 +36,6 @@ struct options
int relative_only; /* Follow only relative links. */
int no_parent; /* Restrict access to the parent
directory. */
int simple_check; /* Should we use simple checking
(strcmp) or do we create a host
hash and call gethostbyname? */
int reclevel; /* Maximum level of recursion */
int dirstruct; /* Do we build the directory structure
as we go along? */

View File

@ -27,6 +27,9 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
# include <strings.h>
#endif /* HAVE_STRING_H */
#include <assert.h>
#ifdef HAVE_UNISTD_H
# include <unistd.h>
#endif
#include "wget.h"
#include "progress.h"
@ -470,14 +473,14 @@ create_image (struct bar_progress *bp, long dltime)
Calculate its geometry:
"xxx% " - percentage - 5 chars
"| ... | " - progress bar decorations - 3 chars
"| ... |" - progress bar decorations - 2 chars
"1012.56 K/s " - dl rate - 12 chars
"nnnn " - downloaded bytes - 11 chars
"ETA: xx:xx:xx" - ETA - 13 chars
"=====>..." - progress bar content - the rest
*/
int progress_len = screen_width - (5 + 3 + 12 + 11 + 13);
int progress_len = screen_width - (5 + 2 + 12 + 11 + 13);
if (progress_len < 7)
progress_len = 0;
@ -530,7 +533,7 @@ create_image (struct bar_progress *bp, long dltime)
}
else
{
strcpy (p, "----.-- K/s ");
strcpy (p, " --.-- K/s ");
p += 12;
}

View File

@ -1,5 +1,5 @@
/* Handling of recursive HTTP retrieving.
Copyright (C) 1995, 1996, 1997, 2000 Free Software Foundation, Inc.
Copyright (C) 1995, 1996, 1997, 2000, 2001 Free Software Foundation, Inc.
This file is part of GNU Wget.
@ -54,452 +54,480 @@ static struct hash_table *dl_file_url_map;
static struct hash_table *dl_url_file_map;
/* List of HTML files downloaded in this Wget run. Used for link
conversion after Wget is done. */
conversion after Wget is done. This list should only be traversed
in order. If you need to check whether a file has been downloaded,
use a hash table, e.g. dl_file_url_map. */
static slist *downloaded_html_files;
/* Functions for maintaining the URL queue. */
/* List of undesirable-to-load URLs. */
static struct hash_table *undesirable_urls;
struct queue_element {
const char *url;
const char *referer;
int depth;
struct queue_element *next;
};
/* Current recursion depth. */
static int depth;
struct url_queue {
struct queue_element *head;
struct queue_element *tail;
int count, maxcount;
};
/* Base directory we're recursing from (used by no_parent). */
static char *base_dir;
/* Create a URL queue. */
static int first_time = 1;
/* Cleanup the data structures associated with recursive retrieving
(the variables above). */
void
recursive_cleanup (void)
static struct url_queue *
url_queue_new (void)
{
if (undesirable_urls)
{
string_set_free (undesirable_urls);
undesirable_urls = NULL;
}
if (dl_file_url_map)
{
free_keys_and_values (dl_file_url_map);
hash_table_destroy (dl_file_url_map);
dl_file_url_map = NULL;
}
if (dl_url_file_map)
{
free_keys_and_values (dl_url_file_map);
hash_table_destroy (dl_url_file_map);
dl_url_file_map = NULL;
}
undesirable_urls = NULL;
slist_free (downloaded_html_files);
downloaded_html_files = NULL;
FREE_MAYBE (base_dir);
first_time = 1;
struct url_queue *queue = xmalloc (sizeof (*queue));
memset (queue, '\0', sizeof (*queue));
return queue;
}
/* Reset FIRST_TIME to 1, so that some action can be taken in
recursive_retrieve(). */
void
recursive_reset (void)
/* Delete a URL queue. */
static void
url_queue_delete (struct url_queue *queue)
{
first_time = 1;
xfree (queue);
}
/* The core of recursive retrieving. Endless recursion is avoided by
having all URLs stored to a linked list of URLs, which is checked
before loading any URL. That way no URL can get loaded twice.
/* Enqueue a URL in the queue. The queue is FIFO: the items will be
retrieved ("dequeued") from the queue in the order they were placed
into it. */
static void
url_enqueue (struct url_queue *queue,
const char *url, const char *referer, int depth)
{
struct queue_element *qel = xmalloc (sizeof (*qel));
qel->url = url;
qel->referer = referer;
qel->depth = depth;
qel->next = NULL;
++queue->count;
if (queue->count > queue->maxcount)
queue->maxcount = queue->count;
DEBUGP (("Enqueuing %s at depth %d\n", url, depth));
DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount));
if (queue->tail)
queue->tail->next = qel;
queue->tail = qel;
if (!queue->head)
queue->head = queue->tail;
}
/* Take a URL out of the queue. Return 1 if this operation succeeded,
or 0 if the queue is empty. */
static int
url_dequeue (struct url_queue *queue,
const char **url, const char **referer, int *depth)
{
struct queue_element *qel = queue->head;
if (!qel)
return 0;
queue->head = queue->head->next;
if (!queue->head)
queue->tail = NULL;
*url = qel->url;
*referer = qel->referer;
*depth = qel->depth;
--queue->count;
DEBUGP (("Dequeuing %s at depth %d\n", qel->url, qel->depth));
DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount));
xfree (qel);
return 1;
}
static int descend_url_p PARAMS ((const struct urlpos *, struct url *, int,
struct url *, struct hash_table *));
/* Retrieve a part of the web beginning with START_URL. This used to
be called "recursive retrieval", because the old function was
recursive and implemented depth-first search. retrieve_tree on the
other hand implements breadth-search traversal of the tree, which
results in much nicer ordering of downloads.
The algorithm this function uses is simple:
1. put START_URL in the queue.
2. while there are URLs in the queue:
3. get next URL from the queue.
4. download it.
5. if the URL is HTML and its depth does not exceed maximum depth,
get the list of URLs embedded therein.
6. for each of those URLs do the following:
7. if the URL is not one of those downloaded before, and if it
satisfies the criteria specified by the various command-line
options, add it to the queue. */
The function also supports specification of maximum recursion depth
and a number of other goodies. */
uerr_t
recursive_retrieve (const char *file, const char *this_url)
retrieve_tree (const char *start_url)
{
char *constr, *filename, *newloc;
char *canon_this_url = NULL;
int dt, inl, dash_p_leaf_HTML = FALSE;
int meta_disallow_follow;
int this_url_ftp; /* See below the explanation */
urlpos *url_list, *cur_url;
struct url *u;
uerr_t status = RETROK;
assert (this_url != NULL);
assert (file != NULL);
/* If quota was exceeded earlier, bail out. */
if (downloaded_exceeds_quota ())
return QUOTEXC;
/* Cache the current URL in the list. */
if (first_time)
/* The queue of URLs we need to load. */
struct url_queue *queue = url_queue_new ();
/* The URLs we decided we don't want to load. */
struct hash_table *blacklist = make_string_hash_table (0);
/* We'll need various components of this, so better get it over with
now. */
struct url *start_url_parsed = url_parse (start_url, NULL);
url_enqueue (queue, xstrdup (start_url), NULL, 0);
string_set_add (blacklist, start_url);
while (1)
{
/* These three operations need to be done only once per Wget
run. They should probably be at a different location. */
if (!undesirable_urls)
undesirable_urls = make_string_hash_table (0);
int descend = 0;
char *url, *referer, *file = NULL;
int depth;
boolean dash_p_leaf_HTML = FALSE;
hash_table_clear (undesirable_urls);
string_set_add (undesirable_urls, this_url);
/* Enter this_url to the hash table, in original and "enhanced" form. */
u = url_parse (this_url, NULL);
if (u)
{
string_set_add (undesirable_urls, u->url);
if (opt.no_parent)
base_dir = xstrdup (u->dir); /* Set the base dir. */
/* Set the canonical this_url to be sent as referer. This
problem exists only when running the first time. */
canon_this_url = xstrdup (u->url);
}
else
{
DEBUGP (("Double yuck! The *base* URL is broken.\n"));
base_dir = NULL;
}
url_free (u);
depth = 1;
first_time = 0;
}
else
++depth;
if (opt.reclevel != INFINITE_RECURSION && depth > opt.reclevel)
/* We've exceeded the maximum recursion depth specified by the user. */
{
if (opt.page_requisites && depth <= opt.reclevel + 1)
/* When -p is specified, we can do one more partial recursion from the
"leaf nodes" on the HTML document tree. The recursion is partial in
that we won't traverse any <A> or <AREA> tags, nor any <LINK> tags
except for <LINK REL="stylesheet">. */
dash_p_leaf_HTML = TRUE;
else
/* Either -p wasn't specified or it was and we've already gone the one
extra (pseudo-)level that it affords us, so we need to bail out. */
{
DEBUGP (("Recursion depth %d exceeded max. depth %d.\n",
depth, opt.reclevel));
--depth;
return RECLEVELEXC;
}
}
/* Determine whether this_url is an FTP URL. If it is, it means
that the retrieval is done through proxy. In that case, FTP
links will be followed by default and recursion will not be
turned off when following them. */
this_url_ftp = (url_scheme (this_url) == SCHEME_FTP);
/* Get the URL-s from an HTML file: */
url_list = get_urls_html (file, canon_this_url ? canon_this_url : this_url,
dash_p_leaf_HTML, &meta_disallow_follow);
if (opt.use_robots && meta_disallow_follow)
{
/* The META tag says we are not to follow this file. Respect
that. */
free_urlpos (url_list);
url_list = NULL;
}
/* Decide what to do with each of the URLs. A URL will be loaded if
it meets several requirements, discussed later. */
for (cur_url = url_list; cur_url; cur_url = cur_url->next)
{
/* If quota was exceeded earlier, bail out. */
if (downloaded_exceeds_quota ())
break;
/* Parse the URL for convenient use in other functions, as well
as to get the optimized form. It also checks URL integrity. */
u = url_parse (cur_url->url, NULL);
if (!u)
{
DEBUGP (("Yuck! A bad URL.\n"));
continue;
}
assert (u->url != NULL);
constr = xstrdup (u->url);
/* Several checkings whether a file is acceptable to load:
1. check if URL is ftp, and we don't load it
2. check for relative links (if relative_only is set)
3. check for domain
4. check for no-parent
5. check for excludes && includes
6. check for suffix
7. check for same host (if spanhost is unset), with possible
gethostbyname baggage
8. check for robots.txt
if (status == FWRITEERR)
break;
Addendum: If the URL is FTP, and it is to be loaded, only the
domain and suffix settings are "stronger".
/* Get the next URL from the queue. */
Note that .html and (yuck) .htm will get loaded regardless of
suffix rules (but that is remedied later with unlink) unless
the depth equals the maximum depth.
if (!url_dequeue (queue,
(const char **)&url, (const char **)&referer,
&depth))
break;
More time- and memory- consuming tests should be put later on
the list. */
/* And download it. */
/* inl is set if the URL we are working on (constr) is stored in
undesirable_urls. Using it is crucial to avoid unnecessary
repeated continuous hits to the hash table. */
inl = string_set_contains (undesirable_urls, constr);
{
int dt = 0;
char *redirected = NULL;
int oldrec = opt.recursive;
/* If it is FTP, and FTP is not followed, chuck it out. */
if (!inl)
if (u->scheme == SCHEME_FTP && !opt.follow_ftp && !this_url_ftp)
opt.recursive = 0;
status = retrieve_url (url, &file, &redirected, NULL, &dt);
opt.recursive = oldrec;
if (redirected)
{
DEBUGP (("Uh, it is FTP but i'm not in the mood to follow FTP.\n"));
string_set_add (undesirable_urls, constr);
inl = 1;
xfree (url);
url = redirected;
}
/* If it is absolute link and they are not followed, chuck it
out. */
if (!inl && u->scheme != SCHEME_FTP)
if (opt.relative_only && !cur_url->link_relative_p)
{
DEBUGP (("It doesn't really look like a relative link.\n"));
string_set_add (undesirable_urls, constr);
inl = 1;
}
/* If its domain is not to be accepted/looked-up, chuck it out. */
if (!inl)
if (!accept_domain (u))
{
DEBUGP (("I don't like the smell of that domain.\n"));
string_set_add (undesirable_urls, constr);
inl = 1;
}
/* Check for parent directory. */
if (!inl && opt.no_parent
/* If the new URL is FTP and the old was not, ignore
opt.no_parent. */
&& !(!this_url_ftp && u->scheme == SCHEME_FTP))
{
/* Check for base_dir first. */
if (!(base_dir && frontcmp (base_dir, u->dir)))
{
/* Failing that, check for parent dir. */
struct url *ut = url_parse (this_url, NULL);
if (!ut)
DEBUGP (("Double yuck! The *base* URL is broken.\n"));
else if (!frontcmp (ut->dir, u->dir))
{
/* Failing that too, kill the URL. */
DEBUGP (("Trying to escape parental guidance with no_parent on.\n"));
string_set_add (undesirable_urls, constr);
inl = 1;
}
url_free (ut);
}
}
/* If the file does not match the acceptance list, or is on the
rejection list, chuck it out. The same goes for the
directory exclude- and include- lists. */
if (!inl && (opt.includes || opt.excludes))
{
if (!accdir (u->dir, ALLABS))
{
DEBUGP (("%s (%s) is excluded/not-included.\n", constr, u->dir));
string_set_add (undesirable_urls, constr);
inl = 1;
}
}
if (!inl)
{
char *suf = NULL;
/* We check for acceptance/rejection rules only for non-HTML
documents. Since we don't know whether they really are
HTML, it will be deduced from (an OR-ed list):
if (file && status == RETROK
&& (dt & RETROKF) && (dt & TEXTHTML))
descend = 1;
}
1) u->file is "" (meaning it is a directory)
2) suffix exists, AND:
a) it is "html", OR
b) it is "htm"
If the file *is* supposed to be HTML, it will *not* be
subject to acc/rej rules, unless a finite maximum depth has
been specified and the current depth is the maximum depth. */
if (!
(!*u->file
|| (((suf = suffix (constr)) != NULL)
&& ((!strcmp (suf, "html") || !strcmp (suf, "htm"))
&& ((opt.reclevel != INFINITE_RECURSION) &&
(depth != opt.reclevel))))))
{
if (!acceptable (u->file))
{
DEBUGP (("%s (%s) does not match acc/rej rules.\n",
constr, u->file));
string_set_add (undesirable_urls, constr);
inl = 1;
}
}
FREE_MAYBE (suf);
}
/* Optimize the URL (which includes possible DNS lookup) only
after all other possibilities have been exhausted. */
if (!inl)
if (descend
&& depth >= opt.reclevel && opt.reclevel != INFINITE_RECURSION)
{
if (!opt.simple_check)
{
/* Find the "true" host. */
char *host = realhost (u->host);
xfree (u->host);
u->host = host;
/* Refresh the printed representation of the URL. */
xfree (u->url);
u->url = url_string (u, 0);
}
if (opt.page_requisites && depth == opt.reclevel)
/* When -p is specified, we can do one more partial
recursion from the "leaf nodes" on the HTML document
tree. The recursion is partial in that we won't
traverse any <A> or <AREA> tags, nor any <LINK> tags
except for <LINK REL="stylesheet">. */
/* #### This would be the place to implement the TODO
entry saying that -p should do two more hops on
framesets. */
dash_p_leaf_HTML = TRUE;
else
{
char *p;
/* Just lowercase the hostname. */
for (p = u->host; *p; p++)
*p = TOLOWER (*p);
xfree (u->url);
u->url = url_string (u, 0);
/* Either -p wasn't specified or it was and we've
already gone the one extra (pseudo-)level that it
affords us, so we need to bail out. */
DEBUGP (("Not descending further; at depth %d, max. %d.\n",
depth, opt.reclevel));
descend = 0;
}
xfree (constr);
constr = xstrdup (u->url);
/* After we have canonicalized the URL, check if we have it
on the black list. */
if (string_set_contains (undesirable_urls, constr))
inl = 1;
/* This line is bogus. */
/*string_set_add (undesirable_urls, constr);*/
if (!inl && !((u->scheme == SCHEME_FTP) && !this_url_ftp))
if (!opt.spanhost && this_url && !same_host (this_url, constr))
{
DEBUGP (("This is not the same hostname as the parent's.\n"));
string_set_add (undesirable_urls, constr);
inl = 1;
}
}
/* What about robots.txt? */
if (!inl && opt.use_robots && u->scheme == SCHEME_HTTP)
/* If the downloaded document was HTML, parse it and enqueue the
links it contains. */
if (descend)
{
struct robot_specs *specs = res_get_specs (u->host, u->port);
if (!specs)
int meta_disallow_follow = 0;
struct urlpos *children = get_urls_html (file, url, dash_p_leaf_HTML,
&meta_disallow_follow);
if (opt.use_robots && meta_disallow_follow)
{
char *rfile;
if (res_retrieve_file (constr, &rfile))
{
specs = res_parse_from_file (rfile);
xfree (rfile);
}
else
{
/* If we cannot get real specs, at least produce
dummy ones so that we can register them and stop
trying to retrieve them. */
specs = res_parse ("", 0);
}
res_register_specs (u->host, u->port, specs);
free_urlpos (children);
children = NULL;
}
/* Now that we have (or don't have) robots.txt specs, we can
check what they say. */
if (!res_match_path (specs, u->path))
if (children)
{
DEBUGP (("Not following %s because robots.txt forbids it.\n",
constr));
string_set_add (undesirable_urls, constr);
inl = 1;
struct urlpos *child = children;
struct url *url_parsed = url_parsed = url_parse (url, NULL);
assert (url_parsed != NULL);
for (; child; child = child->next)
{
if (descend_url_p (child, url_parsed, depth, start_url_parsed,
blacklist))
{
url_enqueue (queue, xstrdup (child->url->url),
xstrdup (url), depth + 1);
/* We blacklist the URL we have enqueued, because we
don't want to enqueue (and hence download) the
same URL twice. */
string_set_add (blacklist, child->url->url);
}
}
url_free (url_parsed);
free_urlpos (children);
}
}
filename = NULL;
/* If it wasn't chucked out, do something with it. */
if (!inl)
if (opt.delete_after || (file && !acceptable (file)))
{
DEBUGP (("I've decided to load it -> "));
/* Add it to the list of already-loaded URL-s. */
string_set_add (undesirable_urls, constr);
/* Automatically followed FTPs will *not* be downloaded
recursively. */
if (u->scheme == SCHEME_FTP)
{
/* Don't you adore side-effects? */
opt.recursive = 0;
}
/* Reset its type. */
dt = 0;
/* Retrieve it. */
retrieve_url (constr, &filename, &newloc,
canon_this_url ? canon_this_url : this_url, &dt);
if (u->scheme == SCHEME_FTP)
{
/* Restore... */
opt.recursive = 1;
}
if (newloc)
{
xfree (constr);
constr = newloc;
}
/* If there was no error, and the type is text/html, parse
it recursively. */
if (dt & TEXTHTML)
{
if (dt & RETROKF)
recursive_retrieve (filename, constr);
}
else
DEBUGP (("%s is not text/html so we don't chase.\n",
filename ? filename: "(null)"));
if (opt.delete_after || (filename && !acceptable (filename)))
/* Either --delete-after was specified, or we loaded this otherwise
rejected (e.g. by -R) HTML file just so we could harvest its
hyperlinks -- in either case, delete the local file. */
{
DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
opt.delete_after ? "--delete-after" :
"recursive rejection criteria"));
logprintf (LOG_VERBOSE,
(opt.delete_after ? _("Removing %s.\n")
: _("Removing %s since it should be rejected.\n")),
filename);
if (unlink (filename))
logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
dt &= ~RETROKF;
}
/* If everything was OK, and links are to be converted, let's
store the local filename. */
if (opt.convert_links && (dt & RETROKF) && (filename != NULL))
{
cur_url->convert = CO_CONVERT_TO_RELATIVE;
cur_url->local_name = xstrdup (filename);
}
/* Either --delete-after was specified, or we loaded this
otherwise rejected (e.g. by -R) HTML file just so we
could harvest its hyperlinks -- in either case, delete
the local file. */
DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
opt.delete_after ? "--delete-after" :
"recursive rejection criteria"));
logprintf (LOG_VERBOSE,
(opt.delete_after ? _("Removing %s.\n")
: _("Removing %s since it should be rejected.\n")),
file);
if (unlink (file))
logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
}
else
DEBUGP (("%s already in list, so we don't load.\n", constr));
/* Free filename and constr. */
FREE_MAYBE (filename);
FREE_MAYBE (constr);
url_free (u);
/* Increment the pbuf for the appropriate size. */
xfree (url);
FREE_MAYBE (referer);
FREE_MAYBE (file);
}
if (opt.convert_links && !opt.delete_after)
/* This is merely the first pass: the links that have been
successfully downloaded are converted. In the second pass,
convert_all_links() will also convert those links that have NOT
been downloaded to their canonical form. */
convert_links (file, url_list);
/* Free the linked list of URL-s. */
free_urlpos (url_list);
/* Free the canonical this_url. */
FREE_MAYBE (canon_this_url);
/* Decrement the recursion depth. */
--depth;
/* If anything is left of the queue due to a premature exit, free it
now. */
{
char *d1, *d2;
int d3;
while (url_dequeue (queue, (const char **)&d1, (const char **)&d2, &d3))
{
xfree (d1);
FREE_MAYBE (d2);
}
}
url_queue_delete (queue);
if (start_url_parsed)
url_free (start_url_parsed);
string_set_free (blacklist);
if (downloaded_exceeds_quota ())
return QUOTEXC;
else if (status == FWRITEERR)
return FWRITEERR;
else
return RETROK;
}
/* Based on the context provided by retrieve_tree, decide whether a
URL is to be descended to. This is only ever called from
retrieve_tree, but is in a separate function for clarity. */
static int
descend_url_p (const struct urlpos *upos, struct url *parent, int depth,
struct url *start_url_parsed, struct hash_table *blacklist)
{
struct url *u = upos->url;
const char *url = u->url;
DEBUGP (("Deciding whether to enqueue \"%s\".\n", url));
if (string_set_contains (blacklist, url))
{
DEBUGP (("Already on the black list.\n"));
goto out;
}
/* Several things to check for:
1. if scheme is not http, and we don't load it
2. check for relative links (if relative_only is set)
3. check for domain
4. check for no-parent
5. check for excludes && includes
6. check for suffix
7. check for same host (if spanhost is unset), with possible
gethostbyname baggage
8. check for robots.txt
Addendum: If the URL is FTP, and it is to be loaded, only the
domain and suffix settings are "stronger".
Note that .html files will get loaded regardless of suffix rules
(but that is remedied later with unlink) unless the depth equals
the maximum depth.
More time- and memory- consuming tests should be put later on
the list. */
/* 1. Schemes other than HTTP are normally not recursed into. */
if (u->scheme != SCHEME_HTTP
&& !(u->scheme == SCHEME_FTP && opt.follow_ftp))
{
DEBUGP (("Not following non-HTTP schemes.\n"));
goto blacklist;
}
/* 2. If it is an absolute link and they are not followed, throw it
out. */
if (u->scheme == SCHEME_HTTP)
if (opt.relative_only && !upos->link_relative_p)
{
DEBUGP (("It doesn't really look like a relative link.\n"));
goto blacklist;
}
/* 3. If its domain is not to be accepted/looked-up, chuck it
out. */
if (!accept_domain (u))
{
DEBUGP (("The domain was not accepted.\n"));
goto blacklist;
}
/* 4. Check for parent directory.
If we descended to a different host or changed the scheme, ignore
opt.no_parent. Also ignore it for -p leaf retrievals. */
if (opt.no_parent
&& u->scheme == parent->scheme
&& 0 == strcasecmp (u->host, parent->host)
&& u->port == parent->port)
{
if (!frontcmp (parent->dir, u->dir))
{
DEBUGP (("Trying to escape the root directory with no_parent in effect.\n"));
goto blacklist;
}
}
/* 5. If the file does not match the acceptance list, or is on the
rejection list, chuck it out. The same goes for the directory
exclusion and inclusion lists. */
if (opt.includes || opt.excludes)
{
if (!accdir (u->dir, ALLABS))
{
DEBUGP (("%s (%s) is excluded/not-included.\n", url, u->dir));
goto blacklist;
}
}
/* 6. */
{
char *suf = NULL;
/* Check for acceptance/rejection rules. We ignore these rules
for HTML documents because they might lead to other files which
need to be downloaded. Of course, we don't know which
documents are HTML before downloading them, so we guess.
A file is subject to acceptance/rejection rules if:
* u->file is not "" (i.e. it is not a directory)
and either:
+ there is no file suffix,
+ or there is a suffix, but is not "html" or "htm",
+ both:
- recursion is not infinite,
- and we are at its very end. */
if (u->file[0] != '\0'
&& ((suf = suffix (url)) == NULL
|| (0 != strcmp (suf, "html") && 0 != strcmp (suf, "htm"))
|| (opt.reclevel == INFINITE_RECURSION && depth >= opt.reclevel)))
{
if (!acceptable (u->file))
{
DEBUGP (("%s (%s) does not match acc/rej rules.\n",
url, u->file));
FREE_MAYBE (suf);
goto blacklist;
}
}
FREE_MAYBE (suf);
}
/* 7. */
if (u->scheme == parent->scheme)
if (!opt.spanhost && 0 != strcasecmp (parent->host, u->host))
{
DEBUGP (("This is not the same hostname as the parent's (%s and %s).\n",
u->host, parent->host));
goto blacklist;
}
/* 8. */
if (opt.use_robots && u->scheme == SCHEME_HTTP)
{
struct robot_specs *specs = res_get_specs (u->host, u->port);
if (!specs)
{
char *rfile;
if (res_retrieve_file (url, &rfile))
{
specs = res_parse_from_file (rfile);
xfree (rfile);
}
else
{
/* If we cannot get real specs, at least produce
dummy ones so that we can register them and stop
trying to retrieve them. */
specs = res_parse ("", 0);
}
res_register_specs (u->host, u->port, specs);
}
/* Now that we have (or don't have) robots.txt specs, we can
check what they say. */
if (!res_match_path (specs, u->path))
{
DEBUGP (("Not following %s because robots.txt forbids it.\n", url));
goto blacklist;
}
}
/* The URL has passed all the tests. It can be placed in the
download queue. */
DEBUGP (("Decided to load it.\n"));
return 1;
blacklist:
string_set_add (blacklist, url);
out:
DEBUGP (("Decided NOT to load it.\n"));
return 0;
}
/* Register that URL has been successfully downloaded to FILE. */
void
register_download (const char *url, const char *file)
{
@ -507,12 +535,35 @@ register_download (const char *url, const char *file)
return;
if (!dl_file_url_map)
dl_file_url_map = make_string_hash_table (0);
hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url));
if (!dl_url_file_map)
dl_url_file_map = make_string_hash_table (0);
hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file));
if (!hash_table_contains (dl_file_url_map, file))
hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url));
if (!hash_table_contains (dl_url_file_map, url))
hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file));
}
/* Register that FROM has been redirected to TO. This assumes that TO
is successfully downloaded and already registered using
register_download() above. */
void
register_redirection (const char *from, const char *to)
{
char *file;
if (!opt.convert_links)
return;
file = hash_table_get (dl_url_file_map, to);
assert (file != NULL);
if (!hash_table_contains (dl_url_file_map, from))
hash_table_put (dl_url_file_map, xstrdup (from), xstrdup (file));
}
/* Register that URL corresponds to the HTML file FILE. */
void
register_html (const char *url, const char *file)
{
@ -558,10 +609,11 @@ convert_all_links (void)
for (html = downloaded_html_files; html; html = html->next)
{
urlpos *urls, *cur_url;
struct urlpos *urls, *cur_url;
char *url;
DEBUGP (("Rescanning %s\n", html->string));
/* Determine the URL of the HTML file. get_urls_html will need
it. */
url = hash_table_get (dl_file_url_map, html->string);
@ -569,19 +621,19 @@ convert_all_links (void)
DEBUGP (("It should correspond to %s.\n", url));
else
DEBUGP (("I cannot find the corresponding URL.\n"));
/* Parse the HTML file... */
urls = get_urls_html (html->string, url, FALSE, NULL);
/* We don't respect meta_disallow_follow here because, even if
the file is not followed, we might still want to convert the
links that have been followed from other files. */
for (cur_url = urls; cur_url; cur_url = cur_url->next)
{
char *local_name;
struct url *u = cur_url->url;
/* The URL must be in canonical form to be compared. */
struct url *u = url_parse (cur_url->url, NULL);
if (!u)
continue;
/* We decide the direction of conversion according to whether
a URL was downloaded. Downloaded URLs will be converted
ABS2REL, whereas non-downloaded will be converted REL2ABS. */
@ -589,6 +641,7 @@ convert_all_links (void)
if (local_name)
DEBUGP (("%s marked for conversion, local %s\n",
u->url, local_name));
/* Decide on the conversion direction. */
if (local_name)
{
@ -610,7 +663,6 @@ convert_all_links (void)
cur_url->convert = CO_CONVERT_TO_COMPLETE;
cur_url->local_name = NULL;
}
url_free (u);
}
/* Convert the links in the file. */
convert_links (html->string, urls);
@ -618,3 +670,24 @@ convert_all_links (void)
free_urlpos (urls);
}
}
/* Cleanup the data structures associated with recursive retrieving
(the variables above). */
void
recursive_cleanup (void)
{
if (dl_file_url_map)
{
free_keys_and_values (dl_file_url_map);
hash_table_destroy (dl_file_url_map);
dl_file_url_map = NULL;
}
if (dl_url_file_map)
{
free_keys_and_values (dl_url_file_map);
hash_table_destroy (dl_url_file_map);
dl_url_file_map = NULL;
}
slist_free (downloaded_html_files);
downloaded_html_files = NULL;
}

View File

@ -21,10 +21,10 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
#define RECUR_H
void recursive_cleanup PARAMS ((void));
void recursive_reset PARAMS ((void));
uerr_t recursive_retrieve PARAMS ((const char *, const char *));
uerr_t retrieve_tree PARAMS ((const char *));
void register_download PARAMS ((const char *, const char *));
void register_redirection PARAMS ((const char *, const char *));
void register_html PARAMS ((const char *, const char *));
void convert_all_links PARAMS ((void));

View File

@ -125,6 +125,10 @@ add_path (struct robot_specs *specs, const char *path_b, const char *path_e,
int allowedp, int exactp)
{
struct path_info pp;
if (path_b < path_e && *path_b == '/')
/* Our path representation doesn't use a leading slash, so remove
one from theirs. */
++path_b;
pp.path = strdupdelim (path_b, path_e);
pp.allowedp = allowedp;
pp.user_agent_exact_p = exactp;
@ -390,6 +394,9 @@ res_parse_from_file (const char *filename)
static void
free_specs (struct robot_specs *specs)
{
int i;
for (i = 0; i < specs->count; i++)
xfree (specs->paths[i].path);
FREE_MAYBE (specs->paths);
xfree (specs);
}
@ -546,3 +553,22 @@ res_retrieve_file (const char *url, char **file)
}
return err == RETROK;
}
static int
cleanup_hash_table_mapper (void *key, void *value, void *arg_ignored)
{
xfree (key);
free_specs (value);
return 0;
}
void
res_cleanup (void)
{
if (registered_specs)
{
hash_table_map (registered_specs, cleanup_hash_table_mapper, NULL);
hash_table_destroy (registered_specs);
registered_specs = NULL;
}
}

View File

@ -29,3 +29,4 @@ struct robot_specs *res_get_specs PARAMS ((const char *, int));
int res_retrieve_file PARAMS ((const char *, char **));
void res_cleanup PARAMS ((void));

View File

@ -184,6 +184,26 @@ rate (long bytes, long msecs, int pad)
return res;
}
static int
register_redirections_mapper (void *key, void *value, void *arg)
{
const char *redirected_from = (const char *)key;
const char *redirected_to = (const char *)arg;
if (0 != strcmp (redirected_from, redirected_to))
register_redirection (redirected_from, redirected_to);
return 0;
}
/* Register the redirections that lead to the successful download of
this URL. This is necessary so that the link converter can convert
redirected URLs to the local file. */
static void
register_all_redirections (struct hash_table *redirections, const char *final)
{
hash_table_map (redirections, register_redirections_mapper, (void *)final);
}
#define USE_PROXY_P(u) (opt.use_proxy && getproxy((u)->scheme) \
&& no_proxy_match((u)->host, \
(const char **)opt.no_proxy))
@ -254,7 +274,7 @@ retrieve_url (const char *origurl, char **file, char **newloc,
proxy_url = url_parse (proxy, &up_error_code);
if (!proxy_url)
{
logprintf (LOG_NOTQUIET, "Error parsing proxy URL %s: %s.\n",
logprintf (LOG_NOTQUIET, _("Error parsing proxy URL %s: %s.\n"),
proxy, url_error (up_error_code));
if (redirections)
string_set_free (redirections);
@ -310,7 +330,7 @@ retrieve_url (const char *origurl, char **file, char **newloc,
if (location_changed)
{
char *construced_newloc;
struct url *newloc_struct;
struct url *newloc_parsed;
assert (mynewloc != NULL);
@ -326,12 +346,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
mynewloc = construced_newloc;
/* Now, see if this new location makes sense. */
newloc_struct = url_parse (mynewloc, &up_error_code);
if (!newloc_struct)
newloc_parsed = url_parse (mynewloc, &up_error_code);
if (!newloc_parsed)
{
logprintf (LOG_NOTQUIET, "%s: %s.\n", mynewloc,
url_error (up_error_code));
url_free (newloc_struct);
url_free (u);
if (redirections)
string_set_free (redirections);
@ -340,11 +359,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
return result;
}
/* Now mynewloc will become newloc_struct->url, because if the
/* Now mynewloc will become newloc_parsed->url, because if the
Location contained relative paths like .././something, we
don't want that propagating as url. */
xfree (mynewloc);
mynewloc = xstrdup (newloc_struct->url);
mynewloc = xstrdup (newloc_parsed->url);
if (!redirections)
{
@ -356,11 +375,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
/* The new location is OK. Check for redirection cycle by
peeking through the history of redirections. */
if (string_set_contains (redirections, newloc_struct->url))
if (string_set_contains (redirections, newloc_parsed->url))
{
logprintf (LOG_NOTQUIET, _("%s: Redirection cycle detected.\n"),
mynewloc);
url_free (newloc_struct);
url_free (newloc_parsed);
url_free (u);
if (redirections)
string_set_free (redirections);
@ -368,12 +387,12 @@ retrieve_url (const char *origurl, char **file, char **newloc,
xfree (mynewloc);
return WRONGCODE;
}
string_set_add (redirections, newloc_struct->url);
string_set_add (redirections, newloc_parsed->url);
xfree (url);
url = mynewloc;
url_free (u);
u = newloc_struct;
u = newloc_parsed;
goto redirected;
}
@ -382,6 +401,8 @@ retrieve_url (const char *origurl, char **file, char **newloc,
if (*dt & RETROKF)
{
register_download (url, local_file);
if (redirections)
register_all_redirections (redirections, url);
if (*dt & TEXTHTML)
register_html (url, local_file);
}
@ -415,16 +436,16 @@ uerr_t
retrieve_from_file (const char *file, int html, int *count)
{
uerr_t status;
urlpos *url_list, *cur_url;
struct urlpos *url_list, *cur_url;
url_list = (html ? get_urls_html (file, NULL, FALSE, NULL)
: get_urls_file (file));
status = RETROK; /* Suppose everything is OK. */
*count = 0; /* Reset the URL count. */
recursive_reset ();
for (cur_url = url_list; cur_url; cur_url = cur_url->next, ++*count)
{
char *filename, *new_file;
char *filename = NULL, *new_file;
int dt;
if (downloaded_exceeds_quota ())
@ -432,10 +453,10 @@ retrieve_from_file (const char *file, int html, int *count)
status = QUOTEXC;
break;
}
status = retrieve_url (cur_url->url, &filename, &new_file, NULL, &dt);
if (opt.recursive && status == RETROK && (dt & TEXTHTML))
status = recursive_retrieve (filename, new_file ? new_file
: cur_url->url);
if (opt.recursive && cur_url->url->scheme != SCHEME_FTP)
status = retrieve_tree (cur_url->url->url);
else
status = retrieve_url (cur_url->url->url, &filename, &new_file, NULL, &dt);
if (filename && opt.delete_after && file_exists_p (filename))
{

455
src/url.c
View File

@ -37,6 +37,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
#include "utils.h"
#include "url.h"
#include "host.h"
#include "hash.h"
#ifndef errno
extern int errno;
@ -182,7 +183,7 @@ encode_string_maybe (const char *s)
{
if (UNSAFE_CHAR (*p1))
{
const unsigned char c = *p1++;
unsigned char c = *p1++;
*p2++ = '%';
*p2++ = XDIGIT_TO_XCHAR (c >> 4);
*p2++ = XDIGIT_TO_XCHAR (c & 0xf);
@ -378,7 +379,7 @@ reencode_string (const char *s)
{
case CM_ENCODE:
{
char c = *p1++;
unsigned char c = *p1++;
*p2++ = '%';
*p2++ = XDIGIT_TO_XCHAR (c >> 4);
*p2++ = XDIGIT_TO_XCHAR (c & 0xf);
@ -586,6 +587,22 @@ strpbrk_or_eos (const char *s, const char *accept)
return p;
}
/* Turn STR into lowercase; return non-zero if a character was
actually changed. */
static int
lowercase_str (char *str)
{
int change = 0;
for (; *str; str++)
if (!ISLOWER (*str))
{
change = 1;
*str = TOLOWER (*str);
}
return change;
}
static char *parse_errors[] = {
#define PE_NO_ERROR 0
"No error",
@ -614,6 +631,7 @@ url_parse (const char *url, int *error)
{
struct url *u;
const char *p;
int path_modified, host_modified;
enum url_scheme scheme;
@ -627,9 +645,7 @@ url_parse (const char *url, int *error)
int port;
char *user = NULL, *passwd = NULL;
const char *url_orig = url;
p = url = reencode_string (url);
char *url_encoded;
scheme = url_scheme (url);
if (scheme == SCHEME_INVALID)
@ -638,6 +654,9 @@ url_parse (const char *url, int *error)
return NULL;
}
url_encoded = reencode_string (url);
p = url_encoded;
p += strlen (supported_schemes[scheme].leading_string);
uname_b = p;
p += url_skip_uname (p);
@ -749,11 +768,6 @@ url_parse (const char *url, int *error)
u = (struct url *)xmalloc (sizeof (struct url));
memset (u, 0, sizeof (*u));
if (url == url_orig)
u->url = xstrdup (url);
else
u->url = (char *)url;
u->scheme = scheme;
u->host = strdupdelim (host_b, host_e);
u->port = port;
@ -761,7 +775,10 @@ url_parse (const char *url, int *error)
u->passwd = passwd;
u->path = strdupdelim (path_b, path_e);
path_simplify (u->path);
path_modified = path_simplify (u->path);
parse_path (u->path, &u->dir, &u->file);
host_modified = lowercase_str (u->host);
if (params_b)
u->params = strdupdelim (params_b, params_e);
@ -770,7 +787,26 @@ url_parse (const char *url, int *error)
if (fragment_b)
u->fragment = strdupdelim (fragment_b, fragment_e);
parse_path (u->path, &u->dir, &u->file);
if (path_modified || u->fragment || host_modified)
{
/* If path_simplify modified the path, or if a fragment is
present, or if the original host name had caps in it, make
sure that u->url is equivalent to what would be printed by
url_string. */
u->url = url_string (u, 0);
if (url_encoded != url)
xfree ((char *) url_encoded);
}
else
{
if (url_encoded == url)
u->url = xstrdup (url);
else
u->url = url_encoded;
}
url_encoded = NULL;
return u;
}
@ -927,17 +963,18 @@ url_free (struct url *url)
FREE_MAYBE (url->fragment);
FREE_MAYBE (url->user);
FREE_MAYBE (url->passwd);
FREE_MAYBE (url->dir);
FREE_MAYBE (url->file);
xfree (url->dir);
xfree (url->file);
xfree (url);
}
urlpos *
struct urlpos *
get_urls_file (const char *file)
{
struct file_memory *fm;
urlpos *head, *tail;
struct urlpos *head, *tail;
const char *text, *text_end;
/* Load the file. */
@ -968,10 +1005,28 @@ get_urls_file (const char *file)
--line_end;
if (line_end > line_beg)
{
urlpos *entry = (urlpos *)xmalloc (sizeof (urlpos));
int up_error_code;
char *url_text;
struct urlpos *entry;
struct url *url;
/* We must copy the URL to a zero-terminated string. *sigh*. */
url_text = strdupdelim (line_beg, line_end);
url = url_parse (url_text, &up_error_code);
if (!url)
{
logprintf (LOG_NOTQUIET, "%s: Invalid URL %s: %s\n",
file, url_text, url_error (up_error_code));
xfree (url_text);
continue;
}
xfree (url_text);
entry = (struct urlpos *)xmalloc (sizeof (struct urlpos));
memset (entry, 0, sizeof (*entry));
entry->next = NULL;
entry->url = strdupdelim (line_beg, line_end);
entry->url = url;
if (!head)
head = entry;
else
@ -985,12 +1040,13 @@ get_urls_file (const char *file)
/* Free the linked list of urlpos. */
void
free_urlpos (urlpos *l)
free_urlpos (struct urlpos *l)
{
while (l)
{
urlpos *next = l->next;
xfree (l->url);
struct urlpos *next = l->next;
if (l->url)
url_free (l->url);
FREE_MAYBE (l->local_name);
xfree (l);
l = next;
@ -1088,7 +1144,9 @@ count_slashes (const char *s)
static char *
mkstruct (const struct url *u)
{
char *host, *dir, *file, *res, *dirpref;
char *dir, *dir_preencoding;
char *file, *res, *dirpref;
char *query = u->query && *u->query ? u->query : NULL;
int l;
if (opt.cut_dirs)
@ -1104,36 +1162,35 @@ mkstruct (const struct url *u)
else
dir = u->dir + (*u->dir == '/');
host = xstrdup (u->host);
/* Check for the true name (or at least a consistent name for saving
to directory) of HOST, reusing the hlist if possible. */
if (opt.add_hostdir && !opt.simple_check)
{
char *nhost = realhost (host);
xfree (host);
host = nhost;
}
/* Add dir_prefix and hostname (if required) to the beginning of
dir. */
if (opt.add_hostdir)
{
/* Add dir_prefix and hostname (if required) to the beginning of
dir. */
dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
+ strlen (u->host)
+ 1 + numdigit (u->port)
+ 1);
if (!DOTP (opt.dir_prefix))
{
dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
+ strlen (host) + 1);
sprintf (dirpref, "%s/%s", opt.dir_prefix, host);
}
sprintf (dirpref, "%s/%s", opt.dir_prefix, u->host);
else
STRDUP_ALLOCA (dirpref, host);
strcpy (dirpref, u->host);
if (u->port != scheme_default_port (u->scheme))
{
int len = strlen (dirpref);
dirpref[len] = ':';
long_to_string (dirpref + len + 1, u->port);
}
}
else /* not add_hostdir */
else /* not add_hostdir */
{
if (!DOTP (opt.dir_prefix))
dirpref = opt.dir_prefix;
else
dirpref = "";
}
xfree (host);
/* If there is a prefix, prepend it. */
if (*dirpref)
@ -1142,7 +1199,10 @@ mkstruct (const struct url *u)
sprintf (newdir, "%s%s%s", dirpref, *dir == '/' ? "" : "/", dir);
dir = newdir;
}
dir = encode_string (dir);
dir_preencoding = dir;
dir = reencode_string (dir_preencoding);
l = strlen (dir);
if (l && dir[l - 1] == '/')
dir[l - 1] = '\0';
@ -1153,9 +1213,17 @@ mkstruct (const struct url *u)
file = u->file;
/* Finally, construct the full name. */
res = (char *)xmalloc (strlen (dir) + 1 + strlen (file) + 1);
res = (char *)xmalloc (strlen (dir) + 1 + strlen (file)
+ (query ? (1 + strlen (query)) : 0)
+ 1);
sprintf (res, "%s%s%s", dir, *dir ? "/" : "", file);
xfree (dir);
if (query)
{
strcat (res, "?");
strcat (res, query);
}
if (dir != dir_preencoding)
xfree (dir);
return res;
}
@ -1177,7 +1245,7 @@ compose_file_name (char *base, char *query)
{
if (UNSAFE_CHAR (*from))
{
const unsigned char c = *from++;
unsigned char c = *from++;
*to++ = '%';
*to++ = XDIGIT_TO_XCHAR (c >> 4);
*to++ = XDIGIT_TO_XCHAR (c & 0xf);
@ -1282,10 +1350,8 @@ url_filename (const struct url *u)
static int
urlpath_length (const char *url)
{
const char *q = strchr (url, '?');
if (q)
return q - url;
return strlen (url);
const char *q = strpbrk_or_eos (url, "?;#");
return q - url;
}
/* Find the last occurrence of character C in the range [b, e), or
@ -1323,63 +1389,42 @@ uri_merge_1 (const char *base, const char *link, int linklength, int no_scheme)
{
const char *end = base + urlpath_length (base);
if (*link != '/')
if (!*link)
{
/* LINK is a relative URL: we need to replace everything
after last slash (possibly empty) with LINK.
So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy",
our result should be "whatever/foo/qux/xyzzy". */
int need_explicit_slash = 0;
int span;
const char *start_insert;
const char *last_slash = find_last_char (base, end, '/');
if (!last_slash)
{
/* No slash found at all. Append LINK to what we have,
but we'll need a slash as a separator.
Example: if base == "foo" and link == "qux/xyzzy", then
we cannot just append link to base, because we'd get
"fooqux/xyzzy", whereas what we want is
"foo/qux/xyzzy".
To make sure the / gets inserted, we set
need_explicit_slash to 1. We also set start_insert
to end + 1, so that the length calculations work out
correctly for one more (slash) character. Accessing
that character is fine, since it will be the
delimiter, '\0' or '?'. */
/* example: "foo?..." */
/* ^ ('?' gets changed to '/') */
start_insert = end + 1;
need_explicit_slash = 1;
}
else if (last_slash && last_slash != base && *(last_slash - 1) == '/')
{
/* example: http://host" */
/* ^ */
start_insert = end + 1;
need_explicit_slash = 1;
}
else
{
/* example: "whatever/foo/bar" */
/* ^ */
start_insert = last_slash + 1;
}
span = start_insert - base;
constr = (char *)xmalloc (span + linklength + 1);
if (span)
memcpy (constr, base, span);
if (need_explicit_slash)
constr[span - 1] = '/';
if (linklength)
memcpy (constr + span, link, linklength);
constr[span + linklength] = '\0';
/* Empty LINK points back to BASE, query string and all. */
constr = xstrdup (base);
}
else /* *link == `/' */
else if (*link == '?')
{
/* LINK points to the same location, but changes the query
string. Examples: */
/* uri_merge("path", "?new") -> "path?new" */
/* uri_merge("path?foo", "?new") -> "path?new" */
/* uri_merge("path?foo#bar", "?new") -> "path?new" */
/* uri_merge("path#foo", "?new") -> "path?new" */
int baselength = end - base;
constr = xmalloc (baselength + linklength + 1);
memcpy (constr, base, baselength);
memcpy (constr + baselength, link, linklength);
constr[baselength + linklength] = '\0';
}
else if (*link == '#')
{
/* uri_merge("path", "#new") -> "path#new" */
/* uri_merge("path#foo", "#new") -> "path#new" */
/* uri_merge("path?foo", "#new") -> "path?foo#new" */
/* uri_merge("path?foo#bar", "#new") -> "path?foo#new" */
int baselength;
const char *end1 = strchr (base, '#');
if (!end1)
end1 = base + strlen (base);
baselength = end1 - base;
constr = xmalloc (baselength + linklength + 1);
memcpy (constr, base, baselength);
memcpy (constr + baselength, link, linklength);
constr[baselength + linklength] = '\0';
}
else if (*link == '/')
{
/* LINK is an absolute path: we need to replace everything
after (and including) the FIRST slash with LINK.
@ -1435,6 +1480,62 @@ uri_merge_1 (const char *base, const char *link, int linklength, int no_scheme)
memcpy (constr + span, link, linklength);
constr[span + linklength] = '\0';
}
else
{
/* LINK is a relative URL: we need to replace everything
after last slash (possibly empty) with LINK.
So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy",
our result should be "whatever/foo/qux/xyzzy". */
int need_explicit_slash = 0;
int span;
const char *start_insert;
const char *last_slash = find_last_char (base, end, '/');
if (!last_slash)
{
/* No slash found at all. Append LINK to what we have,
but we'll need a slash as a separator.
Example: if base == "foo" and link == "qux/xyzzy", then
we cannot just append link to base, because we'd get
"fooqux/xyzzy", whereas what we want is
"foo/qux/xyzzy".
To make sure the / gets inserted, we set
need_explicit_slash to 1. We also set start_insert
to end + 1, so that the length calculations work out
correctly for one more (slash) character. Accessing
that character is fine, since it will be the
delimiter, '\0' or '?'. */
/* example: "foo?..." */
/* ^ ('?' gets changed to '/') */
start_insert = end + 1;
need_explicit_slash = 1;
}
else if (last_slash && last_slash != base && *(last_slash - 1) == '/')
{
/* example: http://host" */
/* ^ */
start_insert = end + 1;
need_explicit_slash = 1;
}
else
{
/* example: "whatever/foo/bar" */
/* ^ */
start_insert = last_slash + 1;
}
span = start_insert - base;
constr = (char *)xmalloc (span + linklength + 1);
if (span)
memcpy (constr, base, span);
if (need_explicit_slash)
constr[span - 1] = '/';
if (linklength)
memcpy (constr + span, link, linklength);
constr[span + linklength] = '\0';
}
}
else /* !no_scheme */
{
@ -1602,12 +1703,13 @@ static void replace_attr PARAMS ((const char **, int, FILE *, const char *));
/* Change the links in an HTML document. Accepts a structure that
defines the positions of all the links. */
void
convert_links (const char *file, urlpos *l)
convert_links (const char *file, struct urlpos *l)
{
struct file_memory *fm;
FILE *fp;
const char *p;
downloaded_file_t downloaded_file_return;
int to_url_count = 0, to_file_count = 0;
logprintf (LOG_VERBOSE, _("Converting %s... "), file);
@ -1615,12 +1717,12 @@ convert_links (const char *file, urlpos *l)
/* First we do a "dry run": go through the list L and see whether
any URL needs to be converted in the first place. If not, just
leave the file alone. */
int count = 0;
urlpos *dry = l;
int dry_count = 0;
struct urlpos *dry = l;
for (dry = l; dry; dry = dry->next)
if (dry->convert != CO_NOCONVERT)
++count;
if (!count)
++dry_count;
if (!dry_count)
{
logputs (LOG_VERBOSE, _("nothing to do.\n"));
return;
@ -1674,7 +1776,7 @@ convert_links (const char *file, urlpos *l)
/* If the URL is not to be converted, skip it. */
if (l->convert == CO_NOCONVERT)
{
DEBUGP (("Skipping %s at position %d.\n", l->url, l->pos));
DEBUGP (("Skipping %s at position %d.\n", l->url->url, l->pos));
continue;
}
@ -1689,19 +1791,21 @@ convert_links (const char *file, urlpos *l)
char *quoted_newname = html_quote_string (newname);
replace_attr (&p, l->size, fp, quoted_newname);
DEBUGP (("TO_RELATIVE: %s to %s at position %d in %s.\n",
l->url, newname, l->pos, file));
l->url->url, newname, l->pos, file));
xfree (newname);
xfree (quoted_newname);
++to_file_count;
}
else if (l->convert == CO_CONVERT_TO_COMPLETE)
{
/* Convert the link to absolute URL. */
char *newlink = l->url;
char *newlink = l->url->url;
char *quoted_newlink = html_quote_string (newlink);
replace_attr (&p, l->size, fp, quoted_newlink);
DEBUGP (("TO_COMPLETE: <something> to %s at position %d in %s.\n",
newlink, l->pos, file));
xfree (quoted_newlink);
++to_url_count;
}
}
/* Output the rest of the file. */
@ -1709,7 +1813,8 @@ convert_links (const char *file, urlpos *l)
fwrite (p, 1, fm->length - (p - fm->content), fp);
fclose (fp);
read_file_free (fm);
logputs (LOG_VERBOSE, _("done.\n"));
logprintf (LOG_VERBOSE,
_("%d-%d\n"), to_file_count, to_url_count);
}
/* Construct and return a malloced copy of the relative link from two
@ -1766,20 +1871,6 @@ construct_relative (const char *s1, const char *s2)
return res;
}
/* Add URL to the head of the list L. */
urlpos *
add_url (urlpos *l, const char *url, const char *file)
{
urlpos *t;
t = (urlpos *)xmalloc (sizeof (urlpos));
memset (t, 0, sizeof (*t));
t->url = xstrdup (url);
t->local_name = xstrdup (file);
t->next = l;
return t;
}
static void
write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
{
@ -1850,15 +1941,9 @@ write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
-- Dan Harkless <wget@harkless.org>
This [adding a field to the urlpos structure] didn't work
because convert_file() is called twice: once after all its
sublinks have been retrieved in recursive_retrieve(), and
once at the end of the day in convert_all_links(). The
original linked list collected in recursive_retrieve() is
lost after the first invocation of convert_links(), and
convert_all_links() makes a new one (it calls get_urls_html()
for each file it covers.) That's why your first approach didn't
work. The way to make it work is perhaps to make this flag a
field in the `urls_html' list.
because convert_file() is called from convert_all_links at
the end of the retrieval with a freshly built new urlpos
list.
-- Hrvoje Niksic <hniksic@arsdigita.com>
*/
converted_file_ptr = xmalloc(sizeof(*converted_file_ptr));
@ -1941,13 +2026,40 @@ find_fragment (const char *beg, int size, const char **bp, const char **ep)
return 0;
}
typedef struct _downloaded_file_list {
char* file;
downloaded_file_t download_type;
struct _downloaded_file_list* next;
} downloaded_file_list;
/* We're storing "modes" of type downloaded_file_t in the hash table.
However, our hash tables only accept pointers for keys and values.
So when we need a pointer, we use the address of a
downloaded_file_t variable of static storage. */
static downloaded_file_t *
downloaded_mode_to_ptr (downloaded_file_t mode)
{
static downloaded_file_t
v1 = FILE_NOT_ALREADY_DOWNLOADED,
v2 = FILE_DOWNLOADED_NORMALLY,
v3 = FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED,
v4 = CHECK_FOR_FILE;
static downloaded_file_list *downloaded_files;
switch (mode)
{
case FILE_NOT_ALREADY_DOWNLOADED:
return &v1;
case FILE_DOWNLOADED_NORMALLY:
return &v2;
case FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED:
return &v3;
case CHECK_FOR_FILE:
return &v4;
}
return NULL;
}
/* This should really be merged with dl_file_url_map and
downloaded_html_files in recur.c. This was originally a list, but
I changed it to a hash table beause it was actually taking a lot of
time to find things in it. */
static struct hash_table *downloaded_files_hash;
/* Remembers which files have been downloaded. In the standard case, should be
called with mode == FILE_DOWNLOADED_NORMALLY for each file we actually
@ -1962,46 +2074,47 @@ static downloaded_file_list *downloaded_files;
it, call with mode == CHECK_FOR_FILE. Please be sure to call this function
with local filenames, not remote URLs. */
downloaded_file_t
downloaded_file (downloaded_file_t mode, const char* file)
downloaded_file (downloaded_file_t mode, const char *file)
{
boolean found_file = FALSE;
downloaded_file_list* rover = downloaded_files;
downloaded_file_t *ptr;
while (rover != NULL)
if (strcmp(rover->file, file) == 0)
{
found_file = TRUE;
break;
}
else
rover = rover->next;
if (found_file)
return rover->download_type; /* file had already been downloaded */
else
if (mode == CHECK_FOR_FILE)
{
if (mode != CHECK_FOR_FILE)
{
rover = xmalloc(sizeof(*rover));
rover->file = xstrdup(file); /* use xstrdup() so die on out-of-mem. */
rover->download_type = mode;
rover->next = downloaded_files;
downloaded_files = rover;
}
return FILE_NOT_ALREADY_DOWNLOADED;
if (!downloaded_files_hash)
return FILE_NOT_ALREADY_DOWNLOADED;
ptr = hash_table_get (downloaded_files_hash, file);
if (!ptr)
return FILE_NOT_ALREADY_DOWNLOADED;
return *ptr;
}
if (!downloaded_files_hash)
downloaded_files_hash = make_string_hash_table (0);
ptr = hash_table_get (downloaded_files_hash, file);
if (ptr)
return *ptr;
ptr = downloaded_mode_to_ptr (mode);
hash_table_put (downloaded_files_hash, xstrdup (file), &ptr);
return FILE_NOT_ALREADY_DOWNLOADED;
}
static int
df_free_mapper (void *key, void *value, void *ignored)
{
xfree (key);
return 0;
}
void
downloaded_files_free (void)
{
downloaded_file_list* rover = downloaded_files;
while (rover)
if (downloaded_files_hash)
{
downloaded_file_list *next = rover->next;
xfree (rover->file);
xfree (rover);
rover = next;
hash_table_map (downloaded_files_hash, df_free_mapper, NULL);
hash_table_destroy (downloaded_files_hash);
downloaded_files_hash = NULL;
}
}

View File

@ -72,11 +72,11 @@ enum convert_options {
/* A structure that defines the whereabouts of a URL, i.e. its
position in an HTML document, etc. */
typedef struct _urlpos
{
char *url; /* linked URL, after it has been
merged with the base */
char *local_name; /* Local file to which it was saved */
struct urlpos {
struct url *url; /* the URL of the link, after it has
been merged with the base */
char *local_name; /* local file to which it was saved
(used by convert_links) */
/* Information about the original link: */
int link_relative_p; /* was the link relative? */
@ -89,8 +89,8 @@ typedef struct _urlpos
/* URL's position in the buffer. */
int pos, size;
struct _urlpos *next; /* Next struct in list */
} urlpos;
struct urlpos *next; /* next list element */
};
/* downloaded_file() takes a parameter of this type and returns this type. */
typedef enum
@ -126,9 +126,9 @@ int url_skip_uname PARAMS ((const char *));
char *url_string PARAMS ((const struct url *, int));
urlpos *get_urls_file PARAMS ((const char *));
urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *));
void free_urlpos PARAMS ((urlpos *));
struct urlpos *get_urls_file PARAMS ((const char *));
struct urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *));
void free_urlpos PARAMS ((struct urlpos *));
char *uri_merge PARAMS ((const char *, const char *));
@ -136,11 +136,10 @@ void rotate_backups PARAMS ((const char *));
int mkalldirs PARAMS ((const char *));
char *url_filename PARAMS ((const struct url *));
char *getproxy PARAMS ((uerr_t));
char *getproxy PARAMS ((enum url_scheme));
int no_proxy_match PARAMS ((const char *, const char **));
void convert_links PARAMS ((const char *, urlpos *));
urlpos *add_url PARAMS ((urlpos *, const char *, const char *));
void convert_links PARAMS ((const char *, struct urlpos *));
downloaded_file_t downloaded_file PARAMS ((downloaded_file_t, const char *));

View File

@ -307,6 +307,18 @@ xstrdup_debug (const char *s, const char *source_file, int source_line)
#endif /* DEBUG_MALLOC */
/* Utility function: like xstrdup(), but also lowercases S. */
char *
xstrdup_lower (const char *s)
{
char *copy = xstrdup (s);
char *p = copy;
for (; *p; p++)
*p = TOLOWER (*p);
return copy;
}
/* Copy the string formed by two pointers (one on the beginning, other
on the char after the last char) to a new, malloc-ed location.
0-terminate it. */
@ -443,6 +455,8 @@ fork_to_background (void)
}
#endif /* not WINDOWS */
#if 0
/* debug */
char *
ps (char *orig)
{
@ -450,6 +464,7 @@ ps (char *orig)
path_simplify (r);
return r;
}
#endif
/* Canonicalize PATH, and return a new path. The new path differs from PATH
in that:
@ -468,45 +483,31 @@ ps (char *orig)
Change the original string instead of strdup-ing.
React correctly when beginning with `./' and `../'.
Don't zip out trailing slashes. */
void
int
path_simplify (char *path)
{
register int i, start, ddot;
register int i, start;
int changes = 0;
char stub_char;
if (!*path)
return;
return 0;
/*stub_char = (*path == '/') ? '/' : '.';*/
stub_char = '/';
/* Addition: Remove all `./'-s preceding the string. If `../'-s
precede, put `/' in front and remove them too. */
i = 0;
ddot = 0;
while (1)
{
if (path[i] == '.' && path[i + 1] == '/')
i += 2;
else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/')
{
i += 3;
ddot = 1;
}
else
break;
}
if (i)
strcpy (path, path + i - ddot);
if (path[0] == '/')
/* Preserve initial '/'. */
++path;
/* Replace single `.' or `..' with `/'. */
/* Nix out leading `.' or `..' with. */
if ((path[0] == '.' && path[1] == '\0')
|| (path[0] == '.' && path[1] == '.' && path[2] == '\0'))
{
path[0] = stub_char;
path[1] = '\0';
return;
path[0] = '\0';
changes = 1;
return changes;
}
/* Walk along PATH looking for things to compact. */
i = 0;
while (1)
@ -531,6 +532,7 @@ path_simplify (char *path)
{
strcpy (path + start + 1, path + i);
i = start + 1;
changes = 1;
}
/* Check for `../', `./' or trailing `.' by itself. */
@ -540,6 +542,7 @@ path_simplify (char *path)
if (!path[i + 1])
{
path[--i] = '\0';
changes = 1;
break;
}
@ -548,6 +551,7 @@ path_simplify (char *path)
{
strcpy (path + i, path + i + 1);
i = (start < 0) ? 0 : start;
changes = 1;
continue;
}
@ -556,12 +560,32 @@ path_simplify (char *path)
(path[i + 2] == '/' || !path[i + 2]))
{
while (--start > -1 && path[start] != '/');
strcpy (path + start + 1, path + i + 2);
strcpy (path + start + 1, path + i + 2 + (start == -1 && path[i + 2]));
i = (start < 0) ? 0 : start;
changes = 1;
continue;
}
} /* path == '.' */
} /* while */
/* Addition: Remove all `./'-s and `../'-s preceding the string. */
i = 0;
while (1)
{
if (path[i] == '.' && path[i + 1] == '/')
i += 2;
else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/')
i += 3;
else
break;
}
if (i)
{
strcpy (path, path + i - 0);
changes = 1;
}
return changes;
}
/* "Touch" FILE, i.e. make its atime and mtime equal to the time

View File

@ -48,12 +48,13 @@ char *datetime_str PARAMS ((time_t *));
void print_malloc_debug_stats ();
#endif
char *xstrdup_lower PARAMS ((const char *));
char *strdupdelim PARAMS ((const char *, const char *));
char **sepstring PARAMS ((const char *));
int frontcmp PARAMS ((const char *, const char *));
char *pwd_cuserid PARAMS ((char *));
void fork_to_background PARAMS ((void));
void path_simplify PARAMS ((char *));
int path_simplify PARAMS ((char *));
void touch PARAMS ((const char *, time_t));
int remove_link PARAMS ((const char *));
@ -98,4 +99,6 @@ long wtimer_granularity PARAMS ((void));
char *html_quote_string PARAMS ((const char *));
int determine_screen_width PARAMS ((void));
#endif /* UTILS_H */

View File

@ -28,6 +28,11 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
# define NDEBUG /* To kill off assertions */
#endif /* not DEBUG */
/* Define this if you want primitive but extensive malloc debugging.
It will make Wget extremely slow, so only do it in development
builds. */
#undef DEBUG_MALLOC
#ifndef PARAMS
# if PROTOTYPES
# define PARAMS(args) args
@ -60,7 +65,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
3) Finally, the debug messages are meant to be a clue for me to
debug problems with Wget. If I get them in a language I don't
understand, debugging will become a new challenge of its own! :-) */
understand, debugging will become a new challenge of its own! */
/* Include these, so random files need not include them. */