similarity (SCHEME_HTTP and SCHEME_HTTPS are similar). Use it in recur.c
(download_child_p). Fixes a bug that caused -H option to be ignored when
child scheme different to parent scheme.
Published in <agn4eu8apduek7magfu9bfe63gto8i7cdh@farscape.privy.mev.co.uk>.
Published in <sxsvgo8nmub.fsf@florida.arsdigita.de>.
* retr.c (retrieve_url): Call uri_merge, not url_concat.
* html-url.c (collect_tags_mapper): Call uri_merge, not
url_concat.
* url.c (mkstruct): Use encode_string instead of xstrdup followed
by URL_CLEANSE.
(path_simplify_with_kludge): Deleted.
(contains_unsafe): Deleted.
(construct): Renamed to uri_merge_1.
(url_concat): Renamed to uri_merge.
* url.c (str_url): Use encode_string instead of the unnecessary
CLEANDUP.
(encode_string_maybe): New function, returns input string if no
encoding is needed.
(encode_string): Call encode_string_maybe to do the dirty work,
xstrdup if no work needed.
* wget.h (XDIGIT_TO_xchar): Define here.
* url.c (decode_string): Use new name.
(encode_string): Ditto.
* http.c (XDIGIT_TO_xchar): Rename HEXD2asc to XDIGIT_TO_xchar.
(dump_hash): Use new name.
* wget.h: Rename ASC2HEXD and HEXD2ASC to XCHAR_TO_XDIGIT and
XDIGIT_TO_XCHAR respectively.
Published in <sxsitkc709p.fsf@florida.arsdigita.de>.
* url.c (parseurl): Don't strip trailing slash when u->dir is "/"
because that strips the *leading* slash, thus forcing relative
FTP retrieval.
* ftp.c (getftp): Convert initial FTP directory from VMS to UNIX
notation for VMS servers.
(ftp_retrieve_dirs): Do not prepend '/' to f->name when
odir is an empty string.
- use mmap() to read whole files in core instead of allocating memory
and read'ing it.
- use a new, more general, HTML parser (html-parse.c) and interface to
it from Wget (html-url.c).
- respect <meta name=robots content=nofollow> (easy with the new HTML
parser).
- use hash tables instead of linked lists in places where the lists
were used to facilitate mappings.
- rewrite the code in host.c to be more readable and faster (hash
tables instead of home-grown lists.)
- make convert_links properly convert partial URLs to complete ones
for those URLs that have *not* been downloaded.
- use HTTP persistent connections where available. very
simple-minded, caches the last connection to the server.
Published in <sxshf533d5r.fsf@florida.arsdigita.de>.
* ftp.c (ftp_retrieve_list): Use new INFINITE_RECURSION #define.
* html.c: htmlfindurl() now takes final `dash_p_leaf_HTML' parameter.
Wrapped some > 80-column lines. When -p is specified and we're at a
leaf node, do not traverse <A>, <AREA>, or <LINK> tags other than
<LINK REL="stylesheet">.
* html.h (htmlfindurl): Now takes final `dash_p_leaf_HTML' parameter.
* init.c: Added new -p / --page-requisites / page_requisites option.
* main.c (print_help): Clarified that -l inf and -l 0 both allow
infinite recursion. Changed the unhelpful --mirrior description
to simply give the options it's equivalent to. Added new -p option.
(main): Added some comments; handle new -p / --page-requisites.
* options.h (struct options): Added new page_requisites field.
* recur.c: Changed "URL-s" to "URLs" and "HTML-s" to "HTMLs".
Calculate and pass down new `dash_p_leaf_HTML' parameter to
get_urls_html(). Use new INFINITE_RECURSION #define.
* retr.c: Changed "URL-s" to "URLs". get_urls_html() now takes
final `dash_p_leaf_HTML' parameter.
* url.c: get_urls_html() and htmlfindurl() now take final
`dash_p_leaf_HTML' parameter.
* url.h (get_urls_html): Now takes final `dash_p_leaf_HTML' parameter.
* wget.h: Added some comments and new INFINITE_RECURSION #define.
* wget.texi (Recursive Retrieval Options): Documented new -p option.
>= width of type" warning on 32-bit architectures. Got rid of it by tricking
the compiler w/ a variable.
* url.c (UNSAFE_CHAR): The macro didn't include all the illegal characters per
RFC1738, namely everything above '~'. It also generated a warning on OSes
where char =~ unsigned char. Fixed.
URLs, gen_page.cgi?page1 and get_page.cgi?page2, they'll both be saved as
get_page.cgi and the second will overwrite the first. Also, parameters to
implicit CGIs, like "http://www.host.com/db/?2000-03-02" cause the URLs to be
printed with trailing garbage characters, and could seg fault. I'm not sure
what Dan had in mind with this patch (no explanatory comments), but I'm removing
it for now. If he can rewrite it so it doesn't break stuff, okay.
together, we compare local file X.orig (if extant) against server file X.
Previously -k and -N were worthless in combination because the local converted
files always differed from the server versions.