mirror of
https://github.com/moparisthebest/curl
synced 2024-11-15 22:15:13 -05:00
725ec470e2
We currently use both spellings the british "behaviour" and the american "behavior". However "behavior" is more used in the project so I think it's worth dropping the british name. Closes #6395
1104 lines
43 KiB
Markdown
1104 lines
43 KiB
Markdown
curl internals
|
|
==============
|
|
|
|
- [Intro](#intro)
|
|
- [git](#git)
|
|
- [Portability](#Portability)
|
|
- [Windows vs Unix](#winvsunix)
|
|
- [Library](#Library)
|
|
- [`Curl_connect`](#Curl_connect)
|
|
- [`multi_do`](#multi_do)
|
|
- [`Curl_readwrite`](#Curl_readwrite)
|
|
- [`multi_done`](#multi_done)
|
|
- [`Curl_disconnect`](#Curl_disconnect)
|
|
- [HTTP(S)](#http)
|
|
- [FTP](#ftp)
|
|
- [Kerberos](#kerberos)
|
|
- [TELNET](#telnet)
|
|
- [FILE](#file)
|
|
- [SMB](#smb)
|
|
- [LDAP](#ldap)
|
|
- [E-mail](#email)
|
|
- [General](#general)
|
|
- [Persistent Connections](#persistent)
|
|
- [multi interface/non-blocking](#multi)
|
|
- [SSL libraries](#ssl)
|
|
- [Library Symbols](#symbols)
|
|
- [Return Codes and Informationals](#returncodes)
|
|
- [AP/ABI](#abi)
|
|
- [Client](#client)
|
|
- [Memory Debugging](#memorydebug)
|
|
- [Test Suite](#test)
|
|
- [Asynchronous name resolves](#asyncdns)
|
|
- [c-ares](#cares)
|
|
- [`curl_off_t`](#curl_off_t)
|
|
- [curlx](#curlx)
|
|
- [Content Encoding](#contentencoding)
|
|
- [`hostip.c` explained](#hostip)
|
|
- [Track Down Memory Leaks](#memoryleak)
|
|
- [`multi_socket`](#multi_socket)
|
|
- [Structs in libcurl](#structs)
|
|
- [Curl_easy](#Curl_easy)
|
|
- [connectdata](#connectdata)
|
|
- [Curl_multi](#Curl_multi)
|
|
- [Curl_handler](#Curl_handler)
|
|
- [conncache](#conncache)
|
|
- [Curl_share](#Curl_share)
|
|
- [CookieInfo](#CookieInfo)
|
|
|
|
<a name="intro"></a>
|
|
Intro
|
|
=====
|
|
|
|
This project is split in two. The library and the client. The client part
|
|
uses the library, but the library is designed to allow other applications to
|
|
use it.
|
|
|
|
The largest amount of code and complexity is in the library part.
|
|
|
|
|
|
<a name="git"></a>
|
|
git
|
|
===
|
|
|
|
All changes to the sources are committed to the git repository as soon as
|
|
they're somewhat verified to work. Changes shall be committed as independently
|
|
as possible so that individual changes can be easily spotted and tracked
|
|
afterwards.
|
|
|
|
Tagging shall be used extensively, and by the time we release new archives we
|
|
should tag the sources with a name similar to the released version number.
|
|
|
|
<a name="Portability"></a>
|
|
Portability
|
|
===========
|
|
|
|
We write curl and libcurl to compile with C89 compilers. On 32-bit and up
|
|
machines. Most of libcurl assumes more or less POSIX compliance but that's
|
|
not a requirement.
|
|
|
|
We write libcurl to build and work with lots of third party tools, and we
|
|
want it to remain functional and buildable with these and later versions
|
|
(older versions may still work but is not what we work hard to maintain):
|
|
|
|
Dependencies
|
|
------------
|
|
|
|
- OpenSSL 0.9.7
|
|
- GnuTLS 3.1.10
|
|
- zlib 1.1.4
|
|
- libssh2 1.0
|
|
- c-ares 1.6.0
|
|
- libidn2 2.0.0
|
|
- wolfSSL 2.0.0
|
|
- openldap 2.0
|
|
- MIT Kerberos 1.2.4
|
|
- GSKit V5R3M0
|
|
- NSS 3.14.x
|
|
- Heimdal ?
|
|
- nghttp2 1.12.0
|
|
- WinSock 2.2 (on Windows 95+ and Windows CE .NET 4.1+)
|
|
|
|
Operating Systems
|
|
-----------------
|
|
|
|
On systems where configure runs, we aim at working on them all - if they have
|
|
a suitable C compiler. On systems that don't run configure, we strive to keep
|
|
curl running correctly on:
|
|
|
|
- Windows 98
|
|
- AS/400 V5R3M0
|
|
- Symbian 9.1
|
|
- Windows CE ?
|
|
- TPF ?
|
|
|
|
Build tools
|
|
-----------
|
|
|
|
When writing code (mostly for generating stuff included in release tarballs)
|
|
we use a few "build tools" and we make sure that we remain functional with
|
|
these versions:
|
|
|
|
- GNU Libtool 1.4.2
|
|
- GNU Autoconf 2.57
|
|
- GNU Automake 1.7
|
|
- GNU M4 1.4
|
|
- perl 5.004
|
|
- roffit 0.5
|
|
- groff ? (any version that supports `groff -Tps -man [in] [out]`)
|
|
- ps2pdf (gs) ?
|
|
|
|
<a name="winvsunix"></a>
|
|
Windows vs Unix
|
|
===============
|
|
|
|
There are a few differences in how to program curl the Unix way compared to
|
|
the Windows way. Perhaps the four most notable details are:
|
|
|
|
1. Different function names for socket operations.
|
|
|
|
In curl, this is solved with defines and macros, so that the source looks
|
|
the same in all places except for the header file that defines them. The
|
|
macros in use are `sclose()`, `sread()` and `swrite()`.
|
|
|
|
2. Windows requires a couple of init calls for the socket stuff.
|
|
|
|
That's taken care of by the `curl_global_init()` call, but if other libs
|
|
also do it etc there might be reasons for applications to alter that
|
|
behavior.
|
|
|
|
We require WinSock version 2.2 and load this version during global init.
|
|
|
|
3. The file descriptors for network communication and file operations are
|
|
not as easily interchangeable as in Unix.
|
|
|
|
We avoid this by not trying any funny tricks on file descriptors.
|
|
|
|
4. When writing data to stdout, Windows makes end-of-lines the DOS way, thus
|
|
destroying binary data, although you do want that conversion if it is
|
|
text coming through... (sigh)
|
|
|
|
We set stdout to binary under windows
|
|
|
|
Inside the source code, We make an effort to avoid `#ifdef [Your OS]`. All
|
|
conditionals that deal with features *should* instead be in the format
|
|
`#ifdef HAVE_THAT_WEIRD_FUNCTION`. Since Windows can't run configure scripts,
|
|
we maintain a `curl_config-win32.h` file in lib directory that is supposed to
|
|
look exactly like a `curl_config.h` file would have looked like on a Windows
|
|
machine!
|
|
|
|
Generally speaking: always remember that this will be compiled on dozens of
|
|
operating systems. Don't walk on the edge!
|
|
|
|
<a name="Library"></a>
|
|
Library
|
|
=======
|
|
|
|
(See [Structs in libcurl](#structs) for the separate section describing all
|
|
major internal structs and their purposes.)
|
|
|
|
There are plenty of entry points to the library, namely each publicly defined
|
|
function that libcurl offers to applications. All of those functions are
|
|
rather small and easy-to-follow. All the ones prefixed with `curl_easy` are
|
|
put in the `lib/easy.c` file.
|
|
|
|
`curl_global_init()` and `curl_global_cleanup()` should be called by the
|
|
application to initialize and clean up global stuff in the library. As of
|
|
today, it can handle the global SSL initialization if SSL is enabled and it
|
|
can initialize the socket layer on Windows machines. libcurl itself has no
|
|
"global" scope.
|
|
|
|
All printf()-style functions use the supplied clones in `lib/mprintf.c`. This
|
|
makes sure we stay absolutely platform independent.
|
|
|
|
[ `curl_easy_init()`][2] allocates an internal struct and makes some
|
|
initializations. The returned handle does not reveal internals. This is the
|
|
`Curl_easy` struct which works as an "anchor" struct for all `curl_easy`
|
|
functions. All connections performed will get connect-specific data allocated
|
|
that should be used for things related to particular connections/requests.
|
|
|
|
[`curl_easy_setopt()`][1] takes three arguments, where the option stuff must
|
|
be passed in pairs: the parameter-ID and the parameter-value. The list of
|
|
options is documented in the man page. This function mainly sets things in
|
|
the `Curl_easy` struct.
|
|
|
|
`curl_easy_perform()` is just a wrapper function that makes use of the multi
|
|
API. It basically calls `curl_multi_init()`, `curl_multi_add_handle()`,
|
|
`curl_multi_wait()`, and `curl_multi_perform()` until the transfer is done
|
|
and then returns.
|
|
|
|
Some of the most important key functions in `url.c` are called from
|
|
`multi.c` when certain key steps are to be made in the transfer operation.
|
|
|
|
<a name="Curl_connect"></a>
|
|
Curl_connect()
|
|
--------------
|
|
|
|
Analyzes the URL, it separates the different components and connects to the
|
|
remote host. This may involve using a proxy and/or using SSL. The
|
|
`Curl_resolv()` function in `lib/hostip.c` is used for looking up host
|
|
names (it does then use the proper underlying method, which may vary
|
|
between platforms and builds).
|
|
|
|
When `Curl_connect` is done, we are connected to the remote site. Then it
|
|
is time to tell the server to get a document/file. `Curl_do()` arranges
|
|
this.
|
|
|
|
This function makes sure there's an allocated and initiated `connectdata`
|
|
struct that is used for this particular connection only (although there may
|
|
be several requests performed on the same connect). A bunch of things are
|
|
initialized/inherited from the `Curl_easy` struct.
|
|
|
|
<a name="multi_do"></a>
|
|
multi_do()
|
|
---------
|
|
|
|
`multi_do()` makes sure the proper protocol-specific function is called.
|
|
The functions are named after the protocols they handle.
|
|
|
|
The protocol-specific functions of course deal with protocol-specific
|
|
negotiations and setup. When they're ready to start the actual file
|
|
transfer they call the `Curl_setup_transfer()` function (in
|
|
`lib/transfer.c`) to setup the transfer and returns.
|
|
|
|
If this DO function fails and the connection is being re-used, libcurl will
|
|
then close this connection, setup a new connection and re-issue the DO
|
|
request on that. This is because there is no way to be perfectly sure that
|
|
we have discovered a dead connection before the DO function and thus we
|
|
might wrongly be re-using a connection that was closed by the remote peer.
|
|
|
|
<a name="Curl_readwrite"></a>
|
|
Curl_readwrite()
|
|
----------------
|
|
|
|
Called during the transfer of the actual protocol payload.
|
|
|
|
During transfer, the progress functions in `lib/progress.c` are called at
|
|
frequent intervals (or at the user's choice, a specified callback might get
|
|
called). The speedcheck functions in `lib/speedcheck.c` are also used to
|
|
verify that the transfer is as fast as required.
|
|
|
|
<a name="multi_done"></a>
|
|
multi_done()
|
|
-----------
|
|
|
|
Called after a transfer is done. This function takes care of everything
|
|
that has to be done after a transfer. This function attempts to leave
|
|
matters in a state so that `multi_do()` should be possible to call again on
|
|
the same connection (in a persistent connection case). It might also soon
|
|
be closed with `Curl_disconnect()`.
|
|
|
|
<a name="Curl_disconnect"></a>
|
|
Curl_disconnect()
|
|
-----------------
|
|
|
|
When doing normal connections and transfers, no one ever tries to close any
|
|
connections so this is not normally called when `curl_easy_perform()` is
|
|
used. This function is only used when we are certain that no more transfers
|
|
are going to be made on the connection. It can be also closed by force, or
|
|
it can be called to make sure that libcurl doesn't keep too many
|
|
connections alive at the same time.
|
|
|
|
This function cleans up all resources that are associated with a single
|
|
connection.
|
|
|
|
<a name="http"></a>
|
|
HTTP(S)
|
|
=======
|
|
|
|
HTTP offers a lot and is the protocol in curl that uses the most lines of
|
|
code. There is a special file `lib/formdata.c` that offers all the
|
|
multipart post functions.
|
|
|
|
base64-functions for user+password stuff (and more) is in `lib/base64.c`
|
|
and all functions for parsing and sending cookies are found in
|
|
`lib/cookie.c`.
|
|
|
|
HTTPS uses in almost every case the same procedure as HTTP, with only two
|
|
exceptions: the connect procedure is different and the function used to read
|
|
or write from the socket is different, although the latter fact is hidden in
|
|
the source by the use of `Curl_read()` for reading and `Curl_write()` for
|
|
writing data to the remote server.
|
|
|
|
`http_chunks.c` contains functions that understands HTTP 1.1 chunked transfer
|
|
encoding.
|
|
|
|
An interesting detail with the HTTP(S) request, is the `Curl_add_buffer()`
|
|
series of functions we use. They append data to one single buffer, and when
|
|
the building is finished the entire request is sent off in one single write.
|
|
This is done this way to overcome problems with flawed firewalls and lame
|
|
servers.
|
|
|
|
<a name="ftp"></a>
|
|
FTP
|
|
===
|
|
|
|
The `Curl_if2ip()` function can be used for getting the IP number of a
|
|
specified network interface, and it resides in `lib/if2ip.c`.
|
|
|
|
`Curl_ftpsendf()` is used for sending FTP commands to the remote server. It
|
|
was made a separate function to prevent us programmers from forgetting that
|
|
they must be CRLF terminated. They must also be sent in one single `write()`
|
|
to make firewalls and similar happy.
|
|
|
|
<a name="kerberos"></a>
|
|
Kerberos
|
|
========
|
|
|
|
Kerberos support is mainly in `lib/krb5.c` but also `curl_sasl_sspi.c` and
|
|
`curl_sasl_gssapi.c` for the email protocols and `socks_gssapi.c` and
|
|
`socks_sspi.c` for SOCKS5 proxy specifics.
|
|
|
|
<a name="telnet"></a>
|
|
TELNET
|
|
======
|
|
|
|
Telnet is implemented in `lib/telnet.c`.
|
|
|
|
<a name="file"></a>
|
|
FILE
|
|
====
|
|
|
|
The `file://` protocol is dealt with in `lib/file.c`.
|
|
|
|
<a name="smb"></a>
|
|
SMB
|
|
===
|
|
|
|
The `smb://` protocol is dealt with in `lib/smb.c`.
|
|
|
|
<a name="ldap"></a>
|
|
LDAP
|
|
====
|
|
|
|
Everything LDAP is in `lib/ldap.c` and `lib/openldap.c`.
|
|
|
|
<a name="email"></a>
|
|
E-mail
|
|
======
|
|
|
|
The e-mail related source code is in `lib/imap.c`, `lib/pop3.c` and
|
|
`lib/smtp.c`.
|
|
|
|
<a name="general"></a>
|
|
General
|
|
=======
|
|
|
|
URL encoding and decoding, called escaping and unescaping in the source code,
|
|
is found in `lib/escape.c`.
|
|
|
|
While transferring data in `Transfer()` a few functions might get used.
|
|
`curl_getdate()` in `lib/parsedate.c` is for HTTP date comparisons (and
|
|
more).
|
|
|
|
`lib/getenv.c` offers `curl_getenv()` which is for reading environment
|
|
variables in a neat platform independent way. That's used in the client, but
|
|
also in `lib/url.c` when checking the proxy environment variables. Note that
|
|
contrary to the normal unix `getenv()`, this returns an allocated buffer that
|
|
must be `free()`ed after use.
|
|
|
|
`lib/netrc.c` holds the `.netrc` parser.
|
|
|
|
`lib/timeval.c` features replacement functions for systems that don't have
|
|
`gettimeofday()` and a few support functions for timeval conversions.
|
|
|
|
A function named `curl_version()` that returns the full curl version string
|
|
is found in `lib/version.c`.
|
|
|
|
<a name="persistent"></a>
|
|
Persistent Connections
|
|
======================
|
|
|
|
The persistent connection support in libcurl requires some considerations on
|
|
how to do things inside of the library.
|
|
|
|
- The `Curl_easy` struct returned in the [`curl_easy_init()`][2] call
|
|
must never hold connection-oriented data. It is meant to hold the root data
|
|
as well as all the options etc that the library-user may choose.
|
|
|
|
- The `Curl_easy` struct holds the "connection cache" (an array of
|
|
pointers to `connectdata` structs).
|
|
|
|
- This enables the 'curl handle' to be reused on subsequent transfers.
|
|
|
|
- When libcurl is told to perform a transfer, it first checks for an already
|
|
existing connection in the cache that we can use. Otherwise it creates a
|
|
new one and adds that to the cache. If the cache is full already when a new
|
|
connection is added, it will first close the oldest unused one.
|
|
|
|
- When the transfer operation is complete, the connection is left
|
|
open. Particular options may tell libcurl not to, and protocols may signal
|
|
closure on connections and then they won't be kept open, of course.
|
|
|
|
- When `curl_easy_cleanup()` is called, we close all still opened connections,
|
|
unless of course the multi interface "owns" the connections.
|
|
|
|
The curl handle must be re-used in order for the persistent connections to
|
|
work.
|
|
|
|
<a name="multi"></a>
|
|
multi interface/non-blocking
|
|
============================
|
|
|
|
The multi interface is a non-blocking interface to the library. To make that
|
|
interface work as well as possible, no low-level functions within libcurl
|
|
must be written to work in a blocking manner. (There are still a few spots
|
|
violating this rule.)
|
|
|
|
One of the primary reasons we introduced c-ares support was to allow the name
|
|
resolve phase to be perfectly non-blocking as well.
|
|
|
|
The FTP and the SFTP/SCP protocols are examples of how we adapt and adjust
|
|
the code to allow non-blocking operations even on multi-stage command-
|
|
response protocols. They are built around state machines that return when
|
|
they would otherwise block waiting for data. The DICT, LDAP and TELNET
|
|
protocols are crappy examples and they are subject for rewrite in the future
|
|
to better fit the libcurl protocol family.
|
|
|
|
<a name="ssl"></a>
|
|
SSL libraries
|
|
=============
|
|
|
|
Originally libcurl supported SSLeay for SSL/TLS transports, but that was then
|
|
extended to its successor OpenSSL but has since also been extended to several
|
|
other SSL/TLS libraries and we expect and hope to further extend the support
|
|
in future libcurl versions.
|
|
|
|
To deal with this internally in the best way possible, we have a generic SSL
|
|
function API as provided by the `vtls/vtls.[ch]` system, and they are the only
|
|
SSL functions we must use from within libcurl. vtls is then crafted to use
|
|
the appropriate lower-level function calls to whatever SSL library that is in
|
|
use. For example `vtls/openssl.[ch]` for the OpenSSL library.
|
|
|
|
<a name="symbols"></a>
|
|
Library Symbols
|
|
===============
|
|
|
|
All symbols used internally in libcurl must use a `Curl_` prefix if they're
|
|
used in more than a single file. Single-file symbols must be made static.
|
|
Public ("exported") symbols must use a `curl_` prefix. (There are exceptions,
|
|
but they are to be changed to follow this pattern in future versions.) Public
|
|
API functions are marked with `CURL_EXTERN` in the public header files so
|
|
that all others can be hidden on platforms where this is possible.
|
|
|
|
<a name="returncodes"></a>
|
|
Return Codes and Informationals
|
|
===============================
|
|
|
|
I've made things simple. Almost every function in libcurl returns a CURLcode,
|
|
that must be `CURLE_OK` if everything is OK or otherwise a suitable error
|
|
code as the `curl/curl.h` include file defines. The very spot that detects an
|
|
error must use the `Curl_failf()` function to set the human-readable error
|
|
description.
|
|
|
|
In aiding the user to understand what's happening and to debug curl usage, we
|
|
must supply a fair number of informational messages by using the
|
|
`Curl_infof()` function. Those messages are only displayed when the user
|
|
explicitly asks for them. They are best used when revealing information that
|
|
isn't otherwise obvious.
|
|
|
|
<a name="abi"></a>
|
|
API/ABI
|
|
=======
|
|
|
|
We make an effort to not export or show internals or how internals work, as
|
|
that makes it easier to keep a solid API/ABI over time. See docs/libcurl/ABI
|
|
for our promise to users.
|
|
|
|
<a name="client"></a>
|
|
Client
|
|
======
|
|
|
|
`main()` resides in `src/tool_main.c`.
|
|
|
|
`src/tool_hugehelp.c` is automatically generated by the `mkhelp.pl` perl
|
|
script to display the complete "manual" and the `src/tool_urlglob.c` file
|
|
holds the functions used for the URL-"globbing" support. Globbing in the
|
|
sense that the `{}` and `[]` expansion stuff is there.
|
|
|
|
The client mostly sets up its `config` struct properly, then
|
|
it calls the `curl_easy_*()` functions of the library and when it gets back
|
|
control after the `curl_easy_perform()` it cleans up the library, checks
|
|
status and exits.
|
|
|
|
When the operation is done, the `ourWriteOut()` function in `src/writeout.c`
|
|
may be called to report about the operation. That function is mostly using the
|
|
`curl_easy_getinfo()` function to extract useful information from the curl
|
|
session.
|
|
|
|
It may loop and do all this several times if many URLs were specified on the
|
|
command line or config file.
|
|
|
|
<a name="memorydebug"></a>
|
|
Memory Debugging
|
|
================
|
|
|
|
The file `lib/memdebug.c` contains debug-versions of a few functions.
|
|
Functions such as `malloc()`, `free()`, `fopen()`, `fclose()`, etc that
|
|
somehow deal with resources that might give us problems if we "leak" them.
|
|
The functions in the memdebug system do nothing fancy, they do their normal
|
|
function and then log information about what they just did. The logged data
|
|
can then be analyzed after a complete session,
|
|
|
|
`memanalyze.pl` is the perl script present in `tests/` that analyzes a log
|
|
file generated by the memory tracking system. It detects if resources are
|
|
allocated but never freed and other kinds of errors related to resource
|
|
management.
|
|
|
|
Internally, definition of preprocessor symbol `DEBUGBUILD` restricts code
|
|
which is only compiled for debug enabled builds. And symbol `CURLDEBUG` is
|
|
used to differentiate code which is _only_ used for memory
|
|
tracking/debugging.
|
|
|
|
Use `-DCURLDEBUG` when compiling to enable memory debugging, this is also
|
|
switched on by running configure with `--enable-curldebug`. Use
|
|
`-DDEBUGBUILD` when compiling to enable a debug build or run configure with
|
|
`--enable-debug`.
|
|
|
|
`curl --version` will list 'Debug' feature for debug enabled builds, and
|
|
will list 'TrackMemory' feature for curl debug memory tracking capable
|
|
builds. These features are independent and can be controlled when running
|
|
the configure script. When `--enable-debug` is given both features will be
|
|
enabled, unless some restriction prevents memory tracking from being used.
|
|
|
|
<a name="test"></a>
|
|
Test Suite
|
|
==========
|
|
|
|
The test suite is placed in its own subdirectory directly off the root in the
|
|
curl archive tree, and it contains a bunch of scripts and a lot of test case
|
|
data.
|
|
|
|
The main test script is `runtests.pl` that will invoke test servers like
|
|
`httpserver.pl` and `ftpserver.pl` before all the test cases are performed.
|
|
The test suite currently only runs on Unix-like platforms.
|
|
|
|
You'll find a description of the test suite in the `tests/README` file, and
|
|
the test case data files in the `tests/FILEFORMAT` file.
|
|
|
|
The test suite automatically detects if curl was built with the memory
|
|
debugging enabled, and if it was, it will detect memory leaks, too.
|
|
|
|
<a name="asyncdns"></a>
|
|
Asynchronous name resolves
|
|
==========================
|
|
|
|
libcurl can be built to do name resolves asynchronously, using either the
|
|
normal resolver in a threaded manner or by using c-ares.
|
|
|
|
<a name="cares"></a>
|
|
[c-ares][3]
|
|
------
|
|
|
|
### Build libcurl to use a c-ares
|
|
|
|
1. ./configure --enable-ares=/path/to/ares/install
|
|
2. make
|
|
|
|
### c-ares on win32
|
|
|
|
First I compiled c-ares. I changed the default C runtime library to be the
|
|
single-threaded rather than the multi-threaded (this seems to be required to
|
|
prevent linking errors later on). Then I simply build the areslib project
|
|
(the other projects adig/ahost seem to fail under MSVC).
|
|
|
|
Next was libcurl. I opened `lib/config-win32.h` and I added a:
|
|
`#define USE_ARES 1`
|
|
|
|
Next thing I did was I added the path for the ares includes to the include
|
|
path, and the libares.lib to the libraries.
|
|
|
|
Lastly, I also changed libcurl to be single-threaded rather than
|
|
multi-threaded, again this was to prevent some duplicate symbol errors. I'm
|
|
not sure why I needed to change everything to single-threaded, but when I
|
|
didn't I got redefinition errors for several CRT functions (`malloc()`,
|
|
`stricmp()`, etc.)
|
|
|
|
<a name="curl_off_t"></a>
|
|
`curl_off_t`
|
|
==========
|
|
|
|
`curl_off_t` is a data type provided by the external libcurl include
|
|
headers. It is the type meant to be used for the [`curl_easy_setopt()`][1]
|
|
options that end with LARGE. The type is 64-bit large on most modern
|
|
platforms.
|
|
|
|
<a name="curlx"></a>
|
|
curlx
|
|
=====
|
|
|
|
The libcurl source code offers a few functions by source only. They are not
|
|
part of the official libcurl API, but the source files might be useful for
|
|
others so apps can optionally compile/build with these sources to gain
|
|
additional functions.
|
|
|
|
We provide them through a single header file for easy access for apps:
|
|
`curlx.h`
|
|
|
|
`curlx_strtoofft()`
|
|
-------------------
|
|
A macro that converts a string containing a number to a `curl_off_t` number.
|
|
This might use the `curlx_strtoll()` function which is provided as source
|
|
code in strtoofft.c. Note that the function is only provided if no
|
|
`strtoll()` (or equivalent) function exist on your platform. If `curl_off_t`
|
|
is only a 32-bit number on your platform, this macro uses `strtol()`.
|
|
|
|
Future
|
|
------
|
|
|
|
Several functions will be removed from the public `curl_` name space in a
|
|
future libcurl release. They will then only become available as `curlx_`
|
|
functions instead. To make the transition easier, we already today provide
|
|
these functions with the `curlx_` prefix to allow sources to be built
|
|
properly with the new function names. The concerned functions are:
|
|
|
|
- `curlx_getenv`
|
|
- `curlx_strequal`
|
|
- `curlx_strnequal`
|
|
- `curlx_mvsnprintf`
|
|
- `curlx_msnprintf`
|
|
- `curlx_maprintf`
|
|
- `curlx_mvaprintf`
|
|
- `curlx_msprintf`
|
|
- `curlx_mprintf`
|
|
- `curlx_mfprintf`
|
|
- `curlx_mvsprintf`
|
|
- `curlx_mvprintf`
|
|
- `curlx_mvfprintf`
|
|
|
|
<a name="contentencoding"></a>
|
|
Content Encoding
|
|
================
|
|
|
|
## About content encodings
|
|
|
|
[HTTP/1.1][4] specifies that a client may request that a server encode its
|
|
response. This is usually used to compress a response using one (or more)
|
|
encodings from a set of commonly available compression techniques. These
|
|
schemes include `deflate` (the zlib algorithm), `gzip`, `br` (brotli) and
|
|
`compress`. A client requests that the server perform an encoding by including
|
|
an `Accept-Encoding` header in the request document. The value of the header
|
|
should be one of the recognized tokens `deflate`, ... (there's a way to
|
|
register new schemes/tokens, see sec 3.5 of the spec). A server MAY honor
|
|
the client's encoding request. When a response is encoded, the server
|
|
includes a `Content-Encoding` header in the response. The value of the
|
|
`Content-Encoding` header indicates which encodings were used to encode the
|
|
data, in the order in which they were applied.
|
|
|
|
It's also possible for a client to attach priorities to different schemes so
|
|
that the server knows which it prefers. See sec 14.3 of RFC 2616 for more
|
|
information on the `Accept-Encoding` header. See sec
|
|
[3.1.2.2 of RFC 7231][15] for more information on the `Content-Encoding`
|
|
header.
|
|
|
|
## Supported content encodings
|
|
|
|
The `deflate`, `gzip` and `br` content encodings are supported by libcurl.
|
|
Both regular and chunked transfers work fine. The zlib library is required
|
|
for the `deflate` and `gzip` encodings, while the brotli decoding library is
|
|
for the `br` encoding.
|
|
|
|
## The libcurl interface
|
|
|
|
To cause libcurl to request a content encoding use:
|
|
|
|
[`curl_easy_setopt`][1](curl, [`CURLOPT_ACCEPT_ENCODING`][5], string)
|
|
|
|
where string is the intended value of the `Accept-Encoding` header.
|
|
|
|
Currently, libcurl does support multiple encodings but only
|
|
understands how to process responses that use the `deflate`, `gzip` and/or
|
|
`br` content encodings, so the only values for [`CURLOPT_ACCEPT_ENCODING`][5]
|
|
that will work (besides `identity`, which does nothing) are `deflate`,
|
|
`gzip` and `br`. If a response is encoded using the `compress` or methods,
|
|
libcurl will return an error indicating that the response could
|
|
not be decoded. If `<string>` is NULL no `Accept-Encoding` header is
|
|
generated. If `<string>` is a zero-length string, then an `Accept-Encoding`
|
|
header containing all supported encodings will be generated.
|
|
|
|
The [`CURLOPT_ACCEPT_ENCODING`][5] must be set to any non-NULL value for
|
|
content to be automatically decoded. If it is not set and the server still
|
|
sends encoded content (despite not having been asked), the data is returned
|
|
in its raw form and the `Content-Encoding` type is not checked.
|
|
|
|
## The curl interface
|
|
|
|
Use the [`--compressed`][6] option with curl to cause it to ask servers to
|
|
compress responses using any format supported by curl.
|
|
|
|
<a name="hostip"></a>
|
|
`hostip.c` explained
|
|
====================
|
|
|
|
The main compile-time defines to keep in mind when reading the `host*.c`
|
|
source file are these:
|
|
|
|
## `CURLRES_IPV6`
|
|
|
|
this host has `getaddrinfo()` and family, and thus we use that. The host may
|
|
not be able to resolve IPv6, but we don't really have to take that into
|
|
account. Hosts that aren't IPv6-enabled have `CURLRES_IPV4` defined.
|
|
|
|
## `CURLRES_ARES`
|
|
|
|
is defined if libcurl is built to use c-ares for asynchronous name
|
|
resolves. This can be Windows or \*nix.
|
|
|
|
## `CURLRES_THREADED`
|
|
|
|
is defined if libcurl is built to use threading for asynchronous name
|
|
resolves. The name resolve will be done in a new thread, and the supported
|
|
asynch API will be the same as for ares-builds. This is the default under
|
|
(native) Windows.
|
|
|
|
If any of the two previous are defined, `CURLRES_ASYNCH` is defined too. If
|
|
libcurl is not built to use an asynchronous resolver, `CURLRES_SYNCH` is
|
|
defined.
|
|
|
|
## `host*.c` sources
|
|
|
|
The `host*.c` sources files are split up like this:
|
|
|
|
- `hostip.c` - method-independent resolver functions and utility functions
|
|
- `hostasyn.c` - functions for asynchronous name resolves
|
|
- `hostsyn.c` - functions for synchronous name resolves
|
|
- `asyn-ares.c` - functions for asynchronous name resolves using c-ares
|
|
- `asyn-thread.c` - functions for asynchronous name resolves using threads
|
|
- `hostip4.c` - IPv4 specific functions
|
|
- `hostip6.c` - IPv6 specific functions
|
|
|
|
The `hostip.h` is the single united header file for all this. It defines the
|
|
`CURLRES_*` defines based on the `config*.h` and `curl_setup.h` defines.
|
|
|
|
<a name="memoryleak"></a>
|
|
Track Down Memory Leaks
|
|
=======================
|
|
|
|
## Single-threaded
|
|
|
|
Please note that this memory leak system is not adjusted to work in more
|
|
than one thread. If you want/need to use it in a multi-threaded app. Please
|
|
adjust accordingly.
|
|
|
|
## Build
|
|
|
|
Rebuild libcurl with `-DCURLDEBUG` (usually, rerunning configure with
|
|
`--enable-debug` fixes this). `make clean` first, then `make` so that all
|
|
files are actually rebuilt properly. It will also make sense to build
|
|
libcurl with the debug option (usually `-g` to the compiler) so that
|
|
debugging it will be easier if you actually do find a leak in the library.
|
|
|
|
This will create a library that has memory debugging enabled.
|
|
|
|
## Modify Your Application
|
|
|
|
Add a line in your application code:
|
|
|
|
```c
|
|
curl_dbg_memdebug("dump");
|
|
```
|
|
|
|
This will make the malloc debug system output a full trace of all resource
|
|
using functions to the given file name. Make sure you rebuild your program
|
|
and that you link with the same libcurl you built for this purpose as
|
|
described above.
|
|
|
|
## Run Your Application
|
|
|
|
Run your program as usual. Watch the specified memory trace file grow.
|
|
|
|
Make your program exit and use the proper libcurl cleanup functions etc. So
|
|
that all non-leaks are returned/freed properly.
|
|
|
|
## Analyze the Flow
|
|
|
|
Use the `tests/memanalyze.pl` perl script to analyze the dump file:
|
|
|
|
tests/memanalyze.pl dump
|
|
|
|
This now outputs a report on what resources that were allocated but never
|
|
freed etc. This report is very fine for posting to the list!
|
|
|
|
If this doesn't produce any output, no leak was detected in libcurl. Then
|
|
the leak is mostly likely to be in your code.
|
|
|
|
<a name="multi_socket"></a>
|
|
`multi_socket`
|
|
==============
|
|
|
|
Implementation of the `curl_multi_socket` API
|
|
|
|
The main ideas of this API are simply:
|
|
|
|
1. The application can use whatever event system it likes as it gets info
|
|
from libcurl about what file descriptors libcurl waits for what action
|
|
on. (The previous API returns `fd_sets` which is very
|
|
`select()`-centric).
|
|
|
|
2. When the application discovers action on a single socket, it calls
|
|
libcurl and informs that there was action on this particular socket and
|
|
libcurl can then act on that socket/transfer only and not care about
|
|
any other transfers. (The previous API always had to scan through all
|
|
the existing transfers.)
|
|
|
|
The idea is that [`curl_multi_socket_action()`][7] calls a given callback
|
|
with information about what socket to wait for what action on, and the
|
|
callback only gets called if the status of that socket has changed.
|
|
|
|
We also added a timer callback that makes libcurl call the application when
|
|
the timeout value changes, and you set that with [`curl_multi_setopt()`][9]
|
|
and the [`CURLMOPT_TIMERFUNCTION`][10] option. To get this to work,
|
|
Internally, there's an added struct to each easy handle in which we store
|
|
an "expire time" (if any). The structs are then "splay sorted" so that we
|
|
can add and remove times from the linked list and yet somewhat swiftly
|
|
figure out both how long there is until the next nearest timer expires
|
|
and which timer (handle) we should take care of now. Of course, the upside
|
|
of all this is that we get a [`curl_multi_timeout()`][8] that should also
|
|
work with old-style applications that use [`curl_multi_perform()`][11].
|
|
|
|
We created an internal "socket to easy handles" hash table that given
|
|
a socket (file descriptor) returns the easy handle that waits for action on
|
|
that socket. This hash is made using the already existing hash code
|
|
(previously only used for the DNS cache).
|
|
|
|
To make libcurl able to report plain sockets in the socket callback, we had
|
|
to re-organize the internals of the [`curl_multi_fdset()`][12] etc so that
|
|
the conversion from sockets to `fd_sets` for that function is only done in
|
|
the last step before the data is returned. I also had to extend c-ares to
|
|
get a function that can return plain sockets, as that library too returned
|
|
only `fd_sets` and that is no longer good enough. The changes done to c-ares
|
|
are available in c-ares 1.3.1 and later.
|
|
|
|
<a name="structs"></a>
|
|
Structs in libcurl
|
|
==================
|
|
|
|
This section should cover 7.32.0 pretty accurately, but will make sense even
|
|
for older and later versions as things don't change drastically that often.
|
|
|
|
<a name="Curl_easy"></a>
|
|
## Curl_easy
|
|
|
|
The `Curl_easy` struct is the one returned to the outside in the external API
|
|
as a `CURL *`. This is usually known as an easy handle in API documentations
|
|
and examples.
|
|
|
|
Information and state that is related to the actual connection is in the
|
|
`connectdata` struct. When a transfer is about to be made, libcurl will
|
|
either create a new connection or re-use an existing one. The particular
|
|
connectdata that is used by this handle is pointed out by
|
|
`Curl_easy->easy_conn`.
|
|
|
|
Data and information that regard this particular single transfer is put in
|
|
the `SingleRequest` sub-struct.
|
|
|
|
When the `Curl_easy` struct is added to a multi handle, as it must be in
|
|
order to do any transfer, the `->multi` member will point to the `Curl_multi`
|
|
struct it belongs to. The `->prev` and `->next` members will then be used by
|
|
the multi code to keep a linked list of `Curl_easy` structs that are added to
|
|
that same multi handle. libcurl always uses multi so `->multi` *will* point
|
|
to a `Curl_multi` when a transfer is in progress.
|
|
|
|
`->mstate` is the multi state of this particular `Curl_easy`. When
|
|
`multi_runsingle()` is called, it will act on this handle according to which
|
|
state it is in. The mstate is also what tells which sockets to return for a
|
|
specific `Curl_easy` when [`curl_multi_fdset()`][12] is called etc.
|
|
|
|
The libcurl source code generally use the name `data` for the variable that
|
|
points to the `Curl_easy`.
|
|
|
|
When doing multiplexed HTTP/2 transfers, each `Curl_easy` is associated with
|
|
an individual stream, sharing the same connectdata struct. Multiplexing
|
|
makes it even more important to keep things associated with the right thing!
|
|
|
|
<a name="connectdata"></a>
|
|
## connectdata
|
|
|
|
A general idea in libcurl is to keep connections around in a connection
|
|
"cache" after they have been used in case they will be used again and then
|
|
re-use an existing one instead of creating a new as it creates a significant
|
|
performance boost.
|
|
|
|
Each `connectdata` identifies a single physical connection to a server. If
|
|
the connection can't be kept alive, the connection will be closed after use
|
|
and then this struct can be removed from the cache and freed.
|
|
|
|
Thus, the same `Curl_easy` can be used multiple times and each time select
|
|
another `connectdata` struct to use for the connection. Keep this in mind,
|
|
as it is then important to consider if options or choices are based on the
|
|
connection or the `Curl_easy`.
|
|
|
|
Functions in libcurl will assume that `connectdata->data` points to the
|
|
`Curl_easy` that uses this connection (for the moment).
|
|
|
|
As a special complexity, some protocols supported by libcurl require a
|
|
special disconnect procedure that is more than just shutting down the
|
|
socket. It can involve sending one or more commands to the server before
|
|
doing so. Since connections are kept in the connection cache after use, the
|
|
original `Curl_easy` may no longer be around when the time comes to shut down
|
|
a particular connection. For this purpose, libcurl holds a special dummy
|
|
`closure_handle` `Curl_easy` in the `Curl_multi` struct to use when needed.
|
|
|
|
FTP uses two TCP connections for a typical transfer but it keeps both in
|
|
this single struct and thus can be considered a single connection for most
|
|
internal concerns.
|
|
|
|
The libcurl source code generally use the name `conn` for the variable that
|
|
points to the connectdata.
|
|
|
|
<a name="Curl_multi"></a>
|
|
## Curl_multi
|
|
|
|
Internally, the easy interface is implemented as a wrapper around multi
|
|
interface functions. This makes everything multi interface.
|
|
|
|
`Curl_multi` is the multi handle struct exposed as `CURLM *` in external
|
|
APIs.
|
|
|
|
This struct holds a list of `Curl_easy` structs that have been added to this
|
|
handle with [`curl_multi_add_handle()`][13]. The start of the list is
|
|
`->easyp` and `->num_easy` is a counter of added `Curl_easy`s.
|
|
|
|
`->msglist` is a linked list of messages to send back when
|
|
[`curl_multi_info_read()`][14] is called. Basically a node is added to that
|
|
list when an individual `Curl_easy`'s transfer has completed.
|
|
|
|
`->hostcache` points to the name cache. It is a hash table for looking up
|
|
name to IP. The nodes have a limited life time in there and this cache is
|
|
meant to reduce the time for when the same name is wanted within a short
|
|
period of time.
|
|
|
|
`->timetree` points to a tree of `Curl_easy`s, sorted by the remaining time
|
|
until it should be checked - normally some sort of timeout. Each `Curl_easy`
|
|
has one node in the tree.
|
|
|
|
`->sockhash` is a hash table to allow fast lookups of socket descriptor for
|
|
which `Curl_easy` uses that descriptor. This is necessary for the
|
|
`multi_socket` API.
|
|
|
|
`->conn_cache` points to the connection cache. It keeps track of all
|
|
connections that are kept after use. The cache has a maximum size.
|
|
|
|
`->closure_handle` is described in the `connectdata` section.
|
|
|
|
The libcurl source code generally use the name `multi` for the variable that
|
|
points to the `Curl_multi` struct.
|
|
|
|
<a name="Curl_handler"></a>
|
|
## Curl_handler
|
|
|
|
Each unique protocol that is supported by libcurl needs to provide at least
|
|
one `Curl_handler` struct. It defines what the protocol is called and what
|
|
functions the main code should call to deal with protocol specific issues.
|
|
In general, there's a source file named `[protocol].c` in which there's a
|
|
`struct Curl_handler Curl_handler_[protocol]` declared. In `url.c` there's
|
|
then the main array with all individual `Curl_handler` structs pointed to
|
|
from a single array which is scanned through when a URL is given to libcurl
|
|
to work with.
|
|
|
|
`->scheme` is the URL scheme name, usually spelled out in uppercase. That's
|
|
"HTTP" or "FTP" etc. SSL versions of the protocol need their own
|
|
`Curl_handler` setup so HTTPS separate from HTTP.
|
|
|
|
`->setup_connection` is called to allow the protocol code to allocate
|
|
protocol specific data that then gets associated with that `Curl_easy` for
|
|
the rest of this transfer. It gets freed again at the end of the transfer.
|
|
It will be called before the `connectdata` for the transfer has been
|
|
selected/created. Most protocols will allocate its private `struct
|
|
[PROTOCOL]` here and assign `Curl_easy->req.p.[protocol]` to it.
|
|
|
|
`->connect_it` allows a protocol to do some specific actions after the TCP
|
|
connect is done, that can still be considered part of the connection phase.
|
|
|
|
Some protocols will alter the `connectdata->recv[]` and
|
|
`connectdata->send[]` function pointers in this function.
|
|
|
|
`->connecting` is similarly a function that keeps getting called as long as
|
|
the protocol considers itself still in the connecting phase.
|
|
|
|
`->do_it` is the function called to issue the transfer request. What we call
|
|
the DO action internally. If the DO is not enough and things need to be kept
|
|
getting done for the entire DO sequence to complete, `->doing` is then
|
|
usually also provided. Each protocol that needs to do multiple commands or
|
|
similar for do/doing need to implement their own state machines (see SCP,
|
|
SFTP, FTP). Some protocols (only FTP and only due to historical reasons) has
|
|
a separate piece of the DO state called `DO_MORE`.
|
|
|
|
`->doing` keeps getting called while issuing the transfer request command(s)
|
|
|
|
`->done` gets called when the transfer is complete and DONE. That's after the
|
|
main data has been transferred.
|
|
|
|
`->do_more` gets called during the `DO_MORE` state. The FTP protocol uses
|
|
this state when setting up the second connection.
|
|
|
|
`->proto_getsock`
|
|
`->doing_getsock`
|
|
`->domore_getsock`
|
|
`->perform_getsock`
|
|
Functions that return socket information. Which socket(s) to wait for which
|
|
action(s) during the particular multi state.
|
|
|
|
`->disconnect` is called immediately before the TCP connection is shutdown.
|
|
|
|
`->readwrite` gets called during transfer to allow the protocol to do extra
|
|
reads/writes
|
|
|
|
`->defport` is the default report TCP or UDP port this protocol uses
|
|
|
|
`->protocol` is one or more bits in the `CURLPROTO_*` set. The SSL versions
|
|
have their "base" protocol set and then the SSL variation. Like
|
|
"HTTP|HTTPS".
|
|
|
|
`->flags` is a bitmask with additional information about the protocol that will
|
|
make it get treated differently by the generic engine:
|
|
|
|
- `PROTOPT_SSL` - will make it connect and negotiate SSL
|
|
|
|
- `PROTOPT_DUAL` - this protocol uses two connections
|
|
|
|
- `PROTOPT_CLOSEACTION` - this protocol has actions to do before closing the
|
|
connection. This flag is no longer used by code, yet still set for a bunch
|
|
of protocol handlers.
|
|
|
|
- `PROTOPT_DIRLOCK` - "direction lock". The SSH protocols set this bit to
|
|
limit which "direction" of socket actions that the main engine will
|
|
concern itself with.
|
|
|
|
- `PROTOPT_NONETWORK` - a protocol that doesn't use network (read `file:`)
|
|
|
|
- `PROTOPT_NEEDSPWD` - this protocol needs a password and will use a default
|
|
one unless one is provided
|
|
|
|
- `PROTOPT_NOURLQUERY` - this protocol can't handle a query part on the URL
|
|
(?foo=bar)
|
|
|
|
<a name="conncache"></a>
|
|
## conncache
|
|
|
|
Is a hash table with connections for later re-use. Each `Curl_easy` has a
|
|
pointer to its connection cache. Each multi handle sets up a connection
|
|
cache that all added `Curl_easy`s share by default.
|
|
|
|
<a name="Curl_share"></a>
|
|
## Curl_share
|
|
|
|
The libcurl share API allocates a `Curl_share` struct, exposed to the
|
|
external API as `CURLSH *`.
|
|
|
|
The idea is that the struct can have a set of its own versions of caches and
|
|
pools and then by providing this struct in the `CURLOPT_SHARE` option, those
|
|
specific `Curl_easy`s will use the caches/pools that this share handle
|
|
holds.
|
|
|
|
Then individual `Curl_easy` structs can be made to share specific things
|
|
that they otherwise wouldn't, such as cookies.
|
|
|
|
The `Curl_share` struct can currently hold cookies, DNS cache and the SSL
|
|
session cache.
|
|
|
|
<a name="CookieInfo"></a>
|
|
## CookieInfo
|
|
|
|
This is the main cookie struct. It holds all known cookies and related
|
|
information. Each `Curl_easy` has its own private `CookieInfo` even when
|
|
they are added to a multi handle. They can be made to share cookies by using
|
|
the share API.
|
|
|
|
|
|
[1]: https://curl.se/libcurl/c/curl_easy_setopt.html
|
|
[2]: https://curl.se/libcurl/c/curl_easy_init.html
|
|
[3]: https://c-ares.haxx.se/
|
|
[4]: https://tools.ietf.org/html/rfc7230 "RFC 7230"
|
|
[5]: https://curl.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html
|
|
[6]: https://curl.se/docs/manpage.html#--compressed
|
|
[7]: https://curl.se/libcurl/c/curl_multi_socket_action.html
|
|
[8]: https://curl.se/libcurl/c/curl_multi_timeout.html
|
|
[9]: https://curl.se/libcurl/c/curl_multi_setopt.html
|
|
[10]: https://curl.se/libcurl/c/CURLMOPT_TIMERFUNCTION.html
|
|
[11]: https://curl.se/libcurl/c/curl_multi_perform.html
|
|
[12]: https://curl.se/libcurl/c/curl_multi_fdset.html
|
|
[13]: https://curl.se/libcurl/c/curl_multi_add_handle.html
|
|
[14]: https://curl.se/libcurl/c/curl_multi_info_read.html
|
|
[15]: https://tools.ietf.org/html/rfc7231#section-3.1.2.2
|