mirror of
https://github.com/moparisthebest/curl
synced 2025-01-10 21:48:10 -05:00
docs/URL-SYNTAX: the URL syntax curl accepts and works with
Closes #6285
This commit is contained in:
parent
5253444090
commit
ea0916d41b
@ -86,6 +86,7 @@ EXTRA_DIST = \
|
||||
THANKS \
|
||||
TODO \
|
||||
TheArtOfHttpScripting.md \
|
||||
URL-SYNTAX.md \
|
||||
VERSIONS.md
|
||||
|
||||
MAN2HTML= roffit $< >$@
|
||||
|
316
docs/URL-SYNTAX.md
Normal file
316
docs/URL-SYNTAX.md
Normal file
@ -0,0 +1,316 @@
|
||||
# URL syntax and their use in curl
|
||||
|
||||
## Specifications
|
||||
|
||||
The official "URL syntax" is primarily defined in these two different
|
||||
specifications:
|
||||
|
||||
- [RFC 3986](https://tools.ietf.org/html/rfc3986) (although URL is called "URI" in there)
|
||||
- [The WHATWG URL Specification](https://url.spec.whatwg.org/)
|
||||
|
||||
RFC 3986 is the earlier one, and curl has always tried to adhere to that one
|
||||
(since it shipped in January 2005).
|
||||
|
||||
The WHATWG URL spec was written later, is incompatible with the RFC 3986 and
|
||||
changes over time.
|
||||
|
||||
## Variations
|
||||
|
||||
URL parsers as implemented in browsers, libraries and tools usually opt to
|
||||
support one of the mentioned specifications. Bugs, differences in
|
||||
interpretations and the moving nature of the WHATWG spec does however make it
|
||||
very unlikely that multiple parsers treat URLs the exact same way!
|
||||
|
||||
## Security
|
||||
|
||||
Due to the inherent differences between URL parser implementations, it is
|
||||
considered a security risk to mix different implementations and assume the
|
||||
same behavior!
|
||||
|
||||
For example, if you use one parser to check if a URL uses a good host name or
|
||||
the correct auth field, and then pass on that same URL to a *second* parser,
|
||||
there will always be a risk it treats the same URL differently. There is no
|
||||
right and wrong in URL land, only differences of opinions.
|
||||
|
||||
libcurl offers a separate API to its URL parser for among others, this reason.
|
||||
|
||||
Applications may at times find it convenient to allow users to specify URLs
|
||||
for various purposes and that string would then end up fed to curl. Getting a
|
||||
URL from an external untrusted party and using it with curl brings several
|
||||
security concerns:
|
||||
|
||||
1. If you have an application that runs as or in a server application, getting
|
||||
an unfiltered URL can trick your application to access a local resource
|
||||
instead of a remote. Protecting yourself against localhost accesses is very
|
||||
hard when accepting user provided URLs.
|
||||
|
||||
2. Such custom URLs can access other ports than you planned as port numbers
|
||||
are part of the regular URL format. The combination of a local host and a
|
||||
custom port number can allow external users to play tricks with your local
|
||||
services.
|
||||
|
||||
3. Such a URL might use other schemes than you thought of or planned for.
|
||||
|
||||
## "RFC3986 plus"
|
||||
|
||||
curl recognizes a URL syntax that we call "RFC 3986 plus". It is grounded on
|
||||
the well established RFC 3986 to make sure previously written command lines and
|
||||
curl using scripts will remain working.
|
||||
|
||||
curl's URL parser allows a few deviations from the spec in order to
|
||||
inter-operate better with URLs that appear in the wild.
|
||||
|
||||
### spaces
|
||||
|
||||
In particular `Location:` headers that indicate to the client where a resource
|
||||
has been redirected to, sometimes contain spaces. This is a violation of RFC
|
||||
3986 but is fine in the WHATWG spec. curl handles these by re-encoding them to
|
||||
`%20`.
|
||||
|
||||
### non-ASCII
|
||||
|
||||
Byte values in a provided URL that are outside of the printable ASCII range
|
||||
are percent-encoded by curl.
|
||||
|
||||
### multiple slashes
|
||||
|
||||
An absolute URL always starts with a "scheme" followed by a colon. For all the
|
||||
schemes curl supports, the colon must be followed by two slashes according to
|
||||
RFC 3986 but not according to the WHATWG spec - which allows one to infinity
|
||||
amount.
|
||||
|
||||
curl allows one, two or three slashes after the colon to still be considered a
|
||||
valid URL.
|
||||
|
||||
### "scheme-less"
|
||||
|
||||
curl supports "URLs" that do not start with a scheme. This is not supported by
|
||||
any of the specifications. This is a shortcut to entering URLs that was
|
||||
supported by browsers early on and has been mimicked by curl.
|
||||
|
||||
Based on what the host name starts with, curl will "guess" what protocol to
|
||||
use:
|
||||
|
||||
- `ftp.` means FTP
|
||||
- `dict.` means DICT
|
||||
- `ldap.` means LDAP
|
||||
- `imap.` means IMAP
|
||||
- `smtp.` means SMTP
|
||||
- `pop3.` means POP3
|
||||
- all other means HTTP
|
||||
|
||||
### globbing letters
|
||||
|
||||
The curl command line tool supports "globbing" of URLs. It means that you can
|
||||
create ranges and lists using `[N-M]` and `{one,two,three}` sequences. The
|
||||
letters used for this (`[]{}`) are reserved in RFC 3986 and can therefore not
|
||||
legitimately be part of such a URL.
|
||||
|
||||
They are however not reserved or special in the WHATWG specification, so
|
||||
globbing can mess up such URLs. Globbing can be turned off for such occasions
|
||||
(using `--globoff`).
|
||||
|
||||
# URL syntax details
|
||||
|
||||
A URL may consist of the following components - many of them are optional:
|
||||
|
||||
[scheme][divider][userinfo][hostname][port number][path][query][fragment]
|
||||
|
||||
Each component is separated from the following component with a divider
|
||||
character or string.
|
||||
|
||||
Which in an example could look like
|
||||
|
||||
http://user:password@www.example.com:80/index.hmtl?foo=bar#top
|
||||
|
||||
## Scheme
|
||||
|
||||
The scheme specifies the protocol to use. A curl build can support a few or
|
||||
many different schemes. You can limit what schemes curl should acccept.
|
||||
|
||||
## Userinfo
|
||||
|
||||
The userinfo field can be used to set user name and password for this
|
||||
authentication purposes in this transfer. The use of this field is discouraged
|
||||
since it often means passing around the password in plain text and is thus a
|
||||
security risk.
|
||||
|
||||
URLs for IMAP, POP3 and SMTP also support *login options* as part of the
|
||||
userinfo field. They're provided as a semicolon after the password and then
|
||||
the options.
|
||||
|
||||
## Hostname
|
||||
|
||||
The hostname part of the URL contains the address of the server that you want
|
||||
to connect to. This can be the fully qualified domain name of the server, the
|
||||
local network name of the machine on your network or the IP address of the
|
||||
server or machine represented by either an IPv4 or IPv6 address (within
|
||||
brackets). For example:
|
||||
|
||||
http://www.example.com/
|
||||
|
||||
http://hostname/
|
||||
|
||||
http://192.168.0.1/
|
||||
|
||||
http://[2001:1890:1112:1::20]/
|
||||
|
||||
If curl was built with International Domain Name (IDN) support, it can also
|
||||
handle host names using non-ASCII characters.
|
||||
|
||||
## Port number
|
||||
|
||||
If there's a colon after the hostname, that should be followed by the port
|
||||
number to use. 1 - 65535. curl also supports a blank port number field - but
|
||||
only if the URL starts with a scheme.
|
||||
|
||||
# Scheme specific behaviors
|
||||
|
||||
## FTP
|
||||
|
||||
The path part of an FTP request specifies the file to retrieve and from what
|
||||
directory. If the file part is omitted then libcurl downloads the directory
|
||||
listing for the directory specified. If the directory is omitted then the
|
||||
directory listing for the root / home directory will be returned.
|
||||
|
||||
FTP servers typically put the user in its "home directory" after login, which
|
||||
then differs between users. To explicitly specify the root directory of an FTP
|
||||
server start the path with double slash `//` or `/%2f` (2F is the hexadecimal
|
||||
value of the ascii code for the slash).
|
||||
|
||||
## FILE
|
||||
|
||||
When a `FILE://` URL is accessed on Windows systems, it can be crafted in a
|
||||
way so that Windows attempts to connect to a (remote) machine when curl wants
|
||||
to read or write such a path.
|
||||
|
||||
curl only allows the hostname part of a FILE URL to be one out of these three
|
||||
alternatives: `localhost`, `127.0.0.1` or blank ("", zero characters).
|
||||
Anything else will make curl fail to parse the URL.
|
||||
|
||||
On Windows, curl accepts that the FILE URL's path starts with a "drive
|
||||
letter". That's a single letter `a` to `z` followed by a colon or a pipe
|
||||
character (`|`).
|
||||
|
||||
## IMAP
|
||||
|
||||
The path part of an IMAP request not only specifies the mailbox to list or
|
||||
select, but can also be used to check the `UIDVALIDITY` of the mailbox, to
|
||||
specify the `UID`, `SECTION` and `PARTIAL` octets of the message to fetch and
|
||||
to specify what messages to search for.
|
||||
|
||||
A top level folder list:
|
||||
|
||||
imap://user:password@mail.example.com
|
||||
|
||||
A folder list on the user's inbox:
|
||||
|
||||
imap://user:password@mail.example.com/INBOX
|
||||
|
||||
Select the user's inbox and fetch message with uid = 1:
|
||||
|
||||
imap://user:password@mail.example.com/INBOX/;UID=1
|
||||
|
||||
Select the user's inbox and fetch the first message in the mail box:
|
||||
|
||||
imap://user:password@mail.example.com/INBOX/;MAILINDEX=1
|
||||
|
||||
Select the user's inbox, check the `UIDVALIDITY` of the mailbox is 50 and
|
||||
fetch message 2 if it is:
|
||||
|
||||
imap://user:password@mail.example.com/INBOX;UIDVALIDITY=50/;UID=2
|
||||
|
||||
Select the user's inbox and fetch the text portion of message 3:
|
||||
|
||||
imap://user:password@mail.example.com/INBOX/;UID=3/;SECTION=TEXT
|
||||
|
||||
Select the user's inbox and fetch the first 1024 octets of message 4:
|
||||
|
||||
imap://user:password@mail.example.com/INBOX/;UID=4/;PARTIAL=0.1024
|
||||
|
||||
Select the user's inbox and check for NEW messages:
|
||||
|
||||
imap://user:password@mail.example.com/INBOX?NEW
|
||||
|
||||
Select the user's inbox and search for messages containing "shadows" in the
|
||||
subject line:
|
||||
|
||||
imap://user:password@mail.example.com/INBOX?SUBJECT%20shadows
|
||||
|
||||
For more information about the individual components of an IMAP URL please see
|
||||
RFC 5092.
|
||||
|
||||
## LDAP
|
||||
|
||||
The path part of a LDAP request can be used to specify the: Distinguished
|
||||
Name, Attributes, Scope, Filter and Extension for a LDAP search. Each field is
|
||||
separated by a question mark and when that field is not required an empty
|
||||
string with the question mark separator should be included.
|
||||
|
||||
Search for the DN as `My Organisation`:
|
||||
|
||||
ldap://ldap.example.com/o=My%20Organisation
|
||||
|
||||
the same search but will only return postalAddress attributes:
|
||||
|
||||
ldap://ldap.example.com/o=My%20Organisation?postalAddress
|
||||
|
||||
Seearch for an empty DN and request information about the
|
||||
`rootDomainNamingContext` attribute for an Active Directory server:
|
||||
|
||||
ldap://ldap.example.com/?rootDomainNamingContext
|
||||
|
||||
For more information about the individual components of a LDAP URL please
|
||||
see RFC4516.
|
||||
|
||||
## POP3
|
||||
|
||||
The path part of a POP3 request specifies the message ID to retrieve. If the
|
||||
ID is not specified then a list of waiting messages is returned instead.
|
||||
|
||||
## SCP
|
||||
|
||||
The path part of an SCP URL specifies the path and file to retrieve or
|
||||
upload. The file is taken as an absolute path from the root directory on the
|
||||
server.
|
||||
|
||||
To specify a path relative to the user's home directory on the server, prepend
|
||||
`~/` to the path portion.
|
||||
|
||||
## SFTP
|
||||
|
||||
The path part of an SFTP URL specifies the file to retrieve or upload. If the
|
||||
path ends with a slash (`/`) then a directory listing is returned instead of a
|
||||
file. If the path is omitted entirely then the directory listing for the root
|
||||
/ home directory will be returned.
|
||||
|
||||
## SMB
|
||||
The path part of a SMB request specifies the file to retrieve and from what
|
||||
share and directory or the share to upload to and as such, may not be omitted.
|
||||
If the user name is embedded in the URL then it must contain the domain name
|
||||
and as such, the backslash must be URL encoded as %2f.
|
||||
|
||||
curl supports SMB version 1 (only)
|
||||
|
||||
## SMTP
|
||||
|
||||
The path part of a SMTP request specifies the host name to present during
|
||||
communication with the mail server. If the path is omitted then libcurl will
|
||||
attempt to resolve the local computer's host name. However, this may not
|
||||
return the fully qualified domain name that is required by some mail servers
|
||||
and specifying this path allows you to set an alternative name, such as your
|
||||
machine's fully qualified domain name, which you might have obtained from an
|
||||
external function such as gethostname or getaddrinfo.
|
||||
|
||||
## RTMP
|
||||
|
||||
There's no official URL spec for RTMP so libcurl uses the URL syntax supported
|
||||
by the underlying librtmp library. It has a syntax where it wants a
|
||||
traditional URL, followed by a space and a series of space-separated
|
||||
`name=value` pairs.
|
||||
|
||||
While space is not typically a "legal" letter, libcurl accepts them. When a
|
||||
user wants to pass in a `#` (hash) character it will be treated as a fragment
|
||||
and get cut off by libcurl if provided literally. You will instead have to
|
||||
escape it by providing it as backslash and its ASCII value in hexadecimal:
|
||||
`\23`.
|
Loading…
Reference in New Issue
Block a user