Skip to content

Latest commit

 

History

History
286 lines (202 loc) · 11.4 KB

urls.md

File metadata and controls

286 lines (202 loc) · 11.4 KB

URLs

curl is called curl because a substring in its name is URL (Uniform Resource Locator). It operates on URLs. URL is the name we casually use for the web address strings, like the ones we usually see prefixed with http:// or starting with www.

URL is, strictly speaking, the former name for these. URI (Uniform Resource Identifier) is the more modern and correct name for them. Their syntax is defined in RFC 3986.

Where curl accepts a “URL” as input, it is then really a “URI”. Most of the protocols curl understands also have a corresponding URI syntax document that describes how that particular URI format works.

curl assumes that you give it a valid URL and it only does limited checks of the format in order to extract the information it deems necessary to perform its operation. You can, for example, most probably pass in illegal characters in the URL without curl noticing or caring and it will just pass them on.

Scheme

URLs start with the "scheme", which is the official name for the http:// part. That tells which protocol the URL uses. The scheme must be a known one that this version of curl supports or it will show an error message and stop. Additionally, the scheme must neither start with nor contain any whitespace.

The scheme separator

The scheme identifier is separated from the rest of the URL by the :// sequence. That is a colon and two forward slashes. There exists URL formats with only one slash, but curl does not support any of them. There are two additional notes to be aware of, about the number of slashes:

curl allows some illegal syntax and tries to correct it internally; so it will also understand and accept URLs with one or three slashes, even though they are in fact not properly formed URLs. curl does this because the browsers started this practice so it has lead to such URLs being used in the wild every now and then.

file:// URLs are written as file://<hostname>/<path> but the only hostnames that are okay to use are localhost, 127.0.0.1 or a blank (nothing at all):

file://localhost/path/to/file
file://127.0.0.1/path/to/file
file:///path/to/file

Inserting any other host name in there will make recent versions of curl to return an error.

Pay special attention to the third example above (file:///path/to/file). That is three slashes before the path. That is again an area with common mistakes and where browsers allow users to use the wrong syntax so as a special exception, curl on Windows also allows this incorrect format:

file://X:/path/to/file

… where X is a windows-style drive letter.

Without scheme

As a convenience, curl also allows users to leave out the scheme part from URLs. Then it guesses which protocol to use based on the first part of the host name. That guessing is basic, as it just checks if the first part of the host name matches one of a set of protocols, and assumes you meant to use that protocol. This heuristic is based on the fact that servers traditionally used to be named like that. The protocols that are detected this way are FTP, DICT, LDAP, IMAP, SMTP and POP3. Any other host name in a scheme-less URL will make curl default to HTTP.

For example, this gets a file from an FTP site:

curl ftp.funet.fi/README

While this gets data from a HTTP server:

curl example.com

You can modify the default protocol to something other than HTTP with the --proto-default option.

Name and password

After the scheme, there can be a possible user name and password embedded. The use of this syntax is usually frowned upon these days since you easily leak this information in scripts or otherwise. For example, listing the directory of an FTP server using a given name and password:

curl ftp://user:password@example.com/

The presence of user name and password in the URL is completely optional. curl also allows that information to be provide with normal command-line options, outside of the URL.

Host name or address

The host name part of the URL is, of course, simply a name that can be resolved to an numerical IP address, or the numerical address itself. When specifying a numerical address, use the dotted version for IPv4 addresses:

curl http://127.0.0.1/

…and for IPv6 addresses the numerical version needs to be within square brackets:

curl http://[::1]/

When a host name is used, the converting of the name to an IP address is typically done using the system's resolver functions. That normally lets a sysadmin provide local name lookups in the /etc/hosts file (or equivalent).

Port number

Each protocol has a "default port" that curl will use for it, unless a specified port number is given. The optional port number can be provided within the URL after the host name part, as a colon and the port number written in decimal. For example, asking for an HTTP document on port 8080:

curl http://example.com:8080/

With the name specified as an IPv4 address:

curl http://127.0.0.1:8080/

With the name given as an IPv6 address:

curl http://[fdea::1]:8080/

Path

Every URL contains a path. If there's none given, / is implied. The path is sent to the specified server to identify exactly which resource that is requested or that will be provided.

The exact use of the path is protocol dependent. For example, getting a file README from the default anonymous user from an FTP server:

curl ftp://ftp.example.com/README

For the protocols that have a directory concept, ending the URL with a trailing slash means that it is a directory and not a file. Thus asking for a directory list from an FTP server is implied with such a slash:

curl ftp://ftp.example.com/tmp/

FTP type

This is not a feature that is widely used.

URLs that identify files on FTP servers have a special feature that allows you to also tell the client (curl in this case) which file type the resource is. This is because FTP is a little special and can change mode for a transfer and thus handle the file differently than if it would use another mode.

You tell curl that the FTP resource is an ASCII type by appending ;type=A to the URL. Getting the foo file from example.com's root directory using ASCII could then be made with:

curl "ftp://example.com/foo;type=A"

And while curl defaults to binary transfers for FTP, the URL format allows you to also specify the binary type with type=I:

curl "ftp://example.com/foo;type=I"

Finally, you can tell curl that the identified resource is a directory if the type you pass is D:

curl "ftp://example.com/foo;type=D"

…this can then work as an alternative format, instead of ending the path with a trailing slash as mentioned above.

Fragment

URLs offer a fragment part. That's usually seen as a hash symbol (#) and a name for a specific name within a web page in browsers. curl supports fragments fine when a URL is passed to it, but the fragment part is never actually sent over the wire so it does not make a difference to curl's operations whether it is present or not.

Browsers' address bar

It is important to realize that when you use a modern web browser, the address bar they tend to feature at the top of their main windows are not using URLs or even URIs. They are in fact mostly using IRIs, which is a superset of URIs to allow internationalization like non-Latin symbols and more, but it usually goes beyond that, too, as they tend to, for example, handle spaces and do magic things on percent encoding in ways none of these mentioned specifications say a client should do.

The address bar is quite simply an interface for humans to enter and see URI-like strings.

Sometimes the differences between what you see in a browser's address bar and what you can pass in to curl is significant.

Many options and URLs

As mentioned above, curl supports hundreds of command-line options and it also supports an unlimited number of URLs. If your shell or command-line system supports it, there's really no limit to how long a command line you can pass to curl.

curl will parse the entire command line first, apply the wishes from the command-line options used, and then go over the URLs one by one (in a left to right order) to perform the operations.

For some options (for example -o or -O that tell curl where to store the transfer), you may want to specify one option for each URL on the command line.

curl will return an exit code for its operation on the last URL used. If you instead rather want curl to exit with an error on the first URL in the set that fails, use the --fail-early option.

Separate options per URL

In previous sections we described how curl always parses all options in the whole command line and applies those to all the URLs that it transfers.

That was a simplification: curl also offers an option (-:, --next) that inserts a boundary between a set of options and URLs for which it will apply the options. When the command-line parser finds a --next option, it applies the following options to the next set of URLs. The --next option thus works as a divider between a set of options and URLs. You can use as many --next options as you please.

As an example, we do an HTTP GET to a URL and follow redirects, we then make a second HTTP POST to a different URL and we round it up with a HEAD request to a third URL. All in a single command line:

curl --location http://example.com/1 --next
  --data sendthis http://example.com/2 --next
  --head http://example.com/3

Trying something like that without the --next options on the command line would generate an illegal command line since curl would attempt to combine both a POST and a HEAD:

Warning: You can only select one HTTP request method! You asked for both POST
Warning: (-d, --data) and HEAD (-I, --head).

Connection reuse

Setting up a TCP connection and especially a TLS connection can be a slow process, even on high bandwidth networks.

It can be useful to remember that curl has a connection pool internally which keeps previously used connections alive and around for a while after they were used so that subsequent requests to the same hosts can reuse an already established connection.

Of course, they can only be kept alive for as long as the curl tool is running. It is a good reason for trying to get several transfers done within the same command line instead of running several independent curl command line invocations.

Do the transfers in parallel

The default behavior of getting the specified URLs one by one in a serial fashion makes it easy to understand exactly when each URL is fetched but it can be slow.

Since version 7.66.0, curl offers the -Z (or --parallel) option that instead instructs curl to attempt to do the specified transfers in a parallel fashion. When this is enabled, curl will do a lot of transfers simultaneously instead of serially. It will do up to 50 transfers at the same time and as soon as one of them has completed, the next one will be kicked off.

For cases where you want to download many files from different sources and a few of them might be slow, a few fast, this can speed things up tremendously.

If 50 parallel transfer is wrong for you, the --parallel-max option is there to allow you to change that as well.

Parallel transfer progress meter

Naturally, the ordinary progress meter display that shows file transfer progress for a single transfer is not that useful for parallel transfers so when curl performs parallel transfers, it will show a different progress meter that displays information about all the current ongoing transfers in a single line.