Download File only when Changed using wget/curl

2024-01-13 ⏳3.0 min(1.2k words)

I am working on one ChatGPT based Product, and it need to download the cl100k_base.tiktoken file via HTTP protocol to count the token consumed in the streaming API. Although the tiktoken file is not very big (1.7M), its content rarely changes. So there is no need to wate net traffic to download the entire every time. In this article, I am going to share how to download file only if its content changed by wget or curl.

HTTP conditional requests

Firstly, let’s learn some basic knowledges about HTTP protocol. The HTTP has a concept of conditional requests1, by which we are able to get remote resource only under certain conditions met.

When we download one file from HTTP server, it will response the file content with some additional metadata in the header. We are able to fetch this metadata without download the whole file, as well. For example, the following curl command will fire on HEAD request and the server will only response metadata in the header.

curl -s -I https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
HTTP/1.1 200 OK
Content-Length: 1681126
Content-Type: application/octet-stream
Content-MD5: sc9JBYqKfuVJ7//VcUymgw==
Last-Modified: Wed, 14 Dec 2022 23:22:53 GMT
ETag: 0x8DADE2A203B60B6
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: f25d6dec-401e-004d-4ad1-45c663000000
x-ms-version: 2009-09-19
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
Date: Sat, 13 Jan 2024 03:35:05 GMT

There are many headers in the response, of which the Last-Modified and ETag are used for conditional request.

As the name implies, the header Last-Modified tells the client the date time when the file has been changed. If the client’s already downloaded the file at some time before, it can check the change time of its local version, and only need to download the new version when the remote file has been changed.

While Last-Modified works for static files very well, it is not suitable for resources dynamic generated, because every time you try to download them, they have the new modified date time.

For this scenario, we can use another mechanism of ETag. You can think of the value of ETag as the version of the content of HTTP resource. No matter when it was created, the value of ETag will not change as long as the content remains unchanged. In other words, if the server response a new ETag, the client should download it and replace the local file.

In theory, we could write some bash script and using the HEAD request to get the metadata and check if we need to download the latest resources. But in reality, the HTTP protocol has the builtin mechanism for this process.

If the client knows the last modified time of some file, and it want to download the latest version, it could fire the HTTP request with the header of If-Modified-Since, which let’s the server check if the resource requested has been modified after the date time offered by If-Modified-Since. If there is no change, the server will response a 304 Not Modified response to tell the client that it is safe to use the local version of resource. If there is any changes, the server will response the new content with new Last-Modified and ETag.

If the client knows the last value of ETag, it could request the resource with this value in the header of If-None-Match. And the server will return a 200 response with whole content only if the ETag of remote side differs client side, which means there is new content need to be fetch. Otherwise, the server will response the 304 Not Modified response.

After learning these theoretical knowledges, let us do some practice.

wget

The wget is one simple CLI util to download HTTP file. All you need to do is running wget feed with the file URL, for example:

wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken

wget will download the remote file and save it to the current directory with the base name of URL. In the above example, the local file name will be cl100k_base.tiktoken.

If you repeat the above command, the wget will download the file again and save it to the local file named cl100k_base.tiktoken.1.

We can use the -N flag (for timestamp-checking) to let wget do the conditional request:

wget --debug -N https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
...
---request begin---
GET /encodings/cl100k_base.tiktoken HTTP/1.1
Host: openaipublic.blob.core.windows.net
If-Modified-Since: Wed, 14 Dec 2022 23:22:53 GMT
User-Agent: Wget/1.21.2
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 304 The condition specified using HTTP conditional header(s) is not met.
Content-Length: 0
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: 290c8897-601e-0017-2bd6-45a084000000
x-ms-version: 2009-09-19
Date: Sat, 13 Jan 2024 04:12:11 GMT

---response end---
304 The condition specified using HTTP conditional header(s) is not met.
Registered socket 3 for persistent reuse.
File ‘cl100k_base.tiktoken’ not modified on server. Omitting download.

I also set the --debug flag to let wget output more debug information. We can see that wget has sent the If-Modified-Since header, and the server response the 304 status code, which means there is no need to download the remote file.

It works like charming. However, the flag -N is not compatible with the -O option. If you set the name of file to be saved, wget cannot get its modified date time, and it will always download the entire file. So if wget is your only choice, you should not use the -O option.

If you can choose the curl, it will do a more better job.

curl

In curl, you can use the -z flag to specify the date time used for If-Modified-Since:

curl -s -z cl100k.tiktoken https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -o cl100k.tiktoken

curl will get the modified time from the file specified by -z flag. If the file does not exist, curl will complain some warning, but continue the download:

Warning: Failed to get filetime: No such file or directory
Warning: Illegal date format for -z, --time-cond (and not a file name).
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.

In the previous example, I intentionally use the -o flag to custom the name of the local file.

When we repeat the command again, cull will do the conditional request:

curl --verbose -s -z cl100k.tiktoken https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -o cl100k.tiktoken
...
> GET /encodings/cl100k_base.tiktoken HTTP/1.1
> Host: openaipublic.blob.core.windows.net
> User-Agent: curl/8.4.0
> Accept: */*
> If-Modified-Since: Sat, 13 Jan 2024 04:22:53 GMT
>
< HTTP/1.1 304 The condition specified using HTTP conditional header(s) is not met.
< Content-Length: 0
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: 0d8fcb6b-e01e-0009-7ed8-454c5c000000
< x-ms-version: 2009-09-19
< Date: Sat, 13 Jan 2024 04:26:59 GMT
<
...

You will see the If-Modified-Since header in the request message.

curl is more powerful than wget because it support conditional request with ETag. However, if we want to use ETag, we need to store the previous value of ETag that the server offered.

For Last-Modified, we can use the local modified time from the file system. But the ETag can not be saved with out additional storage. In curl, you can use the --etag-save tag to specify one file to store the ETag for file downloaded.

For example:

curl --etag-save etags.txt https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -O

curl will download the file and save the value of ETag into etags.txt:

cat etags.txt
0x8DADE2A203B60B6

We need to set the --etag-compare with the file which stored the previous ETag value to fire the conditional request:

curl -O -s --verbose --etag-compare etags.txt https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
...
> GET /encodings/cl100k_base.tiktoken HTTP/1.1
> Host: openaipublic.blob.core.windows.net
> User-Agent: curl/8.4.0
> Accept: */*
> If-None-Match: 0x8DADE2A203B60B6
>
< HTTP/1.1 304 The condition specified using HTTP conditional header(s) is not met.
< Content-Length: 0
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: f39976be-f01e-002a-52d9-45d69f000000
< x-ms-version: 2009-09-19
< Date: Sat, 13 Jan 2024 04:34:26 GMT
<
...

You can see curl sends request with the If-None-Match header and the server responses the 304 status.

Even though curl support store the ETag in one additional file, you should be cautious that the file can only store one value for the current URL. If you need to download multiple files, you need to offer different file to store their Etags.

Conclusion

We can use both wget and curl to do the HTTP conditional request to download file only after it has been changed to save network traffic. And curl is more powerful than wget. Besides, the HTTP conditional request is widely used in the CDN network to save network traffic with stale response.


  1. https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests↩︎