-
Type: Improvement
-
Resolution: Won't Fix
-
Priority: Minor
-
Affects Version/s: master
-
Component/s: CLI: pegasus-s3
-
None
Now that we have the -recursive option for put and get, it is important to be able to recover if a failure occurs in the middle of downloading a bunch of keys from S3. For example, say you are in the middle of downloading 100, 100MB keys from S3-you don't want to download the first 50 again when you retry.
Really, we would like something similar to rsync for s3.
One solution could be to add a --sync option for put and get. The logic would be: if the destination already exists, and has the same size as the source, then don't download the file again. This is not perfectly accurate, and it gets a little complicated in the case of ranged downloads, which create a sparse file, but it would probably work for most cases.
Another option could be to use the content hashes (s3 stores the md5 and returns it as ETag), but that means reading the files off disk to recompute the hash. Also, it gets complicated for multipart uploads.