Imagine you have a huge zip archive stored somewhere in the cloud, say on an S3 bucket, and you need to access a few specific files inside. What do you do? Well, like everyone else, you download the entire 32 GB, unzip the whole thing, and all that just to retrieve 3 miserable files…
Well, guess what? I’ve found a nifty tool that will make your life easier: Cloudzip! It allows you to mount your remote zip archive directly on your machine, like an external hard drive, so you can access the files you need, copy them, use them, all without having to download the entire archive.
Example:
cz ls s3://example-bucket/path/to/archive.zip
Pretty cool, right?
Cloudzip’s operation is quite ingenious. It is based on two simple but incredibly effective principles:
- Zip files allow random read access. They have a “central directory” stored at the end of the archive that describes all the contained files, with their offsets. No need to read the entire archive to find a file.
- Most HTTP servers and cloud storage services (S3, Google Cloud Storage, Azure Blob Storage, etc.) support HTTP requests with “range” headers. Basically, this allows you to fetch only a part of a remote file.
By combining these two principles, Cloudzip can retrieve just the central directory of your zip archive (which weighs only a few KB) to get the list of files and then download only the file segments you need when you access them!
To install:
git clone https://github.com/ozkatz/cloudzip.git
cd cloudzip
go build -o cz main.go
Then copy the cz
binary to a location accessible via your $PATH:
cp cz /usr/local/bin/
And where it gets even crazier (oops, I meant “interesting”) is that with the mount parameter, Cloudzip can actually mount your remote zip archive as a local directory. It starts a small local NFS server and mounts this NFS directory in the folder of your choice.
Another example:
cz mount s3://example-bucket/path/to/archive.zip some_dir/
This way, you have access to all your files as if they were local, you can open them directly in your applications, process them, all without ever having to download the entire archive.
And the best part of all this is that Cloudzip works with almost all remote storages you can imagine. Of course, there’s S3, but also HTTP, HTTPS, GCS, Azure, and even… drumroll… Kaggle!
Ah, Kaggle, that haven of Data Scientists where datasets are larger than a Bitcoin miner’s electricity meter… Cloudzip can use Kaggle’s API to directly fetch the zip of a dataset without having to download it. You can literally mount a Kaggle dataset locally and start working on it in seconds. And if you ever need a particular file to test something, no problem, it will be downloaded on demand.
Of course, it’s not perfect. The NFS mount, for example, is only available on Linux and macOS for now. And don’t expect crazy performance either, we’re still talking about downloading file segments over the network. But for all those cases where you need to access a few files in a huge zip archive, it’s perfect!
And besides, it’s open-source (you didn’t think I would recommend a proprietary tool, did you!). You can find the project on GitHub.