Following on from my earlier post, I wanted to jot down a few notes and ideas about binary caching before I forget!
Roadmap
Long-term it seems to me some sort of p2p based system for this would be ideal, but before then I think a smaller / simpler system for use by a more trusted group of devs and tinkerers might be helpful and be more achievable?
(Ultimately there is a trust / security issue with a larger-scale p2p system where one needs to be sure that the downloaded binary is a correct build of the requested package. This would mean working on fully reproducible builds, and then perhaps some sort of database where multiple builders could report their hash of the build, and the client could chose to place trust once a configured quorum had been reached.)
Frontend - attic?
For the front-end so far Iāve looked at and experimented a bit with attic- it seems to be aimed at smaller groups, but that might OK for a while. It can also filter out objects that are present in a designated upstream cache, which might allow us to piggy-back on NixOS and only have to store packages where they have changed in Aux.
My mini server uses an SMR drive with access cached through an SSD with bcache, so while performance will not be great it shouldnāt be totally terrible either. Unfortunately with the default chunk sizes and sqlite backend, atticās usage patterns seems to cause a huge amount of fsync calls which my system hates. A quick test with the postgres backend suggests that will be quite a lot better even on the same storage stack.
Backend / storage
For smaller-scale use by a group of devs I was thinking we could probably get away with a single front-end machine initially (though that would be a SPOF), but probably need distributed storage given how quickly nix caches tend to explode in size. Iām not sure how much anyone hosting a storage node would need to have to trust other nodes or any central controller? I guess a solution with simple API would be a benefit here.
Storj
Storj is all a bit blockchain-y, but it seems it is actually self-hostable, though there donāt seem to be huge amounts of docs around that. Given the likely complexity, Iāve not looked in much detail - this might be something for further down the line when we want to move to a wider access public cache?
Tahoe-LAFS
@corbin on fedi warned me off this based on significant past experience, but it seems to have the concept of low required trust between clients and storage nodes in the design. I was curious if it would be any use for a small-scale cache, even a personal one.
I have got a PoC up and running locally using a rather wobbly stack of attic ā rclone serve s3 ā sftp ā tahoe-lafs. Unfortunately rcloneās s3 server dumps all the blobs in a single directory which Tahoe seems to hate, with a small upload eventually grinding to halt with Tahoe using 100% CPU.
It does have a simple REST API though so if there were good reasons for using it for anything it might be possible to write something to skip some layers in the above stack and store objects more sanely.
Edit: sadly the performance problems seem to be inherent; disabling atticās chunking helps a bit but almost certainly not enough to make Tahoe useful, IMO.
Other options - distributed S3 stores
These might require a bit more trust between operators, but that might be OK for small scale cache sharing between devs etc. Iāve not looked at any of them in any detail yet.
Other options - āshardedā generic S3 stores
Not sure how viable this is, but it looks like there might be one or two options to present a unified view of multiple generic S3 stores - obviously one would need to be able store data as well:
This would allow using any S3 implementation whether it supports distributed operation or not, e.g.:
Do feel free to mention other ideas and options in the comments, I can edit this post or it can be turned into a wiki or something. BTW, this is very much not something I am experienced in - I fell out of commercial IT well before the advent of cloud computing! So anyone with more experience is welcome to correct me on anything, or take over these ideas and run with them if they are interested. I will continue my own small-scale experiments as time/energy/health permits, and Iām happy to spin up small containers, VMs, etc here if it helps to test someone elseās proof-of-concept or whatever.