Related to the discussion about a binary cache, I went off for a few days and tried to find out how to use IPFS as a binary cache. It was nearly successful, but finally failed due to a bug in IPFS itself.
nix-subsubd (nss) a proxy for nix binary cache substituters
At the moment, the only target for the binary cache is IPFS, but that can be expanded later.
Status This is idea is currently dead in the water due to a bug in IPFS
hindering the retrieval of content from IPFS using just the contentās SHA256 hash.
How it works
Substituters allow nix to retrieve prebuilt packages
and insert them into the store. The most used one is https://cache.nixos.org which is a
centralized cache of build artifacts.
Build artifacts are stored as compressed (xz) NAR (Nix ARchive) files.
nss acts a proxy between nix and a binary HTTP cache. The flow is simple
nix requests information about the NARs (narinfo) from the binary cache
nss forwards that to binary cache
binary cache responds
nss stores the information in an SQLite db, with the pertinent information
being the SHA256 hashsum of the compressed NAR file (.nar.xz)
nix requests the .nar.xz
nss tries to retrieve file from IPFS fails
if file isnāt on IPFS nss forwards request to binary cache to be implemented
If anybody knows go and would like to debug IPFS to fix the bug, or point me to an alternative that:
is distributed
self-hosted
preferably non-crypto, but thatās not a hard requirement
MOST IMPORTANTLY is content addressable with SHA256 (itās what cache.nixos.org uses for their .nar.xz)
Whatās download speeds feel like with IPFS? I have pretty bad internet down under. For many small downloads can it really meet speeds like a torrent might?
The WebUI unfortunately doesnāt indicate what is being downloaded, from whom, nor how fast, like torrents.
But, unless Iām mistaken, torrents arenāt a viable tech for this proxy as I donāt think itās possible to use the SHA256 hashsum of a file and find a corresponding torrent. I donāt know if itās possible to find a corresponding host using the DHT though Not sure what it stores. Itās either a file hash to host mapping or a file chunk to host; the latter being the most likely as from what I remember the torrent stores the hashes of the chunks, but I might be wrong.
Iād be very glad to be wrong, actually, because that could mean somebody could write a program that queries the DHT for hosts and uses the bittorrent swarm to download said file, thus repurposing torrents as a content addressable filesystem.
Curse you @marshmallow, I was supposed to be working on another project, but now this piqued my interested
Hah, I was trying hard not to get distracted by āproperā p2p stuff yet, but thanks very much for looking at this
As someone pointed out on matrix the other night, we donāt necessarily need to aim for fully decentralized yet, a centralized tracker and/or KV store mapping nix-friendly hashes to whatever the underlying tech uses might not be the end of the world. You might even be able to do something mad with a shared git repo for the mappings, at least initiallyā¦
Slow download speeds are a reasonable problem, but for devs and āpower usersā a local hydra instance or similar could probably abused to effectively prefetch things one might want to useā¦?
that does seem like a good intermediate step. At the moment, I have to focus on another project with friends that I promised Iād do. So if somebodyās comfortable with rust and would like to create a merge request (or fork the project, itās GPLv3) to map SHA256 of .nar.xz files to content addresses or torrents, itād be very much appreciated!
Thatās very similar to what I was thinking. The goal was for users to run a service on the side that would populate IPFS (or whatever other service there is) similar to this
{
services.p2pcache = {
enabled = true;
# stuff from the nix store one would like to "donate" to the network
# naming things is hard
donations = environment.systemPackages;
}
}
The once the service starts, it does whateverās necessary to add stuff to the P2P service e.g ipfs add everything in the donations.
I had a poke around with bittorent today, successfully created a manky shell script to generate torrent files for all the nars in a given closure from my s3 storage. Unfortunately failed at the next hurdle, couldnāt get transmission to reliably create a simple swarm between two containers on the same host!
Will pick up another day and try different combination of tracker and torrent client. Canāt believe this was the step I got stuck at!
Mad, late night, drug addled thought: if weāre contemplating storing store path ā nar file mappings in git, could we use radicle? It doesnāt seem to cope with large repos yet, but that shouldnāt be a problem at least initially.
This would give us more decentralisation, and more interestingly radicle seems to support a quorum system for accepting changes: Radicle Protocol Guide
This way updates could be checked by multiple builders and only accepted once a certain number agree?
I like the idea of using radicle. My only concern is packaging it now. Itās flake only from what I can tell and for non-flakers like me, it wasnāt possible to add to my config (probably lack of knowledge).
The quorum system does look very interesting for reproducible builds. Nice find.
The advantage here is that nothing is stored on the disk by the proxy and memory consumption stays low. The downside is that libtorrent doesnāt have up to date rust bindings (it does have python, java, nodejs, and golang bindings though). It could be worked around if a torrent client exists that can output a torrentās content to stdout.
Additionally, Iām not sure how fast sequential downloads are Maybe chunks from different nodes are buffered in memory and served sequentially?
Download then stream using proxy
nix calls proxy
proxy downloads torrent to filesystem (can be tmpfs or elsewhere)
proxy streams downloaded file
Advantage is probably simplicity? Call an external binary to download a torrent, wait until it exists, and serve the downloaded file, maybe clean up afterwards.
Disadvantage would be (to me) an external dependency and (depending on how itās done), waiting for the download to complete, then serving it. The total time to service would thus be download_time + serve_time.
Pre-eval config
By evaluating nix ahead of time and collecting the store paths, one could download all the content, then use @srtcd424 's findings to copy them to the store. Once itās time for nix to ābuildā, it should detect what is already in the store and go on its merry way to generate the rest of the system configuration (or whatever it is one evaluated).
I donāt know how to get nix to spit out all the packages it will have to download to the store. Maybe thereās some kind of dry-run mode? But if thatās solved, maybe this is the simplest option for somebody who can figure out the first step of nix eval
Thanks for looking into this @srtcd424 ! Gives me hope that there wonāt be any need to fumble about with nixās source code and try to get a PR merged into the repo or maintain a patched version thereof. Itās all mostly (completely?) out of band.
This is next on my list to look at - a simple āsubscriptionā shell script / daemon / cron job that can take a list of closures weāre always interested in, e.g. stdenv, minimal, <your favourite language toolchain here> etc and grab them in the middle of the night should be straight-forward but also enough to be useful. And then that proof of concept might inspire people with more brain than me to make something more realtime work.
Which is possibly a little non-trivial. If the whole closure we want is already available we can chase through the narinfo References: lines; if itās not I guess we would have to settle for the build inputs, which are obviously going to be a superset of what we need. Suboptimal but probably good enough for the moment.
Anyway, I have some disgusting python to do the narinfo chasing. Next step is to wire it up to aria2 or something ā¦
Needs lots and other bits and pieces, and there are probably other priorities, but it shows something really basic and simple (albeit not āon demandā) can be done.
Iāve been following along the cache discussions and one aspect I feel we need to answer for any p2p based caches is how can we give a guarantee for data availability?
E.g. when using IPFS a node providing a hash can just silently drop offline possibly making a hash unavailable if it hasnāt been also been stored by another node which IIRC only happens if the hash was requested via a second node. From my understanding its why ipfs-cluster exists.
Yeah, that would be a big issue if something like IPFS was being used as the primary binary cache solution. I think Iāve stumbled across paid IPFS āpinningā solutions, and presumably one could also self-host something similar.
Using something P2P to take some of the bandwidth load off a conventional binary cache might still be useful though.
I think Iāve said before my current focus is thinking about helping small groups of trusted devs, and maybe some early-adopter type hobbyist users, in the early stages of a project. In that case, a cache miss isnāt a disaster, as those sort of users are likely to be more comfortable building from source.
I love IPFS, so Iām glad to see this topic, and glad that someone is spearheading this early on.
I havenāt read everything here, I just want to briefly say: Iād like IPFS as a * fallback *. Central servers can be really fast.
So long as the IPFS hash is stored somewhere for derivaitions we can do a lot with external tools. Fetching the normal cache URL and the IPFS url, then just using whichever resolves first would let the system be fast and reliable.