Bulk builds with download isolation

Contact: tech-pkg
Mentors: unknown
Duration estimate: unknown

In recent years, packages whose builds download things on the fly have become an increasing problem. Such downloads violate both pkgsrc principles/rules and standard best practices: at best, they bypass integrity checks on the downloaded material, and at worst they may check out arbitrary recent changes from GitHub or other hosting sites, which in addition to being a supply-chain security risk also makes reliable repeatable builds impossible.

It has simultaneously grown more difficult to find and work around these issues by hand; these techniques have been growing increasingly popular among those who should perhaps know better, and meanwhile upstream build automation has been steadily growing more elaborate and more opaque. Consequently, we would like a way to automatically detect violations of the rules.

Currently, pbulk runs in two steps: first it scans the tree to find out what it needs to build, and once this is done it starts building. Builds progress in the usual pkgsrc way: the fetch phase comes first and downloads anything needed, then the rest of the build continues. This interleaves (expected) downloads and build operations, which on the one hand tends to give the best build throughput but on the other makes it difficult to apply network access restrictions.

The goal of this project is to set up infrastructure such that bulk builds can run without network access during the build phase. (Or, more specifically, in pkgsrc terms, everything other than the fetch phase.) Then attempts to download things on the fly will fail.

There are two ways to accomplish this and we will probably want both of them. One is to add a separate fetch step to pbulk, so it and the build step can run with different global network configuration. The other is to provide mechanisms so that expected downloads can proceed and unwanted ones cannot.

Separate fetch step

Since bulk builds are generally done on dedicated or mostly-dedicated systems (whether real or virtual) system-wide changes to the network configuration to prevent downloads during the build step will be in most setups acceptable, and also sufficient to accomplish the desired goals.

There are also two possible ways to implement a separate fetch step: as part of pbulk's own operational flow or as a separate external invocation of the pbulk components. The advantage of the former is that it requires less operator intervention, while the advantage of the latter is that it doesn't require teaching pbulk to manipulate the network state on its own. (In general it would need to turn external access off before the build step, and then restore it after in order to be able to upload the build results.) Since there are many possible ways one might manipulate the network state and the details vary between operating systems, and in some cases the operations might require privileges the pbulk process doesn't have, making pbulk do it adds considerable complication; on the other hand, setting up pbulk is already notoriously complicated and requiring additional user-written scripts to manipulate partial build states and adjust the network is a significant drawback.

Another consideration is that to avoid garbaging the failure reports, any download failures need to be recorded with pbulk during the download step, and the packages affeected marked as failed, so that those failures end up in the output results. Otherwise the build step will retry the fetch in an environment without a network, and that will then fail but nobody will be able to see why. For this reason it isn't sufficient to just, for example, run "make -k fetch" from the top level of pkgsrc before building.

Also note that rather than just iterating the package list and running "make fetch" from inside pbulk, it might be desirable to use "make fetch-list" or similar and then combine the results and feed them to a separate download tool. The simple approach can't readily adapt to available network bandwidth: depending on the build host's effective bandwidth from various download sites it might either flood the network or be horrendously slow, and no single fixed parallelism setting can avoid this. Trying to teach pbulk itself to do download load balancing is clearly the wrong idea. However, this is considerably more involved, if only because integrating failure results into the pbulk state is considerably more complicated.

Blocking unauthorized downloads

The other approach is to arrange things so that unwanted downloads will fail. There are a number of possible ways to arrange this on the basic assumption that the system has no network access to the outside world by default. (For example, it might have no default route, or it might have a firewall config that blocks outgoing connections.) Then some additional mechanism is introduced into the pkgsrc fetch stage so that authorized downloads can proceed. One mechanism is to set up a local HTTP proxy and only make the proxy config available to the fetch stage. Another, possibly preferable if the build is happening on a cluster of VMs or chroots, is to ssh to another local machine to download there; that allows mounting the distfiles directory read-only on the build hosts. There are probably others. Part of the goal of this project should be to select one or a small number of reasonable mechanisms and provide the necessary support in pbulk so each can be enabled in a relatively turn-key fashion. We want it to be easy to configure this restriction and ideally in the long term we'd like it to be able to default to "on".

Note that it will almost certainly be necessary to strengthen the pkgsrc infrastructure to support this. For example, conditionally passing around HTTP proxy config depending on the pkgsrc phase will require changes to pkgsrc itself. Also, while one can redirect fetch to a different tool via the FETCH_USING variable, as things stand this runs the risk of breaking packages that need to set it itself. There was at one point some talk about improving this but it apparently never went anywhere.

An additional possibility along these lines is to leverage OS-specific security frameworks to prohibit unwanted downloading. This has the advantage of not needing to disable the network in general, so it can be engaged even for ordinary builds and on non-dedicated hosts. However, because security frameworks and their state of useability vary, it isn't a general solution. In particular on NetBSD at the moment we do not have anything that can do this. (It might be possible to ues kauth for it, but if so the necessary policy modules do not currently exist.) Consequently this option is in some ways a separate project. Note if attempting it that simplistic solutions (e.g. blocking all attempts to create sockets) will probably not work adequately.

Other considerations

Note when considering this project:

Working on pbulk at all is not entirely trivial.
It's necessary to coordinate with Joerg and other stakeholders so that the changes can eventually be merged.
Any scheme will probably break at least a few innocent packages that will then probably need somebody to patch them.

Overall, one does hope that any package that attempts to download things and finds it can't will then fail rather than silently doing something different that we might or might not be able to detect. It is possible in the long run that we'll want to use a security framework that can log download attempts and provide an audit trail; however, first steps first.

Add a comment

Last edited in the wee hours of Sunday night, January 9th, 2023

Preferences | Logout