Why are packages maintained as git submodules?

maren · July 13, 2024, 1:13pm

SWI-Prolog has a lot of git submodules. Most of them live under packages/ and are, predictably, the standard packages that come with most SWI-Prolog installations. In addition, there’s ‘bench’, which has benchmark code, and ‘debian’ (distro-debian), which contains debian packaging.

There are some downsides to git submodules. The separation introduces some extra maintenance burden, since every package change requires a submodule update. It also requires more involved build and contribution instructions, and it makes the tarballs that github automatically generates from tags unbuildable.

Despite the separation of the codebase into all these submodules, most work on this set of repositories is done by Jan, meaning development of all these submodules is pretty much just integrated with development of core SWI-Prolog itself. So why isn’t this development just taking place in a single repository?

Jan has been successfully developing and maintaining SWI-Prolog with this setup for years. @jan, since this is mostly your burden, I assume this setup is actually making your work easier, and you’re willingly paying the price of slightly more git fiddling because you are getting some benefit out of it.
What is that benefit? What am I missing here?

…

With my question out of the way, let me spend some time ranting about Nix builds, for some additional background to my question. There’s no reason to read this, I just had to write it.

In Nix, builds are considered functions. The idea is that given the same inputs and enough sandboxing, a build process should produce the same outputs. This assumption allows Nix to efficiently cache builds. It also allows Nix to generate so-called closures, minimal sets of packages that are needed to run a program, and which can then be deployed on any system (provided it has the same architecture) without having to worry about missing dependencies (or worse, having it load incompatible dependencies that then spew out seemingly unreproducible errors).

For all these assumptions to work, it is very, very important that Nix is able to validate that source inputs do not change. Therefore, Nix packages, when they need to download some source code, also need to specify a hash. This way, Nix can verify that the download corresponds with what the original packager thought the files were, and inductively, that anything built from such sources produces deterministic outputs.

For source tarballs this is super easy. Nix can just download them and run sha256sum or something similar on it. For git repositories, it’s slightly more difficult, as Nix really just wants to care about the source code, not the commit history or any other git-specific thing. So before calculating any checksum, Nix first has to remove the .git subdirectory. When there are submodules, Nix has to do a little dance of recursively checking them out in the right places, then removing all those .git subdirectories, to finally get a complete source tree which can be hashed.

So far so good. SWI-Prolog packaging works just fine today in nixpkgs using a git checkout with submodules.

There is just one little annoyance. Since building a Nix package requires us to know up front what all inputs look like, you can’t just build a Nix package with an updated source and have it work first try. Instead, the general procedure most people take is to

Update the package definition to point at the updated version and clear the expected hash.
Trigger a build, which will fail, because the hash is not set. It will tell us what the hash actually should have been.
Update the package definition with the hash copied out of the failed build.

Not exactly user-friendly! While this works fine for the occasional source inport into nixpkgs, if you need to regularly build different versions of the source code, this gets annoying.

This is why I wrote swipl-nix, to automate that hash-calculating for as many versions of SWI-Prolog that I can. But actually, there is a much better solution, though submodules make it less ideal than it could be.

While most Nix packages are built by third parties, there’s nothing stopping a package from providing its own packaging, and checking it in into the same repository as the code which it packages. In that case, there’s no need for a source import. Or more precisely, if you already have some mechanism in place that fetches the nix code, you’d get the source code with it for free.

In the wonderful world of Nix, there are various programs that help out with this, but the most popular workflow at the moment is flakes. A flake is a bit of nix code, usually in a git repository, which acts as a big wrapper around nix dependencies. You give it a bunch of imprecise inputs, such as other git repositories you want to get the latest version from, and you define a function which turns this into outputs, usually a package. The flake subsystem does all the required work of fetching those inputs, calculating the hashes, and writing a lockfile. This lockfile is then used as a deterministic input.

Long story short, if we could just make SWI-Prolog a flake, it could effectively package itself. People could get the latest development version of swipl just by having their flake point at github:SWI-Prolog/swipl-devel, and the flake subsystem would do all the required boilerplate hash calculating work. As an extra bonus, this would also let you do nix run github:SWI-Prolog/swipl-devel and immediately get the latest dev build. or nix run github:SWI-Prolog/swipl-devel/feature-branch to check out a potential bugfix on feature-branch.

But flakes do not work very well with submodules. It is not impossible to use them, but it was clearly an afterthought in the whole flake design process, and it requires the user to know submodules are in use and modify their commands and code accordingly, thereby breaking the abstraction a little bit. Basically, to pull in a dependency that needs its submodules, the importing flake ~~(or user, when doing a nix run or similar)~~ has to provide an extra flag to also fetch all submodules. ~~For example, we’d have to do nix run 'github:SWI-Prolog/swipl-devel?submodules=1 to run the latest swipl.~~ [edit: turns out this actually does not work for remote git repositories, only for local ones that are already checked out and have all their submodules initialized. There appears to be no way to just nix run a remote git repo that needs submodules.]

It’s not the end of the world. But it is mildly annoying, and that is almost as bad.

For me, things would be much nicer if everything lived directly in the swipl-devel repository. But since probably only a handful of people worldwide care about the niche intersection of Nix and SWI-Prolog, I understand if my concerns aren’t very important here .

But if the submodule situation is more of a historical accident with no clear benefits today, I’d be very willing to help out with a migration to a source tree without submodules, as I think it would be beneficial regardless.

jan · July 13, 2024, 4:46pm

Hi @maren, thanks for the observations. I need to think about this a little. I know GIT submodules are considered a nuisance. Most of that is in my view more failing to understand their design and most tooling could quite easily take care of submodules. The script scripts/make-src-tape creates the release tar balls from git and can do so without checking out the sources and/or modules. GNU tar allows you to extend TAR archives, so it simply creates a TAR for the main module and then extends this with the content of all submodules.

I do like the modularity. Modules also allow combining these modules with other Prolog systems, arrange a different team of maintainers, easily combine a specific package version with a specific Prolog version, get minimal sources (e.g., only the core), etc. I do agree that little of this is used though, so it may just be in my dreams

Note that computing the hash of a tar-ball is not that reliable as the archiving contains time stamps. For example. the tar archives you get from github from the same repository at the same version may change. I faced this problem in the SWI-Prolog pack system

What I do not get from your story is whether or not the nix hash for a set of sources is defined? Your story seems to suggest it is up to the packager to provide a function (?) that computes this hash? If that is the case, dealing with GIT can be done much better than what you outline. Note that files in git are hashed by content (no time stamp, etc.: only the content). A directory is the hash of a tree. This is a document that describes the file hierarchy and, for each file (name), its mode and content hash. I.e., the tree hash uniquely describes the directory layout and for each entry the mode and content. Again, no time stamps, owner, etc. as you find in TAR archives.

So, I think a git commit hash is a much better indication of the content than the hash of a TAR. It is not affected by times, but yes, in theory you may have multiple commits that point at the same tree. To get some idea, use

  git cat-file -p HEAD

to get the HEAD commit and subsequently repeat this with the tree hash, etc. If you want ta pure content hash for a git repo, the tree hash of the main commit is what you want. If you want one for a repo with submodules you repeat the process for the commit hash for the module recorded in the main repo in the each module repo, create a nice canonical document from this and hash it. That is probably less than half a page shell script and way faster than what you describe.

My initial answer (may change) is that reproducible packaging can be much better done based on git than on TAR archives, even when using GIT submodules. Some of the git tooling could be improved (e.g., generate a recursive tar-ball, etc.). Possibly exists already as more and more commands get the --recurse-submodules or --recursive option. Githubs handling of submodules is poor I looked at Gitlabs. It resolves some problems. I didn’t check whether it can create recursive tar balls. I have considered switching, but there are more github users and github has stuff such as its sponsoring program.

So, my first question is whether one can use the above tricks with nix? If the answer is no, is the nix community open to fixing this?

maren · July 13, 2024, 6:06pm

Hi Jan, thanks for answering so quickly!

I’ll answer some stuff about how Nix does hashing at the end of my message. It is interesting, but I think not super important for the source structure of SWI-Prolog.

I don’t think git submodule are very difficult, although I don’t use them often enough to build up enough familiarity to just know what to do. I also don’t think lack of understanding among the wider community is really a big point against a particular workflow. What really matters is if the submodules provide a tangible benefit to those working with the code normally. Which, for the most part, is just you.

Modularity is good. I just don’t think you really lose that when it is in the same repository. The build system doesn’t have to be different. It can still include/exclude features based on compiler flags and detected environment.
It’s true that you would have to download the entire source rather than just a subset even if you wanted to do a limited build. Looking at the numbers, that is actually surprisingly significant, as 3/4th of the source size is those packages (especially xpce, woah! lots of eps files in there). But surely, this is not actually helping you? Or anyone else regularly working on SWI-Prolog? All contributors have all these sources checked out regardless.

I do see how it could be cool if more prolog implementations were to use a commonly maintained package. And I agree, this would be an excellent use case for a submodule, as there would be more than one project depending on it.
But if that’s not the case now, aren’t you just making extra work for yourself by maintaining it this way?

One final thing before moving on to hashing,

I’m not sure if I understand this. Surely to create the tarballs, some sources have to be checked out at some point?
That’s a fun concatenation trick though, I didn’t know about that!

Alright, hashing time.

This is luckily not the case! I imagine loads of people would get it wrong.

Hashing is a core feature of Nix. A particular source input (which can be a compressed tarball, a decompressed archive, an imported git repository, and probably various other options) is always hashed in a very predictable way by Nix itself. We do not have any influence on this, and it’s one of those things that was stabilized years ago and will most likely never change, because it is so fundamental to how Nix works.
The package definition just says how it wants the source to be obtained, and what the expected outcome of the hash computation is.

I don’t know if it is efficiently using git metadata to come up with the hash. I suspect it is not, but there’s little I can do about that. It’s just a few wasted CPU cycles though, nothing too major.

Regarding how tarballs are actually hashed in nix, depending on the fetch method this is either a hash of the tarball itself, or (more commonly) a hash of the decompressed contents. In the latter case, ‘weird’ differences between equivalent tarballs should mostly not matter.

Anyway, as I said, there’s actually nothing I can do to change how this hash is computed. But frankly, the problem isn’t really with hash computation. Git submodules do work just fine in many circumstances (it is how SWI-Prolog is packaged right now after all). The only annoying edge case is that when using flakes to provide packaging directly in the SWI-Prolog repository (rather than externally, such as with nixpkgs or swipl-nix), submodules aren’t quite as well supported as they could be. Judging from what I can find on Google, this is a known problem that’s been talked about for several years now, so I don’t really expect things to move quickly here. In my opinion, flakes should just auto-include their submodules, but it doesn’t look like that is going to happen.

Not that silliness around flakes should really determine whether SWI-Prolog uses git submodules or not. It’s just what caused me to consider why we’re using them in the first place.

peter.ludemann · July 13, 2024, 6:24pm

As someone who has worked on the core, core submodules and packs, let me add my 2-cents (TL;DR: I somewhat prefer submodules to a large single repo) …

At first, I didn’t like submodules, and I still don’t fully understand them (and I have only a basic understanding of git; both its command-line and graphic interfaces confuse me). So, I just use the incantation of [git clone, make a fork on GitHub, git add remote myfork ...] and it Just Works (for both submodules and packs). This lets me make changes in a PR and Jan (or the pack maintainer) can merge my changes whenever they wish. In particular, the submodule might take a while before it’s integrated into the latest release – each component (submodule) can be tested independently and added to a new release of swipl separately. Without submodules, I think that maintaining these “add-on” components would be more error-prone, especially in situations where some major changes have been done – for example, when I updated PCRE from using PCRE1 to PCRE2 or when I added the protobuf compiler functionality.

maren · July 13, 2024, 6:40pm

@jan is this true? Is having the PRs separated into their own packages, and then doing an update in the main repository after merging easier from a maintenance perspective compared with direct pull requests on swipl-devel?

Without submodules, the workflow for larger changes that can’t just be merged right away is to have feature branches. On github, you can create such a branch and immediately create a PR out of it, marking it as ‘draft’. This way, you can have a single overview of everything that is being worked on (well, everything publicly worked on anyway) under the PRs.

Note that nixpkgs, the main package set for Nix/NixOS, is maintaining more than 100k packages (if the count on the package search page is to be believed) using a PR workflow on a single repository. This despite the fact that each package has its own set of maintainers, and most maintainers don’t even have direct commit access.
You don’t need submodules for collaboration.

jan · July 14, 2024, 9:00am

It is the current practice. It allows me to transfer the maintenance burden on packages more easily to more knowledgeable developers. During the development my involvement is only at a rather global level. When done, I do final integration tests before updating the module. That works well. But yes, something similar can be done using feature branches. If we consider several packages as well as (major) changes to the core, configuring certain combinations of submodule versions is certainly easier than combining feature branches.

The rough story is that the current workflow works well for me. Surely it is also possible to create a workflow that would be based on a monolithic repository. It comes with advantages and disadvantages. I’m more inclined to hope for better/wider support for submodules rather than replacing it by something that is IMO a worse organization.

As for this “flake” repo, can’t we keep that up-to-date using a github hook? That would be a work-around, but might be good enough until nix manages to support submodules properly?

maren · July 14, 2024, 9:03am

If submodules and the current process work well for you there is no need to change anything.

We absolutely can, and that was going to be my next suggestion .
swipl-nix could just be made to refresh whenever something happens. I should probably make it work more incrementally than it is right now though. Probably it can all be made to run inside github ci.

jan · July 14, 2024, 2:47pm

Good. Go ahead and let me know if something needs to be done on SWI-Prolog’s side or you want some discussion on how to set things up.

Topic		Replies	Views
Announcing swipl-nix Announce	0	106	July 8, 2024
SWI-Prolog github repository ("bench" submodule) is currently borked SWI-Prolog web site and services bug	7	848	August 10, 2020
Good behaviour in SWI-Prolog library contribution Help!	1	247	February 11, 2021
Improving contributor guide discoverability (was: Consolidating the 71 GitHub repositories to simplify maintenance and contribution) General	18	1141	April 7, 2019
Swi-prolog build: various questions General	20	105	July 24, 2024

Why are packages maintained as git submodules?

Related topics