#Bisecting GHC is easier than I thought

#Introduction

At work, I’m busy upgrading our compiler from GHC 9.6 to GHC 9.10. A couple days ago I found some failing tests which eventually turned out to be a compiler bug! Simon Peyton Jones took the time to reply with a very detailed comment explaining how GHC’s constraint solving logic led to the bug, so be sure to check that out if you’re curious about how the bug works.

To find the commit that introduced the bug I used git bisect to perform a binary search over the commits. The principle behind git bisect is simple, but building a compiler is a thorny problem in practice.

It turns out that building GHC only takes about 12 minutes. This means that you can search through 1000 commits in about two hours or one million commits in four hours, which is much faster feedback than I assumed before I started! It happens that there’s just about 1000 commits between GHC 9.6 and 9.8, so I was able to get this done in about a day, including the time it took me to write and iron out the bisection script. (Note that the 12 minute figure is for a stage 1 build; it’s probably not optimized but it was good enough to find my bug! I also tried building GHC on my laptop and it took 19 minutes, so your mileage may vary.)

Here’s how I performed the bisection, from start to finish. This is a wandering account featuring a lot of errors that are more-or-less irrelevant to the final product, so feel free to skip to the end if you want to see the finished bisection test script. However, if you’re planning on performing a similar bisection yourself, you may find the errors and the methods I used to work around them helpful.

#Key takeaways

Build systems and dependencies for compilers are constantly changing. Bisecting a large range of commits means you have to be able to build a large range of commits. Choosing a set of tools that can build every commit in the range is not always possible.

Relatedly, the GHC build system is very sensitive to the particular set of tools in use. Attempting to build and run GHC with a boot GHC from Nixpkgs, I found a wide variety of different errors depending on which revision of Nixpkgs I used.

Building old revisions of software with old versions of tools can be quite frustrating. You will encounter bugs that have been fixed in newer revisions of the software, or revisions of the software which only work with old versions of the tools. You may have to cherry-pick or conditionally apply patches to work around these issues.

Tools like Nixpkgs are indispensable for large-scale integration work like this. Being able to easily build, patch, and cache a wide range of versions of software makes it so much easier to find a set of tools that can build the relevant versions of GHC.

#What to bisect?

First, we want to determine the range of commits to bisect on. I used ghcup to download different versions of GHC and run my reproducer with them, which allowed me to determine that the error was introduced somewhere between GHC 9.6 and GHC 9.8. I checked out the GHC repository and listed the tags, eventually determining that I wanted the range of commits from the ghc-9.7-start tag to the ghc-9.8.1-release tag. (Note that there is no publicly released GHC 9.7, so this is rather unintuitive unless you already follow the GHC development process!)

#Submodules to whatever the opposite of a rescue is

If you’ve touched the GHC repository at all, you might have modified files, untracked files which exist in the release you want to check out, or — worst of all — modified submodules. To clean out your working tree, you need to run the following commands:

# Remove untracked files:
# -d is for "recurse into directories", -x is for "remove ignored files".
git clean --force -dx
# Undo changes to tracked files:
git reset --hard
# Unregister submodules and remove their worktrees:
git submodule deinit --all --force

The submodules make working in the GHC repo quite frustrating, but I recently read “Demystifying git submodules” by Dmitry Mazin, which helped me build out my mental model. Except… git submodule update, which the manual page promises will “update the registered submodules to match what the superproject expects by […] updating the working tree of the submodules”, doesn’t seem to work like I expect it to.

Here’s what happens. After running a build on ghc-9.7-start, Git tells me that content has been modified in the libraries/unix submodule:

$ git status
HEAD detached at ghc-9.7-start
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
        modified:   libraries/unix (modified content)

I update the submodules:

$ git submodule update --init --recursive --force
...
Submodule path 'utils/haddock': checked out '261a7c8ac5b5ff29e6e0380690cbb6ee9730f985'
...
$ git status
HEAD detached at ghc-9.7-start
nothing to commit, working tree clean

But then, when I go to checkout another branch, Git tells me that untracked files in utils/haddock are getting in the way:

$ git switch master
error: The following untracked working tree files would be overwritten by checkout:
        utils/haddock/.github/mergify.yml
        ...
Aborting

UPDATE: I sent this article to my friend who works as a firmware engineer — and, therefore, uses submodules gleefully — and she immediately suggested that a submodule might have been removed. I checked and that’s what’s going on here. It had never occurred to me because I was checking out a later commit starting at an earlier one, and I had assumed that a repository with 31 submodules didn’t get there by removing them. What follows are my flailing attempts to fix this without removing all the submodules. Skip ahead to “Getting ready” if you don’t want to read that.

I try to remove the untracked files, but it doesn’t help:

$ git clean -fdx
$ git submodule foreach git clean -fdx
...
Entering 'utils/haddock'
...
$ git switch master
error: The following untracked working tree files would be overwritten by checkout:
        utils/haddock/.github/mergify.yml
        ...
Aborting

In fact, Git thinks that the working tree for utils/haddock is clean and that all of those files are tracked:

$ pushd utils/haddock
$ git status
HEAD detached at 261a7c8a
nothing to commit, working tree clean
$ git ls-files .github/mergify.yml
.github/mergify.yml
$ popd

Removing the checkouts for the submodules entirely seems to work, but I can’t figure out why:

$ git submodule deinit --all --force
...
Cleared directory 'utils/haddock'
Submodule 'utils/haddock' (https://gitlab.haskell.org/ghc/haddock.git) unregistered for path 'utils/haddock'
...
$ git switch master
Previous HEAD position was 261a7c8a Bump GHC version to 9.7
Switched to branch 'master'
Your branch is up to date with 'origin/master'.

#Getting ready

Now that we know the range of commits we want to bisect, let’s get GHC building. If you haven’t built GHC before (I hadn’t!), the GHC Wiki has a good article on setting up your system for building GHC.

After getting everything set up, the build process looks roughly like this:

# Clone and update the submodules.
git submodule update --init --recursive

# Run `autoreconf`, copy `config.sub` into submodules.
./boot

# Configure for building; `$CONFIGURE_ARGS` is set by `ghc.nix` and contains
# paths to the `gmp` and `ncurses` libraries.
./configure $CONFIGURE_ARGS

# Build GHC with the bespoke Hadrian build system:
./hadrian/build -j --flavour=Quick

I chose to use ghc.nix to install the dependencies. This worked really well, but when I attempted to reproduce my builds for this writeup everything fell apart! It turned out that I had been using an old version of ghc.nix from a GitHub repository that hadn’t been updated since October 2023. By pure coincidence, this was perfectly suited for building the revisions of GHC I was interested in, which were authored between December 2022 and October 2023.

I also made a couple tweaks to the userSettings attribute-set in ghc.nix’s flake.nix:

I set withIde = false; to disable building the huge and expensive haskell-language-server, which I wouldn’t be using.
I set bootghc = "ghc96"; because I’m building older versions of GHC, which isn’t necessarily supported with a newer GHC. For bisecting larger ranges or historical versions, it might be nice to build some tooling to dynamically chose a boot GHC based on the commit being built. (Also, I would learn later that GHC 9.4 would have been a better choice.)

#Fixing up ghc.nix & a potpourri of build errors

I suspect that the smarter move here would have been to just point at an old commit of ghc.nix and call it a day, but instead I’ve spent my downtime for the last couple days fixing up the latest revisions of ghc.nix to make it easier to perform these builds in the future.

With the old revision of ghc.nix, I got a hash mismatch error:

$ nix develop
error: NAR hash mismatch in input
  'github:commercialhaskell/all-cabal-hashes/bd2d976d126b7730d82c772a207cf34e927aa69d'
    (/nix/store/i19cc0hifz3f7iayz477v6s2aajyl1ii-source),
  expected 'sha256-c6R3PkzqDCeAIqB+aygnjIMOmnkAmepyakOqtb8oQrg=',
       got 'sha256-7wwhcWxDLTKHghemMh30Le7D5pi7+eSUCg4jj7yS+jM='

I fixed this by updating the relevant input with nix flake update all-cabal-hashes.

./boot and ./configure worked fine, but I ran into trouble when I attempted to run the actual build process:

$ ./hadrian/build -j --flavour=Quick
Resolving dependencies...
Error: cabal: Could not resolve dependencies:
[__0] trying: hadrian-0.1.0.0 (user goal)
[__1] trying: base-4.18.0.0/installed-4.18.0.0 (dependency of hadrian)
[__2] next goal: Cabal (dependency of hadrian)
[__2] rejecting: Cabal-3.10.1.0/installed-3.10.1.0 (conflict: hadrian =>
Cabal>=3.2 && <3.9)
...

In retrospect, this should have been a signal to find a better version of GHC and Cabal to build with. According to the (very well-hidden) GHC Boot Library Version History page on the GHC wiki, GHC 9.4 was distributed with Cabal 3.8, so that’s what we likely want.

I decided to charge on, though. Here’s what I needed to do to get GHC to build with “close enough” versions of the build tools!

First, you need to tell the Cabal that’s building Hadrian to ignore the dependency version bounds with --allow-newer:

CABFLAGS=--allow-newer ./hadrian/build -j --flavour=Quick

This will build Hadrian and then GHC. However, it won’t get very far:

# cabal-configure (for _build/stageBoot/linters/lint-whitespace/setup-config)
Error: hadrian: Encountered missing or private dependencies:
Error: hadrian: Encountered missing or private dependencies:
mtl >=2.1 && <2.3
mtl >=2.1 && <2.3

For most command-line options, we can use the Hadrian user settings to add Cabal options (e.g., echo ".*.cabal.configure.opts += ..." >> _build/hadrian.settings), but the Hadrian build system works with Cabal’s lower-level Setup.hs interface, which doesn’t support the --allow-newer flag, so we have to patch the .cabal files manually.

I used a tool called jailbreak-cabal to remove version bounds from dependency specifications in .cabal files. Here’s the command to run, which skips the hundreds of .cabal files in integration tests:

find . \
    ! '(' \
           -path ./testsuite -prune \
        -o -path ./libraries/Cabal/Cabal-tests/tests -prune \
        -o -path ./libraries/Cabal/cabal-install/tests -prune \
        -o -path ./libraries/Cabal/cabal-testsuite/PackageTests -prune \
    ')' \
    -path '*.cabal' \
    -exec jailbreak-cabal '{}' ';'

With that in place, my builds completed successfully, but I still needed to make a few other tweaks to get the build working with the new canonical ghc.nix repository.

First, I needed to remove gdb from the included tools because gdb isn’t available on macOS.

Then, I needed to export CXX with an absolute path so that the ./configure script wouldn’t fail due to a quoting issue. This bug is fixed on recent versions of GHC, but it’s present on ghc-9.7-start, which we want to build!

After fixing the ./configure step, I could attempt the build, but it didn’t get very far:

| Run GhcPkg Update (Stage0 InTreeLibs): _build/stage0/libraries/ghc-boot-th/inplace-pkg-config => none
dieVerbatim: user error (Error: hadrian:
'/nix/store/3dahn98q1m46ickndsg797zp8gv801b2-ghc-9.6.6-with-packages/bin/ghc'
exited with an error:
ghc-9.6.6: can't find a package database at
_build/stage0/libraries/ghc-boot-th/build/package.conf.inplace
)
Build failed.

At this point, I suspected my GHC / Cabal versions were to blame, so I started to look for more historically-accurate versions of the tools. First I found the date of the commit I was trying to build:

$ git show
commit fc3a2232da89ed4442b52a99ba1826d04362a7e8 (HEAD, tag: ghc-9.7-start)
Author: Ben Gamari <bgamari.foss@gmail.com>
Date:   Thu Dec 22 13:45:06 2022 -0500

    Bump GHC version to 9.7

Then I asked for a commit around that time in Nixpkgs:

$ git log --before='Thu Dec 22 13:45:06 2022' nixos-unstable
commit 012700eae502f6054a056cf7b94f78ff549e278d
Merge: 0550dfc0228a 5f1760cb902c
Author: Fabian Affolter <mail@fabian-affolter.ch>
Date:   Thu Dec 22 22:38:20 2022 +0100

    Merge pull request #207299 from r-ryantm/auto-update/python3.10-python-crontab

    python310Packages.python-crontab: 2.6.0 -> 2.7.1

If you’ve read the git rev-parse documentation, you might be tempted to try git log 'nixos-unstable@{Thu Dec 22 13:45:06 2022}', but that will attempt to find what your checkout had nixos-unstable pointing to on that date.

Attempting to build GHC from that revision gave me a compiler error:

ghc: panic! (the 'impossible' happened)
  GHC version 9.4.2:
        Template variable unbound in rewrite rule
  Variable: sg_shgu :: WasmTypeTag 'I32 ~R# WasmTypeTag w_sgB4
  Rule "SC:$j0"
  Rule bndrs: [sg_shgu]
  LHS args: [ty_word_X16]
  Actual args: [ty_word_X16]
  Call stack:
      CallStack (from HasCallStack):
        callStackDoc, called at compiler/GHC/Utils/Panic.hs:182:37 in ghc:GHC.Utils.Panic
        pprPanic, called at compiler/GHC/Core/Rules.hs:619:10 in ghc:GHC.Core.Rules

I jumped ahead and tried a commit from a year later. That built successfully, but running any of the executables it produced failed:

$ _build/stage1/bin/ghc
dyld[12621]: symbol not found in flat namespace '_unixzm2zi8zi3zi0zminplace_SystemziPosixziFiles_getFileStatus_closure'
fish: Job 1, '_build/stage1/bin/ghc' terminated by signal SIGABRT (Abort)

It was around this time that I finally looked at the old ghc.nix repository again and noticed it was pointing to a commit on the nixos-23.05 branch of Nixpkgs. A stable release branch! Why hadn’t I thought of that earlier? I only needed one more tweak to fix ghc.nix with old versions of Nixpkgs that don’t include a top-level pkgs.happy attribute and my builds started completing successfully.

Now we’re ready to start bisecting!

#Starting the bisection

First, mark a commit you know is bad and a commit you know is good:

git bisect start ghc-9.8.1-release ghc-9.7-start

Then run the bisection using a script to determine which commits are good:

git bisect run ../test.sh

Note that the test.sh script is in the parent directory so it doesn’t get deleted when we run git clean.

#The bisection script

We’ll need to build GHC repeatedly across a variety of different versions, so we’ll need some tweaks to get things working smoothly.

First, we’ll want to remove any previous build artifacts and get the submodules set up correctly:

# Log commands as we run them.
set -x

# Reset the submodules in case of changes (e.g., to generated files).
# This takes about 5 seconds, or a minute for a fresh clone.
git submodule update --init --force --recursive || exit 128

# Remove existing build products.
# This takes roughly no time.
time rm -rf _build || exit 128

Why exit with code 128 if these commands fail? According to the git bisect man page:

Note that the script […] should exit with code 0 if the current source code is good/old, and exit with a code between 1 and 127 (inclusive), except 125, if the current source code is bad/new.

Any other exit code will abort the bisect process. It should be noted that a program that terminates via exit(-1) leaves $? = 255, (see the exit(3) manual page), as the value is chopped with & 0377.

The special exit code 125 should be used when the current source code cannot be tested. If the script exits with this code, the current revision will be skipped (see git bisect skip above). 125 was chosen as the highest sensible value to use for this purpose, because 126 and 127 are used by POSIX shells to signal specific error status (127 is for command not found, 126 is for command found but not executable—these details do not matter, as they are normal errors in the script, as far as bisect run is concerned).

The first script I tried looked like this:

function build {
  git submodule update --init --force --recursive
  rm -rf _build
  ./boot
  # ...
}

set -ex

build || exit 128

I hoped that this would run the commands in build, returning 1 from build if any of the commands failed, and then exiting with 128 as a result. It turns out this is not the case! According to the Bash manual, using set -e will cause the script to:

Exit immediately if a pipeline, which may consist of a single simple command, a list, or a compound command returns a non-zero status. The shell does not exit if the command that fails is […] part of any command executed in a && or || list except the command following the final && or ||

If a compound command or shell function executes in a context where -e is being ignored, none of the commands executed within the compound command or function body will be affected by the -e setting, even if -e is set and a command returns a failure status.

Therefore, because we run build || exit 128, build will keep going if one of the commands in it fails! I googled “bash errexit site:gnu.org” to find the documentation I’ve linked above, but the first result was a message sent to the Bash mailing list titled “why does errexit exist in its current utterly useless form?”, complaining about this exact problem.

To work around this, we have to append || exit 128 to each command in the build process. To state the obvious, this is tedious and error-prone.

Anyways, now we’re ready to run the build! Booting the repository takes about two seconds:

./boot || exit 128

Configuring it takes about 30 seconds:

./configure $CONFIGURE_ARGS || exit 128

And building Hadrian and then GHC takes about 12 minutes on my desktop (M1 Ultra with 20 cores and 64GB RAM) or about 19 minutes on my laptop (M2 Max with 12 cores and 32GB RAM):

CABFLAGS=--allow-newer ./hadrian/build -j --flavour=Quick || exit 128

When we’re done, make sure to reset any changes we made to files in the working tree so that Git can check out the next revision:

git reset --hard || exit 128

Double-check that we’ve actually produced a runghc executable, and then we can use the built compiler to run our tests:

# Make sure we actually built a compiler:
if [[ ! -e _build/stage1/bin/runghc ]]
then
  exit 128
fi

tmp=$(mktemp)
_build/stage1/bin/runghc ../Main.hs > "$tmp" || exit 128

# If the output doesn't contain 'action', the bug is present in this commit.
grep --quiet action "$tmp"

And that’s it! The finished product isn’t too complex, but the road to get there involved a lot of subtle pitfalls I wanted to write down.

#Putting it all together

Here’s what the bisection test script looks like when it’s all assembled together! The script is also available as a GitHub Gist.

#!/usr/bin/env bash

# Log commands as we run them, and exit if any command produces an error.
set -x

# Attempt to build GHC.
#
# If we can't build GHC, exit with 128. This will abort the entire `git bisect`
# instead of erroneously marking the commit as 'bad'.
#
# From `man git-bisect`:
# > Note that the script [...] should exit with code
# > 0 if the current source code is good/old, and exit with a code between 1
# > and 127 (inclusive), except 125, if the current source code is bad/new.
# >
# > Any other exit code will abort the bisect process. It should be noted that
# > a program that terminates via exit(-1) leaves $? = 255, (see the exit(3)
# > manual page), as the value is chopped with & 0377.

# 5 seconds:
# Reset the submodules if there's changes like generated files.
time git submodule update --init --force --recursive || exit 128

# Remove existing build products.
time rm -rf _build || exit 128

# Now we run the actual build process.
# See: https://gitlab.haskell.org/ghc/ghc/-/wikis/building/preparation

# 2 seconds:
time ./boot || exit 128

# 30 seconds:
time ./configure $CONFIGURE_ARGS || exit 128

# ~12 minutes (M1 Ultra, 20 cores), ~19 minutes (???, 12 cores):
time CABFLAGS=--allow-newer ./hadrian/build -j --flavour=Quick || exit 128

# Reset any modified files or checking out the next commit will fail:
time git reset --hard || exit 128

# Make sure we actually built a compiler:
if [[ ! -e _build/stage1/bin/runghc ]]
then
  exit 128
fi

tmp=$(mktemp)
_build/stage1/bin/runghc "../Main.hs" > "$tmp" || exit 128

# If the output contains 'action', we're all OK.
# If it just says 'run', we have a problem!
grep --quiet action "$tmp"