Bisecting GHC is easier than I thought
Introduction
At work, I’m busy upgrading our compiler from GHC 9.6 to GHC 9.10. A couple days ago I found some failing tests which eventually turned out to be a compiler bug! Simon Peyton Jones took the time to reply with a very detailed comment explaining how GHC’s constraint solving logic led to the bug, so be sure to check that out if you’re curious about how the bug works.
To find the commit that introduced the bug I used git bisect
to
perform a binary search over the commits. The principle behind git bisect
is
simple, but building a compiler is a thorny problem in practice.
It turns out that building GHC only takes about 12 minutes. This means that you can search through 1000 commits in about two hours or one million commits in four hours, which is much faster feedback than I assumed before I started! It happens that there’s just about 1000 commits between GHC 9.6 and 9.8, so I was able to get this done in about a day, including the time it took me to write and iron out the bisection script. (Note that the 12 minute figure is for a stage 1 build; it’s probably not optimized but it was good enough to find my bug! I also tried building GHC on my laptop and it took 19 minutes, so your mileage may vary.)
Here’s how I performed the bisection, from start to finish. This is a wandering account featuring a lot of errors that are more-or-less irrelevant to the final product, so feel free to skip to the end if you want to see the finished bisection test script. However, if you’re planning on performing a similar bisection yourself, you may find the errors and the methods I used to work around them helpful.
Key takeaways
Build systems and dependencies for compilers are constantly changing. Bisecting a large range of commits means you have to be able to build a large range of commits. Choosing a set of tools that can build every commit in the range is not always possible.
Relatedly, the GHC build system is very sensitive to the particular set of tools in use. Attempting to build and run GHC with a boot GHC from Nixpkgs, I found a wide variety of different errors depending on which revision of Nixpkgs I used.
Building old revisions of software with old versions of tools can be quite frustrating. You will encounter bugs that have been fixed in newer revisions of the software, or revisions of the software which only work with old versions of the tools. You may have to cherry-pick or conditionally apply patches to work around these issues.
Tools like Nixpkgs are indispensable for large-scale integration work like this. Being able to easily build, patch, and cache a wide range of versions of software makes it so much easier to find a set of tools that can build the relevant versions of GHC.
What to bisect?
First, we want to determine the range of commits to bisect on. I used
ghcup to download different versions of GHC and run my reproducer with
them, which allowed me to determine that the error was introduced somewhere
between GHC 9.6 and GHC 9.8. I checked out the GHC repository and
listed the tags, eventually determining that I wanted the range of commits from
the ghc-9.7-start
tag to the ghc-9.8.1-release
tag. (Note that there is no
publicly released GHC 9.7, so this is rather unintuitive unless you already
follow the GHC development process!)
Submodules to whatever the opposite of a rescue is
If you’ve touched the GHC repository at all, you might have modified files, untracked files which exist in the release you want to check out, or — worst of all — modified submodules. To clean out your working tree, you need to run the following commands:
# Remove untracked files:
# -d is for "recurse into directories", -x is for "remove ignored files".
git clean --force -dx
# Undo changes to tracked files:
git reset --hard
# Unregister submodules and remove their worktrees:
git submodule deinit --all --force
The submodules make working in the GHC repo quite frustrating, but I recently
read “Demystifying git submodules” by Dmitry Mazin, which helped
me build out my mental model. Except… git submodule update
, which the
manual page promises will “update the registered
submodules to match what the superproject expects by […] updating the working
tree of the submodules”, doesn’t seem to work like I expect it to.
Here’s what happens. After running a build on ghc-9.7-start
, Git
tells me that content has been modified in the libraries/unix
submodule:
$ git status
HEAD detached at ghc-9.7-start
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
(commit or discard the untracked or modified content in submodules)
modified: libraries/unix (modified content)
I update the submodules:
$ git submodule update --init --recursive --force
...
Submodule path 'utils/haddock': checked out '261a7c8ac5b5ff29e6e0380690cbb6ee9730f985'
...
$ git status
HEAD detached at ghc-9.7-start
nothing to commit, working tree clean
But then, when I go to checkout another branch, Git tells me that untracked
files in utils/haddock
are getting in the way:
$ git switch master
error: The following untracked working tree files would be overwritten by checkout:
utils/haddock/.github/mergify.yml
...
Aborting
UPDATE: I sent this article to my friend who works as a firmware engineer — and, therefore, uses submodules gleefully — and she immediately suggested that a submodule might have been removed. I checked and that’s what’s going on here. It had never occurred to me because I was checking out a later commit starting at an earlier one, and I had assumed that a repository with 31 submodules didn’t get there by removing them. What follows are my flailing attempts to fix this without removing all the submodules. Skip ahead to “Getting ready” if you don’t want to read that.
I try to remove the untracked files, but it doesn’t help:
$ git clean -fdx
$ git submodule foreach git clean -fdx
...
Entering 'utils/haddock'
...
$ git switch master
error: The following untracked working tree files would be overwritten by checkout:
utils/haddock/.github/mergify.yml
...
Aborting
In fact, Git thinks that the working tree for utils/haddock
is clean and that
all of those files are tracked:
$ pushd utils/haddock
$ git status
HEAD detached at 261a7c8a
nothing to commit, working tree clean
$ git ls-files .github/mergify.yml
.github/mergify.yml
$ popd
Removing the checkouts for the submodules entirely seems to work, but I can’t figure out why:
$ git submodule deinit --all --force
...
Cleared directory 'utils/haddock'
Submodule 'utils/haddock' (https://gitlab.haskell.org/ghc/haddock.git) unregistered for path 'utils/haddock'
...
$ git switch master
Previous HEAD position was 261a7c8a Bump GHC version to 9.7
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
Getting ready
Now that we know the range of commits we want to bisect, let’s get GHC building. If you haven’t built GHC before (I hadn’t!), the GHC Wiki has a good article on setting up your system for building GHC.
After getting everything set up, the build process looks roughly like this:
# Clone and update the submodules.
git submodule update --init --recursive
# Run `autoreconf`, copy `config.sub` into submodules.
./boot
# Configure for building; `$CONFIGURE_ARGS` is set by `ghc.nix` and contains
# paths to the `gmp` and `ncurses` libraries.
./configure $CONFIGURE_ARGS
# Build GHC with the bespoke Hadrian build system:
./hadrian/build -j --flavour=Quick
I chose to use ghc.nix to install the dependencies. This worked really well, but when I attempted to reproduce my builds for this writeup everything fell apart! It turned out that I had been using an old version of ghc.nix from a GitHub repository that hadn’t been updated since October 2023. By pure coincidence, this was perfectly suited for building the revisions of GHC I was interested in, which were authored between December 2022 and October 2023.
I also made a couple tweaks to the userSettings
attribute-set in ghc.nix’s
flake.nix
:
- I set
withIde = false;
to disable building the huge and expensivehaskell-language-server
, which I wouldn’t be using. - I set
bootghc = "ghc96";
because I’m building older versions of GHC, which isn’t necessarily supported with a newer GHC. For bisecting larger ranges or historical versions, it might be nice to build some tooling to dynamically chose a boot GHC based on the commit being built. (Also, I would learn later that GHC 9.4 would have been a better choice.)
Fixing up ghc.nix & a potpourri of build errors
I suspect that the smarter move here would have been to just point at an old commit of ghc.nix and call it a day, but instead I’ve spent my downtime for the last couple days fixing up the latest revisions of ghc.nix to make it easier to perform these builds in the future.
With the old revision of ghc.nix, I got a hash mismatch error:
$ nix develop
error: NAR hash mismatch in input
'github:commercialhaskell/all-cabal-hashes/bd2d976d126b7730d82c772a207cf34e927aa69d'
(/nix/store/i19cc0hifz3f7iayz477v6s2aajyl1ii-source),
expected 'sha256-c6R3PkzqDCeAIqB+aygnjIMOmnkAmepyakOqtb8oQrg=',
got 'sha256-7wwhcWxDLTKHghemMh30Le7D5pi7+eSUCg4jj7yS+jM='
I fixed this by updating the relevant input with nix flake update all-cabal-hashes
.
./boot
and ./configure
worked fine, but I ran into trouble when I attempted
to run the actual build process:
$ ./hadrian/build -j --flavour=Quick
Resolving dependencies...
Error: cabal: Could not resolve dependencies:
[__0] trying: hadrian-0.1.0.0 (user goal)
[__1] trying: base-4.18.0.0/installed-4.18.0.0 (dependency of hadrian)
[__2] next goal: Cabal (dependency of hadrian)
[__2] rejecting: Cabal-3.10.1.0/installed-3.10.1.0 (conflict: hadrian =>
Cabal>=3.2 && <3.9)
...
In retrospect, this should have been a signal to find a better version of GHC and Cabal to build with. According to the (very well-hidden) GHC Boot Library Version History page on the GHC wiki, GHC 9.4 was distributed with Cabal 3.8, so that’s what we likely want.
I decided to charge on, though. Here’s what I needed to do to get GHC to build with “close enough” versions of the build tools!
First, you need to tell the Cabal that’s building Hadrian to
ignore the dependency version bounds with --allow-newer
:
CABFLAGS=--allow-newer ./hadrian/build -j --flavour=Quick
This will build Hadrian and then GHC. However, it won’t get very far:
# cabal-configure (for _build/stageBoot/linters/lint-whitespace/setup-config)
Error: hadrian: Encountered missing or private dependencies:
Error: hadrian: Encountered missing or private dependencies:
mtl >=2.1 && <2.3
mtl >=2.1 && <2.3
For most command-line options, we can use the Hadrian user
settings to add Cabal options (e.g., echo ".*.cabal.configure.opts += ..." >> _build/hadrian.settings
), but the Hadrian
build system works with Cabal’s lower-level Setup.hs
interface,
which doesn’t support the --allow-newer
flag, so we have to patch the
.cabal
files manually.
I used a tool called jailbreak-cabal
to remove version
bounds from dependency specifications in .cabal
files. Here’s the command to
run, which skips the hundreds of .cabal
files in integration tests:
find . \
! '(' \
-path ./testsuite -prune \
-o -path ./libraries/Cabal/Cabal-tests/tests -prune \
-o -path ./libraries/Cabal/cabal-install/tests -prune \
-o -path ./libraries/Cabal/cabal-testsuite/PackageTests -prune \
')' \
-path '*.cabal' \
-exec jailbreak-cabal '{}' ';'
With that in place, my builds completed successfully, but I still needed to make a few other tweaks to get the build working with the new canonical ghc.nix repository.
First, I needed to remove gdb
from the included tools because
gdb
isn’t available on macOS.
Then, I needed to export CXX
with an absolute path so that the
./configure
script wouldn’t fail due to a quoting issue. This bug is
fixed on recent versions of GHC, but it’s present on ghc-9.7-start
, which we
want to build!
After fixing the ./configure
step, I could attempt the build, but it didn’t
get very far:
| Run GhcPkg Update (Stage0 InTreeLibs): _build/stage0/libraries/ghc-boot-th/inplace-pkg-config => none
dieVerbatim: user error (Error: hadrian:
'/nix/store/3dahn98q1m46ickndsg797zp8gv801b2-ghc-9.6.6-with-packages/bin/ghc'
exited with an error:
ghc-9.6.6: can't find a package database at
_build/stage0/libraries/ghc-boot-th/build/package.conf.inplace
)
Build failed.
At this point, I suspected my GHC / Cabal versions were to blame, so I started to look for more historically-accurate versions of the tools. First I found the date of the commit I was trying to build:
$ git show
commit fc3a2232da89ed4442b52a99ba1826d04362a7e8 (HEAD, tag: ghc-9.7-start)
Author: Ben Gamari <bgamari.foss@gmail.com>
Date: Thu Dec 22 13:45:06 2022 -0500
Bump GHC version to 9.7
Then I asked for a commit around that time in Nixpkgs:
$ git log --before='Thu Dec 22 13:45:06 2022' nixos-unstable
commit 012700eae502f6054a056cf7b94f78ff549e278d
Merge: 0550dfc0228a 5f1760cb902c
Author: Fabian Affolter <mail@fabian-affolter.ch>
Date: Thu Dec 22 22:38:20 2022 +0100
Merge pull request #207299 from r-ryantm/auto-update/python3.10-python-crontab
python310Packages.python-crontab: 2.6.0 -> 2.7.1
If you’ve read the git rev-parse
documentation, you might
be tempted to try git log 'nixos-unstable@{Thu Dec 22 13:45:06 2022}'
, but
that will attempt to find what your checkout had nixos-unstable
pointing to
on that date.
Attempting to build GHC from that revision gave me a compiler error:
ghc: panic! (the 'impossible' happened)
GHC version 9.4.2:
Template variable unbound in rewrite rule
Variable: sg_shgu :: WasmTypeTag 'I32 ~R# WasmTypeTag w_sgB4
Rule "SC:$j0"
Rule bndrs: [sg_shgu]
LHS args: [ty_word_X16]
Actual args: [ty_word_X16]
Call stack:
CallStack (from HasCallStack):
callStackDoc, called at compiler/GHC/Utils/Panic.hs:182:37 in ghc:GHC.Utils.Panic
pprPanic, called at compiler/GHC/Core/Rules.hs:619:10 in ghc:GHC.Core.Rules
I jumped ahead and tried a commit from a year later. That built successfully, but running any of the executables it produced failed:
$ _build/stage1/bin/ghc
dyld[12621]: symbol not found in flat namespace '_unixzm2zi8zi3zi0zminplace_SystemziPosixziFiles_getFileStatus_closure'
fish: Job 1, '_build/stage1/bin/ghc' terminated by signal SIGABRT (Abort)
It was around this time that I finally looked at the old ghc.nix repository
again and noticed it was pointing to a commit on the nixos-23.05
branch of
Nixpkgs. A stable release branch! Why hadn’t I thought of
that earlier? I only needed one more tweak to fix ghc.nix with old versions of
Nixpkgs that don’t include a top-level pkgs.happy
attribute and my builds started completing successfully.
Now we’re ready to start bisecting!
Starting the bisection
First, mark a commit you know is bad and a commit you know is good:
git bisect start ghc-9.8.1-release ghc-9.7-start
Then run the bisection using a script to determine which commits are good:
git bisect run ../test.sh
Note that the test.sh
script is in the parent directory so it doesn’t get
deleted when we run git clean
.
The bisection script
We’ll need to build GHC repeatedly across a variety of different versions, so we’ll need some tweaks to get things working smoothly.
First, we’ll want to remove any previous build artifacts and get the submodules set up correctly:
# Log commands as we run them.
set -x
# Reset the submodules in case of changes (e.g., to generated files).
# This takes about 5 seconds, or a minute for a fresh clone.
git submodule update --init --force --recursive || exit 128
# Remove existing build products.
# This takes roughly no time.
time rm -rf _build || exit 128
Why exit with code 128 if these commands fail? According to the git bisect
man page:
Note that the script […] should exit with code 0 if the current source code is good/old, and exit with a code between 1 and 127 (inclusive), except 125, if the current source code is bad/new.
Any other exit code will abort the bisect process. It should be noted that a program that terminates via
exit(-1)
leaves$? = 255
, (see theexit(3)
manual page), as the value is chopped with& 0377
.The special exit code 125 should be used when the current source code cannot be tested. If the script exits with this code, the current revision will be skipped (see
git bisect skip
above). 125 was chosen as the highest sensible value to use for this purpose, because 126 and 127 are used by POSIX shells to signal specific error status (127 is for command not found, 126 is for command found but not executable—these details do not matter, as they are normal errors in the script, as far asbisect run
is concerned).
The first script I tried looked like this:
function build {
git submodule update --init --force --recursive
rm -rf _build
./boot
# ...
}
set -ex
build || exit 128
I hoped that this would run the commands in build
, returning 1 from build
if any of the commands failed, and then exiting with 128
as a result. It
turns out this is not the case! According to the Bash manual, using
set -e
will cause the script to:
Exit immediately if a pipeline, which may consist of a single simple command, a list, or a compound command returns a non-zero status. The shell does not exit if the command that fails is […] part of any command executed in a
&&
or||
list except the command following the final&&
or||
If a compound command or shell function executes in a context where
-e
is being ignored, none of the commands executed within the compound command or function body will be affected by the-e
setting, even if-e
is set and a command returns a failure status.
Therefore, because we run build || exit 128
, build
will keep going if one
of the commands in it fails! I googled “bash errexit site:gnu.org” to find the
documentation I’ve linked above, but the first result was a message sent to
the Bash mailing list titled “why does errexit exist in its current utterly
useless form?”, complaining about this exact problem.
To work around this, we have to append || exit 128
to each command in the
build process. To state the obvious, this is tedious and error-prone.
Anyways, now we’re ready to run the build! Booting the repository takes about two seconds:
./boot || exit 128
Configuring it takes about 30 seconds:
./configure $CONFIGURE_ARGS || exit 128
And building Hadrian and then GHC takes about 12 minutes on my desktop (M1 Ultra with 20 cores and 64GB RAM) or about 19 minutes on my laptop (M2 Max with 12 cores and 32GB RAM):
CABFLAGS=--allow-newer ./hadrian/build -j --flavour=Quick || exit 128
When we’re done, make sure to reset any changes we made to files in the working tree so that Git can check out the next revision:
git reset --hard || exit 128
Double-check that we’ve actually produced a runghc
executable, and then we
can use the built compiler to run our tests:
# Make sure we actually built a compiler:
if [[ ! -e _build/stage1/bin/runghc ]]
then
exit 128
fi
tmp=$(mktemp)
_build/stage1/bin/runghc ../Main.hs > "$tmp" || exit 128
# If the output doesn't contain 'action', the bug is present in this commit.
grep --quiet action "$tmp"
And that’s it! The finished product isn’t too complex, but the road to get there involved a lot of subtle pitfalls I wanted to write down.
Putting it all together
Here’s what the bisection test script looks like when it’s all assembled together! The script is also available as a GitHub Gist.
#!/usr/bin/env bash
# Log commands as we run them, and exit if any command produces an error.
set -x
# Attempt to build GHC.
#
# If we can't build GHC, exit with 128. This will abort the entire `git bisect`
# instead of erroneously marking the commit as 'bad'.
#
# From `man git-bisect`:
# > Note that the script [...] should exit with code
# > 0 if the current source code is good/old, and exit with a code between 1
# > and 127 (inclusive), except 125, if the current source code is bad/new.
# >
# > Any other exit code will abort the bisect process. It should be noted that
# > a program that terminates via exit(-1) leaves $? = 255, (see the exit(3)
# > manual page), as the value is chopped with & 0377.
# 5 seconds:
# Reset the submodules if there's changes like generated files.
time git submodule update --init --force --recursive || exit 128
# Remove existing build products.
time rm -rf _build || exit 128
# Now we run the actual build process.
# See: https://gitlab.haskell.org/ghc/ghc/-/wikis/building/preparation
# 2 seconds:
time ./boot || exit 128
# 30 seconds:
time ./configure $CONFIGURE_ARGS || exit 128
# ~12 minutes (M1 Ultra, 20 cores), ~19 minutes (???, 12 cores):
time CABFLAGS=--allow-newer ./hadrian/build -j --flavour=Quick || exit 128
# Reset any modified files or checking out the next commit will fail:
time git reset --hard || exit 128
# Make sure we actually built a compiler:
if [[ ! -e _build/stage1/bin/runghc ]]
then
exit 128
fi
tmp=$(mktemp)
_build/stage1/bin/runghc "../Main.hs" > "$tmp" || exit 128
# If the output contains 'action', we're all OK.
# If it just says 'run', we have a problem!
grep --quiet action "$tmp"