Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

Originally, LLDB was ported to NetBSD by Kamil Rytarowski. However, multiple upstream changes and lack of continuous testing have resulted in decline of support. So far we haven't been able to restore the previous state.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. My four first goals as detailed in the previous report were:

  1. Restore tracing in LLDB for NetBSD (i386/amd64/aarch64) for single-threaded applications.

  2. Restore execution of LLDB regression tests, unless there is need for a significant LLDB or kernel work, mark detected bugs as failing or unsupported ones.

  3. Enable execution of LLDB regression tests on the buildbot in order to catch regressions.

  4. Upstream NetBSD (i386/amd64) core(5) support. Develop LLDB regression tests (and the testing framework enhancement) as requested by upstream.

Of those tasks, I consider running regression tests on the buildbot the highest priority. Bisecting regressions post-factum is hard due to long build times, and having continuous integration working is going to be very helpful to maintaining the code long-term.

In this report, I'd like to summarize what I achieved and what technical difficulties I met.

The kqueue interoperability issues

Given no specific clue as to why LLDB was no longer able to start processes on NetBSD, I've decided to start by establishing the status of the test suites. More specifically, I've started with a small subset of LLDB test suite — unittests. In this section, I'd like to focus on two important issues I had with them.

Firstly, one of the tests was hanging indefinitely. As I established, the purpose of the test was to check whether the main loop implementation correctly detects and reports when all the slaves of a pty are disconnected (and therefore the reads on master would fail). Through debugging, I've came to the conclusion that kevent() is not reporting this particular scenario.

I have built a simple test case (which is now part of kqueue ATF tests) and confirmed it. Afterwards, I have attempted to establish whether this behavior is correct. While kqueue(2) does not mention ptys specifically, it states the following for pipes:

Fifos, Pipes

Returns when there is data to read; data contains the number of bytes available.

When the last writer disconnects, the filter will set EV_EOF in flags. This may be cleared by passing in EV_CLEAR, at which point the filter will resume waiting for data to become available before returning.

Furthermore, my test program indicated that FreeBSD exhibits the described EV_EOF behavior. Therefore, I have decided to write a kernel patch adding this functionality, submitted it to review and eventually committed it after applying helpful suggestions from Robert Elz ([PATCH v3] kern/tty_pty: Fix reporting EOF via kevent and add a test case). I have also disabled the test case temporarily since the functionality is non-critical to LLDB (r353545).

Secondly, a few gdbserver-based tests were flaky — i.e. unpredictably passed and failed every iteration. I've started debugging this with a test whose purpose was to check verbose error messages support in the protocol. To my surprise, it seemed as if gdbserver worked fine as far as error message exchange was concerned. This packet was followed by a termination request from client — and it seemed that the server sometimes replies to it correctly, and sometimes terminates just before receiving it.

While working on this particular issue, I've noticed a few deficiencies in LLDB's error handling. In this case, this involved two major issues:

  1. gdbserver ignored errors from main loop. As a result, if kevent() failed, it silently exited with a successful status. I've fixed it to catch and report the error verbosely instead: r354030.

  2. Main loop reported meaningless return value (-1) from kevent(). I've established that most likely all kevent() implementation use errno instead, and made the function return it: r354029.

After applying those two fixes, gdbserver clearly indicated the problem: kevent() returned due to EINTR (i.e. the process receiving a signal). Lacking correct handling for this value, the main loop implementation wrongly treated it as fatal error and terminated the program. I've fixed this via implementing EINTR support for kevent() in r354122.

This trivial fix not only resolved most of the flaky tests but also turned out to be the root cause for LLDB being unable to start processes. Therefore, at this point tracing for single-threaded processes was restored on amd64. Testing on other platforms is pending.

Now, for the moral: working error reporting can save a lot of time.

Socket issues

The next issue I hit while working on the unittests is rather curious, and I have to admit I haven't managed to neither find the root cause or build a good reproducer for it. Nevertheless, I seem to have caught the gist of it and found a good workaround.

The test in question focuses on the high-level socket API in LLDB. It is rather trivial — it binds a server in one thread, and tries to connect to it from a second thread. So far, so good. Most of the time the test works just fine. However, sometimes — especially early after booting — it hangs forever.

I've debugged this thoroughly and came to the following conclusion: the test binds to (i.e. purely IPv4) but tries to connect to localhost. The latter results in the client trying IPv6 first, failing and then succeeding with IPv4. The connection is accepted, the test case moves forward and terminates successfully.

Now, in the failing case, the IPv6 connection attempt succeeds, even though there is no server bound to that port. As a result, the client part is happily connected to a non-existing service, and the server part hangs forever waiting for the connection to come.

I have attempted to reproduce this with an isolated test case, reproducing the use of threads, binding to port zero, the IPv4/IPv6 mixup and I simply haven't been able to reproduce this. However, curiously enough my test case actually fixes the problem. I mean, if I start my test case before LLDB unit tests, they work fine afterwards (until next reboot).

Being unable to make any further progress on this weird behavior, I've decided to fix the test design instead — and make it connect to the same address it binds to: r353868.

Getting the right toolchain for live testing

The largest problem so far was getting LLDB tests to interoperate with NetBSD's clang driver correctly. On other systems, clang either defaults to libstdc++, or has libc++ installed as part of the system (FreeBSD, Darwin). The NetBSD driver wants to use libc++ but we do not have it installed by default.

While this could be solved via installing libc++ on the buildbot host, I thought it would be better to establish a solution that would allow LLDB to use just-built clang — similarly to how other LLVM projects (such as OpenMP) do. This way, we would be testing the matching libc++ revision and users would be able to run the tests in a single checkout out of the box.

Sadly, this is non-trivial. While it could be all hacked into the driver itself, it does not really belong there. While it is reasonable to link tests into the build tree, we wouldn't want regular executables built by user to bind to it. This is why normally this is handled via the test system. However, the tests in LLDB are an accumulation of at least three different test systems, each one calling the compiler separately.

In order to establish a baseline for this, I have created wrappers for clang that added the necessary command-line options. The state-of-art wrapper for clang looked like the following:

#!/usr/bin/env bash

cxxinc="-cxx-isystem $topdir/include/c++/v1"
lpath="-L $topdir/lib"

# needed to handle 'clang -v' correctly
[ $# -eq 1 ] && [ "$1" = -v ] && exec $topdir/bin/clang-9-real "$@"
exec $topdir/bin/clang-9-real $cxxinc $lpath $rpath "$@" $pthread $libs

The actual executable I renamed to clang-9-real, and this wrapper replaced clang and a similar one replaced clang++. clang-cl was linked to the real executable (as it wasn't called in wrapper-relevant contexts), while clang-9 was linked to the wrapper.

After establishing a baseline of working tests, I've looked into migrating the necessary bits one by one to the driver and/or LLDB test system, removing the migrated parts and verifying whether tests pass the same.

My proposal so far involves, appropriately:

  1. Replacing -cxx-isystem with libc++ header search using path relative the compiler executable: D58592.

  2. Integrating -L and -Wl,-rpath with the LLDB test system: D58630.

  3. Adding NetBSD to list of platforms needing -pthread: r355274.

  4. The need for -lunwind is solved via switching the test failing due to the lack of it to use libc++ instead of libstdc++: r355273.

The reason for adjusting libc++ header search in the driver rather than in LLDB tests is that the path is specific to building against libc++, and the driver makes it convenient to adjust the path conditionally to standard C++ library being used. In other words, it saves us from hard-relying on the assumption that tests will be run against libc++ only.

I've went for integrating -L in the test system since we do not want to link arbitrary programs to the libraries in LLVM's build directory. Appending this path unconditionally should be otherwise harmless to LLDB's tests, so that is the easier way to go.

Originally I wanted to avoid appending RPATHs. However, it seems that the LD_LIBRARY_PATH solution that works for Linux does not reliably work on NetBSD with LLDB. Therefore, passing -Wl,-rpath along with -L allowed me to solve the problem simpler.

Furthermore, those design solutions match other LLVM projects. I've mentioned OpenMP before — so far we had to pass -cxx-isystem to its tests explicitly but it passed -L for us. Those patches render passing -cxx-isystem unnecessary, and therefore make LLDB follow the suit of OpenMP.

Finishing touches

Having a reasonably working compiler and major regressions fixed, I have focused on establishing a baseline for running tests. The goal is to mark broken tests XFAIL or skip them. With all tests marked appropriately, we would be able to start running tests on the buildbot and catch regressions compared to this baseline. The current progress on this can be see in D58527.

Sadly, besides failing tests there is still a small number of flaky or hanging tests which are non-trivial to detect. The upstream maintainer, Pavel Labath is very helpful and I hope to be able to finally get all the flaky tests either fixed or covered with his help.

Other fixes not worth a separate section include:

  • fixing compiler warnings about empty format strings: r354922,

  • fixing two dlopen() based test cases not to link -ldl on NetBSD: r354617,

  • finishing Kamil's patch for core file support: r354466, followup fix in r354483,

  • removing dead code in main loop: r354050,

  • fixing stand-alone builds after they've been switched to LLVMConfig.cmake: r353925,

  • skipping lldb-mi tests when Python support (needed by lldb-mi) is disabled: r353700,

  • fixing incorrect initialization of sigset_t (not actually used right now): r353675.

Buildbot updates

The last part worth mentioning is that the NetBSD LLVM buildbot has seen some changes. Notably, zorg r354820 included:

  • fixing the bot commit filtering to include all projects built,

  • renaming the bot to shorter netbsd-amd64,

  • and moving it to toolchain category.

One of the most useful functions of buildbot is that it associated every successive build with new commits. If the build fails, it blames the authors of those commits and reports the failure to them. However, for this to work buildbot needs to be aware which projects are being tested.

Our buildbot configuration has been initially based on one used for LLDB, and it assumed LLVM, Clang and LLDB are the only projects built and tested. Over time, we've added additional projects but we failed to update the buildbot configs appropriately. Finally, with the help of Jonas Hahnfeld, Pavel Labath and Galina Kistanova we've managed to update the list and make the bot blame all projects correctly.

While at it, we were suggested to rename the bot. The previous name was lldb-amd64-ninja-netbsd8, and others suggested that the developers may ignore failures in other projects seeing lldb there. Kamil Rytarowski also pointed out that the version number confuses users to believe that we're running separate bots for different versions. The new name and category mean to clearly indicate that we're running a single bot instance for multiple projects.

Quick summary and future plans

At this point, the most important regressions in LLDB have been fixed and it is able to debug simple programs on amd64 once again. The test suite patches are still waiting for review, and once they're approved I still need to work on flaky tests before we can reliably enable that on the buildbot. This is the first priority.

The next item on the TODO list is to take over and finish Kamil's patch for core files with thread. Most notably, the patch requires writing tests, and verifying whether there are no new bugs affecting it.

On a semi-related note, LLVM 8.0.0 will be released in a few days and I will be probably working on updating src to the new version. I will also try to convince Joerg to switch from unmaintained libcxxrt to upstream libc++abi. Kamil also wanted to change libc++ include path to match upstream (NetBSD is dropping /v1 suffix at the moment).

Once this is done, the next big step is to fix threading support. Testing on non-amd64 arches is deferred until I gain access to some hardware.

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

Posted Saturday evening, March 2nd, 2019 Tags:
Kernel signal code is a complex maze, it's very difficult to introduce non-trivial changes without regressions. Over the past month I worked on covering missing elementary scenarios involving the ptrace(2) API. Part of the new tests were marked as expected to success, however a number of them are expected to fail.

The NetBSD distribution changes

I've also introduced non-ptrace(2) related changes namely from the domain of kernel sanitizers, kernel fixes and corresponding ATF tests. I won't discuss them further as they were beyond the ptrace(2) scope. These changes were largely stimulated by students preparing for summer work as a part of Google Summer of Code.

The ptrace(2) ATF commits landed into the repository:

  • Define PTRACE_ILLEGAL_ASM for NetBSD/amd64 in ptrace.h
  • Enable 3 new ptrace(2) tests for SIGILL
  • Refactor GPR and FPR tests in t_ptrace_wait* tests
  • Refactor definition of PT_STEP tests into single macro
  • Correct a style in description of PT_STEP tests in t_ptrace_wait*
  • Refactor kill* test in t_ptrace_wait*
  • Add infinite_thread() for ptrace(2) ATF tests
  • Add initial pthread(3) tests in ATF t_prace_wait* tests
  • Link t_ptrace_wait* tests with -pthread
  • Initial refactoring of siginfo* tests in t_ptrace_wait*
  • Drop siginfo5 from ATF tests in t_ptrace_wait*
  • Merge siginfo6 into other PT_STEP tests in t_ptrace_wait*
  • Rename the siginfo4 test in ATF t_ptrace_wait*
  • Refactor lwp_create1 and lwp_exit1 into trace_thread* in ptrace(2) tests
  • Rename signal1 to signal_mask_unrelated in t_ptrace_wait*
  • Add new regression scenarios for crash signals in t_ptrace_wait*
  • Replace signal2 in t_ptrace_wait* with new tests
  • Add new ATF tests traceme_raisesignal_ignored in t_ptrace_wait*
  • Add new ATF tests traceme_signal{ignored,masked}_crash* in t_ptrace_wait*
  • Add additional assert in traceme_signalmasked_crash t_ptrace_wait* tests
  • Add additional assert in traceme_signalignored_crash t_ptrace_wait* tests
  • Remove redundant test from ATF t_ptrace_wait*
  • Add new ATF t_ptrace_wait* vfork(2) tests
  • Add minor improvements in unrelated_tracer_sees_crash in t_ptrace_wait*
  • Add more tests for variations of unrelated_tracer_sees_crash in ATF
  • Replace signal4 (PT_STEP) test with refactored ones with extra asserts
  • Add signal masked and ignored variations of traceme_vfork_exec in ATF tests
  • Add signal masked and ignored variations of traceme_exec in ATF tests
  • Drop signal5 test-case from ATF t_ptrace_wait*
  • Refactor signal6-8 tests in t_ptrace_wait*

Trap signals processing without signal context reset

The current NetBSD kernel approach of processing crash signals (SEGV, FPE, BUS, ILL, TRAP) is to reset the context of signals. This behavior was introduced as an intermediate and partially legitimate fix for cases of masking a crash signal that was causing infinite loop in a dying process.

The expected behavior is to never reset signal context of a trap signal (or any other signal) when executed under a debugger. In order to achieve these semantics I've introduced a fix for this for the first time last year, but I had to revert quickly, as it caused side effect breakage, not covered by existing at that time ATF ptrace(2) regression tests. This time I made sure to cover upfront almost all interesting scenarios that are requested to function properly. Surprisingly after grabbing old faulty fix and improving it locally, the current signal maze code caused various side effects in corner cases, such as translating SIGKILL in certain tests to previous trap signal (like SIGSEGV).. In other cases side effect behavior seems to be probably even stranger, as one tests hangs only against a certain type of wait(2)-like function (waitid(2)), and executes without hangs against other wait(2)-like function types.

For the reference such surprises can be achieved with the following patch:

Index: sys/kern/kern_sig.c
RCS file: /cvsroot/src/sys/kern/kern_sig.c,v
retrieving revision 1.350
diff -u -r1.350 kern_sig.c
--- sys/kern/kern_sig.c	29 Nov 2018 10:27:36 -0000	1.350
+++ sys/kern/kern_sig.c	3 Mar 2019 19:26:54 -0000
@@ -911,13 +911,25 @@
+	if (ISSET(p->p_slflag, PSL_TRACED) &&
+	    !(p->p_pptr == p->p_opptr && ISSET(p->p_lflag, PL_PPWAIT))) {
+		p->p_xsig = signo;
+		p->p_sigctx.ps_faked = true; // XXX
+		p->p_sigctx.ps_info._signo = signo;
+		p->p_sigctx.ps_info._code = ksi->ksi_code;
+		sigswitch(0, signo, false);
+		// XXX ktrpoint(KTR_PSIG)
+		mutex_exit(p->p_lock);
+		return;
+	}
 	mask = &l->l_sigmask;
 	ps = p->p_sigacts;
-	const bool traced = (p->p_slflag & PSL_TRACED) != 0;
 	const bool caught = sigismember(&p->p_sigctx.ps_sigcatch, signo);
 	const bool masked = sigismember(mask, signo);
-	if (!traced && caught && !masked) {
+	if (caught && !masked) {
 		kpsendsig(l, ksi, mask);

Such changes need proper investigation and addressing bugs that are now detectable easier with the extended test-suite.

Plan for the next milestone

Keep preparing kernel fixes and after thorough verification applying them to the mainline.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

Posted early Monday morning, March 4th, 2019 Tags: