Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support. You can read more about that in my July 2019 report.

I've been on vacation in August, and in September I've resumed the work on LLDB. I've started by fixing new regressions in LLVM suite, then improved my previous patches and continued debugging test failures and timeouts resulting from my patches.

LLVM 8 and 9 in NetBSD

Updates to LLVM 8 src branch

I have been asked to rebase my llvm8 branch of NetBSD src tree. I've done that, and updated it to LLVM 8.0.1 while at it.

LLVM 9 release

The LLVM 9.0.0 final has been tagged in September. I have been doing the pre-release testing for it, and discovered that the following tests were hanging:

LLVM :: ExecutionEngine/MCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/MCJIT/eh.ll
LLVM :: ExecutionEngine/MCJIT/multi-module-eh-a.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh.ll
LLVM :: ExecutionEngine/OrcMCJIT/multi-module-eh-a.ll

I couldn't reproduce the problem with LLVM trunk, so I've instead focused on looking for a fix. I've came to the conclusion that the problem was fixed through adding missing linked library. I've requested backport in bug 43196 and it has been merged in r371042.

I didn't put more effort into figuring out why the lack of this linkage caused issues for us. However, as Lang Hames said on the bug, ‘adding the dependency was the right thing to do’.

LLVM 9 for NetBSD src

Afterwards, I have started working on updating my NetBSD src branch for LLVM 9. However, in middle of that I've been informed that Joerg has already finished doing that independently, so I've stopped.

Furthermore, I was informed that LLVM 9.0.0 will not make it to src, since it still lacks some fixes (most notably, adding a pass to lower is.constant and objectsize intrinsics). Joerg plans to import some revision of the trunk instead.

Buildbot regressions

Initial regressions

The first problem that needed solving was LLDB build failure caused by replacing std::once_flag with llvm::once_flag. I've came to the conclusion that the build fails because the call site in LLDB combined std::call_once with llvm::once_flag. The solution was to replace the former with llvm::call_once.

After fixing the build failure, we had a bunch of test failures on buildbot to address. Kamil helped me and tracked one of them down to a new test for stack exhaustion handling. The test author decided that it ‘is only a best-effort mitigation for the case where things have already gone wrong’, and marked it unsupported on NetBSD.

On the plus side, two of the tests previously failing on NetBSD have been fixed upstream. I've un-XFAIL-ed them appropriately. Five new test failures in LLDB were related to those tests being unconditionally skipped before — I've marked them XFAIL pending further investigation in the future.

Another set of issues was caused by enabling -fvisibility=hidden for libc++ which caused problems when building with GCC. After being pinged, the author decided to enable it only for builds done using clang.

New issues through September

During September, two new issues arose. The first one was my fault, so I'm going to cover it in appropriate section below. The second one was new thread_local test failing. Since it was a newly added test that failed on most of the supported platforms, I've just added NetBSD to the list of failing platforms.

Current buildbot status

After fixing the immediate issues, the buildbot returned to previous status. The majority of tests pass, with one flaky test repeatedly timing out. Normally, I would skip this specific test in order to have buildbot report only fresh failures. However, since it is threading-related I'm waiting to finish my threading update and reassert afterwards.

Furthermore, I have added --shuffle to lit arguments in order to randomize the order in which the tests are run. According to upstream, this reduces the chance of load-intensive tests being run simultaneously and therefore causing timeouts.

The buildbot host seems to have started crashing recently. OpenMP tests were causing similar issues in the past, and I'm currently trying to figure out whether they are the culprit again.

__has_feature(leak_sanitizer)

Kamil asked me to implement a feature check for leak sanitizer being used. The __has_feature(leak_sanitizer) preprocessor macro is complementary to __SANITIZE_LEAK__ used in NetBSD gcc and is used to avoid reports when leaks are known but the cost of fixing them exceeds the gain.

Progress in threading support

Fixing LLDB bugs

In the course of previous work, I had a patch for threading support in LLDB partially ready. However, the improvements have also resulted in some of the tests starting to hang. The main focus of my late work as investigating those problems.

The first issue that I've discovered was inconsistency in expressing no signal sent. In some places, LLDB used LLDB_INVALID_SIGNAL (-1) to express that, in others it used 0. So far this went unnoticed since the end result in ptrace calls was the same. However, the reworked NetBSD threading support used explicit PT_SET_SIGINFO which — combined with wrong signal parameter — wiped previously queued signal.

I've fixed C packet handler, then fixed c, vCont and s handlers to use LLDB_INVALID_SIGNAL correctly. However, I've only tested the fixes with my updated thread support, causing regression in the old code. Therefore, I've also had to fix LLDB_INVALID_SIGNAL handling in NetBSD plugin for the time being.

Thread suspend/resume kernel problem

Sadly, further investigation of hanging tests led me to the conclusion that they are caused by kernel bugs. The first bug I've noticed is that PT_SUSPEND/PT_RESUME do not cause the thread to be resumed correctly. I have written the following reproducer for it:

#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

void* thread_func(void* foo) {
    int i;
    printf("in thread_func, lwp = %d\n", _lwp_self());
    for (i = 0; i < 100; ++i) {
        printf("t2 %d\n", i);
        sleep(2);
    }
    printf("out thread_func\n");
    return NULL;
}

int main() {
    int ret;
    int pid = fork();
    assert(pid != -1);
    if (pid == 0) {
        int i;
        pthread_t t2;

        ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
        assert(ret != -1);
        printf("in main, lwp = %d\n", _lwp_self());
        ret = pthread_create(&t2, NULL, thread_func, NULL);
        assert(ret == 0);
        printf("thread started\n");

        for (i = 0; i < 100; ++i) {
            printf("t1 %d\n", i);
            sleep(2);
        }

        ret = pthread_join(t2, NULL);
        assert(ret == 0);
        printf("thread joined\n");
    }

    sleep(1);
    ret = kill(pid, SIGSTOP);
    assert(ret == 0);
    printf("stopped\n");

    pid_t waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    printf("t2 suspend\n");
    ret = ptrace(PT_SUSPEND, pid, NULL, 2);
    assert(ret == 0);
    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    sleep(3);
    ret = kill(pid, SIGSTOP);
    assert(ret == 0);
    printf("stopped\n");

    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    printf("t2 resume\n");
    ret = ptrace(PT_RESUME, pid, NULL, 2);
    assert(ret == 0);
    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    sleep(5);
    ret = kill(pid, SIGTERM);
    assert(ret == 0);

    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    return 0;
}

The program should run a two-threaded subprocess, with both threads outputting successive numbers. The second thread should be suspended shortly, then resumed. However, currently it does not resume.

I believe that this caused by ptrace_startstop() altering process flags without reimplementing the complete logic as used by lwp_suspend() and lwp_continue(). I've been able to move forward by calling the two latter functions from ptrace_startstop(). However, Kamil has indicated that he'd like to make those routines use separate bits (to distinguish LWPs stopped by process from LWPs stopped by debugger), so I haven't pushed my patch forward.

Multiple thread reporting kernel problem

The second and more important problem is related to how new LWPs are reported to the debugger. Or rather, that they are not reported reliably. When many threads are started by the process in a short time (e.g. in a loop), the debugger receives reports only for some of them.

This can be reproduced using the following program:

#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

void* thread_func(void* foo) {
    printf("in thread, lwp = %d\n", _lwp_self());
    sleep(10);
    return NULL;
}

int main() {
    int ret;
    int pid = fork();
    assert(pid != -1);
    if (pid == 0) {
        int i;
        pthread_t t[10];

        ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
        assert(ret != -1);
        printf("in main, lwp = %d\n", _lwp_self());
        raise(SIGSTOP);
        printf("main resumed\n");

        for (i = 0; i < 10; i++) {
            ret = pthread_create(&t[i], NULL, thread_func, NULL);
            assert(ret == 0);
            printf("thread %d started\n", i);
        }

        for (i = 0; i < 10; i++) {
            ret = pthread_join(t[i], NULL);
            assert(ret == 0);
            printf("thread %d joined\n", i);
        }

        return 0;
    }

    pid_t waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);
    assert(WSTOPSIG(ret) == SIGSTOP);

    struct ptrace_event ev;
    ev.pe_set_event = PTRACE_LWP_CREATE | PTRACE_LWP_EXIT;

    ret = ptrace(PT_SET_EVENT_MASK, pid, &ev, sizeof(ev));
    assert(ret == 0);

    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    while (1) {
        waited = waitpid(pid, &ret, 0);
        assert(waited == pid);
        printf("wait: %d\n", ret);
        if (WIFSTOPPED(ret)) {
            assert(WSTOPSIG(ret) == SIGTRAP);

            ptrace_siginfo_t info;
            ret = ptrace(PT_GET_SIGINFO, pid, &info, sizeof(info));
            assert(ret == 0);

            struct ptrace_state pst;
            ret = ptrace(PT_GET_PROCESS_STATE, pid, &pst, sizeof(pst));
            assert(ret == 0);
            printf("SIGTRAP: si_code = %d, ev = %d, lwp = %d\n",
                    info.psi_siginfo.si_code, pst.pe_report_event, pst.pe_lwp);

            ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
            assert(ret == 0);
        } else
            break;
    }

    return 0;
}

The program starts 10 threads, and the debugger should report 10 SIGTRAP events for LWPs being started (ev = 8) and the same number for LWPs exiting (ev = 16). However, initially I've been getting as many as 4 SIGTRAPs, and the remaining 6 threads went unnoticed.

The issue is that do_lwp_create() does not raise SIGTRAP directly but defers that to mi_startlwp() that is called asynchronously as the LWP starts. This means that the former function can return before SIGTRAP is emitted, and the program can start another LWP. Since signals are not properly queued, multiple SIGTRAPs can end up being issued simultaneously and lost.

Kamil has already worked on making simultaneous signal deliver more reliable. However, he reverted his commit as it caused regressions. Nevertheless, applying it made it possible for the test program to get all SIGTRAPs at least most of the time.

The ‘repeated’ SIGTRAPs did not include correct LWP information, though. Kamil has recently fixed that by moving the relevant data from process information to signal information struct. Combined with his earlier patch, this makes my test program pass most of the time (sadly, there seem to be some more race conditions involved).

Summary of threading work

My current work-in-progress patch can be found on Differential as D64647. However, it is currently unsuitable for merging as some tests start failing or hanging as a side effect of the changes. I'd like to try to get as many of them fixed as possible before pushing the changes to trunk, in order to avoid causing harm to the build bot.

The status with the current set of Kamil's work-in-progress patches applied to the kernel includes approximately 4 failing tests and 10 hanging tests.

Other LLVM news

Manikishan Ghantasala has been working on NetBSD-specific clang-format improvements in this year's Google Summer of Code. He is continuing to work on clang-format, and has recently been given commit access to the LLVM project!

Besides NetBSD-specific work, I've been trying to improve a few other areas of LLVM. I've been working on fixing regressions in stand-alone build support and regressions in support for BUILD_SHARED_LIBS=ON builds. I have to admit that while a year ago I was the only person fixing those issues, nowadays I see more contributions submitting patches for breakages specific to those builds.

I have recently worked on fixing bad assumptions in LLDB's Python support. However, it seems that Haibo Huang has taken it from me and is doing a great job.

My most recent endeavor was fixing LLVM_DISTRIBUTION_COMPONENTS support in LLVM projects. This is going to make it possible to precisely fine-tune which components are installed, both in combined tree and stand-alone builds.

Future plans

My first goal right now is to assert what is causing the test host to crash, and restore buildbot stability. Afterwards, I'd like to continue investigating threading problems and provide more reproducers for any kernel issues we may be having. Once this is done, I'd like to finally push my LLDB patch.

Since threading is not the only goal left in the TODO, I may switch between working on it and on the remaining TODO items. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

Posted Saturday afternoon, October 5th, 2019 Tags:

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support. You can read more about that in my July 2019 report.

I've been on vacation in August, and in September I've resumed the work on LLDB. I've started by fixing new regressions in LLVM suite, then improved my previous patches and continued debugging test failures and timeouts resulting from my patches.

LLVM 8 and 9 in NetBSD

Updates to LLVM 8 src branch

I have been asked to rebase my llvm8 branch of NetBSD src tree. I've done that, and updated it to LLVM 8.0.1 while at it.

LLVM 9 release

The LLVM 9.0.0 final has been tagged in September. I have been doing the pre-release testing for it, and discovered that the following tests were hanging:

LLVM :: ExecutionEngine/MCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/MCJIT/eh.ll
LLVM :: ExecutionEngine/MCJIT/multi-module-eh-a.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh.ll
LLVM :: ExecutionEngine/OrcMCJIT/multi-module-eh-a.ll

I couldn't reproduce the problem with LLVM trunk, so I've instead focused on looking for a fix. I've came to the conclusion that the problem was fixed through adding missing linked library. I've requested backport in bug 43196 and it has been merged in r371042.

I didn't put more effort into figuring out why the lack of this linkage caused issues for us. However, as Lang Hames said on the bug, ‘adding the dependency was the right thing to do’.

LLVM 9 for NetBSD src

Afterwards, I have started working on updating my NetBSD src branch for LLVM 9. However, in middle of that I've been informed that Joerg has already finished doing that independently, so I've stopped.

Furthermore, I was informed that LLVM 9.0.0 will not make it to src, since it still lacks some fixes (most notably, adding a pass to lower is.constant and objectsize intrinsics). Joerg plans to import some revision of the trunk instead.

Buildbot regressions

Initial regressions

The first problem that needed solving was LLDB build failure caused by replacing std::once_flag with llvm::once_flag. I've came to the conclusion that the build fails because the call site in LLDB combined std::call_once with llvm::once_flag. The solution was to replace the former with llvm::call_once.

After fixing the build failure, we had a bunch of test failures on buildbot to address. Kamil helped me and tracked one of them down to a new test for stack exhaustion handling. The test author decided that it ‘is only a best-effort mitigation for the case where things have already gone wrong’, and marked it unsupported on NetBSD.

On the plus side, two of the tests previously failing on NetBSD have been fixed upstream. I've un-XFAIL-ed them appropriately. Five new test failures in LLDB were related to those tests being unconditionally skipped before — I've marked them XFAIL pending further investigation in the future.

Another set of issues was caused by enabling -fvisibility=hidden for libc++ which caused problems when building with GCC. After being pinged, the author decided to enable it only for builds done using clang.

New issues through September

During September, two new issues arose. The first one was my fault, so I'm going to cover it in appropriate section below. The second one was new thread_local test failing. Since it was a newly added test that failed on most of the supported platforms, I've just added NetBSD to the list of failing platforms.

Current buildbot status

After fixing the immediate issues, the buildbot returned to previous status. The majority of tests pass, with one flaky test repeatedly timing out. Normally, I would skip this specific test in order to have buildbot report only fresh failures. However, since it is threading-related I'm waiting to finish my threading update and reassert afterwards.

Furthermore, I have added --shuffle to lit arguments in order to randomize the order in which the tests are run. According to upstream, this reduces the chance of load-intensive tests being run simultaneously and therefore causing timeouts.

The buildbot host seems to have started crashing recently. OpenMP tests were causing similar issues in the past, and I'm currently trying to figure out whether they are the culprit again.

__has_feature(leak_sanitizer)

Kamil asked me to implement a feature check for leak sanitizer being used. The __has_feature(leak_sanitizer) preprocessor macro is complementary to __SANITIZE_LEAK__ used in NetBSD gcc and is used to avoid reports when leaks are known but the cost of fixing them exceeds the gain.

Progress in threading support

Fixing LLDB bugs

In the course of previous work, I had a patch for threading support in LLDB partially ready. However, the improvements have also resulted in some of the tests starting to hang. The main focus of my late work as investigating those problems.

The first issue that I've discovered was inconsistency in expressing no signal sent. In some places, LLDB used LLDB_INVALID_SIGNAL (-1) to express that, in others it used 0. So far this went unnoticed since the end result in ptrace calls was the same. However, the reworked NetBSD threading support used explicit PT_SET_SIGINFO which — combined with wrong signal parameter — wiped previously queued signal.

I've fixed C packet handler, then fixed c, vCont and s handlers to use LLDB_INVALID_SIGNAL correctly. However, I've only tested the fixes with my updated thread support, causing regression in the old code. Therefore, I've also had to fix LLDB_INVALID_SIGNAL handling in NetBSD plugin for the time being.

Thread suspend/resume kernel problem

Sadly, further investigation of hanging tests led me to the conclusion that they are caused by kernel bugs. The first bug I've noticed is that PT_SUSPEND/PT_RESUME do not cause the thread to be resumed correctly. I have written the following reproducer for it:

#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

void* thread_func(void* foo) {
    int i;
    printf("in thread_func, lwp = %d\n", _lwp_self());
    for (i = 0; i < 100; ++i) {
        printf("t2 %d\n", i);
        sleep(2);
    }
    printf("out thread_func\n");
    return NULL;
}

int main() {
    int ret;
    int pid = fork();
    assert(pid != -1);
    if (pid == 0) {
        int i;
        pthread_t t2;

        ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
        assert(ret != -1);
        printf("in main, lwp = %d\n", _lwp_self());
        ret = pthread_create(&t2, NULL, thread_func, NULL);
        assert(ret == 0);
        printf("thread started\n");

        for (i = 0; i < 100; ++i) {
            printf("t1 %d\n", i);
            sleep(2);
        }

        ret = pthread_join(t2, NULL);
        assert(ret == 0);
        printf("thread joined\n");
    }

    sleep(1);
    ret = kill(pid, SIGSTOP);
    assert(ret == 0);
    printf("stopped\n");

    pid_t waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    printf("t2 suspend\n");
    ret = ptrace(PT_SUSPEND, pid, NULL, 2);
    assert(ret == 0);
    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    sleep(3);
    ret = kill(pid, SIGSTOP);
    assert(ret == 0);
    printf("stopped\n");

    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    printf("t2 resume\n");
    ret = ptrace(PT_RESUME, pid, NULL, 2);
    assert(ret == 0);
    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    sleep(5);
    ret = kill(pid, SIGTERM);
    assert(ret == 0);

    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    return 0;
}

The program should run a two-threaded subprocess, with both threads outputting successive numbers. The second thread should be suspended shortly, then resumed. However, currently it does not resume.

I believe that this caused by ptrace_startstop() altering process flags without reimplementing the complete logic as used by lwp_suspend() and lwp_continue(). I've been able to move forward by calling the two latter functions from ptrace_startstop(). However, Kamil has indicated that he'd like to make those routines use separate bits (to distinguish LWPs stopped by process from LWPs stopped by debugger), so I haven't pushed my patch forward.

Multiple thread reporting kernel problem

The second and more important problem is related to how new LWPs are reported to the debugger. Or rather, that they are not reported reliably. When many threads are started by the process in a short time (e.g. in a loop), the debugger receives reports only for some of them.

This can be reproduced using the following program:

#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

void* thread_func(void* foo) {
    printf("in thread, lwp = %d\n", _lwp_self());
    sleep(10);
    return NULL;
}

int main() {
    int ret;
    int pid = fork();
    assert(pid != -1);
    if (pid == 0) {
        int i;
        pthread_t t[10];

        ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
        assert(ret != -1);
        printf("in main, lwp = %d\n", _lwp_self());
        raise(SIGSTOP);
        printf("main resumed\n");

        for (i = 0; i < 10; i++) {
            ret = pthread_create(&t[i], NULL, thread_func, NULL);
            assert(ret == 0);
            printf("thread %d started\n", i);
        }

        for (i = 0; i < 10; i++) {
            ret = pthread_join(t[i], NULL);
            assert(ret == 0);
            printf("thread %d joined\n", i);
        }

        return 0;
    }

    pid_t waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);
    assert(WSTOPSIG(ret) == SIGSTOP);

    struct ptrace_event ev;
    ev.pe_set_event = PTRACE_LWP_CREATE | PTRACE_LWP_EXIT;

    ret = ptrace(PT_SET_EVENT_MASK, pid, &ev, sizeof(ev));
    assert(ret == 0);

    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    while (1) {
        waited = waitpid(pid, &ret, 0);
        assert(waited == pid);
        printf("wait: %d\n", ret);
        if (WIFSTOPPED(ret)) {
            assert(WSTOPSIG(ret) == SIGTRAP);

            ptrace_siginfo_t info;
            ret = ptrace(PT_GET_SIGINFO, pid, &info, sizeof(info));
            assert(ret == 0);

            struct ptrace_state pst;
            ret = ptrace(PT_GET_PROCESS_STATE, pid, &pst, sizeof(pst));
            assert(ret == 0);
            printf("SIGTRAP: si_code = %d, ev = %d, lwp = %d\n",
                    info.psi_siginfo.si_code, pst.pe_report_event, pst.pe_lwp);

            ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
            assert(ret == 0);
        } else
            break;
    }

    return 0;
}

The program starts 10 threads, and the debugger should report 10 SIGTRAP events for LWPs being started (ev = 8) and the same number for LWPs exiting (ev = 16). However, initially I've been getting as many as 4 SIGTRAPs, and the remaining 6 threads went unnoticed.

The issue is that do_lwp_create() does not raise SIGTRAP directly but defers that to mi_startlwp() that is called asynchronously as the LWP starts. This means that the former function can return before SIGTRAP is emitted, and the program can start another LWP. Since signals are not properly queued, multiple SIGTRAPs can end up being issued simultaneously and lost.

Kamil has already worked on making simultaneous signal deliver more reliable. However, he reverted his commit as it caused regressions. Nevertheless, applying it made it possible for the test program to get all SIGTRAPs at least most of the time.

The ‘repeated’ SIGTRAPs did not include correct LWP information, though. Kamil has recently fixed that by moving the relevant data from process information to signal information struct. Combined with his earlier patch, this makes my test program pass most of the time (sadly, there seem to be some more race conditions involved).

Summary of threading work

My current work-in-progress patch can be found on Differential as D64647. However, it is currently unsuitable for merging as some tests start failing or hanging as a side effect of the changes. I'd like to try to get as many of them fixed as possible before pushing the changes to trunk, in order to avoid causing harm to the build bot.

The status with the current set of Kamil's work-in-progress patches applied to the kernel includes approximately 4 failing tests and 10 hanging tests.

Other LLVM news

Manikishan Ghantasala has been working on NetBSD-specific clang-format improvements in this year's Google Summer of Code. He is continuing to work on clang-format, and has recently been given commit access to the LLVM project!

Besides NetBSD-specific work, I've been trying to improve a few other areas of LLVM. I've been working on fixing regressions in stand-alone build support and regressions in support for BUILD_SHARED_LIBS=ON builds. I have to admit that while a year ago I was the only person fixing those issues, nowadays I see more contributions submitting patches for breakages specific to those builds.

I have recently worked on fixing bad assumptions in LLDB's Python support. However, it seems that Haibo Huang has taken it from me and is doing a great job.

My most recent endeavor was fixing LLVM_DISTRIBUTION_COMPONENTS support in LLVM projects. This is going to make it possible to precisely fine-tune which components are installed, both in combined tree and stand-alone builds.

Future plans

My first goal right now is to assert what is causing the test host to crash, and restore buildbot stability. Afterwards, I'd like to continue investigating threading problems and provide more reproducers for any kernel issues we may be having. Once this is done, I'd like to finally push my LLDB patch.

Since threading is not the only goal left in the TODO, I may switch between working on it and on the remaining TODO items. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

Posted Saturday afternoon, October 5th, 2019 Tags:
I have introduced changes that make debuggers more reliable in threaded scenarios. Additionally, I have enhanced Leak Sanitizer support and introduced various improvements in the basesystem.

Threading support

Threads and synchronization in the kernel, in general, is an evergreen task of the kernel developers. The process of enhancing support for tracing multiple threads has been documented by Michal Gorny in his LLDB entry Threading support in LLDB continued.

Overall I have introduced these changes:

  • Separate suspend from userland (_lwp_suspend(2)) flag from suspend by a debugger (PT_SUSPEND). This removes one of the underlying problems of threading stability as a debuggee was able to accidentally unstop suspended thread. This property is needed whenever we want to trace a selection (typically single entity) of threads.
  • Store SIGTRAP event information inside siginfo_t, rather than in struct proc. A single signal can only be reported at the time to the debugger, and its context is no longer prone to be overwritten by concurrent threads.
  • Change that introduces restarts in functions notifying events for debuggers. There was a time window between registering an event by a thread, stopping the process and unlocking mutexes of the process; as another process could take the mutexes before being stopped and overwrite the event with its own data. Now each event routine for debugger checks whether a process is already stopping (or demising or no longer being tracked) and preserves the signal to be emitted locally in the context of the lwp local variable on the stack and continues stopping self as requested by the other LWP. Once the thread is awaken, it retries to emit the signal and deliver the event signal to the debugger.
  • Introduce PT_STOP, that combines kill(SIGSTOP) and ptrace(PT_CONTINUE,SIGSTOP) semantics in a single call. It works like:
    • kill(SIGSTOP) for unstopped tracee
    • ptrace(PT_CONTINUE,SIGSTOP) for stopped tracee
    The child will be stopped and always possible to be waited (with wait(2) like calls).

    For stopped tracee kill(SIGSTOP) has no effect. PT_CONTINUE+SIGSTOP cannot be used on an unstopped process (EBUSY).

    This operation is modeled after PT_KILL that is similar for the SIGKILL call. While there, allow PT_KILL on unstopped traced child.

    This operation is useful in an abnormal exit of a debugger from a signal handler, usually followed by waitpid(2) and ptrace(PT_DETACH).

For the sake of tracking the missed in action signals emitted by tracee, I have introduced the feature in NetBSD truss (as part of the picotrace repository) to register syscall entry (SCE) and syscall exit (SCX) calls and track missing SCE/SCX events that were never delivered. Unfortunately, the number of missing events was huge, even for simple 2-threaded applications.

    truss[2585] running for 22.205305922 seconds
    truss[2585] attached to child=759 ('firefox') for 22.204289369 seconds
    syscall                     seconds      calls     errors missed-sce missed-scx
    read                    0.048522952        609          0         54         76
    write                   0.044693735        487          0         35         66
    open                    0.002516815         18          0          5          5
    close                   0.001015263         17          0          9          6
    unlink                  0.001375463         13          0          3          0
    getpid                  0.093458089       1993          0         16         56
    geteuid                 0.000049301          1          0          0          1
    recvmsg                 0.343353019       4828       3685         90        112
    access                  0.001450653         12          3          5          4
    dup                     0.000570904         10          0          0          1
    munmap                  0.010375949         88          0          6          3
    mprotect                0.196781932       2251          0         11         62
    madvise                 0.049820002        430          0         11         18
    writev                  0.237488362       1507          0         76         67
    rename                  0.000379918          2          0          1          0
    mkdir                   0.000283846          2          2          1          2
    mmap                    0.033342935        481          0         15         40
    lseek                   0.003341775         62          0         25         24
    ftruncate               0.000507707          9          0          1          0
    __sysctl                0.000144506          2          0          0          0
    poll                   18.694195617       4531          0        106        191
    __sigprocmask14         0.001585329         20          0          0          2
    getcontext              0.000083238          1          0          0          0
    _lwp_create             0.000104646          1          0          0          0
    _lwp_self               0.001456718         22          0         24         79
    _lwp_unpark             0.035319633        607          0         14         39
    _lwp_unpark_all         0.020660377        250          0         38         50
    _lwp_setname            0.000118418          2          0          0          0
    __select50             15.125525493        637          0         82        125
    __gettimeofday50        3.279021049       2930          0         40        135
    __clock_gettime50      10.673311747      33132          0       1418       3003
    __stat50                0.006375356         52          3         12          5
    __fstat50               0.001490944         17          0          3          2
    __lstat50               0.000110906          1          0          1          0
    __getrusage50           0.008863815        109          0          7          1
    ___lwp_park60          62.720893458        964        251        454        453
                          -------------    -------    -------    -------    -------
                          111.638589870      56098       3944       2563       4628

With my kernel changes landed, the number of missed sce/scx events is down to zero (with exceptions to signals that e.g. never return such as the exit(2) call).

Once these changes settle in HEAD, I plan to backport them to NetBSD-9. I have already received feedback that GDB works much better now.

The kernel also has now more runtime asserts that validate correctness of the code paths.

Sanitizers

I've introduced a special preprocessor macro to detect LSan (__SANITIZE_LEAK__) and UBSan (__SANITIZE_UNDEFINED__) in GCC. The patches were submitted upstream to the GCC mailing list, in two patches (LSan + UBSan). Unfortunately, GCC does not see value in feature parity with LLVM and for the time being it will be a local NetBSD specific GCC extension. These macros are now integrated into the NetBSD public system headers, for use by the basesystem software.

The LSan macro is now used inside the LLVM codebase and the ps(1) program is the first user of it. The UBSan macro is now used to disable relaxed alignment on x86. While such code is still functional, it is not clean from undefined behavior as specified by C. This is especially needed in the kernel fuzzing process, as we can reduce noise from less interesting reports.

During the previous month a number of reports from kernel fuzzing were fixed. There is still more to go.

Almost all local patches needed for LSan were merged upstream. The last remaining local patch is scheduled for later as it is very invasive for all platforms and sanitizers. In the worst case we just have more false negatives in detection of leaks in specific scenarios.

Miscellaneous changes

I have fixed a regression in upstream GDB with SIGTTOU handling. This was an upstream bug fixed by Alan Hayward and cherry-picked by me. As a side effect, a certain environment setup would cause the tracer to sleep.

I have reverted the regression in changed in6_addr change. It appeased UBSan, but broke at least qemu networking. The regression was tracked down by Andreas Gustafsson and reported in the NetBSD's bug tracking system.

I have landed a patch that returns ELF loader dl_phdr_info information for dl_iterate_phdr(3). This synchronized the behavior with Linux, FreeBSD and OpenBSD and is used by sanitizers.

I have passed through core@ the patch to change the kevent::udata type from intptr_t to void*. The former is slightly more pedantic, but the latter is what is in all other kevent users and this mismatch of types affected specifically C++ users that needed special NetBSD-only workarounds.

I have marked git and hg meta files as ones to be ignored by cvs import. This was causing problems among people repackaging the NetBSD source code with other VCS software than CVS.

I keep working on getting GDB test-suite to run on NetBSD, I spent some time on getting fluent in the TCL programming language (as GDB uses dejagnu and TCL scripting). I have already fixed two bugs that affected NetBSD users in the TCL runtime: getaddrbyname_r and gethostbyaddr_r were falsely reported as available and picked on NetBSD, causing damage in operation. Fluency in TCL will allow me to be more efficient in addressing and debugging failing tests in GDB and likely reuse this knowledge in other fields useful for the project.

I made __CTASSERT a static assert again. Previously, this homegrown check for compile-time checks silently stopped working for C99 compilers supporting VLA (variable length array). It was caught by kUBSan that detected VLA of dynamic size of -1, that is still compatible but has unspecified runtime semantics. The new form is inspired by the Perl ctassert code and uses bit-field constant that enforces the assert to be effective again. Few misuses __CTASSERT, mostly in the Linux DRMKMS code, were fixed.

I have submitted a proposal to the C Working Group a proposal to add new methods for setting and getting the thread name.

Plan for the next milestone

Keep stabilizing the reliability debugging interfaces and get ATF and LLDB threading code reliably pass tests. Cover more scenarios with ptrace(2) in the ATF regression test-suite.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

Posted mid-morning Thursday, October 10th, 2019 Tags:
I have introduced changes that make debuggers more reliable in threaded scenarios. Additionally, I have enhanced Leak Sanitizer support and introduced various improvements in the basesystem.

Threading support

Threads and synchronization in the kernel, in general, is an evergreen task of the kernel developers. The process of enhancing support for tracing multiple threads has been documented by Michal Gorny in his LLDB entry Threading support in LLDB continued.

Overall I have introduced these changes:

  • Separate suspend from userland (_lwp_suspend(2)) flag from suspend by a debugger (PT_SUSPEND). This removes one of the underlying problems of threading stability as a debuggee was able to accidentally unstop suspended thread. This property is needed whenever we want to trace a selection (typically single entity) of threads.
  • Store SIGTRAP event information inside siginfo_t, rather than in struct proc. A single signal can only be reported at the time to the debugger, and its context is no longer prone to be overwritten by concurrent threads.
  • Change that introduces restarts in functions notifying events for debuggers. There was a time window between registering an event by a thread, stopping the process and unlocking mutexes of the process; as another process could take the mutexes before being stopped and overwrite the event with its own data. Now each event routine for debugger checks whether a process is already stopping (or demising or no longer being tracked) and preserves the signal to be emitted locally in the context of the lwp local variable on the stack and continues stopping self as requested by the other LWP. Once the thread is awaken, it retries to emit the signal and deliver the event signal to the debugger.
  • Introduce PT_STOP, that combines kill(SIGSTOP) and ptrace(PT_CONTINUE,SIGSTOP) semantics in a single call. It works like:
    • kill(SIGSTOP) for unstopped tracee
    • ptrace(PT_CONTINUE,SIGSTOP) for stopped tracee
    The child will be stopped and always possible to be waited (with wait(2) like calls).

    For stopped tracee kill(SIGSTOP) has no effect. PT_CONTINUE+SIGSTOP cannot be used on an unstopped process (EBUSY).

    This operation is modeled after PT_KILL that is similar for the SIGKILL call. While there, allow PT_KILL on unstopped traced child.

    This operation is useful in an abnormal exit of a debugger from a signal handler, usually followed by waitpid(2) and ptrace(PT_DETACH).

For the sake of tracking the missed in action signals emitted by tracee, I have introduced the feature in NetBSD truss (as part of the picotrace repository) to register syscall entry (SCE) and syscall exit (SCX) calls and track missing SCE/SCX events that were never delivered. Unfortunately, the number of missing events was huge, even for simple 2-threaded applications.

    truss[2585] running for 22.205305922 seconds
    truss[2585] attached to child=759 ('firefox') for 22.204289369 seconds
    syscall                     seconds      calls     errors missed-sce missed-scx
    read                    0.048522952        609          0         54         76
    write                   0.044693735        487          0         35         66
    open                    0.002516815         18          0          5          5
    close                   0.001015263         17          0          9          6
    unlink                  0.001375463         13          0          3          0
    getpid                  0.093458089       1993          0         16         56
    geteuid                 0.000049301          1          0          0          1
    recvmsg                 0.343353019       4828       3685         90        112
    access                  0.001450653         12          3          5          4
    dup                     0.000570904         10          0          0          1
    munmap                  0.010375949         88          0          6          3
    mprotect                0.196781932       2251          0         11         62
    madvise                 0.049820002        430          0         11         18
    writev                  0.237488362       1507          0         76         67
    rename                  0.000379918          2          0          1          0
    mkdir                   0.000283846          2          2          1          2
    mmap                    0.033342935        481          0         15         40
    lseek                   0.003341775         62          0         25         24
    ftruncate               0.000507707          9          0          1          0
    __sysctl                0.000144506          2          0          0          0
    poll                   18.694195617       4531          0        106        191
    __sigprocmask14         0.001585329         20          0          0          2
    getcontext              0.000083238          1          0          0          0
    _lwp_create             0.000104646          1          0          0          0
    _lwp_self               0.001456718         22          0         24         79
    _lwp_unpark             0.035319633        607          0         14         39
    _lwp_unpark_all         0.020660377        250          0         38         50
    _lwp_setname            0.000118418          2          0          0          0
    __select50             15.125525493        637          0         82        125
    __gettimeofday50        3.279021049       2930          0         40        135
    __clock_gettime50      10.673311747      33132          0       1418       3003
    __stat50                0.006375356         52          3         12          5
    __fstat50               0.001490944         17          0          3          2
    __lstat50               0.000110906          1          0          1          0
    __getrusage50           0.008863815        109          0          7          1
    ___lwp_park60          62.720893458        964        251        454        453
                          -------------    -------    -------    -------    -------
                          111.638589870      56098       3944       2563       4628

With my kernel changes landed, the number of missed sce/scx events is down to zero (with exceptions to signals that e.g. never return such as the exit(2) call).

Once these changes settle in HEAD, I plan to backport them to NetBSD-9. I have already received feedback that GDB works much better now.

The kernel also has now more runtime asserts that validate correctness of the code paths.

Sanitizers

I've introduced a special preprocessor macro to detect LSan (__SANITIZE_LEAK__) and UBSan (__SANITIZE_UNDEFINED__) in GCC. The patches were submitted upstream to the GCC mailing list, in two patches (LSan + UBSan). Unfortunately, GCC does not see value in feature parity with LLVM and for the time being it will be a local NetBSD specific GCC extension. These macros are now integrated into the NetBSD public system headers, for use by the basesystem software.

The LSan macro is now used inside the LLVM codebase and the ps(1) program is the first user of it. The UBSan macro is now used to disable relaxed alignment on x86. While such code is still functional, it is not clean from undefined behavior as specified by C. This is especially needed in the kernel fuzzing process, as we can reduce noise from less interesting reports.

During the previous month a number of reports from kernel fuzzing were fixed. There is still more to go.

Almost all local patches needed for LSan were merged upstream. The last remaining local patch is scheduled for later as it is very invasive for all platforms and sanitizers. In the worst case we just have more false negatives in detection of leaks in specific scenarios.

Miscellaneous changes

I have fixed a regression in upstream GDB with SIGTTOU handling. This was an upstream bug fixed by Alan Hayward and cherry-picked by me. As a side effect, a certain environment setup would cause the tracer to sleep.

I have reverted the regression in changed in6_addr change. It appeased UBSan, but broke at least qemu networking. The regression was tracked down by Andreas Gustafsson and reported in the NetBSD's bug tracking system.

I have landed a patch that returns ELF loader dl_phdr_info information for dl_iterate_phdr(3). This synchronized the behavior with Linux, FreeBSD and OpenBSD and is used by sanitizers.

I have passed through core@ the patch to change the kevent::udata type from intptr_t to void*. The former is slightly more pedantic, but the latter is what is in all other kevent users and this mismatch of types affected specifically C++ users that needed special NetBSD-only workarounds.

I have marked git and hg meta files as ones to be ignored by cvs import. This was causing problems among people repackaging the NetBSD source code with other VCS software than CVS.

I keep working on getting GDB test-suite to run on NetBSD, I spent some time on getting fluent in the TCL programming language (as GDB uses dejagnu and TCL scripting). I have already fixed two bugs that affected NetBSD users in the TCL runtime: getaddrbyname_r and gethostbyaddr_r were falsely reported as available and picked on NetBSD, causing damage in operation. Fluency in TCL will allow me to be more efficient in addressing and debugging failing tests in GDB and likely reuse this knowledge in other fields useful for the project.

I made __CTASSERT a static assert again. Previously, this homegrown check for compile-time checks silently stopped working for C99 compilers supporting VLA (variable length array). It was caught by kUBSan that detected VLA of dynamic size of -1, that is still compatible but has unspecified runtime semantics. The new form is inspired by the Perl ctassert code and uses bit-field constant that enforces the assert to be effective again. Few misuses __CTASSERT, mostly in the Linux DRMKMS code, were fixed.

I have submitted a proposal to the C Working Group a proposal to add new methods for setting and getting the thread name.

Plan for the next milestone

Keep stabilizing the reliability debugging interfaces and get ATF and LLDB threading code reliably pass tests. Cover more scenarios with ptrace(2) in the ATF regression test-suite.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

Posted mid-morning Thursday, October 10th, 2019 Tags:
Add a comment