Jan 2020
S M T W T F S
     
25
26  

Archives

This page is a blog mirror of sorts. It pulls in articles from blog's feed and publishes them here (with a feed, too).

This month I have improved the NetBSD ptrace(2) API, removing one legacy interface with a few flaws and replacing it with two new calls with new features, and removing technical debt.

As LLVM 10.0 is branching now soon (Jan 15th 2020), I worked on proper support of the LLVM features for NetBSD 9.0 (today RC1) and NetBSD HEAD (future 10.0).

ptrace(2) API changes

There are around 20 Machine Independent ptrace(2) calls. The origin of some of these calls trace back to BSD4.3. The PT_LWPINFO call was introduced in 2003 and was loosely inspired by a similar interface in HP-UX ttrace(2). As that was the early in the history of POSIX threads and SMP support, not every bit of the interface remained ideal for the current computing needs.

The PT_LWPINFO call was originally intended to retrieve the thread (LWP) information inside a traced process.

This call was designed to work as an iterator over threads to retrieve the LWP id + event information. The event information is received in a raw format (PL_EVENT_NONE, PL_EVENT_SIGNAL, PL_EVENT_SUSPENDED).

Problems:

1. PT_LWPINFO shares the operation name with PT_LWPINFO from FreeBSD that works differently and is used for different purposes:

  • On FreeBSD PT_LWPINFO returns pieces of information for the suspended thread, not the next thread in the iteration.
  • FreeBSD uses a custom interface for iterating over threads (actually retrieving the threads is done with PT_GETNUMLWPS + PT_GETLWPLIST).
  • There is almost no overlapping correct usage of PT_LWPINFO on NetBSD and PL_LWPINFO on FreeBSD, and this causes confusion and misuse of the interfaces (recently I fixed such misuse in the DTrace code).

2. pl_event can only return whether a signal was emitted to all threads or a single one. There is no information whether this is a per-LWP signal or per-PROC signal, no siginfo_t information is attached etc.

3. Syncing our behavior with FreeBSD would mean complete breakage of our PT_LWPINFO users and it is actually unnecessary, as we receive full siginfo_t through Linux-like PT_GET_SIGINFO, instead of reimplementing siginfo_t inside ptrace_lwpinfo in FreeBSD-style. (FreeBSD wanted to follow NetBSD and adopt some of our APIs in ptrace(2) and signals.).

4. Our PT_LWPINFO is unable to list LWP ids in a traced process.

5. The PT_LWPINFO semantics cannot be used in core files as-is (as our PT_LPWINFO returns next LWP, not the indicated one) and pl_event is redundant with netbsd_elfcore_procinfo.cpi_siglwp, and still less powerful (as it cannot distinguish between a per-LWP and a per-PROC signal in a single-threaded application).

6. PT_LWPINFO is already documented in the BUGS section of ptrace(2), as it contains additional flaws.

Solution:

1. Remove PT_LWPINFO from the public ptrace(2) API, keeping it only as a hidden namespaced symbol for legacy compatibility.

2. Introduce the PT_LWPSTATUS that prompts the kernel about exact thread and retrieves useful information about LWP.

3. Introduce PT_LWPNEXT with the iteration semantics from PT_LWPINFO, namely return the next LWP.

4. Include per-LWP information in core(5) files as "PT_LWPSTATUS@nnn".

5. Fix flattening the signal context in netbsd_elfcore_procinfo in core(5) files, and move per-LWP signal information to the per-LWP structure "PT_LWPSTATUS@nnn".

6. Do not bother with FreeBSD like PT_GETNUMLWPS + PT_GETLWPLIST calls, as this is a micro-optimization. We intend to retrieve the list of threads once on attach/exec and later trace them through the LWP events (PTRACE_LWP_CREATE, PTRACE_LWP_EXIT). It's more important to keep compatibility with current usage of PT_LWPINFO.

7. Keep the existing ATF tests for PT_LWPINFO to avoid rot.

PT_LWPSTATUS and PT_LWPNEXT operate over newly introduced "struct ptrace_lwpstatus". This structure is inspired by: - SmartOS lwpstatus_t, - struct ptrace_lwpinfo from NetBSD, - struct ptrace_lwpinfo from FreeBSD

and their usage in real existing open-source software.

#define PL_LNAMELEN 20 /* extra 4 for alignment */

struct ptrace_lwpstatus {
 lwpid_t  pl_lwpid;  /* LWP described */
 sigset_t pl_sigpend;  /* LWP signals pending */
 sigset_t pl_sigmask;  /* LWP signal mask */
 char  pl_name[PL_LNAMELEN]; /* LWP name, may be empty */
 void  *pl_private;  /* LWP private data */
 /* Add fields at the end */
};

  • pt_lwpid is picked from PT_LWPINFO.
  • pl_event is removed entirely as useless, misleading and harmful.
  • pl_sigpend and pl_sigmask are mainly intended to untangle the cpi_sig* fields from "struct ptrace_lwpstatus" (fix "XXX" in the kernel code).
  • pl_name is an easy to use API to retrieve the LWP name, replacing sysctl() retrieval. (Previous algorithm: retrieve the number of LWPs, retrieve all LWPs; iterate over them; finding the matching ID; copy the LWP name.) pl_name will also be included with the missing LWP name information in core(5) files.
  • pl_private implements currently missing interface to read the TLS base value.

I have decided to avoid a writable version of PT_LWPSTATUS that rewrites signals, name, or private pointer. These options are practically unused in existing open-source software. There are two exceptions that I am familiar with, but both are specific to kludges overusing ptrace(2). If these operations are needed, they can be implemented without a writable version of PT_LWPSTATUS, patching tracee's code.

I have switched GDB (in base), LLDB, picotrace and sanitizers to the new API. As NetBSD 9.0 is nearing release, this API change will land NetBSD 10.0 and existing ptrace(2) software will use PT_LWPINFO for now.

New interfaces are ensured to be stable and continuously verified by the ATF infrastructure.

pthreadtracer

In the early in the history of libpthread, the NetBSD developers designed and programmed a libpthread_dbg library. It's use-case was initially intended to handle user-space scheduling of threads in the M:N threading model inspired by Solaris.

After the switch of the internals to new SMP design (1:1 model) by Andrew Doran, this library lost its purpose and was no longer used (except being linked for some time in a local base system GDB version). I removed the libpthread_dbg when I modernized the ptrace(2) API, as it no longer had any use (and it was broken in several ways for years without being noticed).

As I have introduced the PT_LWPSTATUS call, I have decided to verify this interface in a fancy way. I have mapped ptrace_lwpstatus::pl_private into the tls_base structure as it is defined in the sys/tls.h header:

struct tls_tcb {   
#ifdef __HAVE_TLS_VARIANT_I
        void    **tcb_dtv;
        void    *tcb_pthread;
#else
        void    *tcb_self;
        void    **tcb_dtv;
        void    *tcb_pthread;
#endif
};

The pl_private pointer is in fact a pointer to a structure in debugger's address space, pointing to a tls_tcl structure. This is not true universally in every environment, but it is true in regular programs using the ELF loader and the libpthread library. Now, with the tcb_pthread field we can reference a regular C-style pthread_t object. Now, wrapping it into a real tracer, I have implemented a program that can either start a debuggee or attach to a process and on demand (as a SIGINFO handler, usually triggered in the BSD environment with ctrl-t) dump the full state of pthread_t objects within a process. A part of the example usage is below:

$ ./pthreadtracer -p `pgrep nslookup` 
[ 21088.9252645] load: 2.83  cmd: pthreadtracer 6404 [wait parked] 0.00u 0.00s 0% 1600k
DTV=0x7f7ff7ee70c8 TCB_PTHREAD=0x7f7ff7e94000
LID=4 NAME='sock-0' TLS_TSD=0x7f7ff7eed890
pt_self = 0x7f7ff7e94000
pt_tls = 0x7f7ff7eed890
pt_magic = 0x11110001 (= PT_MAGIC=0x11110001)
pt_state = 1
pt_lock = 0x0
pt_flags = 0
pt_cancel = 0
pt_errno = 35
pt_stack = {.ss_sp = 0x7f7fef9e0000, ss_size = 4194304, ss_flags = 0}
pt_stack_allocated = YES
pt_guardsize = 65536

Full log is stored here. The source code of this program, on top of picotrace is here.

The problem with this utility is that it requires libpthread sources available and reachable by the build rules. pthreadtracer reaches each field of pthread_t knowing its exact internal structure. This is enough for validation of PT_LWPSTATUS, but is it enough for shipping it to users and finding its real world use-case? Debuggers (GDB, LLDB) using debug information can reach the same data with DWARF, but supporting DWARF in pthreadtracer is currently harder than it ought to be for the interface tests. There is also an option to revive at some point libpthread_dbg(3), revamping it for modern libpthread(3), this would help avoid DWARF introspection and it could find some use in self-introspection programs, but are there any?

LLD

I keep searching for a solution to properly support lld (LLVM linker).

NetBSD's major issue with LLVM lld is the lack of standalone linker support, therefore being a real GNU ld replacement. I was forced to publish a standalone wrapper for lld, called lld-standalone and host it on GitHub for the time being, at least until we will sort out the talks with LLVM developers.

LLVM sanitizers

As the NetBSD code is evolving, there is a need to support multiple kernel versions starting from 9.0 with the LLVM sanitizers. I have introduced the following changes:

  • [compiler-rt] [netbsd] Switch to syscall for ThreadSelfTlsTcb()
  • [compiler-rt] [netbsd] Add support for versioned statvfs interceptors
  • [compiler-rt] Sync NetBSD ioctl definitions with 9.99.26
  • [compiler-rt] [fuzzer] Include stdarg.h for va_list
  • [compiler-rt] [fuzzer] Enable LSan in libFuzzer tests on NetBSD
  • [compiler-rt] Enable SANITIZER_CAN_USE_PREINIT_ARRAY on NetBSD
  • [compiler-rt] Adapt stop-the-world for ptrace changes in NetBSD-9.99.30
  • [compiler-rt] Adapt for ptrace(2) changes in NetBSD-9.99.30

The purpose of these changes is as follows:

  • Stop using internal interface to retrieve the tcl_tcb struct (TLS base) and switch to public API with the syscall _lwp_getprivate(2). While there, I have harmonized the namespacing of __lwp_getprivate_fast() and __lwp_gettcb_fast() in the NetBSD distribution. Now, every port will need to use the same define (-D_RTLD_SOURCE, -D_LIBC_SOURCE or -D__LIBPTHREAD_SOURCE__). Previously these interfaces were conflicting with the public namespaces (affecting kernel builds) and wrongly suggesting that these interfaces might be available to public third party code. Initially I used it in LLVM sanitizers, but switched it to full-syscall _lwp_getspecific().
  • Nowadays almost every mainstream OS implements support for preinit/initarray/finitarray in all ports, regardless of ABI requirements. NetBSD originally supported these features only when they were mandated by an ABI specification. Christos Zoulas in 2018 enabled these features for all CPUs, and this eventually allowed to enable this feature unconditionally for consumption in the sanitizer code. This allows use of the same interface as Linux or Solaris, rather than relying on C++-style constructors that have their own issues (need to abuse priorities of constructors and lack of guarantee that our code will be called before other constructors, which can be fatal).
  • Support for kernels between 9.0 and 9.99.30 (and later, unless there are breaking changes).

There is still one portability issue in the sanitizers, as we hard-code the offset of the link_map field within the internal dlopen handle pointer. The dlopen handler is internal to the ELF loader object of type Obj_Entry. This type is not available to third party code and it is not stable. It also has a different layout depending on the CPU architecture. The same problem exists for at least FreeBSD, and to some extent to Linux. I have prepared a patch that utilizes the dlinfo(3) call with option RTLD_DI_LINKMAP. Unfortunately there is a regression with MSan on NetBSD HEAD (it works on 9.0rc1) that makes it harder for me to finalize the patch. I suspect that after the switch to GCC 8, there is now incompatible behavior that causes a recursive call sequence: _Unwind_Backtrace() calling _Unwind_Find_FDE(), calling search_object, and triggering the __interceptor_malloc interceptor again, which calls _Unwind_Backtrace(), resulting in deadlock. The offending code is located in src/external/gpl3/gcc/dist/libgcc/unwind-dw2-fde.c and needs proper investigation. A quick workaround to stop recursive stack unwinding unfortunately did not work, as there is another (related?) problem:

==4629==MemorySanitizer CHECK failed:
/public/llvm-project/llvm/projects/compiler-rt/lib/msan/msan_origin.h:104 "((stack_id)) != (0)" (0x0, 0x0)

This shows that this low-level code is very sensitive to slight changes, and needs maintenance power. We keep improving the coverage of tested scenarios on the LLVM buildbot, and we enabled sanitizer tests on 9.0 NetBSD/amd64; however we could make use of more manpower in order to reach full Linux parity in the toolchain.

Other changes

As my project in LLVM and ptrace(2) is slowly concluding, I'm trying to finalize the related tasks that were left behind.

I've finished researching why we couldn't use syscall restart on kevent(2) call in LLDB and improved the system documentation on it. I have also fixed small nits in the NetBSD wiki page on kevent(2).

I have updated the list of ELF defines for CPUs and OS ABIs in sys/exec_elf.h.

Plan for the next milestone

Port remaining ptrace(2) test scenarios from Linux, FreeBSD and OpenBSD to ATF and ensure that they are properly operational.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

Posted Monday evening, January 13th, 2020 Tags: blog

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February 2019, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues, fixing watchpoint and threading support.

Throughout December I've continued working on our build bot maintenance, in particular enabling compiler-rt tests. I've revived and finished my old patch for extended register state (XState) in core dumps. I've started working on bringing proper i386 support to LLDB.

Generic LLVM updates

Enabling and fixing more test suites

In my last report, I've indicated that I've started fixing test suite regressions and enabling additional test suites belonging to compiler-rt. So far I've been able to enable the following suites:

  • builtins library (alternative to libgcc)
  • profiling library
  • ASAN (address sanitizer, static and dynamic)
  • CFI (control flow integrity)
  • LSAN (leak sanitizer)
  • MSAN (memory sanitizer)
  • SafeStack (stack buffer overflow protection)
  • TSAN (thread sanitizer)
  • UBSAN (undefined behavior sanitizer)
  • UBSAN minimal
  • XRay (function call tracing)

In case someone's wondering how different memory-related sanitizers differ, here's a short answer: ASAN covers major errors that can be detected with approximate 2x slowdown (out-of-bounds accesses, use-after-free, double-free...), LSAN focuses on memory leaks and has almost no overhead, while MSAN detects unitialized reads with 3x slowdown.

The following test suites were skipped because of major known breakage, pending investigation:

  • generic interception tests (test runner can't find tests?)
  • fuzzer tests (many test failures, most likely as a side effect of LSAN integration)
  • clangd tests (many test failures)

The changes done to improve test suite status are:

Repeating the rationale for disabling ASLR/MPROTECT from my previous report: the sanitizers or tools in question do not work with the listed hardening features by design, and we explicitly make them fail. We are using paxctl to disable the relevant feature per-executable, and this makes it possible to run the relevant tests on systems where ASLR and MPROTECT are enabled globally.

This also included two builtin tests. In case of clear_cache_test.c, this is a problem with test itself and I have submitted a better MPROTECT support for clear_cache_test already. In case of enable_execute_stack_test.c, it's a problem with the API itself and I don't think it can be fixed without replacing it with something more suitable for NetBSD. However, it does not seem to be actually used by programs created by clang, so I do not think it's worth looking into at the moment.

Demise and return of LLD

In my last report, I've mentioned that we've switched to using LLD as the linker for the second stage builds. Sadly, this was only to discover that some of the new test failures were caused exactly by that.

As I've reported back in January 2019, NetBSD's dynamic loader does not support executables with more than two segments. The problem has not been fixed yet, and we were so far relying on explicitly disabling the additional read-only segment in LLD. However, upstream started splitting the RW segment on GNU RELRO, effectively restoring three segments (or up to four, without our previous hack).

This forced me to initially disable LLD and return to GNU ld. However, upstream has suggested using -znorelro recently, and we were enable to go back down to two segments and reenable it.

libc++ system feature list update

Kamil has noticed that our feature list for libc++ is outdated. We have missed indicating that NetBSD supports aligned_alloc(), timespec_get() and C11 features. I have updated the feature list.

Current build bot status

The stage 1 build currently fails as upstream has broken libc++ builds with GCC. Hopefully, this will be fixed after the weekend.

Before that, we had a bunch of failing tests: 7 related to profiling and 4 related to XRay. Plus, the flaky LLDB tests mentioned earlier.

Core dump XState support finally in

I was working on including full register set ('XState') in core dumps before my summer vacation. I've implemented the requested changes and finally pushed them. The patch set included four patches:

  1. Include XSTATE note in x86 core dump, including preliminary support for machine-dependent core dump notes.

  2. Fix alignment when reading core notes fixing a bug in my tests for core dumps that were added earlier,

  3. Combine x86 register tests into unified test function simplifying the test suite a lot (by almost a half of the original lines).

  4. Add tests for reading registers from x86 core dumps covering both old and new notes.

NetBSD/i386 support for LLDB

As the next step in my LLDB work, I've started working on providing i386 support. This covers both native i386 systems, and 32-bit executable support on amd64. In total, the combined amd64/i386 support covers four scenarios:

  1. 64-bit kernel, 64-bit debugger, 64-bit executable (native 64-bit).

  2. 64-bit kernel, 64-bit debugger, 32-bit executable.

  3. 64-bit kernel, 32-bit debugger, 32-bit executable.

  4. 32-bit kernel, 32-bit debugger, 32-bit executable (native 32-bit).

Those cases are really different only from kernel's point-of-view. For scenarios 1. and 2. the debugger is using 64-bit ptrace API, while in cases 3. and 4. it is using 32-bit ptrace API. In case 2., the application runs via compat32 and the kernel fits its data into 64-bit ptrace API. In case 3., the debugger runs via compat32.

Technically, cases 1. and 2. are already covered by the amd64 code in LLDB. However, for user's convenience LLDB needs to be extended to recognize 32-bit processes on NetBSD and adjust the data obtained from ptrace to 32-bit executables. Cases 3. and 4. need to be covered via making the code build on i386.

Other LLDB plugins implement this via creating separate i386 and amd64 modules, then including 32-bit branch in amd64 that reuses parts of i386 code. I am following suit with that. My plan is to implement 32-bit process support for case 2. first, then port everything to i386.

So far I have implemented the code to recognize 32-bit processes and I have started implementing i386 register interface that is meant to map data from 64-bit ptrace register dumps. However, it does not seem to map registers correctly at the moment and I am still debugging the problem.

Future plans

As mentioned above, I am currently working on providing support for debugging 32-bit executables on amd64. Afterwards, I am going to work on porting LLDB to run on i386.

I am also tentatively addressing compiler-rt test suite problems in order to reduce the number of build bot failures. I also need to look into remaining kernel problems regarding simultaneous delivery of signals and breakpoints or watchpoints.

Furthermore, I am planning to continue with the items from the original LLDB plan. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for aarch64 target.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

Posted early Monday morning, January 13th, 2020 Tags: blog

Introduction

We successfully incorporated the Argon2 reference implementation into NetBSD/amd64 for our 2019 Google Summer of Coding project. We introduced our project here and provided some hints on how to select parameters here. For our final report, we will provide an overview of what changes were made to complete the project.

Incorporating the Argon2 Reference Implementation

The Argon2 reference implementation, available here, is available under both the Creative Commons CC0 1.0 and the Apache Public License 2.0. To import the reference implementation into src/external, we chose to use the Apache 2.0 license for this project.

During our initial phase 1, we focused on building the libargon2 library and integrating the functionality into the existing password management framework via libcrypt. Toward this end, we imported the reference implementation and created the "glue" to incorporate the changes into /usr/src/external/apache. The reference implementation is found in

m2$ ls /usr/src/external/apache2/argon2                                                                                    
Makefile dist     lib      usr.bin
The Argon2 reference implementation provides both a library and a binary. We build the libargon2 library to support libcrypt integration, and the argon2(1) binary to provide a userland command-line tool for evaluation. To build the code, we add MKARGON2 to bsd.own.mk
_MKVARS.yes= \
	...
        MKARGON2 \
	...
and add the following conditional build to /usr/src/external/apache2/Makefile
.if (defined(MKARGON2) && ${MKARGON2} != "no")
SUBDIR+= argon2
.endif
After successfully building and installation, we have the following new files and symlinks
/usr/bin/argon2
/usr/lib/libargon2.a
/usr/lib/libargon2.so
/usr/lib/libargon2.so.1
/usr/lib/libargon2.so.1.0
To incorporate Argon2 into the password management framework of NetBSD, we focused on libcrypt. In /usr/src/lib/libcrypt/Makefile, we first check for MKARGON2
.if (defined(MKARGON2) && ${MKARGON2} != "no")
HAVE_ARGON2=1
.endif
If HAVE_ARGON2 is defined and enabled, we append the following to the build flags
.if defined(HAVE_ARGON2)
SRCS+=          crypt-argon2.c
CFLAGS+=        -DHAVE_ARGON2 -I../../external/apache2/argon2/dist/phc-winner
-argon2/include/
LDADD+=         -largon2 
.endif
As hinted above, our most significant addition to libcrypt is the file crypt-argon2.c. This file pulls in the functionality of libargon2 into libcrypt. Changes were also made to pw_gensalt.c to allow for parameter parsing and salt generation.

Having completed the backend support, we pull Argon2 into userland tools, such as pwhash(1), in the same way as above

.if ( defined(MKARGON2) && ${MKARGON2} != "no" )
CPPFLAGS+=      -DHAVE_ARGON2
.endif
Once built, we can specify Argon2 using the '-A' command-line argument to pwhash(1), followed by the Argon2 variant name, and any of the parameterized values specified in argon2(1). See our first blog post for more details. As an example, to generate an argon2id encoding of the password password using default parameters, we can use the following
m2# pwhash -A argon2id password
$argon2id$v=19$m=4096,t=3,p=1$.SJJCiU575MDnA8s$+pjT4JsF2eLNQuLPEyhRA5LCFG
QWAKsksIPl5ewTWNY
To simplify Argon2 password management, we can utilize passwd.conf(5) to apply Argon2 to a specified user or all users. The same parameters are accepted as for argon2(1). For example, to specify argon2i with non-default parameters for user 'testuser', you can use the following in your passwd.conf
m1# grep -A1 testuser /etc/passwd.conf 
testuser:
        localcipher = argon2i,t=6,m=4096,p=1
With the above configuration in place, we are able to support standard password management. For example
m1# passwd testuser
Changing password for testuser.
New Password:
Retype New Password:

m1# grep testuser /etc/master.passwd  
testuser:$argon2i$v=19$m=4096,t=6,p=1$PDd65qr6JU0Pfnpr$8YOMYcwINuKHoxIV8Q0FJHG+
RP82xtmAuGep26brilU:1001:100::0:0::/home/testuser:/sbin/nologin

Testing

The argon2(1) binary allows us to easily validate parameters and encoding. This is most useful during performance testing, see here. With argon2(1), we can specify our parameterized values and evaluate both the resulting encoding and timing.
m2# echo -n password|argon2 somesalt -id -p 3 -m 8
Type:           Argon2id
Iterations:     3
Memory:         256 KiB
Parallelism:    3
Hash:           97f773f68715d27272490d3d2e74a2a9b06a5bca759b71eab7c02be8a453bfb9
Encoded:        $argon2id$v=19$m=256,t=3,p=3$c29tZXNhbHQ$l/dz9ocV0nJySQ09LnSiqb
BqW8p1m3Hqt8Ar6KRTv7k
0.000 seconds
Verification ok
We provide one approach to evaluating Argon2 parameter tuning in our second post. In addition to manual testing, we also provide some ATF tests for pwhash, for both hashing and verification. These tests are focus on encoding correctness, matching known encodings to test results during execution.
/usr/src/tests/usr.bin/argon2

tp: t_argon2_v10_hash
tp: t_argon2_v10_verify
tp: t_argon2_v13_hash
tp: t_argon2_v13_verify


cd /usr/src/tests/usr.bin/argon2
atf-run

info: atf.version, Automated Testing Framework 0.20 (atf-0.20)
info: tests.root, /usr/src/tests/usr.bin/argon2

..

tc-so:Executing command [ /bin/sh -c echo -n password | \
argon2 somesalt -v 13 -t 2 -m 8 -p 1 -r ]
tc-end: 1567497383.571791, argon2_v13_t2_m8_p1, passed

...

Conclusion

We have successfully integrated Argon2 into NetBSD using the native build framework. We have extended existing functionality to support local password management using Argon2 encoding. We are able to tune Argon2 so that we can achieve reasonable performance on NetBSD. In this final post, we summarize the work done to incorporate the reference implementation into NetBSD and how to use it. We hope you can use the work completed during this project. Thank you for the opportunity to participate in the Google Summer of Code 2019 and the NetBSD project!

Posted Sunday afternoon, January 12th, 2020 Tags: blog

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues, and fixing watchpoint support. In October 2019, I've finished my work on threading support (pending pushes) and fought issues related to upgrade to NetBSD 9.

November was focused on finally pushing the aforementioned patches and major buildbot changes. Notably, I was working on extending the test runs to compiler-rt which required revisiting past driver issues, as well as resolving new ones. More details on this below.

LLDB changes

Test updates, minor fixes

The previous month has left us with a few regressions caused by the kernel upgrade. I've done my best to figure out those I could reasonably fast; for the remaining ones Kamil suggested that I mark them XFAIL for now and revisit them later while addressing broken tests. This is what I did.

While implementing additional tests in the threading patches, I've discovered that the subset of LLDB tests dedicated to testing lldb-server behavior was disabled on NetBSD. I've reenabled lldb-server tests and marked failing tests appropriately.

After enabling and fixing those tests, I've implemented missing support in the NetBSD plugin for getting thread name.

I've also switched our process plugin to use the newer PT_STOP request over calling kill(). The main advantage of PT_STOP is that it reliably notifies about SIGSTOP via wait() even if the process is stopped already.

I've been able to reenable EOF detection test that was previously disabled due to bugs in the old versions of NetBSD 8 kernel.

Threading support pushed

After satisfying the last upstream requests, I was able to merge the three threading support patches:

  1. basic threading support,

  2. watchpoint support in threaded programs,

  3. concurrent watchpoint fixes.

This fixed 43 tests. It also triggered some flaky tests and a known regression and I'm planning to address them as the part of final bug cracking.

Build bot redesign

Recap of the problems

The tests of clang runtime components (compiler-rt, openmp) are performed using freshly built clang. This version of clang attempts to build and link C++ programs with libc++. However, our clang driver naturally requires system installation of libc++ — after all, we normally don't want the driver to include temporary build paths for regular executables! For this reason, building against fresh libc++ in build tree requires appropriate -cxx-isystem, -L and -Wl,-rpath flags.

So far, we managed to resolve this via using existing mechanisms to add additional flags to the test compiler calls. However, the existing solutions do not seem to suffice for compiler-rt. While technically I could work on adding more support code for that, I've decided it's better to look for a more general and permanent solution.

Two-stage builds

As part of the solution, I've proposed to switch our build bot to a two-stage build model. That is, firstly we're using the system GCC version to build a minimal functioning clang. Then, we're using this newly-built clang to build the whole LLVM suite, including another copy of clang.

The main advantage of this model is that we're verifying whether clang is capable of building a working copy of itself. Additionally, it insulates us against problems with host GCC. For example, we've experienced issues with GCC 8 and the default -O3. On the negative side, it increases build time significantly, especially that the second stage needs to be rebuilt from scratch every time.

A common practice in compiler world is to actually do three stages. In this case, it would mean building minimal clang with host compiler, then second stage with first stage clang, then third stage using second stage's clang. This would have the additional benefit of verifying that clang is capable of building a compiler that's fully capable of building itself. However, this seems to have little actual gain for us while it would increase the build time even more.

Compiler wrappers

Another interesting side effect of using the two-stage build model is that it proves an opportunity of injecting wrappers over clang and clang++ built in the first stage. Those wrappers allows us to add necessary -I, -L and -Wl,-rpath arguments without having to patch the driver for this special case.

Furthermore, I've used this opportunity to add experimental LLD usage to the first stage, and use it instead of GNU ld for the second stage. The LLVM linker has a significantly smaller memory footprint and therefore allows us to improve build efficiency. Sadly, proper LLD support for NetBSD still depends on patches that are waiting for upstream review.

Compiler-rt status and tests

The builds of compiler-rt have been reenabled for the build bot. I am planning to start enabling individual test groups (e.g. builtins, ASAN, MSAN, etc.) as I get them to work. However, there are still other problems to be resolved before that happens.

Firstly, there are new test regressions. Some of them seem to be specifically related to build layout changes, or to use of LLD as linker. I am currently investigating them.

Secondly, compiler-rt tests aim to test all supported multilib targets by default. We are currently preparing to enable compat32 in the kernel on the host running build bot and therefore achieve proper multilib suppor for running them.

Thirdly, ASAN, MSAN and TSAN are incompatible with ASLR (address space layout randomization) that is enabled by default on NetBSD. Furthermore, XRay is incompatible with W^X restriction.

Making tests work with PaX features

Previously, we've already addressed the ASLR incompatibility by adding an explicit check for it and bailing out if it's enabled. However, while this somehow resolves the problem for regular users, it means that the relevant tests can't be run on hosts having ASLR enabled.

Kamil suggested that we should use paxctl to disable ASLR per-executable here. This has the obvious advantage that it enables the tests to work on all hosts. However, it required injecting the paxctl invocation between the build and run step in relevant tests.

The ‘obvious’ solution to this problem would be to add a kind of %paxctl_aslr substitution that evaluates to paxctl call on NetBSD, and to : (no-op) on other systems. However, this required updating all the relevant tests and making sure that the invocation keeps being included in new tests.

Instead, I've noticed that the %run substitution is already using various kinds of wrappers for other targets, e.g. to run tests via an emulator. I went for a more agreeable solution of substituting %run in appropriate test suites with a tiny wrapper calling paxctl before executing the test.

Clang/LLD dependent libraries feature

Introduction to the feature

Enabling the two stage builds had also another side effect. Since stage 2 build is done via clang+LLD, a newly added feature of dependent libraries got enabled and broke our build.

Dependent libraries are a feature permitting source files to specify additional libraries that are afterwards injected into linker's invocation. This is done via a #pragma originally used by MSVC. Consider the following example:

#include <stdio.h>
#include <math.h>
#pragma comment(lib, "m")

int main() {
    printf("%f\n", pow(2, 4.3));
    return 0;
}

When the source file is compiled using Clang on an ELF target, the lib comments are converted into .deplibs object section:

$ llvm-readobj -a --section-data test.o
[...]
  Section {
    Index: 6
    Name: .deplibs (25)
    Type: SHT_LLVM_DEPENDENT_LIBRARIES (0x6FFF4C04)
    Flags [ (0x30)
      SHF_MERGE (0x10)
      SHF_STRINGS (0x20)
    ]
    Address: 0x0
    Offset: 0x94
    Size: 2
    Link: 0
    Info: 0
    AddressAlignment: 1
    EntrySize: 1
    SectionData (
      0000: 6D00                                 |m.|
    )
  }
[...]

When the objects are linked into a final executable using LLD, it collects all libraries from .deplibs sections and links to the specified libraries.

The example program pasted above would have to be built on systems requiring explicit -lm (e.g. Linux) via:

$(CC) ... test.c -lm

However, when using Clang+LLD, it is sufficient to call:

clang -fuse-ld=lld ... test.c

and the library is included automatically. Of course, this normally makes little sense because you have to maintain compatibility with other compilers and linkers, as well as old versions of Clang and LLD.

Use of LLVM to approach static library dependency problem

LLVM started using the deplibs feature internally in D62090 in order to specify linkage between runtimes and their dependent libraries. Apparently, the goal was to provide an in-house solution to the static library dependency problem.

The problem discussed is that static libraries on Unix-derived platforms are primitive archives containing object files. Unlike shared libraries, they do not contain lists of other libraries they depend on. As a result, when linking against a static library, the user needs to explicitly pass all the dependent libraries to the linker invocation.

Over years, a number of workarounds were proposed to relieve the user (or build system) from having to know the exact dependencies of the static libraries used. A few worth noting include:

  • libtool archives (.la) used by libtool as generic wrappers over shared and static libraries,

  • library-specific *-config programs and pkg-config files, providing options for build systems to utilize,

  • GNU ld scripts that can be used in place of libraries to alter linker's behavior.

The first two solutions work at build system level, and therefore are portable to different compilers and linkers. The third one requires linker support but have been used successfully to some degree due to wide deployment of GNU binutils, as well as support in other linkers (e.g. LLD).

Dependent libraries provide yet another attempt to solve the same problem. Unlike the listed approaches, it is practically transparent to the static library format — at the cost of requiring both compiler and linker support. However, since the runtimes are normally supposed to be used by Clang itself, at least the first of the points can be normally assumed to be satisfied.

Why it broke NetBSD?

After all the lengthy introduction, let's get to the point. As a result of my changes, the second stage is now built using Clang/LLD. However, it seems that the original change making use of deplibs in runtimes was tested only on Linux — and it caused failures for us since it implicitly appended libraries not present on NetBSD.

Over time, users of a few other systems have added various #ifdefs in order to exclude Linux-specific libraries from their systems. However, this solution is hardly optimal. It requires us to maintain two disjoint sets of rules for adding each library — one in CMake for linking of shared libraries, and another one in the source files for emitting dependent libraries.

Since dependent libraries pragmas are present only in source files and not headers, I went for a different approach. Instead of using a second set of rules to decide which libraries to link, I've exported the results of CMake checks into -D flags, and made dependent libraries conditional on CMake check results.

Firstly, I've fixed deplibs in libunwind in order to fix builds on NetBSD. Afterwards, per upstream's request I've extended the deplibs fix to libc++ and libc++abi.

Future plans

I am currently still working on fixing regressions after the switch to two-stage build. As things develop, I am also planning to enable further test suites there.

Furthermore, I am planning to continue with the items from the original LLDB plan. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

Posted at lunch time on Thursday, December 12th, 2019 Tags: blog

Since the start of the release process four months ago a lot of improvements went into the branch - more than 500 pullups were processed!

This includes usbnet (a common framework for usb ethernet drivers), aarch64 stability enhancements and lots of new hardware support, installer/sysinst fixes and changes to the NVMM (hardware virtualization) interface.

We hope this will lead to the best NetBSD release ever (only to be topped by NetBSD 10 next year).

Here are a few highlights of the new release:

You can download binaries of NetBSD 9.0_RC1 from our Fastly-provided CDN.

For more details refer to the official release announcement.

Please help us out by testing 9.0_RC1. We love any and all feedback. Report problems through the usual channels (submit a PR or write to the appropriate list). More general feedback is welcome, please mail releng. Your input will help us put the finishing touches on what promises to be a great release!

Enjoy!

Martin

Posted Monday afternoon, December 2nd, 2019 Tags: blog
This report was written by Maciej Grochowski as a part of developing the AFL+KCOV project.

This report is a continuation of my previous work on Fuzzing Filesystems via AFL. You can find previous posts where I described the fuzzing (part1, part2) or my EuroBSDcon presentation.
In this part, we won't talk too much about fuzzing itself but I want to describe the process of finding root causes of File system issues and my recent work trying to improve this process.
This story begins with a mount issue that I found during my very first run of the AFL, and I presented it during my talk on EuroBSDcon in Lillehammer.

Invisible Mount point

afl-fuzz: /dev/vnd0: opendisk: Device busy That was the first error that I saw on my setup after couple of seconds of AFL run.
I was not sure what exactly was the problem and thought that mount wrapper might cause a problem.
Although after a long troubleshooting session I realized that this might be my first found issue.
To give the reader a better understanding of the problem without digging too deeply into fuzzer setup or mount process.
Let's assume that we have some broken file system image exposed as a block device visible as a /dev/wd1a.

The device can be easily mounted on mount point mnt1, however when we try to unmount it we get an error: error: ls: /mnt1: No such file or directory, and if we try to use raw system call unmount(2) it also end up with the similar error.

However, we can see clearly that the mount point exists with the mount command:

# mount
/dev/wd0a on / type ffs(local)
...
tmpfson /var/shmtype tmpfs(local)
/dev/vnd0 on /mnt1 type ffs(local)

Thust any lstat(2) based command is trying to convince us that no such directory exists.

# ls / | grep mnt
mnt
mnt1

# ls -alh /mnt1
ls: /mnt1: No such file or directory
# stat /mnt1
stat: /mnt1: lstat: No such file or directory

To understand what is happening we need to dig a little bit deeper than with standard bash tools.
First of all mnt1 is a folder created on the root partition at a local filesystem so getdents(2) or dirent(3) should show it as a entry inside dentry structure on the disk.
Raw getdents syscall is great tool for checking directory content because it reads the data from the directory structure on disk.

# ./getdents  /
|inode_nr|rec_len|file_type|name_len(name)|
#:   2,      16,    IFDIR,       1 (.)
#:   2,      16,    IFDIR,       2 (..)
#:   5,      24,    IFREG,       6 (.cshrc)
#:   6,      24,    IFREG,       8 (.profile)
#:   7,      24,    IFREG,       8 (boot.cfg)
#: 3574272,  24,    IFDIR,       3 (etc)
...
#: 3872128,  24,    IFDIR,       3 (mnt)
#: 5315584,  24,    IFDIR,       4 (mnt1)

Getdentries confirms that we have mnt1 as a directory inside the root of our system fs.
But, we cannot execute lstat, unmount or any other system-call that require a path to this file.
A quick look on definitions of these system calls show their structure:

unmount(const char *dir, int flags);
stat(const char *path, struct stat *sb);
lstat(const char *path, struct stat *sb);
open(const char *path, int flags, ...);

All of these function take as an argument path to the file, which as we know will endup in vfs lookup.
How about something that uses filedescryptor? Can we even obtain it?
As we saw earlier running open(2) on path also returns EACCES.
Looks like without digging inside VFS lookup we will not be able to understand the issue.

Get Filesystem Root

After some debugging and code walk I found the place that caused error.
VFS during the name resolution needs to check and switch FS in case of embedded mount points.
After the new filesystem is found VFS_ROOT is issued on that particular mount point.
VFS_ROOT is translated in case of FFS to the ufs_root which calls vcache with fixed value equal to the inode number of root inode which is 2 for UFS.

#define UFS_ROOTINO     ((ino_t)2)  

Below listning with the code of ufs_root from ufs/ufs/ufs_vfsops.c.

int
ufs_root(struct mount *mp, struct vnode **vpp)
{
...
        if ((error = VFS_VGET(mp, (ino_t)UFS_ROOTINO, &nvp)) != 0)
               return (error);

By using the debugger, I was able to make sure that the entry with number 2 after hashing does not exist in the vcache.
As a next step, I wanted to check the Root inode on the given filesystem image.
Filesystem debuggers are good tools to do such checks. NetBSD comes with FSDB which is general-purpose filesystem debugger.
Nonetheless, by default FSDB links against fsck_ffs which makes it tied to the FFS.

Filesystem Debugger for the help!

Filesystem debugger is a tool designed to browse on-disk structure and values of particular entries. It helps in understanding the Filesystems issues by giving particular values that the system reads from the disk. Unfortunately, current fsdb_ffs is a bit limited in the amount of information that it exposes.
Example output of trying to browse damaged root inode on corrupted FS.

# fsdb -dnF -f ./filesystem.out

** ./filesystem.out (NO WRITE)
superblock mismatches
...
BAD SUPER BLOCK: VALUES IN SUPER BLOCK DISAGREE WITH THOSE IN FIRST ALTERNATE                                     
clean = 0
isappleufs = 0, dirblksiz = 512
Editing file system `./filesystem.out'
Last Mounted on /mnt
current inode 2: unallocated inode

fsdb (inum: 2)> print
command `print
'
current inode 2: unallocated inode

FSDB Plugin: Print Formatted

Fortunately, fsdb_ffs leaves all necessary interfaces to allows accessing this data with small effort.
I implemented a simple plugin that allows browsing all values inside: inodes, superblock and cylinder groups on FFS. There are still a couple of todos that have to be finished, but the current version allows us to review inodes.

fsdb (inum: 2)> pf inode number=2 format=ufs1
command `pf inode number=2 format=ufs1
'
Disk format ufs1inode 2 block: 512
 ---------------------------- 
di_mode: 0x0                    di_nlink: 0x0
di_size: 0x0                    di_atime: 0x0
di_atimensec: 0x0               di_mtime: 0x0
di_mtimensec: 0x0               di_ctime: 0x0
di_ctimensec: 0x0               di_flags: 0x0
di_blocks: 0x0                  di_gen: 0x6c3122e2
di_uid: 0x0                     di_gid: 0x0
di_modrev: 0x0
 --- inode.di_oldids ---

We can see that the Filesystem image got wiped out most of the root inode fields.
For comparison, if we will take a look at root inode from freshly created FS we will see the proper structure.
Based on that we can quickly realize that fields: di_mode, di_nlink, di_size, di_blocks are different and can be the root cause.

Disk format ufs1 inode: 2 block: 512
 ---------------------------- 
di_mode: 0x41ed                 di_nlink: 0x2
di_size: 0x200                  di_atime: 0x0
di_atimensec: 0x0               di_mtime: 0x0
di_mtimensec: 0x0               di_ctime: 0x0
di_ctimensec: 0x0               di_flags: 0x0
di_blocks: 0x1                  di_gen: 0x68881d2c
di_uid: 0x0                     di_gid: 0x0
di_modrev: 0x0
 --- inode.di_oldids ---

From FSDB and incore to source code

First we will summarize what we already know:

  1. unmount fails in namei operation failure due to the corrupted FS
  2. Filesystem has corrupted root inode
  3. Corrupted root inode has fields: di_mode, di_nlink, di_size, di_blocks set to zero

Now we can find a place where inodes are loaded from the disk, this function for FFS is ffs_init_vnode(ump, vp, ino);.
This function is called during the loading vnode in vfs layer inside ffs_loadvnode.
Quick walkthrough through ffs_loadvnode expose the usage of the field i_mode:

         error = ffs_init_vnode(ump, vp, ino);                                                                                                                                                                                     
         if (error)                                                                                                                                                                                                                
                return error;                                                                                                                                                                                                     
                                                                                                                                                                                                                                   
         ip = VTOI(vp);                                                                                                                                                                                                            
         if (ip->i_mode == 0) {                                                                                                                                                                                                    
                 ffs_deinit_vnode(ump, vp);                                                                                                                                                                                        
                                                                                                                                                                                                                                   
                 return ENOENT;                                                                                                                                                                                                    
         }   

This seems to be a source of our problem. Whenever we are loading inode from disk to obtain the vnode, we validate if i_mode is non zero.
In our case root inode is wiped out, what results that vnode is dropped and an error returned.
So simply we cannot load any inode with i_mode set to the zero, inode number 2 called root is no different here. Due to that the VFS_LOADVNODE operation always fails, so lookup does and name resolution will return ENOENT error. To fix this issue we need a root inode validation on mount step, I created such validation and tested against corrupted filesystem image.
The mount return error, which proved the observation that such validation would help.

Conclusions

The following post is a continuation of the project: "Fuzzing Filesystems with kcov and AFL".
I presented how fuzzed bugs, which do not always show up as system panics, can be analyzed, and what tools a programmer can use.
Above the investigation described the very first bug that I found by fuzzing mount(2) with Afl+kcov.
During that root cause analysis, I realized the need for better tools for debugging Filesystem related issues.
Because of that reason, I added small functionality pf (print-formatted) into the fsdb(8), to allow walking through the on-disk structures. The described bug was reported with proposed fix based on validation of the root inode on kern-tech mailing list.

Future work

  1. Tools: I am still progressing with the fuzzing of mount process, however, I do not only focus on the finding bugs but also on tools that can be used for debugging and also doing regression tests. I am planning to add better support for browsing blocks on inode into the fsdb-pf, as well as write functionality that would allow more testing and potential recovery easier.
  2. Fuzzing: In next post, I will show a remote setup of AFL with an example of usage.
  3. I got a suggestion to take a look at FreeBSD UFS security checks on mount(2) done by McKusick. I think is worth it to see what else is validated and we can port to NetBSD FFS.
Posted Wednesday evening, November 27th, 2019 Tags: blog

Per the membership voting, we have seated the new Board of Directors of the NetBSD Foundation:

  • Taylor R. Campbell <riastadh@>
  • William J. Coldwell <billc@>
  • Michael van Elst <mlelstv@>
  • Thomas Klausner <wiz@>
  • Cherry G. Mathew <cherry@>
  • Pierre Pronchery <khorben@>
  • Leonardo Taccari <leot@>

We would like to thank Makoto Fujiwara <mef@> and Jeremy C. Reed <reed@> for their service on the Board of Directors during their term(s).

The new Board of Directors have voted in the executive officers for The NetBSD Foundation:

President:William J. Coldwell
Vice President: Pierre Pronchery
Secretary: Christos Zoulas
Assistant Secretary: Thomas Klausner
Treasurer: Christos Zoulas
Assistant Treasurer: Taylor R. Campbell

Thanks to everyone that voted and we look forward to a great 2020.

Posted late Wednesday evening, November 20th, 2019 Tags: blog

Per the membership voting, we have seated the new Board of Directors of the NetBSD Foundation:

  • Taylor R. Campbell <riastadh@>
  • William J. Coldwell <billc@>
  • Michael van Elst <mlelstv@>
  • Thomas Klausner <wiz@>
  • Cherry G. Mathew <cherry@>
  • Pierre Pronchery <khorben@>
  • Leonardo Taccari <leot@>

We would like to thank Makoto Fujiwara <mef@> and Jeremy C. Reed <reed@> for their service on the Board of Directors during their term(s).

The new Board of Directors have voted in the executive officers for The NetBSD Foundation:

President:William J. Coldwell
Vice President: Pierre Pronchery
Secretary: Christos Zoulas
Assistant Secretary: Thomas Klausner
Treasurer: Christos Zoulas
Assistant Treasurer: Taylor R. Campbell

Thanks to everyone that voted and we look forward to a great 2020.

Posted late Wednesday evening, November 20th, 2019 Tags: blog

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support which is taking longer than expected. You can read more about that in my September 2019 report.

So far the number of issues uncovered while enabling proper threading support has stopped me from merging the work-in-progress patches. However, I've finally reached the point where I believe that the current work can be merged and the remaining problems can be resolved afterwards. More on that and other LLVM-related events happening during the last month in this report.

LLVM news and buildbot status update

LLVM switched to git

Probably the most important event to note is that the LLVM project has switched from Subversion to git, and moved their repositories to GitHub. While the original plan provided for maintaining the old repositories as read-only mirrors, as of today this still hasn't been implemented. For this reason, we were forced to quickly switch buildbot to the git monorepo.

The buildbot is operational now, and seems to be handling git correctly. However, it is connected to the staging server for the time being. Its URL changed to http://lab.llvm.org:8014/builders/netbsd-amd64 (i.e. the port from 8011 to 8014).

Monthly regression report

Now for the usual list of 'what they broke this time'.

LLDB has been given a new API for handling files, in particular for passing them to Python scripts. The change of API has caused some 'bad file descriptor' errors, e.g.:

ERROR: test_SBDebugger (TestDefaultConstructorForAPIObjects.APIDefaultConstructorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/decorators.py", line 343, in wrapper
    return func(self, *args, **kwargs)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/TestDefaultConstructorForAPIObjects.py", line 133, in test_SBDebugger
    sb_debugger.fuzz_obj(obj)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/sb_debugger.py", line 13, in fuzz_obj
    obj.SetInputFileHandle(None, True)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 3890, in SetInputFileHandle
    self.SetInputFile(SBFile.Create(file, borrow=True))
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5418, in Create
    return cls.MakeBorrowed(file)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5379, in MakeBorrowed
    return _lldb.SBFile_MakeBorrowed(BORROWED)
IOError: [Errno 9] Bad file descriptor
Config=x86_64-/data/motus/netbsd8/netbsd8/build/bin/clang-10
----------------------------------------------------------------------

I've been able to determine that the error was produced by flush() method call invoked on a file descriptor referring to stdin. Appropriately, I've fixed the type conversion method not to flush read-only fds.

Afterwards, Lawrence D'Anna was able to find and fix another fflush() issue.

A newly added test revealed that platform process list -v command on NetBSD missed listing the process name. I've fixed it to provide Arg0 in process info.

Another new test failed due to our target not implementing ShellExpandArguments() API. Apparently the only target actually implementing it is Darwin, so I've just marked TestCustomShell XFAIL on all BSD targets.

LLDB upstream was forced to reintroduce readline module override that aims to prevent readline and libedit from being loaded into a single program simultaneously. This module failed to build on NetBSD. I've discovered that the original was meant to be built on Linux only, and since the problem still doesn't affect other platforms, I've made it Linux-only again.

libunwind build has been changed to link using the C compiler rather than C++. This caused some libc++ failures on NetBSD. The author has reverted the change for now, and is looking for a better way of resolving the problem.

Finally, I have disabled another OpenMP test that caused NetBSD to hang. While ideally I'd like to have the underlying kernel problem fixed, this is non-trivial and I prefer to focus on LLDB right now.

New LLD work

I've been asked to rebase my LLD patches for the new code. While doing it, I've finally committed the -z nognustack option patch from January.

In the meantime, Kamil's been working on finally resolving the long-standing impasse on LLD design. He is working on a new NetBSD-specific frontend to LLD that would satisfy our system-wide linker requirements without modifying the standard driver used by other platforms.

Upgrade to NetBSD 9 beta

Our recent work, especially the work on threading support has required a number of fixes in the NetBSD kernel. Those fixes were backported to NetBSD 9 branch but not to 8. The 8 kernel used by the buildbot was therefore suboptimal for testing new features. Furthermore, with the 9.0 release coming soon-ish, it became necessary to start actively testing it for regressions.

The buildbot has been upgraded to NetBSD 9 beta on 2019-11-06. Initially, the upgrade has caused LLDB to start crashing on startup. I have not been able to pinpoint the exact issue yet. However, I've established that it happens with -O3 optimization level only, and I've worked it around by switching the build to -O2. I am planning to look into the problem more once the buildbot is restored fully.

The upgrade to nb9 has caused 4 LLDB tests to start succeeding, and 6 to start failing. Namely:

********************
Unexpected Passing Tests (4):
    lldb-api :: commands/watchpoints/watchpoint_commands/condition/TestWatchpointConditionCmd.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandPython.py
    lldb-api :: lang/c/bitfields/TestBitfields.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandLLDB.py

********************
Failing Tests (6):
    lldb-shell :: Reproducer/Functionalities/TestExpressionEvaluation.test
    lldb-api :: commands/expression/call-restarts/TestCallThatRestarts.py
    lldb-api :: functionalities/signal/handle-segv/TestHandleSegv.py
    lldb-unit :: tools/lldb-server/tests/./LLDBServerTests/StandardStartupTest.TestStopReplyContainsThreadPcs
    lldb-api :: functionalities/inferior-crashing/TestInferiorCrashingStep.py
    lldb-api :: functionalities/signal/TestSendSignal.py

I am going to start investigating the new failures shortly.

Further LLDB threading work

Fixes to register support

Enabling thread support revealed a problem in register API introspection specific to NetBSD. The API responsible for passing registers in groups to Python was unable to name some of the groups on NetBSD, and the null names have caused the TestRegistersIterator to fail. Threading support made this specifically visible by replacing a regular test failure with Python code error.

In order to resolve the problem, I had to describe all supported register sets in NetBSD register context. The code was roughly based on the Linux equivalent, modified to match register sets used by our ptrace() API. Interestingly, I had to also include MPX registers that are currently unimplemented, as otherwise LLDB implicitly put them in an anonymous group.

While at it, I've also changed the register set numbering to match the more common ordering, in order to avoid issues in the future.

Finished basic thread support patch

I've finally completed and submitted the patch for NetBSD thread support. Besides fixing a few mistakes, I've implemented thread affinity support for all relevant SIGTRAP events (breakpoints, traces, hardware watchpoints) and removed incomplete hardware breakpoint stub that caused LLDB to crash.

In its current form, this patch combines three changes essential to correct support of threaded programs:

  1. It enables reporting of new and exited threads, and maintains debugged thread list based on that.

  2. It modifies the signal (generic and SIGTRAP) handling functions to read the thread identifier and associate the event with correct thread(s). Previously, all events were assigned to all threads.

  3. It updates the process resuming function to support controlling the state (running, single-stepping, stopped) of individual threads, and raising a signal either to the whole process or to a single thread. Previously, the code used only the requested action for the first thread and populated it to all threads in the process.

Proper watchpoint support in multi-threaded programs

I've submitted a separate patch to copy watchpoints to newly-created threads. This is necessary due to the design of Debug Register support in NetBSD. Quoting the ptrace(2) manpage:

  • debug registers are only per-LWP, not per-process globally
  • debug registers must not be inherited after (v)forking a process
  • debug registers must not be inherited after forking a thread
  • a debugger is responsible to set global watchpoints/breakpoints with the debug registers, to achieve this PTRACE_LWP_CREATE / PTRACE_LWP_EXIT event monitoring function is designed to be used

LLDB supports per-process watchpoints only at the moment. To fit this into NetBSD model, we need to monitor new threads and copy watchpoints to them. Since LLDB does not keep explicit watchpoint information at the moment (it relies on querying debug registers), the proposed implementation verbosely copies dbregs from the currently selected thread (all existing threads should have the same dbregs).

Fixed support for concurrent watchpoint triggers

The final problem I've been investigating was a server crash with the new code when multiple watchpoints were triggered concurrently. My final patch aims to fix handling concurrent watchpoint events.

When a watchpoint is triggered, the kernel delivers SIGTRAP with TRAP_DBREG to the debugger. The debugger investigates DR6 register of the specified thread in order to determine which watchpoint was triggered, and reports it. When multiple watchpoints are triggered simultaneously, the kernel reports that as series of successive SIGTRAPs. Normally, that works just fine.

However, on x86 watchpoint triggers are reported before the instruction is executed. For this reason, LLDB temporarily disables the breakpoint, single-steps and reenables it. The problem with that is that the GDB protocol doesn't control watchpoints per thread, so the operation disables and reenables the watchpoint on all threads. As a side effect of this, DR6 is cleared everywhere.

Now, if multiple watchpoints were triggered concurrently, DR6 is set on all relevant threads. However, after handling SIGTRAP on the first one, the disable/reenable (or more specifically, remove/readd) wipes DR6 on all threads. The handler for next SIGTRAP can't establish the correct watchpoint number, and starts looking for breakpoints. Since hardware breakpoints are not implemented, the relevant method returns an error and lldb-server eventually exits.

There are two problems to be solved there. Firstly, lldb-server should not exit in this circumstances. This is already solved in the first patch as mentioned above. Secondly, we need to be able to handle concurrent watchpoint hits independently of the clear/set packets. This is solved by this patch.

There are multiple different approaches to this problem. I've chosen to remodel clear/set watchpoint method in order to prevent it from resetting DR6 if the same watchpoint is being restored, as the alternatives (such as pre-storing DR6 on the first SIGTRAP) have more corner conditions to be concerned about.

The current design of these two methods assumes that the 'clear' method clears both the triggered state in DR6 and control bits in DR7, while the 'set' method sets the address in DR0..3, and the control bits in DR7.

The new design limits the 'clear' method to disabling the watchpoint by clearing the enable bit in DR7. The remaining bits, as well as trigger status and address are preserved. The 'set' method uses them to determine whether a new watchpoint is being set, or the previous one merely reenabled. In the latter case, it just updates DR7, while preserving the previous trigger. In the former, it updates all registers and clears the trigger from DR6.

This solution effectively prevents the disable/reenable logic of LLDB from clearing concurrent watchpoint hits, and therefore makes it possible for the SIGTRAP handler to report them correctly. If the user manually replaces the watchpoint with another one, DR6 is cleared and LLDB does not associate the concurrent trigger to the watchpoint that no longer exists.

Thread status summary

The current version of the patches fixes approximately 47 test failures, and causes approximately 4 new test failures and 2 hanging tests. There is around 7 new flaky tests, related to signals concurrent with breakpoints or watchpoints.

Future plans

The first immediate goal is to investigate and resolve test suite regressions related to NetBSD 9 upgrade. The second goal is to get the threading patches merged, and simultaneously work on resolving the remaining test failures and hangs.

When that's done, I'd like to finally move on with the remaining TODO items. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

Posted Saturday night, November 9th, 2019 Tags: blog

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support which is taking longer than expected. You can read more about that in my September 2019 report.

So far the number of issues uncovered while enabling proper threading support has stopped me from merging the work-in-progress patches. However, I've finally reached the point where I believe that the current work can be merged and the remaining problems can be resolved afterwards. More on that and other LLVM-related events happening during the last month in this report.

LLVM news and buildbot status update

LLVM switched to git

Probably the most important event to note is that the LLVM project has switched from Subversion to git, and moved their repositories to GitHub. While the original plan provided for maintaining the old repositories as read-only mirrors, as of today this still hasn't been implemented. For this reason, we were forced to quickly switch buildbot to the git monorepo.

The buildbot is operational now, and seems to be handling git correctly. However, it is connected to the staging server for the time being. Its URL changed to http://lab.llvm.org:8014/builders/netbsd-amd64 (i.e. the port from 8011 to 8014).

Monthly regression report

Now for the usual list of 'what they broke this time'.

LLDB has been given a new API for handling files, in particular for passing them to Python scripts. The change of API has caused some 'bad file descriptor' errors, e.g.:

ERROR: test_SBDebugger (TestDefaultConstructorForAPIObjects.APIDefaultConstructorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/decorators.py", line 343, in wrapper
    return func(self, *args, **kwargs)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/TestDefaultConstructorForAPIObjects.py", line 133, in test_SBDebugger
    sb_debugger.fuzz_obj(obj)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/sb_debugger.py", line 13, in fuzz_obj
    obj.SetInputFileHandle(None, True)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 3890, in SetInputFileHandle
    self.SetInputFile(SBFile.Create(file, borrow=True))
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5418, in Create
    return cls.MakeBorrowed(file)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5379, in MakeBorrowed
    return _lldb.SBFile_MakeBorrowed(BORROWED)
IOError: [Errno 9] Bad file descriptor
Config=x86_64-/data/motus/netbsd8/netbsd8/build/bin/clang-10
----------------------------------------------------------------------

I've been able to determine that the error was produced by flush() method call invoked on a file descriptor referring to stdin. Appropriately, I've fixed the type conversion method not to flush read-only fds.

Afterwards, Lawrence D'Anna was able to find and fix another fflush() issue.

A newly added test revealed that platform process list -v command on NetBSD missed listing the process name. I've fixed it to provide Arg0 in process info.

Another new test failed due to our target not implementing ShellExpandArguments() API. Apparently the only target actually implementing it is Darwin, so I've just marked TestCustomShell XFAIL on all BSD targets.

LLDB upstream was forced to reintroduce readline module override that aims to prevent readline and libedit from being loaded into a single program simultaneously. This module failed to build on NetBSD. I've discovered that the original was meant to be built on Linux only, and since the problem still doesn't affect other platforms, I've made it Linux-only again.

libunwind build has been changed to link using the C compiler rather than C++. This caused some libc++ failures on NetBSD. The author has reverted the change for now, and is looking for a better way of resolving the problem.

Finally, I have disabled another OpenMP test that caused NetBSD to hang. While ideally I'd like to have the underlying kernel problem fixed, this is non-trivial and I prefer to focus on LLDB right now.

New LLD work

I've been asked to rebase my LLD patches for the new code. While doing it, I've finally committed the -z nognustack option patch from January.

In the meantime, Kamil's been working on finally resolving the long-standing impasse on LLD design. He is working on a new NetBSD-specific frontend to LLD that would satisfy our system-wide linker requirements without modifying the standard driver used by other platforms.

Upgrade to NetBSD 9 beta

Our recent work, especially the work on threading support has required a number of fixes in the NetBSD kernel. Those fixes were backported to NetBSD 9 branch but not to 8. The 8 kernel used by the buildbot was therefore suboptimal for testing new features. Furthermore, with the 9.0 release coming soon-ish, it became necessary to start actively testing it for regressions.

The buildbot has been upgraded to NetBSD 9 beta on 2019-11-06. Initially, the upgrade has caused LLDB to start crashing on startup. I have not been able to pinpoint the exact issue yet. However, I've established that it happens with -O3 optimization level only, and I've worked it around by switching the build to -O2. I am planning to look into the problem more once the buildbot is restored fully.

The upgrade to nb9 has caused 4 LLDB tests to start succeeding, and 6 to start failing. Namely:

********************
Unexpected Passing Tests (4):
    lldb-api :: commands/watchpoints/watchpoint_commands/condition/TestWatchpointConditionCmd.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandPython.py
    lldb-api :: lang/c/bitfields/TestBitfields.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandLLDB.py

********************
Failing Tests (6):
    lldb-shell :: Reproducer/Functionalities/TestExpressionEvaluation.test
    lldb-api :: commands/expression/call-restarts/TestCallThatRestarts.py
    lldb-api :: functionalities/signal/handle-segv/TestHandleSegv.py
    lldb-unit :: tools/lldb-server/tests/./LLDBServerTests/StandardStartupTest.TestStopReplyContainsThreadPcs
    lldb-api :: functionalities/inferior-crashing/TestInferiorCrashingStep.py
    lldb-api :: functionalities/signal/TestSendSignal.py

I am going to start investigating the new failures shortly.

Further LLDB threading work

Fixes to register support

Enabling thread support revealed a problem in register API introspection specific to NetBSD. The API responsible for passing registers in groups to Python was unable to name some of the groups on NetBSD, and the null names have caused the TestRegistersIterator to fail. Threading support made this specifically visible by replacing a regular test failure with Python code error.

In order to resolve the problem, I had to describe all supported register sets in NetBSD register context. The code was roughly based on the Linux equivalent, modified to match register sets used by our ptrace() API. Interestingly, I had to also include MPX registers that are currently unimplemented, as otherwise LLDB implicitly put them in an anonymous group.

While at it, I've also changed the register set numbering to match the more common ordering, in order to avoid issues in the future.

Finished basic thread support patch

I've finally completed and submitted the patch for NetBSD thread support. Besides fixing a few mistakes, I've implemented thread affinity support for all relevant SIGTRAP events (breakpoints, traces, hardware watchpoints) and removed incomplete hardware breakpoint stub that caused LLDB to crash.

In its current form, this patch combines three changes essential to correct support of threaded programs:

  1. It enables reporting of new and exited threads, and maintains debugged thread list based on that.

  2. It modifies the signal (generic and SIGTRAP) handling functions to read the thread identifier and associate the event with correct thread(s). Previously, all events were assigned to all threads.

  3. It updates the process resuming function to support controlling the state (running, single-stepping, stopped) of individual threads, and raising a signal either to the whole process or to a single thread. Previously, the code used only the requested action for the first thread and populated it to all threads in the process.

Proper watchpoint support in multi-threaded programs

I've submitted a separate patch to copy watchpoints to newly-created threads. This is necessary due to the design of Debug Register support in NetBSD. Quoting the ptrace(2) manpage:

  • debug registers are only per-LWP, not per-process globally
  • debug registers must not be inherited after (v)forking a process
  • debug registers must not be inherited after forking a thread
  • a debugger is responsible to set global watchpoints/breakpoints with the debug registers, to achieve this PTRACE_LWP_CREATE / PTRACE_LWP_EXIT event monitoring function is designed to be used

LLDB supports per-process watchpoints only at the moment. To fit this into NetBSD model, we need to monitor new threads and copy watchpoints to them. Since LLDB does not keep explicit watchpoint information at the moment (it relies on querying debug registers), the proposed implementation verbosely copies dbregs from the currently selected thread (all existing threads should have the same dbregs).

Fixed support for concurrent watchpoint triggers

The final problem I've been investigating was a server crash with the new code when multiple watchpoints were triggered concurrently. My final patch aims to fix handling concurrent watchpoint events.

When a watchpoint is triggered, the kernel delivers SIGTRAP with TRAP_DBREG to the debugger. The debugger investigates DR6 register of the specified thread in order to determine which watchpoint was triggered, and reports it. When multiple watchpoints are triggered simultaneously, the kernel reports that as series of successive SIGTRAPs. Normally, that works just fine.

However, on x86 watchpoint triggers are reported before the instruction is executed. For this reason, LLDB temporarily disables the breakpoint, single-steps and reenables it. The problem with that is that the GDB protocol doesn't control watchpoints per thread, so the operation disables and reenables the watchpoint on all threads. As a side effect of this, DR6 is cleared everywhere.

Now, if multiple watchpoints were triggered concurrently, DR6 is set on all relevant threads. However, after handling SIGTRAP on the first one, the disable/reenable (or more specifically, remove/readd) wipes DR6 on all threads. The handler for next SIGTRAP can't establish the correct watchpoint number, and starts looking for breakpoints. Since hardware breakpoints are not implemented, the relevant method returns an error and lldb-server eventually exits.

There are two problems to be solved there. Firstly, lldb-server should not exit in this circumstances. This is already solved in the first patch as mentioned above. Secondly, we need to be able to handle concurrent watchpoint hits independently of the clear/set packets. This is solved by this patch.

There are multiple different approaches to this problem. I've chosen to remodel clear/set watchpoint method in order to prevent it from resetting DR6 if the same watchpoint is being restored, as the alternatives (such as pre-storing DR6 on the first SIGTRAP) have more corner conditions to be concerned about.

The current design of these two methods assumes that the 'clear' method clears both the triggered state in DR6 and control bits in DR7, while the 'set' method sets the address in DR0..3, and the control bits in DR7.

The new design limits the 'clear' method to disabling the watchpoint by clearing the enable bit in DR7. The remaining bits, as well as trigger status and address are preserved. The 'set' method uses them to determine whether a new watchpoint is being set, or the previous one merely reenabled. In the latter case, it just updates DR7, while preserving the previous trigger. In the former, it updates all registers and clears the trigger from DR6.

This solution effectively prevents the disable/reenable logic of LLDB from clearing concurrent watchpoint hits, and therefore makes it possible for the SIGTRAP handler to report them correctly. If the user manually replaces the watchpoint with another one, DR6 is cleared and LLDB does not associate the concurrent trigger to the watchpoint that no longer exists.

Thread status summary

The current version of the patches fixes approximately 47 test failures, and causes approximately 4 new test failures and 2 hanging tests. There is around 7 new flaky tests, related to signals concurrent with breakpoints or watchpoints.

Future plans

The first immediate goal is to investigate and resolve test suite regressions related to NetBSD 9 upgrade. The second goal is to get the threading patches merged, and simultaneously work on resolving the remaining test failures and hangs.

When that's done, I'd like to finally move on with the remaining TODO items. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

Posted Saturday night, November 9th, 2019 Tags: blog
Add a comment