Dec 2019
S M T W T F S
10 11
       

Archives

This page is a blog mirror of sorts. It pulls in articles from blog's feed and publishes them here (with a feed, too).

Since the start of the release process four months ago a lot of improvements went into the branch - more than 500 pullups were processed!

This includes usbnet (a common framework for usb ethernet drivers), aarch64 stability enhancements and lots of new hardware support, installer/sysinst fixes and changes to the NVMM (hardware virtualization) interface.

We hope this will lead to the best NetBSD release ever (only to be topped by NetBSD 10 next year).

Here are a few highlights of the new release:

You can download binaries of NetBSD 9.0_RC1 from our Fastly-provided CDN.

For more details refer to the official release announcement.

Please help us out by testing 9.0_RC1. We love any and all feedback. Report problems through the usual channels (submit a PR or write to the appropriate list). More general feedback is welcome, please mail releng. Your input will help us put the finishing touches on what promises to be a great release!

Enjoy!

Martin

Posted Monday afternoon, December 2nd, 2019 Tags: blog
This report was written by Maciej Grochowski as a part of developing the AFL+KCOV project.

This report is a continuation of my previous work on Fuzzing Filesystems via AFL. You can find previous posts where I described the fuzzing (part1, part2) or my EuroBSDcon presentation.
In this part, we won't talk too much about fuzzing itself but I want to describe the process of finding root causes of File system issues and my recent work trying to improve this process.
This story begins with a mount issue that I found during my very first run of the AFL, and I presented it during my talk on EuroBSDcon in Lillehammer.

Invisible Mount point

afl-fuzz: /dev/vnd0: opendisk: Device busy That was the first error that I saw on my setup after couple of seconds of AFL run.
I was not sure what exactly was the problem and thought that mount wrapper might cause a problem.
Although after a long troubleshooting session I realized that this might be my first found issue.
To give the reader a better understanding of the problem without digging too deeply into fuzzer setup or mount process.
Let's assume that we have some broken file system image exposed as a block device visible as a /dev/wd1a.

The device can be easily mounted on mount point mnt1, however when we try to unmount it we get an error: error: ls: /mnt1: No such file or directory, and if we try to use raw system call unmount(2) it also end up with the similar error.

However, we can see clearly that the mount point exists with the mount command:

# mount
/dev/wd0a on / type ffs(local)
...
tmpfson /var/shmtype tmpfs(local)
/dev/vnd0 on /mnt1 type ffs(local)

Thust any lstat(2) based command is trying to convince us that no such directory exists.

# ls / | grep mnt
mnt
mnt1

# ls -alh /mnt1
ls: /mnt1: No such file or directory
# stat /mnt1
stat: /mnt1: lstat: No such file or directory

To understand what is happening we need to dig a little bit deeper than with standard bash tools.
First of all mnt1 is a folder created on the root partition at a local filesystem so getdents(2) or dirent(3) should show it as a entry inside dentry structure on the disk.
Raw getdents syscall is great tool for checking directory content because it reads the data from the directory structure on disk.

# ./getdents  /
|inode_nr|rec_len|file_type|name_len(name)|
#:   2,      16,    IFDIR,       1 (.)
#:   2,      16,    IFDIR,       2 (..)
#:   5,      24,    IFREG,       6 (.cshrc)
#:   6,      24,    IFREG,       8 (.profile)
#:   7,      24,    IFREG,       8 (boot.cfg)
#: 3574272,  24,    IFDIR,       3 (etc)
...
#: 3872128,  24,    IFDIR,       3 (mnt)
#: 5315584,  24,    IFDIR,       4 (mnt1)

Getdentries confirms that we have mnt1 as a directory inside the root of our system fs.
But, we cannot execute lstat, unmount or any other system-call that require a path to this file.
A quick look on definitions of these system calls show their structure:

unmount(const char *dir, int flags);
stat(const char *path, struct stat *sb);
lstat(const char *path, struct stat *sb);
open(const char *path, int flags, ...);

All of these function take as an argument path to the file, which as we know will endup in vfs lookup.
How about something that uses filedescryptor? Can we even obtain it?
As we saw earlier running open(2) on path also returns EACCES.
Looks like without digging inside VFS lookup we will not be able to understand the issue.

Get Filesystem Root

After some debugging and code walk I found the place that caused error.
VFS during the name resolution needs to check and switch FS in case of embedded mount points.
After the new filesystem is found VFS_ROOT is issued on that particular mount point.
VFS_ROOT is translated in case of FFS to the ufs_root which calls vcache with fixed value equal to the inode number of root inode which is 2 for UFS.

#define UFS_ROOTINO     ((ino_t)2)  

Below listning with the code of ufs_root from ufs/ufs/ufs_vfsops.c.

int
ufs_root(struct mount *mp, struct vnode **vpp)
{
...
        if ((error = VFS_VGET(mp, (ino_t)UFS_ROOTINO, &nvp)) != 0)
               return (error);

By using the debugger, I was able to make sure that the entry with number 2 after hashing does not exist in the vcache.
As a next step, I wanted to check the Root inode on the given filesystem image.
Filesystem debuggers are good tools to do such checks. NetBSD comes with FSDB which is general-purpose filesystem debugger.
Nonetheless, by default FSDB links against fsck_ffs which makes it tied to the FFS.

Filesystem Debugger for the help!

Filesystem debugger is a tool designed to browse on-disk structure and values of particular entries. It helps in understanding the Filesystems issues by giving particular values that the system reads from the disk. Unfortunately, current fsdb_ffs is a bit limited in the amount of information that it exposes.
Example output of trying to browse damaged root inode on corrupted FS.

# fsdb -dnF -f ./filesystem.out

** ./filesystem.out (NO WRITE)
superblock mismatches
...
BAD SUPER BLOCK: VALUES IN SUPER BLOCK DISAGREE WITH THOSE IN FIRST ALTERNATE                                     
clean = 0
isappleufs = 0, dirblksiz = 512
Editing file system `./filesystem.out'
Last Mounted on /mnt
current inode 2: unallocated inode

fsdb (inum: 2)> print
command `print
'
current inode 2: unallocated inode

FSDB Plugin: Print Formatted

Fortunately, fsdb_ffs leaves all necessary interfaces to allows accessing this data with small effort.
I implemented a simple plugin that allows browsing all values inside: inodes, superblock and cylinder groups on FFS. There are still a couple of todos that have to be finished, but the current version allows us to review inodes.

fsdb (inum: 2)> pf inode number=2 format=ufs1
command `pf inode number=2 format=ufs1
'
Disk format ufs1inode 2 block: 512
 ---------------------------- 
di_mode: 0x0                    di_nlink: 0x0
di_size: 0x0                    di_atime: 0x0
di_atimensec: 0x0               di_mtime: 0x0
di_mtimensec: 0x0               di_ctime: 0x0
di_ctimensec: 0x0               di_flags: 0x0
di_blocks: 0x0                  di_gen: 0x6c3122e2
di_uid: 0x0                     di_gid: 0x0
di_modrev: 0x0
 --- inode.di_oldids ---

We can see that the Filesystem image got wiped out most of the root inode fields.
For comparison, if we will take a look at root inode from freshly created FS we will see the proper structure.
Based on that we can quickly realize that fields: di_mode, di_nlink, di_size, di_blocks are different and can be the root cause.

Disk format ufs1 inode: 2 block: 512
 ---------------------------- 
di_mode: 0x41ed                 di_nlink: 0x2
di_size: 0x200                  di_atime: 0x0
di_atimensec: 0x0               di_mtime: 0x0
di_mtimensec: 0x0               di_ctime: 0x0
di_ctimensec: 0x0               di_flags: 0x0
di_blocks: 0x1                  di_gen: 0x68881d2c
di_uid: 0x0                     di_gid: 0x0
di_modrev: 0x0
 --- inode.di_oldids ---

From FSDB and incore to source code

First we will summarize what we already know:

  1. unmount fails in namei operation failure due to the corrupted FS
  2. Filesystem has corrupted root inode
  3. Corrupted root inode has fields: di_mode, di_nlink, di_size, di_blocks set to zero

Now we can find a place where inodes are loaded from the disk, this function for FFS is ffs_init_vnode(ump, vp, ino);.
This function is called during the loading vnode in vfs layer inside ffs_loadvnode.
Quick walkthrough through ffs_loadvnode expose the usage of the field i_mode:

         error = ffs_init_vnode(ump, vp, ino);                                                                                                                                                                                     
         if (error)                                                                                                                                                                                                                
                return error;                                                                                                                                                                                                     
                                                                                                                                                                                                                                   
         ip = VTOI(vp);                                                                                                                                                                                                            
         if (ip->i_mode == 0) {                                                                                                                                                                                                    
                 ffs_deinit_vnode(ump, vp);                                                                                                                                                                                        
                                                                                                                                                                                                                                   
                 return ENOENT;                                                                                                                                                                                                    
         }   

This seems to be a source of our problem. Whenever we are loading inode from disk to obtain the vnode, we validate if i_mode is non zero.
In our case root inode is wiped out, what results that vnode is dropped and an error returned.
So simply we cannot load any inode with i_mode set to the zero, inode number 2 called root is no different here. Due to that the VFS_LOADVNODE operation always fails, so lookup does and name resolution will return ENOENT error. To fix this issue we need a root inode validation on mount step, I created such validation and tested against corrupted filesystem image.
The mount return error, which proved the observation that such validation would help.

Conclusions

The following post is a continuation of the project: "Fuzzing Filesystems with kcov and AFL".
I presented how fuzzed bugs, which do not always show up as system panics, can be analyzed, and what tools a programmer can use.
Above the investigation described the very first bug that I found by fuzzing mount(2) with Afl+kcov.
During that root cause analysis, I realized the need for better tools for debugging Filesystem related issues.
Because of that reason, I added small functionality pf (print-formatted) into the fsdb(8), to allow walking through the on-disk structures. The described bug was reported with proposed fix based on validation of the root inode on kern-tech mailing list.

Future work

  1. Tools: I am still progressing with the fuzzing of mount process, however, I do not only focus on the finding bugs but also on tools that can be used for debugging and also doing regression tests. I am planning to add better support for browsing blocks on inode into the fsdb-pf, as well as write functionality that would allow more testing and potential recovery easier.
  2. Fuzzing: In next post, I will show a remote setup of AFL with an example of usage.
  3. I got a suggestion to take a look at FreeBSD UFS security checks on mount(2) done by McKusick. I think is worth it to see what else is validated and we can port to NetBSD FFS.
Posted Wednesday evening, November 27th, 2019 Tags: blog

Per the membership voting, we have seated the new Board of Directors of the NetBSD Foundation:

  • Taylor R. Campbell <riastadh@>
  • William J. Coldwell <billc@>
  • Michael van Elst <mlelstv@>
  • Thomas Klausner <wiz@>
  • Cherry G. Mathew <cherry@>
  • Pierre Pronchery <khorben@>
  • Leonardo Taccari <leot@>

We would like to thank Makoto Fujiwara <mef@> and Jeremy C. Reed <reed@> for their service on the Board of Directors during their term(s).

The new Board of Directors have voted in the executive officers for The NetBSD Foundation:

President:William J. Coldwell
Vice President: Pierre Pronchery
Secretary: Christos Zoulas
Assistant Secretary: Thomas Klausner
Treasurer: Christos Zoulas
Assistant Treasurer: Taylor R. Campbell

Thanks to everyone that voted and we look forward to a great 2020.

Posted late Wednesday evening, November 20th, 2019 Tags: blog

Per the membership voting, we have seated the new Board of Directors of the NetBSD Foundation:

  • Taylor R. Campbell <riastadh@>
  • William J. Coldwell <billc@>
  • Michael van Elst <mlelstv@>
  • Thomas Klausner <wiz@>
  • Cherry G. Mathew <cherry@>
  • Pierre Pronchery <khorben@>
  • Leonardo Taccari <leot@>

We would like to thank Makoto Fujiwara <mef@> and Jeremy C. Reed <reed@> for their service on the Board of Directors during their term(s).

The new Board of Directors have voted in the executive officers for The NetBSD Foundation:

President:William J. Coldwell
Vice President: Pierre Pronchery
Secretary: Christos Zoulas
Assistant Secretary: Thomas Klausner
Treasurer: Christos Zoulas
Assistant Treasurer: Taylor R. Campbell

Thanks to everyone that voted and we look forward to a great 2020.

Posted late Wednesday evening, November 20th, 2019 Tags: blog

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support which is taking longer than expected. You can read more about that in my September 2019 report.

So far the number of issues uncovered while enabling proper threading support has stopped me from merging the work-in-progress patches. However, I've finally reached the point where I believe that the current work can be merged and the remaining problems can be resolved afterwards. More on that and other LLVM-related events happening during the last month in this report.

LLVM news and buildbot status update

LLVM switched to git

Probably the most important event to note is that the LLVM project has switched from Subversion to git, and moved their repositories to GitHub. While the original plan provided for maintaining the old repositories as read-only mirrors, as of today this still hasn't been implemented. For this reason, we were forced to quickly switch buildbot to the git monorepo.

The buildbot is operational now, and seems to be handling git correctly. However, it is connected to the staging server for the time being. Its URL changed to http://lab.llvm.org:8014/builders/netbsd-amd64 (i.e. the port from 8011 to 8014).

Monthly regression report

Now for the usual list of 'what they broke this time'.

LLDB has been given a new API for handling files, in particular for passing them to Python scripts. The change of API has caused some 'bad file descriptor' errors, e.g.:

ERROR: test_SBDebugger (TestDefaultConstructorForAPIObjects.APIDefaultConstructorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/decorators.py", line 343, in wrapper
    return func(self, *args, **kwargs)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/TestDefaultConstructorForAPIObjects.py", line 133, in test_SBDebugger
    sb_debugger.fuzz_obj(obj)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/sb_debugger.py", line 13, in fuzz_obj
    obj.SetInputFileHandle(None, True)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 3890, in SetInputFileHandle
    self.SetInputFile(SBFile.Create(file, borrow=True))
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5418, in Create
    return cls.MakeBorrowed(file)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5379, in MakeBorrowed
    return _lldb.SBFile_MakeBorrowed(BORROWED)
IOError: [Errno 9] Bad file descriptor
Config=x86_64-/data/motus/netbsd8/netbsd8/build/bin/clang-10
----------------------------------------------------------------------

I've been able to determine that the error was produced by flush() method call invoked on a file descriptor referring to stdin. Appropriately, I've fixed the type conversion method not to flush read-only fds.

Afterwards, Lawrence D'Anna was able to find and fix another fflush() issue.

A newly added test revealed that platform process list -v command on NetBSD missed listing the process name. I've fixed it to provide Arg0 in process info.

Another new test failed due to our target not implementing ShellExpandArguments() API. Apparently the only target actually implementing it is Darwin, so I've just marked TestCustomShell XFAIL on all BSD targets.

LLDB upstream was forced to reintroduce readline module override that aims to prevent readline and libedit from being loaded into a single program simultaneously. This module failed to build on NetBSD. I've discovered that the original was meant to be built on Linux only, and since the problem still doesn't affect other platforms, I've made it Linux-only again.

libunwind build has been changed to link using the C compiler rather than C++. This caused some libc++ failures on NetBSD. The author has reverted the change for now, and is looking for a better way of resolving the problem.

Finally, I have disabled another OpenMP test that caused NetBSD to hang. While ideally I'd like to have the underlying kernel problem fixed, this is non-trivial and I prefer to focus on LLDB right now.

New LLD work

I've been asked to rebase my LLD patches for the new code. While doing it, I've finally committed the -z nognustack option patch from January.

In the meantime, Kamil's been working on finally resolving the long-standing impasse on LLD design. He is working on a new NetBSD-specific frontend to LLD that would satisfy our system-wide linker requirements without modifying the standard driver used by other platforms.

Upgrade to NetBSD 9 beta

Our recent work, especially the work on threading support has required a number of fixes in the NetBSD kernel. Those fixes were backported to NetBSD 9 branch but not to 8. The 8 kernel used by the buildbot was therefore suboptimal for testing new features. Furthermore, with the 9.0 release coming soon-ish, it became necessary to start actively testing it for regressions.

The buildbot has been upgraded to NetBSD 9 beta on 2019-11-06. Initially, the upgrade has caused LLDB to start crashing on startup. I have not been able to pinpoint the exact issue yet. However, I've established that it happens with -O3 optimization level only, and I've worked it around by switching the build to -O2. I am planning to look into the problem more once the buildbot is restored fully.

The upgrade to nb9 has caused 4 LLDB tests to start succeeding, and 6 to start failing. Namely:

********************
Unexpected Passing Tests (4):
    lldb-api :: commands/watchpoints/watchpoint_commands/condition/TestWatchpointConditionCmd.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandPython.py
    lldb-api :: lang/c/bitfields/TestBitfields.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandLLDB.py

********************
Failing Tests (6):
    lldb-shell :: Reproducer/Functionalities/TestExpressionEvaluation.test
    lldb-api :: commands/expression/call-restarts/TestCallThatRestarts.py
    lldb-api :: functionalities/signal/handle-segv/TestHandleSegv.py
    lldb-unit :: tools/lldb-server/tests/./LLDBServerTests/StandardStartupTest.TestStopReplyContainsThreadPcs
    lldb-api :: functionalities/inferior-crashing/TestInferiorCrashingStep.py
    lldb-api :: functionalities/signal/TestSendSignal.py

I am going to start investigating the new failures shortly.

Further LLDB threading work

Fixes to register support

Enabling thread support revealed a problem in register API introspection specific to NetBSD. The API responsible for passing registers in groups to Python was unable to name some of the groups on NetBSD, and the null names have caused the TestRegistersIterator to fail. Threading support made this specifically visible by replacing a regular test failure with Python code error.

In order to resolve the problem, I had to describe all supported register sets in NetBSD register context. The code was roughly based on the Linux equivalent, modified to match register sets used by our ptrace() API. Interestingly, I had to also include MPX registers that are currently unimplemented, as otherwise LLDB implicitly put them in an anonymous group.

While at it, I've also changed the register set numbering to match the more common ordering, in order to avoid issues in the future.

Finished basic thread support patch

I've finally completed and submitted the patch for NetBSD thread support. Besides fixing a few mistakes, I've implemented thread affinity support for all relevant SIGTRAP events (breakpoints, traces, hardware watchpoints) and removed incomplete hardware breakpoint stub that caused LLDB to crash.

In its current form, this patch combines three changes essential to correct support of threaded programs:

  1. It enables reporting of new and exited threads, and maintains debugged thread list based on that.

  2. It modifies the signal (generic and SIGTRAP) handling functions to read the thread identifier and associate the event with correct thread(s). Previously, all events were assigned to all threads.

  3. It updates the process resuming function to support controlling the state (running, single-stepping, stopped) of individual threads, and raising a signal either to the whole process or to a single thread. Previously, the code used only the requested action for the first thread and populated it to all threads in the process.

Proper watchpoint support in multi-threaded programs

I've submitted a separate patch to copy watchpoints to newly-created threads. This is necessary due to the design of Debug Register support in NetBSD. Quoting the ptrace(2) manpage:

  • debug registers are only per-LWP, not per-process globally
  • debug registers must not be inherited after (v)forking a process
  • debug registers must not be inherited after forking a thread
  • a debugger is responsible to set global watchpoints/breakpoints with the debug registers, to achieve this PTRACE_LWP_CREATE / PTRACE_LWP_EXIT event monitoring function is designed to be used

LLDB supports per-process watchpoints only at the moment. To fit this into NetBSD model, we need to monitor new threads and copy watchpoints to them. Since LLDB does not keep explicit watchpoint information at the moment (it relies on querying debug registers), the proposed implementation verbosely copies dbregs from the currently selected thread (all existing threads should have the same dbregs).

Fixed support for concurrent watchpoint triggers

The final problem I've been investigating was a server crash with the new code when multiple watchpoints were triggered concurrently. My final patch aims to fix handling concurrent watchpoint events.

When a watchpoint is triggered, the kernel delivers SIGTRAP with TRAP_DBREG to the debugger. The debugger investigates DR6 register of the specified thread in order to determine which watchpoint was triggered, and reports it. When multiple watchpoints are triggered simultaneously, the kernel reports that as series of successive SIGTRAPs. Normally, that works just fine.

However, on x86 watchpoint triggers are reported before the instruction is executed. For this reason, LLDB temporarily disables the breakpoint, single-steps and reenables it. The problem with that is that the GDB protocol doesn't control watchpoints per thread, so the operation disables and reenables the watchpoint on all threads. As a side effect of this, DR6 is cleared everywhere.

Now, if multiple watchpoints were triggered concurrently, DR6 is set on all relevant threads. However, after handling SIGTRAP on the first one, the disable/reenable (or more specifically, remove/readd) wipes DR6 on all threads. The handler for next SIGTRAP can't establish the correct watchpoint number, and starts looking for breakpoints. Since hardware breakpoints are not implemented, the relevant method returns an error and lldb-server eventually exits.

There are two problems to be solved there. Firstly, lldb-server should not exit in this circumstances. This is already solved in the first patch as mentioned above. Secondly, we need to be able to handle concurrent watchpoint hits independently of the clear/set packets. This is solved by this patch.

There are multiple different approaches to this problem. I've chosen to remodel clear/set watchpoint method in order to prevent it from resetting DR6 if the same watchpoint is being restored, as the alternatives (such as pre-storing DR6 on the first SIGTRAP) have more corner conditions to be concerned about.

The current design of these two methods assumes that the 'clear' method clears both the triggered state in DR6 and control bits in DR7, while the 'set' method sets the address in DR0..3, and the control bits in DR7.

The new design limits the 'clear' method to disabling the watchpoint by clearing the enable bit in DR7. The remaining bits, as well as trigger status and address are preserved. The 'set' method uses them to determine whether a new watchpoint is being set, or the previous one merely reenabled. In the latter case, it just updates DR7, while preserving the previous trigger. In the former, it updates all registers and clears the trigger from DR6.

This solution effectively prevents the disable/reenable logic of LLDB from clearing concurrent watchpoint hits, and therefore makes it possible for the SIGTRAP handler to report them correctly. If the user manually replaces the watchpoint with another one, DR6 is cleared and LLDB does not associate the concurrent trigger to the watchpoint that no longer exists.

Thread status summary

The current version of the patches fixes approximately 47 test failures, and causes approximately 4 new test failures and 2 hanging tests. There is around 7 new flaky tests, related to signals concurrent with breakpoints or watchpoints.

Future plans

The first immediate goal is to investigate and resolve test suite regressions related to NetBSD 9 upgrade. The second goal is to get the threading patches merged, and simultaneously work on resolving the remaining test failures and hangs.

When that's done, I'd like to finally move on with the remaining TODO items. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

Posted Saturday night, November 9th, 2019 Tags: blog

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support which is taking longer than expected. You can read more about that in my September 2019 report.

So far the number of issues uncovered while enabling proper threading support has stopped me from merging the work-in-progress patches. However, I've finally reached the point where I believe that the current work can be merged and the remaining problems can be resolved afterwards. More on that and other LLVM-related events happening during the last month in this report.

LLVM news and buildbot status update

LLVM switched to git

Probably the most important event to note is that the LLVM project has switched from Subversion to git, and moved their repositories to GitHub. While the original plan provided for maintaining the old repositories as read-only mirrors, as of today this still hasn't been implemented. For this reason, we were forced to quickly switch buildbot to the git monorepo.

The buildbot is operational now, and seems to be handling git correctly. However, it is connected to the staging server for the time being. Its URL changed to http://lab.llvm.org:8014/builders/netbsd-amd64 (i.e. the port from 8011 to 8014).

Monthly regression report

Now for the usual list of 'what they broke this time'.

LLDB has been given a new API for handling files, in particular for passing them to Python scripts. The change of API has caused some 'bad file descriptor' errors, e.g.:

ERROR: test_SBDebugger (TestDefaultConstructorForAPIObjects.APIDefaultConstructorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/decorators.py", line 343, in wrapper
    return func(self, *args, **kwargs)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/TestDefaultConstructorForAPIObjects.py", line 133, in test_SBDebugger
    sb_debugger.fuzz_obj(obj)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/sb_debugger.py", line 13, in fuzz_obj
    obj.SetInputFileHandle(None, True)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 3890, in SetInputFileHandle
    self.SetInputFile(SBFile.Create(file, borrow=True))
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5418, in Create
    return cls.MakeBorrowed(file)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5379, in MakeBorrowed
    return _lldb.SBFile_MakeBorrowed(BORROWED)
IOError: [Errno 9] Bad file descriptor
Config=x86_64-/data/motus/netbsd8/netbsd8/build/bin/clang-10
----------------------------------------------------------------------

I've been able to determine that the error was produced by flush() method call invoked on a file descriptor referring to stdin. Appropriately, I've fixed the type conversion method not to flush read-only fds.

Afterwards, Lawrence D'Anna was able to find and fix another fflush() issue.

A newly added test revealed that platform process list -v command on NetBSD missed listing the process name. I've fixed it to provide Arg0 in process info.

Another new test failed due to our target not implementing ShellExpandArguments() API. Apparently the only target actually implementing it is Darwin, so I've just marked TestCustomShell XFAIL on all BSD targets.

LLDB upstream was forced to reintroduce readline module override that aims to prevent readline and libedit from being loaded into a single program simultaneously. This module failed to build on NetBSD. I've discovered that the original was meant to be built on Linux only, and since the problem still doesn't affect other platforms, I've made it Linux-only again.

libunwind build has been changed to link using the C compiler rather than C++. This caused some libc++ failures on NetBSD. The author has reverted the change for now, and is looking for a better way of resolving the problem.

Finally, I have disabled another OpenMP test that caused NetBSD to hang. While ideally I'd like to have the underlying kernel problem fixed, this is non-trivial and I prefer to focus on LLDB right now.

New LLD work

I've been asked to rebase my LLD patches for the new code. While doing it, I've finally committed the -z nognustack option patch from January.

In the meantime, Kamil's been working on finally resolving the long-standing impasse on LLD design. He is working on a new NetBSD-specific frontend to LLD that would satisfy our system-wide linker requirements without modifying the standard driver used by other platforms.

Upgrade to NetBSD 9 beta

Our recent work, especially the work on threading support has required a number of fixes in the NetBSD kernel. Those fixes were backported to NetBSD 9 branch but not to 8. The 8 kernel used by the buildbot was therefore suboptimal for testing new features. Furthermore, with the 9.0 release coming soon-ish, it became necessary to start actively testing it for regressions.

The buildbot has been upgraded to NetBSD 9 beta on 2019-11-06. Initially, the upgrade has caused LLDB to start crashing on startup. I have not been able to pinpoint the exact issue yet. However, I've established that it happens with -O3 optimization level only, and I've worked it around by switching the build to -O2. I am planning to look into the problem more once the buildbot is restored fully.

The upgrade to nb9 has caused 4 LLDB tests to start succeeding, and 6 to start failing. Namely:

********************
Unexpected Passing Tests (4):
    lldb-api :: commands/watchpoints/watchpoint_commands/condition/TestWatchpointConditionCmd.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandPython.py
    lldb-api :: lang/c/bitfields/TestBitfields.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandLLDB.py

********************
Failing Tests (6):
    lldb-shell :: Reproducer/Functionalities/TestExpressionEvaluation.test
    lldb-api :: commands/expression/call-restarts/TestCallThatRestarts.py
    lldb-api :: functionalities/signal/handle-segv/TestHandleSegv.py
    lldb-unit :: tools/lldb-server/tests/./LLDBServerTests/StandardStartupTest.TestStopReplyContainsThreadPcs
    lldb-api :: functionalities/inferior-crashing/TestInferiorCrashingStep.py
    lldb-api :: functionalities/signal/TestSendSignal.py

I am going to start investigating the new failures shortly.

Further LLDB threading work

Fixes to register support

Enabling thread support revealed a problem in register API introspection specific to NetBSD. The API responsible for passing registers in groups to Python was unable to name some of the groups on NetBSD, and the null names have caused the TestRegistersIterator to fail. Threading support made this specifically visible by replacing a regular test failure with Python code error.

In order to resolve the problem, I had to describe all supported register sets in NetBSD register context. The code was roughly based on the Linux equivalent, modified to match register sets used by our ptrace() API. Interestingly, I had to also include MPX registers that are currently unimplemented, as otherwise LLDB implicitly put them in an anonymous group.

While at it, I've also changed the register set numbering to match the more common ordering, in order to avoid issues in the future.

Finished basic thread support patch

I've finally completed and submitted the patch for NetBSD thread support. Besides fixing a few mistakes, I've implemented thread affinity support for all relevant SIGTRAP events (breakpoints, traces, hardware watchpoints) and removed incomplete hardware breakpoint stub that caused LLDB to crash.

In its current form, this patch combines three changes essential to correct support of threaded programs:

  1. It enables reporting of new and exited threads, and maintains debugged thread list based on that.

  2. It modifies the signal (generic and SIGTRAP) handling functions to read the thread identifier and associate the event with correct thread(s). Previously, all events were assigned to all threads.

  3. It updates the process resuming function to support controlling the state (running, single-stepping, stopped) of individual threads, and raising a signal either to the whole process or to a single thread. Previously, the code used only the requested action for the first thread and populated it to all threads in the process.

Proper watchpoint support in multi-threaded programs

I've submitted a separate patch to copy watchpoints to newly-created threads. This is necessary due to the design of Debug Register support in NetBSD. Quoting the ptrace(2) manpage:

  • debug registers are only per-LWP, not per-process globally
  • debug registers must not be inherited after (v)forking a process
  • debug registers must not be inherited after forking a thread
  • a debugger is responsible to set global watchpoints/breakpoints with the debug registers, to achieve this PTRACE_LWP_CREATE / PTRACE_LWP_EXIT event monitoring function is designed to be used

LLDB supports per-process watchpoints only at the moment. To fit this into NetBSD model, we need to monitor new threads and copy watchpoints to them. Since LLDB does not keep explicit watchpoint information at the moment (it relies on querying debug registers), the proposed implementation verbosely copies dbregs from the currently selected thread (all existing threads should have the same dbregs).

Fixed support for concurrent watchpoint triggers

The final problem I've been investigating was a server crash with the new code when multiple watchpoints were triggered concurrently. My final patch aims to fix handling concurrent watchpoint events.

When a watchpoint is triggered, the kernel delivers SIGTRAP with TRAP_DBREG to the debugger. The debugger investigates DR6 register of the specified thread in order to determine which watchpoint was triggered, and reports it. When multiple watchpoints are triggered simultaneously, the kernel reports that as series of successive SIGTRAPs. Normally, that works just fine.

However, on x86 watchpoint triggers are reported before the instruction is executed. For this reason, LLDB temporarily disables the breakpoint, single-steps and reenables it. The problem with that is that the GDB protocol doesn't control watchpoints per thread, so the operation disables and reenables the watchpoint on all threads. As a side effect of this, DR6 is cleared everywhere.

Now, if multiple watchpoints were triggered concurrently, DR6 is set on all relevant threads. However, after handling SIGTRAP on the first one, the disable/reenable (or more specifically, remove/readd) wipes DR6 on all threads. The handler for next SIGTRAP can't establish the correct watchpoint number, and starts looking for breakpoints. Since hardware breakpoints are not implemented, the relevant method returns an error and lldb-server eventually exits.

There are two problems to be solved there. Firstly, lldb-server should not exit in this circumstances. This is already solved in the first patch as mentioned above. Secondly, we need to be able to handle concurrent watchpoint hits independently of the clear/set packets. This is solved by this patch.

There are multiple different approaches to this problem. I've chosen to remodel clear/set watchpoint method in order to prevent it from resetting DR6 if the same watchpoint is being restored, as the alternatives (such as pre-storing DR6 on the first SIGTRAP) have more corner conditions to be concerned about.

The current design of these two methods assumes that the 'clear' method clears both the triggered state in DR6 and control bits in DR7, while the 'set' method sets the address in DR0..3, and the control bits in DR7.

The new design limits the 'clear' method to disabling the watchpoint by clearing the enable bit in DR7. The remaining bits, as well as trigger status and address are preserved. The 'set' method uses them to determine whether a new watchpoint is being set, or the previous one merely reenabled. In the latter case, it just updates DR7, while preserving the previous trigger. In the former, it updates all registers and clears the trigger from DR6.

This solution effectively prevents the disable/reenable logic of LLDB from clearing concurrent watchpoint hits, and therefore makes it possible for the SIGTRAP handler to report them correctly. If the user manually replaces the watchpoint with another one, DR6 is cleared and LLDB does not associate the concurrent trigger to the watchpoint that no longer exists.

Thread status summary

The current version of the patches fixes approximately 47 test failures, and causes approximately 4 new test failures and 2 hanging tests. There is around 7 new flaky tests, related to signals concurrent with breakpoints or watchpoints.

Future plans

The first immediate goal is to investigate and resolve test suite regressions related to NetBSD 9 upgrade. The second goal is to get the threading patches merged, and simultaneously work on resolving the remaining test failures and hangs.

When that's done, I'd like to finally move on with the remaining TODO items. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

Posted Saturday night, November 9th, 2019 Tags: blog
I have introduced changes to make debuggers more reliable in threaded scenarios. Additionally I have revamped micro-UBSan runtime for newer Clang (version 10git). I have received the OK from core@ to switch our iconv(3) to POSIX conformant iconv(3) and I have adapted where possible and readily known in pkgsrc to the newer API. This month I continued to find a solution to the impasse in LLD that blocks adding NetBSD support.

Threading support

I have simplified the struct proc and removed a p_oppid field that stored the numeric process id of the original parent (forker). This field is not needed as it duplicates p_opptr (current real parent pointer) that is already safe to use. So far this has not proven to be unsafe.

I have refactored the signal code making it more verbose to reflect the actual needs of the kernel signal code.

I have fixed a nasty bug in the function that is called when a thread returns from the kernel to userland. There was a tiny time window when in certain scenarios a thread was never stopped on process suspension but was instead resumed causing waitpid(2) polling to never return success as the process can be never stopped with a running thread.

There was a race bug that could cause a nested thread termination call, triggering a panic.

With the above changes I was able to reliably run all ATF tests for LWP events (threading events). I have also bumped the threading tests to atually execute 100 concurrent threads, as the higher number can more easily trigger anomalies. In my observations all tests are now rock solid.

There are now no longer any ptrace(2) tests in ATF marked as flaky or disabled. The two main offenders, vfork(2) events and threading events, are now solid.

Michal Gorny detected another source of instability of threads with a LLDB regression test. It was related to emitting a massive number of concurrent threads. I have helped Michal to address this problem and squash the bug.

All of the above changes are now pulled to NetBSD-9 for future 9.0 release.

There are at the time of writing, 4 failing LLDB threading tests and few more related to debug registers. Both failure types are under investigation. They could be bugs in the NetBSD support in some extent, but maybe there is need to fixup something on the kernel level.

The project is still not 100% accomplished but we are now very close to finishing everything in the domain of threads. I could torture the NetBSD kernel for few hours with a massive number of threads and events without a single crash or failure. On the other hand there are still likely some suspicious corner cases that need proper investigation. There are also some suspicious reports for crashes from syzkaller, the kernel fuzzer. Those still need to be promptly checked.

LLVM projects

I have attempted to change our original plan with LLD and instead of mutating the LLD behavior on target basis, write a dedicated LLD wrapper that tunes LLD for NetBSD. My patch is still in review. As an improvement over the previous ones, it wasn't immediately rejected... https://reviews.llvm.org/D69755.

I have upstreamed chunks of code with the following commits:

  • [compiler-rt] [msan] Correct the __libc_thr_keycreate prototype
  • [compiler-rt] [msan] Support POSIX iconv(3) on NetBSD 9.99.17+
  • [compiler-rt] Harmonize __sanitizer_addrinfo with the NetBSD headers
  • [compiler-rt] Sync NetBSD syscall hooks with 9.99.17

NetBSD distribution changes

I have switched the iconv(3) function prototype to POSIX-conformant form. The history of this function is documented in iconv(3) as follows:

STANDARDS
     iconv_open(), iconv_close(), and iconv() conform to IEEE Std 1003.1-2001
     ("POSIX.1").

     Historically, the definition of iconv has not been consistent across
     operating systems.  This is due to an unfortunate historical mistake,
     documented in this e-mail:
     https://www5.opengroup.org/sophocles2/show_mail.tpl?&source=L&listname=austin-group-l&id=7404.
     The standards page for the header file  defined the second
     argument of iconv() as char **, but the standards page for the iconv()
     implementation defined it as const char **.  The standards committee
     later chose to change the function definition to follow the header file
     definition (without const), even though the version with const is
     arguably more correct.  NetBSD used initially the const form.  It was
     decided to reject the committee's regression and become (technically)
     incompatible.

     This decision was changed in NetBSD 10 and the iconv() prototype was
     synchronized with the standard.

Meanwhile I fixed what was known to be effected in pkgsrc. Unfortunately Qt4/KDE4 had several build issues and this motivated me to fix its users for the new function through upgrades to the Qt5/KDE5 stack. Many dead packages without upgrade path were dropped from pkgsrc.

As there is a new Clang upgrade coming, I have implemented handlers for new UBSan reports: function_type_mismatch_v1() and implicit_conversion(). The first one is a new ABI for function_type_mismatch() and the second one is completely new.

GSoC Mentor Summit

I took part in the GSoC Mentor Summit in Munich and presented a talk titled "NetBSD version 9. What's new in store?".

Plan for the next milestone

Support Michal Gorny in reaching the milestone of passing all threading and debug register tests in LLDB.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

Posted mid-morning Monday, November 4th, 2019 Tags: blog
I have introduced changes to make debuggers more reliable in threaded scenarios. Additionally I have revamped micro-UBSan runtime for newer Clang (version 10git). I have received the OK from core@ to switch our iconv(3) to POSIX conformant iconv(3) and I have adapted where possible and readily known in pkgsrc to the newer API. This month I continued to find a solution to the impasse in LLD that blocks adding NetBSD support.

Threading support

I have simplified the struct proc and removed a p_oppid field that stored the numeric process id of the original parent (forker). This field is not needed as it duplicates p_opptr (current real parent pointer) that is already safe to use. So far this has not proven to be unsafe.

I have refactored the signal code making it more verbose to reflect the actual needs of the kernel signal code.

I have fixed a nasty bug in the function that is called when a thread returns from the kernel to userland. There was a tiny time window when in certain scenarios a thread was never stopped on process suspension but was instead resumed causing waitpid(2) polling to never return success as the process can be never stopped with a running thread.

There was a race bug that could cause a nested thread termination call, triggering a panic.

With the above changes I was able to reliably run all ATF tests for LWP events (threading events). I have also bumped the threading tests to atually execute 100 concurrent threads, as the higher number can more easily trigger anomalies. In my observations all tests are now rock solid.

There are now no longer any ptrace(2) tests in ATF marked as flaky or disabled. The two main offenders, vfork(2) events and threading events, are now solid.

Michal Gorny detected another source of instability of threads with a LLDB regression test. It was related to emitting a massive number of concurrent threads. I have helped Michal to address this problem and squash the bug.

All of the above changes are now pulled to NetBSD-9 for future 9.0 release.

There are at the time of writing, 4 failing LLDB threading tests and few more related to debug registers. Both failure types are under investigation. They could be bugs in the NetBSD support in some extent, but maybe there is need to fixup something on the kernel level.

The project is still not 100% accomplished but we are now very close to finishing everything in the domain of threads. I could torture the NetBSD kernel for few hours with a massive number of threads and events without a single crash or failure. On the other hand there are still likely some suspicious corner cases that need proper investigation. There are also some suspicious reports for crashes from syzkaller, the kernel fuzzer. Those still need to be promptly checked.

LLVM projects

I have attempted to change our original plan with LLD and instead of mutating the LLD behavior on target basis, write a dedicated LLD wrapper that tunes LLD for NetBSD. My patch is still in review. As an improvement over the previous ones, it wasn't immediately rejected... https://reviews.llvm.org/D69755.

I have upstreamed chunks of code with the following commits:

  • [compiler-rt] [msan] Correct the __libc_thr_keycreate prototype
  • [compiler-rt] [msan] Support POSIX iconv(3) on NetBSD 9.99.17+
  • [compiler-rt] Harmonize __sanitizer_addrinfo with the NetBSD headers
  • [compiler-rt] Sync NetBSD syscall hooks with 9.99.17

NetBSD distribution changes

I have switched the iconv(3) function prototype to POSIX-conformant form. The history of this function is documented in iconv(3) as follows:

STANDARDS
     iconv_open(), iconv_close(), and iconv() conform to IEEE Std 1003.1-2001
     ("POSIX.1").

     Historically, the definition of iconv has not been consistent across
     operating systems.  This is due to an unfortunate historical mistake,
     documented in this e-mail:
     https://www5.opengroup.org/sophocles2/show_mail.tpl?&source=L&listname=austin-group-l&id=7404.
     The standards page for the header file  defined the second
     argument of iconv() as char **, but the standards page for the iconv()
     implementation defined it as const char **.  The standards committee
     later chose to change the function definition to follow the header file
     definition (without const), even though the version with const is
     arguably more correct.  NetBSD used initially the const form.  It was
     decided to reject the committee's regression and become (technically)
     incompatible.

     This decision was changed in NetBSD 10 and the iconv() prototype was
     synchronized with the standard.

Meanwhile I fixed what was known to be effected in pkgsrc. Unfortunately Qt4/KDE4 had several build issues and this motivated me to fix its users for the new function through upgrades to the Qt5/KDE5 stack. Many dead packages without upgrade path were dropped from pkgsrc.

As there is a new Clang upgrade coming, I have implemented handlers for new UBSan reports: function_type_mismatch_v1() and implicit_conversion(). The first one is a new ABI for function_type_mismatch() and the second one is completely new.

GSoC Mentor Summit

I took part in the GSoC Mentor Summit in Munich and presented a talk titled "NetBSD version 9. What's new in store?".

Plan for the next milestone

Support Michal Gorny in reaching the milestone of passing all threading and debug register tests in LLDB.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

Posted mid-morning Monday, November 4th, 2019 Tags: blog
I have introduced changes that make debuggers more reliable in threaded scenarios. Additionally, I have enhanced Leak Sanitizer support and introduced various improvements in the basesystem.

Threading support

Threads and synchronization in the kernel, in general, is an evergreen task of the kernel developers. The process of enhancing support for tracing multiple threads has been documented by Michal Gorny in his LLDB entry Threading support in LLDB continued.

Overall I have introduced these changes:

  • Separate suspend from userland (_lwp_suspend(2)) flag from suspend by a debugger (PT_SUSPEND). This removes one of the underlying problems of threading stability as a debuggee was able to accidentally unstop suspended thread. This property is needed whenever we want to trace a selection (typically single entity) of threads.
  • Store SIGTRAP event information inside siginfo_t, rather than in struct proc. A single signal can only be reported at the time to the debugger, and its context is no longer prone to be overwritten by concurrent threads.
  • Change that introduces restarts in functions notifying events for debuggers. There was a time window between registering an event by a thread, stopping the process and unlocking mutexes of the process; as another process could take the mutexes before being stopped and overwrite the event with its own data. Now each event routine for debugger checks whether a process is already stopping (or demising or no longer being tracked) and preserves the signal to be emitted locally in the context of the lwp local variable on the stack and continues stopping self as requested by the other LWP. Once the thread is awaken, it retries to emit the signal and deliver the event signal to the debugger.
  • Introduce PT_STOP, that combines kill(SIGSTOP) and ptrace(PT_CONTINUE,SIGSTOP) semantics in a single call. It works like:
    • kill(SIGSTOP) for unstopped tracee
    • ptrace(PT_CONTINUE,SIGSTOP) for stopped tracee
    The child will be stopped and always possible to be waited (with wait(2) like calls).

    For stopped tracee kill(SIGSTOP) has no effect. PT_CONTINUE+SIGSTOP cannot be used on an unstopped process (EBUSY).

    This operation is modeled after PT_KILL that is similar for the SIGKILL call. While there, allow PT_KILL on unstopped traced child.

    This operation is useful in an abnormal exit of a debugger from a signal handler, usually followed by waitpid(2) and ptrace(PT_DETACH).

For the sake of tracking the missed in action signals emitted by tracee, I have introduced the feature in NetBSD truss (as part of the picotrace repository) to register syscall entry (SCE) and syscall exit (SCX) calls and track missing SCE/SCX events that were never delivered. Unfortunately, the number of missing events was huge, even for simple 2-threaded applications.

    truss[2585] running for 22.205305922 seconds
    truss[2585] attached to child=759 ('firefox') for 22.204289369 seconds
    syscall                     seconds      calls     errors missed-sce missed-scx
    read                    0.048522952        609          0         54         76
    write                   0.044693735        487          0         35         66
    open                    0.002516815         18          0          5          5
    close                   0.001015263         17          0          9          6
    unlink                  0.001375463         13          0          3          0
    getpid                  0.093458089       1993          0         16         56
    geteuid                 0.000049301          1          0          0          1
    recvmsg                 0.343353019       4828       3685         90        112
    access                  0.001450653         12          3          5          4
    dup                     0.000570904         10          0          0          1
    munmap                  0.010375949         88          0          6          3
    mprotect                0.196781932       2251          0         11         62
    madvise                 0.049820002        430          0         11         18
    writev                  0.237488362       1507          0         76         67
    rename                  0.000379918          2          0          1          0
    mkdir                   0.000283846          2          2          1          2
    mmap                    0.033342935        481          0         15         40
    lseek                   0.003341775         62          0         25         24
    ftruncate               0.000507707          9          0          1          0
    __sysctl                0.000144506          2          0          0          0
    poll                   18.694195617       4531          0        106        191
    __sigprocmask14         0.001585329         20          0          0          2
    getcontext              0.000083238          1          0          0          0
    _lwp_create             0.000104646          1          0          0          0
    _lwp_self               0.001456718         22          0         24         79
    _lwp_unpark             0.035319633        607          0         14         39
    _lwp_unpark_all         0.020660377        250          0         38         50
    _lwp_setname            0.000118418          2          0          0          0
    __select50             15.125525493        637          0         82        125
    __gettimeofday50        3.279021049       2930          0         40        135
    __clock_gettime50      10.673311747      33132          0       1418       3003
    __stat50                0.006375356         52          3         12          5
    __fstat50               0.001490944         17          0          3          2
    __lstat50               0.000110906          1          0          1          0
    __getrusage50           0.008863815        109          0          7          1
    ___lwp_park60          62.720893458        964        251        454        453
                          -------------    -------    -------    -------    -------
                          111.638589870      56098       3944       2563       4628

With my kernel changes landed, the number of missed sce/scx events is down to zero (with exceptions to signals that e.g. never return such as the exit(2) call).

Once these changes settle in HEAD, I plan to backport them to NetBSD-9. I have already received feedback that GDB works much better now.

The kernel also has now more runtime asserts that validate correctness of the code paths.

Sanitizers

I've introduced a special preprocessor macro to detect LSan (__SANITIZE_LEAK__) and UBSan (__SANITIZE_UNDEFINED__) in GCC. The patches were submitted upstream to the GCC mailing list, in two patches (LSan + UBSan). Unfortunately, GCC does not see value in feature parity with LLVM and for the time being it will be a local NetBSD specific GCC extension. These macros are now integrated into the NetBSD public system headers, for use by the basesystem software.

The LSan macro is now used inside the LLVM codebase and the ps(1) program is the first user of it. The UBSan macro is now used to disable relaxed alignment on x86. While such code is still functional, it is not clean from undefined behavior as specified by C. This is especially needed in the kernel fuzzing process, as we can reduce noise from less interesting reports.

During the previous month a number of reports from kernel fuzzing were fixed. There is still more to go.

Almost all local patches needed for LSan were merged upstream. The last remaining local patch is scheduled for later as it is very invasive for all platforms and sanitizers. In the worst case we just have more false negatives in detection of leaks in specific scenarios.

Miscellaneous changes

I have fixed a regression in upstream GDB with SIGTTOU handling. This was an upstream bug fixed by Alan Hayward and cherry-picked by me. As a side effect, a certain environment setup would cause the tracer to sleep.

I have reverted the regression in changed in6_addr change. It appeased UBSan, but broke at least qemu networking. The regression was tracked down by Andreas Gustafsson and reported in the NetBSD's bug tracking system.

I have landed a patch that returns ELF loader dl_phdr_info information for dl_iterate_phdr(3). This synchronized the behavior with Linux, FreeBSD and OpenBSD and is used by sanitizers.

I have passed through core@ the patch to change the kevent::udata type from intptr_t to void*. The former is slightly more pedantic, but the latter is what is in all other kevent users and this mismatch of types affected specifically C++ users that needed special NetBSD-only workarounds.

I have marked git and hg meta files as ones to be ignored by cvs import. This was causing problems among people repackaging the NetBSD source code with other VCS software than CVS.

I keep working on getting GDB test-suite to run on NetBSD, I spent some time on getting fluent in the TCL programming language (as GDB uses dejagnu and TCL scripting). I have already fixed two bugs that affected NetBSD users in the TCL runtime: getaddrbyname_r and gethostbyaddr_r were falsely reported as available and picked on NetBSD, causing damage in operation. Fluency in TCL will allow me to be more efficient in addressing and debugging failing tests in GDB and likely reuse this knowledge in other fields useful for the project.

I made __CTASSERT a static assert again. Previously, this homegrown check for compile-time checks silently stopped working for C99 compilers supporting VLA (variable length array). It was caught by kUBSan that detected VLA of dynamic size of -1, that is still compatible but has unspecified runtime semantics. The new form is inspired by the Perl ctassert code and uses bit-field constant that enforces the assert to be effective again. Few misuses __CTASSERT, mostly in the Linux DRMKMS code, were fixed.

I have submitted a proposal to the C Working Group a proposal to add new methods for setting and getting the thread name.

Plan for the next milestone

Keep stabilizing the reliability debugging interfaces and get ATF and LLDB threading code reliably pass tests. Cover more scenarios with ptrace(2) in the ATF regression test-suite.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

Posted mid-morning Thursday, October 10th, 2019 Tags: blog
I have introduced changes that make debuggers more reliable in threaded scenarios. Additionally, I have enhanced Leak Sanitizer support and introduced various improvements in the basesystem.

Threading support

Threads and synchronization in the kernel, in general, is an evergreen task of the kernel developers. The process of enhancing support for tracing multiple threads has been documented by Michal Gorny in his LLDB entry Threading support in LLDB continued.

Overall I have introduced these changes:

  • Separate suspend from userland (_lwp_suspend(2)) flag from suspend by a debugger (PT_SUSPEND). This removes one of the underlying problems of threading stability as a debuggee was able to accidentally unstop suspended thread. This property is needed whenever we want to trace a selection (typically single entity) of threads.
  • Store SIGTRAP event information inside siginfo_t, rather than in struct proc. A single signal can only be reported at the time to the debugger, and its context is no longer prone to be overwritten by concurrent threads.
  • Change that introduces restarts in functions notifying events for debuggers. There was a time window between registering an event by a thread, stopping the process and unlocking mutexes of the process; as another process could take the mutexes before being stopped and overwrite the event with its own data. Now each event routine for debugger checks whether a process is already stopping (or demising or no longer being tracked) and preserves the signal to be emitted locally in the context of the lwp local variable on the stack and continues stopping self as requested by the other LWP. Once the thread is awaken, it retries to emit the signal and deliver the event signal to the debugger.
  • Introduce PT_STOP, that combines kill(SIGSTOP) and ptrace(PT_CONTINUE,SIGSTOP) semantics in a single call. It works like:
    • kill(SIGSTOP) for unstopped tracee
    • ptrace(PT_CONTINUE,SIGSTOP) for stopped tracee
    The child will be stopped and always possible to be waited (with wait(2) like calls).

    For stopped tracee kill(SIGSTOP) has no effect. PT_CONTINUE+SIGSTOP cannot be used on an unstopped process (EBUSY).

    This operation is modeled after PT_KILL that is similar for the SIGKILL call. While there, allow PT_KILL on unstopped traced child.

    This operation is useful in an abnormal exit of a debugger from a signal handler, usually followed by waitpid(2) and ptrace(PT_DETACH).

For the sake of tracking the missed in action signals emitted by tracee, I have introduced the feature in NetBSD truss (as part of the picotrace repository) to register syscall entry (SCE) and syscall exit (SCX) calls and track missing SCE/SCX events that were never delivered. Unfortunately, the number of missing events was huge, even for simple 2-threaded applications.

    truss[2585] running for 22.205305922 seconds
    truss[2585] attached to child=759 ('firefox') for 22.204289369 seconds
    syscall                     seconds      calls     errors missed-sce missed-scx
    read                    0.048522952        609          0         54         76
    write                   0.044693735        487          0         35         66
    open                    0.002516815         18          0          5          5
    close                   0.001015263         17          0          9          6
    unlink                  0.001375463         13          0          3          0
    getpid                  0.093458089       1993          0         16         56
    geteuid                 0.000049301          1          0          0          1
    recvmsg                 0.343353019       4828       3685         90        112
    access                  0.001450653         12          3          5          4
    dup                     0.000570904         10          0          0          1
    munmap                  0.010375949         88          0          6          3
    mprotect                0.196781932       2251          0         11         62
    madvise                 0.049820002        430          0         11         18
    writev                  0.237488362       1507          0         76         67
    rename                  0.000379918          2          0          1          0
    mkdir                   0.000283846          2          2          1          2
    mmap                    0.033342935        481          0         15         40
    lseek                   0.003341775         62          0         25         24
    ftruncate               0.000507707          9          0          1          0
    __sysctl                0.000144506          2          0          0          0
    poll                   18.694195617       4531          0        106        191
    __sigprocmask14         0.001585329         20          0          0          2
    getcontext              0.000083238          1          0          0          0
    _lwp_create             0.000104646          1          0          0          0
    _lwp_self               0.001456718         22          0         24         79
    _lwp_unpark             0.035319633        607          0         14         39
    _lwp_unpark_all         0.020660377        250          0         38         50
    _lwp_setname            0.000118418          2          0          0          0
    __select50             15.125525493        637          0         82        125
    __gettimeofday50        3.279021049       2930          0         40        135
    __clock_gettime50      10.673311747      33132          0       1418       3003
    __stat50                0.006375356         52          3         12          5
    __fstat50               0.001490944         17          0          3          2
    __lstat50               0.000110906          1          0          1          0
    __getrusage50           0.008863815        109          0          7          1
    ___lwp_park60          62.720893458        964        251        454        453
                          -------------    -------    -------    -------    -------
                          111.638589870      56098       3944       2563       4628

With my kernel changes landed, the number of missed sce/scx events is down to zero (with exceptions to signals that e.g. never return such as the exit(2) call).

Once these changes settle in HEAD, I plan to backport them to NetBSD-9. I have already received feedback that GDB works much better now.

The kernel also has now more runtime asserts that validate correctness of the code paths.

Sanitizers

I've introduced a special preprocessor macro to detect LSan (__SANITIZE_LEAK__) and UBSan (__SANITIZE_UNDEFINED__) in GCC. The patches were submitted upstream to the GCC mailing list, in two patches (LSan + UBSan). Unfortunately, GCC does not see value in feature parity with LLVM and for the time being it will be a local NetBSD specific GCC extension. These macros are now integrated into the NetBSD public system headers, for use by the basesystem software.

The LSan macro is now used inside the LLVM codebase and the ps(1) program is the first user of it. The UBSan macro is now used to disable relaxed alignment on x86. While such code is still functional, it is not clean from undefined behavior as specified by C. This is especially needed in the kernel fuzzing process, as we can reduce noise from less interesting reports.

During the previous month a number of reports from kernel fuzzing were fixed. There is still more to go.

Almost all local patches needed for LSan were merged upstream. The last remaining local patch is scheduled for later as it is very invasive for all platforms and sanitizers. In the worst case we just have more false negatives in detection of leaks in specific scenarios.

Miscellaneous changes

I have fixed a regression in upstream GDB with SIGTTOU handling. This was an upstream bug fixed by Alan Hayward and cherry-picked by me. As a side effect, a certain environment setup would cause the tracer to sleep.

I have reverted the regression in changed in6_addr change. It appeased UBSan, but broke at least qemu networking. The regression was tracked down by Andreas Gustafsson and reported in the NetBSD's bug tracking system.

I have landed a patch that returns ELF loader dl_phdr_info information for dl_iterate_phdr(3). This synchronized the behavior with Linux, FreeBSD and OpenBSD and is used by sanitizers.

I have passed through core@ the patch to change the kevent::udata type from intptr_t to void*. The former is slightly more pedantic, but the latter is what is in all other kevent users and this mismatch of types affected specifically C++ users that needed special NetBSD-only workarounds.

I have marked git and hg meta files as ones to be ignored by cvs import. This was causing problems among people repackaging the NetBSD source code with other VCS software than CVS.

I keep working on getting GDB test-suite to run on NetBSD, I spent some time on getting fluent in the TCL programming language (as GDB uses dejagnu and TCL scripting). I have already fixed two bugs that affected NetBSD users in the TCL runtime: getaddrbyname_r and gethostbyaddr_r were falsely reported as available and picked on NetBSD, causing damage in operation. Fluency in TCL will allow me to be more efficient in addressing and debugging failing tests in GDB and likely reuse this knowledge in other fields useful for the project.

I made __CTASSERT a static assert again. Previously, this homegrown check for compile-time checks silently stopped working for C99 compilers supporting VLA (variable length array). It was caught by kUBSan that detected VLA of dynamic size of -1, that is still compatible but has unspecified runtime semantics. The new form is inspired by the Perl ctassert code and uses bit-field constant that enforces the assert to be effective again. Few misuses __CTASSERT, mostly in the Linux DRMKMS code, were fixed.

I have submitted a proposal to the C Working Group a proposal to add new methods for setting and getting the thread name.

Plan for the next milestone

Keep stabilizing the reliability debugging interfaces and get ATF and LLDB threading code reliably pass tests. Cover more scenarios with ptrace(2) in the ATF regression test-suite.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

Posted mid-morning Thursday, October 10th, 2019 Tags: blog
Add a comment