Traditionally, the NetBSD kernel code had been protected by a single, global lock. This lock ensured that, on a multiprocessor system, two different threads of execution did not access the kernel concurrently and thus simplified the internal design of the kernel. However, such design does not scale to multiprocessor machines because, effectively, the kernel is restricted to run on a single processor at any given time.
The NetBSD kernel has been modified to use fine grained locks in many of its different subsystems, achieving good performance on today's multiprocessor machines. Unfotunately, these changes have not yet been applied to the networking code, which remains protected by the single lock. In other words: NetBSD networking has evolved to work in a uniprocessor envionment; switching it to use fine-grained locked is a hard and complex problem.
This project is currently claimed
Funding
At this time, The NetBSD Foundation is accepting project specifications to remove the single networking lock. If you want to apply for this project, please send your proposal to the contact addresses listed above.
Due to the size of this project, your proposal does not need to cover everything to qualify for funding. We have attempted to split the work into smaller units, and you can submit funding applications for these smaller subtasks independently as long as the work you deliver fits in the grand order of this project. For example, you could send an application to make the network interfaces alone MP-friendly (see the work plan below).
What follows is a particular design proposal, extracted from an original text written by Matt Thomas. You may choose to work on this particular proposal or come up with your own.
Tentative specification
The future of NetBSD network infrastructure has to efficiently embrace two major design criteria: Symmetric Multi-Processing (SMP) and modularity. Other design considerations include not only supporting but taking advantage of the capability of newer network devices to do packet classification, payload splitting, and even full connection offload.
You can divide the network infrastructure into 5 major components:
- Interfaces (both real devices and pseudo-devices)
- Socket code
- Protocols
- Routing code
- mbuf code.
Part of the complexity is that, due to the monolithic nature of the kernel, each layer currently feels free to call any other layer. This makes designing a lock hierarchy difficult and likely to fail.
Part of the problem are asynchonous upcalls, among which include:
ifa->ifa_rtrequest
for route changes.pr_ctlinput
for interface events.
Another source of complexity is the large number of global variables scattered throughout the source files. This makes putting locks around them difficult.
Subtasks
The proposed solution presented here include the following tasks (in no particular order) to achieve the desired goals of SMP support and modularity:
- Lockless, atomic FIFO/LIFO queues
- Lockless, atomic and generic Radix/Patricia trees
- Fast protocol and port demultiplexing
- Implement per-interface interrupt handling
- Kernel continuations
- Lazy receive processing
- Separate nexthop cache from the routing table
- Make TCP syncache optional
- Virtual network stacks
Work plan
Aside from the list of tasks above, the work to be done for this project can be achieved by following these steps:
Move ARP out of the routing table. See the nexthop cache project.
Make the network interfaces MP, which are one of the few users of the big kernel lock left. This needs to support multiple receive and transmit queues to help reduce locking contention. This also includes changing more of the common interfaces to do what the
tsec
driver does (basically do everything with softints). This also needs to change the*_input
routines to use a table to do dispatch instead of the current switch code so domain can be dynamically loaded.Collect global variables in the IP/UDP/TCP protocols into structures. This helps the following items.
Make IPV4/ICMP/IGMP/REASS MP-friendly.
Make IPV6/ICMP/IGMP/ND MP-friendly.
Make TCP MP-friendly.
Make UDP MP-friendly.
Radical thoughts
You should also consider the following ideas:
LWPs in user space do not need a kernel stack
Those pages are only being used in case the an exception happens. Interrupts are probably going to their own dedicated stack. One could just keep a set of kernel stacks around. Each CPU has one, when a user exception happens, that stack is assigned to the current LWP and removed as the active CPU one. When that CPU next returns to user space, the kernel stack it was using is saved to be used for the next user exception. The idle lwp would just use the current kernel stack.
LWPs waiting for kernel condition shouldn't need a kernel stack
If an LWP is waiting on a kernel condition variable, it is expecting to be inactive for some time, possibly a long time. During this inactivity, it does not really need a kernel stack.
When the exception handler get an usermode exeception, it sets LWP
restartable flag that indicates that the exception is restartable, and then
services the exception as normal. As routines are called, they can clear
the LWP restartable flag as needed. When an LWP needs to block for a long
time, instead of calling cv_wait
, it could call cv_restart
. If
cv_restart
returned false, the LWPs restartable flag was clear so
cv_restart
acted just like cv_wait
. Otherwise, the LWP and CV would
have been tied together (big hand wave), the lock had been released and the
routine should have returned ERESTART
. cv_restart
could also wait for
a small amount of time like .5 second, and only if the timeout expires.
As the stack unwinds, eventually, it would return to the last the exception
handler. The exception would see the LWP has a bound CV, save the LWP's
user state into the PCB, set the LWP to sleeping, mark the lwp's stack as
idle, and call the scheduler to find more work. When called,
cpu_switchto
would notice the stack is marked idle, and detach it from
the LWP.
When the condition times out or is signalled, the first LWP attached to the
condition variable is marked runnable and detached from the CV. When the
cpu_switchto
routine is called, the it would notice the lack of a stack
so it would grab one, restore the trapframe, and reinvoke the exception
handler.