File:  [NetBSD Developer Wiki] / wikisrc / projects / project / smp_networking.mdwn
Revision 1.3: download - view: text, annotated - select for diffs
Thu Nov 10 21:21:58 2011 UTC (2 years, 5 months ago) by jmmv
Branches: MAIN
CVS tags: HEAD
Add a 'work plan'; from matt@.

[[!template id=project

title="SMP Networking (aka remove the big network lock)"

contact="""
[tech-kern](mailto:tech-kern@NetBSD.org),
[tech-net](mailto:tech-net@NetBSD.org),
[board](mailto:board@NetBSD.org),
[core](mailto:core@NetBSD.org)
"""

category="networking"
difficulty="hard"
funded="The NetBSD Foundation"

description="""
**WARNING: THIS IS A DRAFT; THE INFORMATION CONTAINED IN THIS PROJECT AND
ANY OF THE SUBPROJECTS LINKED BELOW IS SUBJECT TO CHANGE.**

Traditionally, the NetBSD kernel code had been protected by a single,
global lock.  This lock ensured that, on a multiprocessor system, two
different threads of execution did not access the kernel concurrently and
thus simplified the internal design of the kernel.  However, such design
does not scale to multiprocessor machines because, effectively, the kernel
is restricted to run on a single processor at any given time.

The NetBSD kernel has been modified to use fine grained locks in many of
its different subsystems, achieving good performance on today's
multiprocessor machines.  Unfotunately, these changes have not yet been
applied to the networking code, which remains protected by the single lock.
In other words: NetBSD networking has evolved to work in a uniprocessor
envionment; switching it to use fine-grained locked is a hard and complex
problem.

# Funding

At this time, The NetBSD Foundation is accepting project specifications to
remove the single networking lock.  If you want to apply for this project,
please send your proposal to the contact addresses listed above.

Due to the size of this project, your proposal does not need to cover
everything to qualify for funding.  We have attempted to split the work
into smaller units, and **you can submit funding applications for these
smaller subtasks independently** as long as the work you deliver fits in
the grand order of this project.  For example, you could send an
application to make the network interfaces alone MP-friendly (see the *work
plan* below).

What follows is a particular design proposal, extracted from an
[original text](http://www.NetBSD.org/~matt/smpnet.html) written by
[Matt Thomas](mailto:matt@NetBSD.org).  You may choose to work on this
particular proposal or come up with your own.

# Tentative specification

The future of NetBSD network infrastructure has to efficiently embrace two
major design criteria: Symmetric Multi-Processing (SMP) and modularity.
Other design considerations include not only supporting but taking
advantage of the capability of newer network devices to do packet
classification, payload splitting, and even full connection offload.

You can divide the network infrastructure into 5 major components:

* Interfaces (both real devices and pseudo-devices)
* Socket code
* Protocols
* Routing code
* mbuf code.

Part of the complexity is that, due to the monolithic nature of the kernel,
each layer currently feels free to call any other layer.  This makes
designing a lock hierarchy difficult and likely to fail.

Part of the problem are asynchonous upcalls, among which include:

* `ifa->ifa_rtrequest` for route changes.
* `pr_ctlinput` for interface events.

Another source of complexity is the large number of global variables
scattered throughout the source files.  This makes putting locks around
them difficult.

## Subtasks

The proposed solution presented here include the following tasks (in no
particular order) to achieve the desired goals of SMP support and
modularity:

[[!map show="title" pages="projects/project/* and tagged(project) and tagged(smp_networking)"]]

## Work plan

Aside from the list of tasks above, the work to be done for this project
can be achieved by following these steps:

1. Move ARP out of the routing table.  See the [[nexthop_cache]] project.

1. Make the network interfaces MP, which are one of the few users of the
   big kernel lock left.  This needs to support multiple receive and
   transmit queues to help reduce locking contention.  This also includes
   changing more of the common interfaces to do what the `tsec` driver does
   (basically do everything with softints).  This also needs to change the
   `*_input` routines to use a table to do dispatch instead of the current
   switch code so domain can be dynamically loaded.

1. Collect global variables in the IP/UDP/TCP protocols into structures.
   This helps the following items.

1. Make IPV4/ICMP/IGMP/REASS MP-friendly.

1. Make IPV6/ICMP/IGMP/ND MP-friendly.

1. Make TCP MP-friendly.

1. Make UDP MP-friendly.

# Radical thoughts

You should also consider the following ideas:

## LWPs in user space do not need a kernel stack

Those pages are only being used in case the an exception happens.
Interrupts are probably going to their own dedicated stack.  One could just
keep a set of kernel stacks around.  Each CPU has one, when a user
exception happens, that stack is assigned to the current LWP and removed as
the active CPU one.  When that CPU next returns to user space, the kernel
stack it was using is saved to be used for the next user exception.  The
idle lwp would just use the current kernel stack.

## LWPs waiting for kernel condition shouldn't need a kernel stack

If an LWP is waiting on a kernel condition variable, it is expecting to be
inactive for some time, possibly a long time.  During this inactivity, it
does not really need a kernel stack.

When the exception handler get an usermode exeception, it sets LWP
restartable flag that indicates that the exception is restartable, and then
services the exception as normal.  As routines are called, they can clear
the LWP restartable flag as needed.  When an LWP needs to block for a long
time, instead of calling `cv_wait`, it could call `cv_restart`.  If
`cv_restart` returned false, the LWPs restartable flag was clear so
`cv_restart` acted just like `cv_wait`.  Otherwise, the LWP and CV would
have been tied together (big hand wave), the lock had been released and the
routine should have returned `ERESTART`.  `cv_restart` could also wait for
a small amount of time like .5 second, and only if the timeout expires.

As the stack unwinds, eventually, it would return to the last the exception
handler.  The exception would see the LWP has a bound CV, save the LWP's
user state into the PCB, set the LWP to sleeping, mark the lwp's stack as
idle, and call the scheduler to find more work.  When called,
`cpu_switchto` would notice the stack is marked idle, and detach it from
the LWP.

When the condition times out or is signalled, the first LWP attached to the
condition variable is marked runnable and detached from the CV.  When the
`cpu_switchto` routine is called, the it would notice the lack of a stack
so it would grab one, restore the trapframe, and reinvoke the exception
handler.
"""
]]

CVSweb for NetBSD wikisrc <wikimaster@NetBSD.org> software: FreeBSD-CVSweb