9.3. Interrupt HandlersA good deal of the frame handling we discuss in this chapter takes place in response to interrupts from network hardware. The scheduling of functions triggered by interrupts is a complicated topic and deserves some study, even though it doesn't concern networking in particular. Therefore, in this section, we discuss the various ways that interrupts are handled by different network drivers and introduce the concepts of bottom halves and softirqs. In Chapter 5, we saw how device drivers register their handlers with an IRQ number, but we did not see how hardware interrupts delegate frame processing to software interrupt handlers. This section will describe how an interrupt request associated with the reception of a frame is handled all the way to the point where protocol handlers discussed in Chapter 13 receive their packets. We will see the relationship between hardware IRQs and software IRQs and why the latter category is needed. We will briefly see how interrupts were handled with the old kernels and then compare the old approach to the new one introduced with kernel version 2.4. This discussion will show the advantages of the new model over the old one, especially in the area of performance. Before launching into softirqs, we need a small introduction to the concept of bottom half handlers 9.3.1. Reasons for Bottom Half HandlersWhenever a CPU receives an interrupt notification, it invokes the handler associated with that interrupt, which is identified by a number. During the handler's executionin which the kernel code is said to be in interrupt context
In the simplest situation, these are the main events touched off by an interrupt:
In short, interrupt handlers are nonpreemptible and non-reentrant. (A function is defined as non-reentrant when it cannot be interrupted by another invocation of itself. In the case of interrupt handlers, it simply means that they are executed with interrupts disabled.) This design choice helps reduce the likelihood of race conditions. However, because the CPU is so limited in what it can do, the nonpreemptible design has potentially serious effects on performance by the kernel as well as the processes waiting to be served by the CPU. Therefore, the work done by interrupt handlers should be as quick as possible. The amount of processing needed by the interrupt handlers during interrupt context depends on the type of event. A keyboard, for instance, may simply send an interrupt every time a key is pressed, which requires very little effort to be handled: the handler simply needs to store the code of the key somewhere, and run a few times per second at most. At other times, the actions required to handle an interrupt are not trivial and their executions could require much CPU time. Network devices, for instance, have a relatively complex job: they need to allocate a buffer (sk_buff), copy the received data into it, initialize a few parameters within the buffer structure (protocol) to tell the higher-layer protocol handlers what kind of data is coming from the driver, and so on. Here is where the concept of a bottom half handler comes into play. Even if the action triggered by an interrupt needs a lot of CPU time, most of this action can usually wait. Interrupts are allowed to preempt the CPU in the first place because if the operating system makes the hardware wait too long, it may lose data. This is obviously true of real-time streaming data, but also is true of any hardware that has to store incoming data in fixed-size buffers. And if the hardware loses data, there is usually no way to get it back. On the other hand, if the kernel or a user-space process has to be delayed or preempted, no data will be lost (with the exception of real-time systems, which entail a completely different way of handling processes as well as interrupts). In light of these considerations, modern interrupt handlers are divided into a top half and a bottom half. The top half consists of everything that has to be executed before releasing the CPU, to preserve data. The bottom half contains everything that can be done at relative leisure. One can define a bottom half as an asynchronous request to execute a particular function. Normally, when you want to execute a function, you do not have to request anythingyou simply invoke it. When an interrupt arrives, you have a lot to do and don't want to do it right away. Thus, you package most of the work into a function that you submit as a bottom half. The following model allows the kernel to keep interrupts disabled for much less time than the simple model shown previously:
Over time, Linux developers have tried different types of bottom halves, which obey different rules. Networking has played a large role in the development of new implementations, because of networking's need for low latencythat is, a minimal amount of time between the reception of a frame and its delivery. Low latency is more important for network device drivers than for other types of devices because of the high number of tasks involved in reception and transmission. As described earlier in the section "Interrupts," it can be disastrous to let a large number of frames build up while waiting to be handled. Sound cards are another example of devices requiring fast response. 9.3.2. Bottom Halves SolutionsThe kernel provides different mechanism for implementing bottom halves and for deferring work in general. These mechanisms differ mainly with regard to the following points:
In this chapter, we will look only at those mechanisms that do not need a process contextnamely, softirqs and tasklets. In the next section, we will briefly see their implications for concurrency and When you need to defer the execution of a function that may sleep, you need to use a dedicated kernel thread or work queues 9.3.3. Concurrency and LockingBefore launching into the code that network drivers use to handle bottom halves, we need some background on concurrency, which refers to functions that can interfere with each other either because they are scheduled on different CPUs or because one is suspended by the kernel to run another. Related topics are locks and the disabling of interrupts. (Concurrency is discussed in detail in both Understanding the Linux Kernel and Linux Device Drivers.) Three different types of functions will be introduced in this chapter to handle interrupts, old-style bottom halves, softirqs, and tasklets. All of them can be used to schedule the execution of a function, but they come with some big differences. As far as concurrency is concerned, we can summarize the differences as follows:
Therefore, these three features require different kinds of locking mechanisms. The higher the concurrency allowed, the more carefully the programmer has to design the code executed, for the sake of both accuracy and performance. Whether a softirq or a tasklet represents the best choice for any given context depends on both locking and concurrency requirements. In most cases, tasklets are the way to go. But given the tight response requirements of the receive and transmit networking tasks, softirqs are preferred in those two specific cases. We will see later in this chapter how the networking code uses softirqs. In some cases, the programmer has to disable hardware interrupts, software interrupts
9.3.4. PreemptionIn time-sharing systems, the kernel has always been able to preempt user processes at will, but the kernel itself is often nonpreemptive, which means that once it starts running it will not be interrupted until it is ready to give up control. A nonpreemptive kernel sometimes holds up high-priority processes when they are ready to run because the kernel is executing a system call for a lower-priority process. To support real-time extensions and for other reasons, the Linux kernel was made fully preemptible during the 2.5 kernel development cycle. With this new kernel feature, system calls and other kernel tasks can be preempted by other kernel tasks with higher priorities. Because much work had already been done to eliminate critical sections (nonpreemptible code) from the kernel to support SMP locking mechanisms, adding full preemption was not a major change to the kernel. Once preemption was added, developers just had to define explicitly where to disable it (in hardware and software interrupt code, in the scheduler itself, in the code protected by spin locks and read/write locks, etc.). However, there are times when preemption, just like interrupts, must be disabled. In this section, I'll cover just a few functions related to preemption that you may bump into while browsing the code, and then briefly show how some of the locking macros have been updated to deal with preemption. The following functions control preemption:
The networking code does not deal with these routines directly. However, preempt_enable and preempt_disable are indirectly called, for instance, by locking primitives, like rcu_read_lock and rcu_read_unlock, spin_lock and spin_unlock, etc. Routines used to access per-CPU data structures, like get_cpu and get_cpu_var, also disable preemption before reading the data. A counter for each process, named preempt_count and embedded in the thread_info structure, indicates whether a given process allows preemption. The field can be read with preempt_count( ) and is manipulated indirectly through the inc_preempt_count and dec_preempt_count functions defined in include/linux/preempt.h. There are situations in which the kernel should not be preempted. These include when it is servicing hardware, as well as when it uses one of the calls just shown to disable preemption. Therefore, preempt_count is split into three components. Each byte is a counter for a different condition that requires nonpreemption: hardware interrupts, software interrupts, and general nonpreemption. The layout of preempt_count is shown in Figure 9-3. Figure 9-3. Structure of preempt_countThe figure shows, in addition to the purpose of each byte, the main functions that manipulate it. The high-order byte is not fully used at the moment, but its second least significant bit is set before calling the schedule function and tells that function that it has been called to preempt the current task.[*] In include/asm-xxx/hardirq.h you can find several macros that make it easier to read and write preempt_counter; some of these include the XXX_OFFSET variables shown in Figure 9-3 and used by the functions listed in the figure to increment or decrement the right byte.
Despite all this complexity, whenever a check has to be done on the current process to see if it can be preempted, all the kernel needs to know is whether preempt_count is NULL (it does not really matter why preemption is disabled). 9.3.5. Bottom-Half HandlersThe infrastructure for bottom halves must address the following needs:
Let's first see how kernels up to version 2.2 handled bottom half handlers 9.3.5.1. Bottom-half handlers in kernel 2.2The 2.2 model for bottom-half handlers divides them into a large number of types, which are differentiated by when and how often the kernel checks for them and runs them. The 2.2 list is as follows, taken from include/linux/interrupt.h. In this book, we are most interested in NET_BH.
Each bottom-half type is associated with a function handler by means of init_bh. The networking code, for instance, initializes the NET_BH bottom-half type to the net_bh handler in net_dev_init, which is covered in Chapter 5.
The main function used to unregister a BH handler is remove_bh. (There are other related functions too, such as enable_bh/disable_bh, but we do not need to see all of them.) Whenever an interrupt handler wants to trigger the execution of a bottom half handler, it has to set the corresponding flag with mark_bh. This function is very simple: it sets a bit into a global bitmap bh_active, which, as we will see in a moment, is tested in several places.
For instance, you will see later in the chapter that every time a network device driver has successfully received a frame, it signals the kernel about it with a call to netif_rx. The latter queues the newly received frame into the ingress queue backlog (shared by all the CPUs) and marks the NET_BH bottom-half handler flag.
During several routine operations, the kernel checks whether any bottom halves are scheduled for execution. If any are waiting, the kernel runs the function do_bottom_half (currently in kernel/softirq.c), to execute them. The checks are performed during:
run_bottom_half, the function used by do_bottom_half to execute the pending interrupt handlers, looks like this:
The order in which the pending handlers are invoked depends on the positions of the associated flags inside the bitmap and the direction used to scan those flags (returned by get_active_bhs). In other words, bottom halves are not run on a first-come-first-served basis. And since networking bottom halves can take a long time, those that have the misfortune to be dequeued last can experience high latency. Bottom halves in 2.2 and earlier kernels suffer from a ban on concurrency. Only one bottom half can run at any time, regardless of the number of CPUs. 9.3.5.2. Bottom-half handlers in kernel 2.4 and above: the introduction of the softirqThe biggest improvement between kernels 2.2 and 2.4, as far as interrupt handling is concerned, was the introduction of software interrupts The new softirq model has only six types (from include/linux/interrupt.h):
All the XXX_BH bottom-half types in the old model are still available to old drivers, but have been reimplemented to run as softirqs of the HI_SOFTIRQ type (which means they take priority over the other softirq types). The two types used by networking code, NET_TX_SOFTIRQ and NET_RX_SOFTIRQ, are introduced in the later section "How the Networking Code Uses softirqs." The next section will introduce tasklets. Softirqs, like the old bottom halves, run with interrupts enabled and therefore can be suspended at any time to handle a new, incoming interrupt. However, the kernel does not allow a new request for a softirq to run on a CPU if another instance of that softirq has been suspended on that CPU; this drastically reduces the amount of locking needed. Each softirq type can maintain an array of data structures of type softnet_data, one per CPU, to hold state information about the current softirq; we'll see the contents of this structure in the section "softnet_data Structure." Since different instances of the same type of softirq can run simultaneously on different CPUs, the functions run by softirqs still need to lock other data structures that are shared, to avoid race conditions. The functions used to register and schedule a softirq handler, and the logic behind them, are very similar to the ones used with 2.2 bottom halves. softirq handlers are registered with the open_softirq function, which, unlike init_bh, accepts an extra parameter so that the function handler can be passed some input data if needed. None of the softirqs, however, currently uses that extra parameter, and a proposal has been floated to remove it. open_softirq simply copies the input parameters into the global array softirq_vec, declared in kernel/softirq.c, which holds the associations between types and handlers.
A softirq can be scheduled for execution on the local CPU by the following functions:
The following code, taken from kernel 2.4.5,[*] shows the model used at an early stage of softirq development. It is very similar to the 2.2 model, and invokes the function do_softirq, which is a counterpart to the 2.2 function do_bottom_half discussed in the previous section. do_softirq is called if at least one softirq has been scheduled for execution:
The only difference between this early stage of softirqs and the 2.2 bottom-half model is that the softirq version has to check the flags on a per-CPU basis, since each CPU has its own bitmap of pending softirqs. The implementation of do_softirq is also very similar to its counterpart do_bottom_half in 2.2. The kernel also calls the function at some of the same points, but not entirely the same. The main change is the introduction of a new per-CPU kernel thread, ksoftirqd. Here are the main points where do_softirq may be invoked:[*]
I have described i386 behavior in general; other architectures may use different naming conventions or have additional timers that also invoke do_softirq. Another interesting place where do_softirq is called is from within netif_rx_ni, which is briefly described in the section "Old Interface Between Device Drivers and Kernel: First Part of netif_rx" in Chapter 10. The traffic generator built into the kernel (net/core/pktgen.c) also calls do_softirq. 9.3.6. TaskletsMost of the bottom halves of the 2.2 kernel variety have been converted to either softirqs or tasklets
In the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq," we saw the list of softirqs. HI_SOFTIRQ is used to implement high-priority tasklets, and TASKLET_SOFTIRQ is used for lower-priority ones. Each time a request for a deferred execution is issued, an instance of a tasklet_struct structure is queued onto either a list processed by HI_SOFTIRQ or another one that is instead processed by TASKLET_SOFTIRQ. Since softirqs are handled independently by each CPU, it should not be a surprise that there are two lists of pending tasklet_structs for each CPU, one associated with HI_SOFTIRQ and one with TASKLET_SOFTIRQ. These are their definitions from kernel/softirq.c:
At first sight, tasklets may seem to be just like the old bottom halves, but there actually are substantial differences:
The tasklet_struct data structure is defined in include/linux/interrupt.h as follows:
The following is the field-by-field description:
The following are some important kernel functions that handle tasklets, from kernel/softirq.c and include/linux/interrupt.h:
9.3.7. Softirq InitializationDuring kernel initialization, softirq_init initializes the software IRQ layer with the two general-purpose softirqs: tasklet_action and tasklet_hi_action, which are associated with TASKLET_SOFTIRQ and HI_SOFTIRQ, respectively.
The two softirqs used by the networking code NET_RX_SOFTIRQ and NET_TX_SOFTIRQ are initialized in net_dev_init, one of the networking initialization functions (see the section "How the Networking Code Uses softirqs"). The other softirqs listed in the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq" are registered in the associated subsystems (SCSI_SOFTIRQ in drivers/scsi/scsi.c, TIMER_SOFTIRQ in kernel/timer.c, etc.). HI_SOFTIRQ is mainly used by sound card device drivers.[*]
Users of TASKLET_SOFTIRQ include:
9.3.8. Pending softirq HandlingWe explained in the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq" when do_softirq is invoked to take care of the pending softirqs. Here we will see the internals of the function. You will notice how much it resembles the one used by kernel 2.2 described in the section "Bottom-half handlers in kernel 2.2." do_softirq stops and does nothing if the CPU is currently serving a hardware or software interrupt. The function checks for this by calling in_interrupt, which is equivalent to if (in_irq( ) || in_softirq( )). If do_softirq decides to proceed, it saves pending softirqs in pending with local_softirq_pending.
From the preceding snapshot, it could seem that do_softirq runs with IRQs disabled, but that's not true. IRQs are kept disabled only when manipulating the bitmap of pending softirqs (i.e., accessing the softnet_data structure). You will see in a moment that _ _do_softirq internally re-enables IRQs when running the softirq handlers. 9.3.8.1. _ _do_softirq functionIt is possible for the same softirq type to be scheduled multiple times while do_softirq is running. Since IRQs are enabled when running the softirq handlers, the bitmap of pending softirq can be manipulated while serving an interrupt, and therefore any of the softirq handlers that has been executed by _ _do_softirq could be re-scheduled during the execution of _ _do_softirq itself. For this reason, before _ _do_softirq re-enables IRQs, it saves the current bitmap of the pending softirq on the local variable pending and clears it from the softnet_data instance associated with the local CPU using local_softirq_pending( )=0. Then based on pending, it calls all the necessary handlers. Once all the handlers have been called, _ _do_softirq checks whether in the meantime any softirqs were scheduled again (this request disables IRQs). If there was at least one pending softirq, it will repeat the whole process. However, _ _do_softirq repeats it only up to MAX_SOFTIRQ_RESTART times (experimentation has found that 10 times works well). The use of MAX_SOFTIRQ_RESTART is a design decision made to keep a single type of interruptparticularly a stream of networking interruptsfrom starving other interrupts out of one of the CPUs. Without the limit in _ _do_softirq, starvation could easily happen when a server is highly loaded by network traffic and the number of NET_RX_SOFTIRQ interrupts goes through the roof. Let's see how starvation could take place. do_IRQ would raise a NET_RX_SOFTIRQ interrupt that would cause do_softirq to be executed. _ _do_softirq would clear the NET_RX_SOFTIRQ flag, but before it ended it would be interrupted by an interrupt that would set NET_RX_SOFTIRQ again, and so on, indefinitely. Let's see now how the central part of _ _do_softirq manages to invoke the softirq handler. Every time one softirq type is served, its bit is cleared from the local copy of the active softirqs, pending. h is initialized to point to the global data structure softirq_vec that holds the associations between softirq types and their function handlers (for instance, NET_RX_SOFTIRQ is handled by net_rx_action). The loop ends when the bitmap is cleared. Finally, if there are pending softirqs that cannot be handled because do_softirq must return, having repeated its job MAX_SOFTIRQ_RESTART times already, the ksoftirqd tHRead is awakened and given the responsibility of handling them later. Because do_softirq is invoked at so many points within the kernel, it is actually likely that a later invocation of do_softirq will handle these interrupts before the ksoftirqd tHRead is scheduled.
9.3.9. Per-Architecture Processing of softirqThe do_softirq function provided in kernel/softirq.c can be overridden by another function provided by the architecture code (which ends up calling _ _do_softirq anyway). This explains why the definition of do_softirq in kernel/softirq.c is wrapped with the check on _ _ARCH_HAS_DO_SOFTIRQ (see the previous section). A few architectures, including i386 (see arch/i386/kernel/irq.c), define their own version of do_softirq. Such architecture versions are used when the architectures use 4 KB stacks (instead of 8 KB) and use the remaining 4 K to implement stacked handling of both hard IRQs and softirqs. Please refer to Understanding the Linux Kernel for more detail. 9.3.10. ksoftirqd Kernel ThreadsBackground kernel threads are assigned the job of checking for softirqs that have been left unexecuted by the functions previously described, and executing as many of those softirqs as they can before needing to give that CPU back to other activities. There is one kernel thread for each CPU, named ksoftirqd_CPU0, ksoftirqd_CPU1, and so on. The section "Starting the threads" describes how these threads are started at CPU boot time. The function ksoftirqd associated to these threads is pretty simple and is defined in the same file softirq.c:
There are a couple of small details I want to emphasize. The priority of a process, also called the nice priority, is a number ranging from -20 (maximum) to +19 (minimum). The ksoftirqd threads are given a low priority of 19. This is done so that frequently running softirqs such as NET_RX_SOFTIRQ cannot completely kidnap the CPU, which would leave almost no resources to other processes. We already saw that do_softirq can be invoked from different places in the code, so this low priority doesn't represent a handicap. Once started, the loop simply keeps calling do_softirq (always with preemption disabled) until one of the following conditions becomes true:
9.3.10.1. Starting the threadsThere is one ksoftirqd thread for each CPU. When the system's first CPU comes online, the first thread is started at kernel boot time inside do_pre_smp_initcalls.[*] The ksoftirqd threads for the other CPUs that come up at boot time, and for any other CPU that may be enabled later on a system that can handle hot-pluggable CPUs, are taken care of through the cpu_chain notification chain.
Notification chains were introduced in Chapter 4. The cpu_chain chain lets various subsystems know when a CPU is up and running or when one dies. The softirq subsystem registers to the cpu_chain with spawn_ksoftirqd, called from the function do_pre_smp_initcalls mentioned previously. The callback routine cpu_callback that processes notifications from cpu_chain is used to initialize the necessary per-CPU data structures and start the ksoftirqd thread on the CPU. The complete list of CPU_XXX notifications is in include/linux/notifier.h, but we need only four of them in the context of this chapter:
CPU_PREPARE_UP creates the thread and binds it to the associated CPU, but does not wake up the thread. CPU_ONLINE wakes up the thread. When a CPU dies, its associated ksoftirqd instance is killed:
Note that spawn_ksoftirqd places two direct calls to cpu_callback before registering with cpu_chain via register_cpu_notifier. This is necessary because CPU notifications are not generated for the first CPU that comes online. 9.3.11. Tasklet ProcessingThe two handlers for low-latency tasklets (TASKLET_SOFTIRQ) and high-latency tasklets (HI_SOFTIRQ) are identical; they simply work on two different lists. For this reason, we will describe only one of them: tasklet_action, the one associated with TASKLET_SOFTIRQ. Only one instance of each tasklet can be waiting for execution at any time. When tasklet_schedule or tasklet_hi_schedule schedules a tasklet, the function sets the TASKLET_STATE_SCHED bit described earlier in the section "Tasklets." Attempts to reschedule the same tasklet will be ignored because TASKLET_STATE_SCHED is already set. The bit is cleared only when the tasklet starts its execution; thus, during or after its execution another instance can be scheduled. The tasklet_action function starts by copying the list of tasklets waiting to be processed into a local variable first; it then clears the global list.[*] This is the only part of the function that is executed with interrupts disabled. Disabling them is necessary to avoid race conditions with interrupt handlers that could add new elements to the list while tasklet_action accesses it.
At this point, the function goes through the list tasklet by tasklet. For each element it invokes the handler if both of the following are true:
The part of the function implementing these activities follows:
At this stage, since the tasklet was not already being executed and it was extracted from the list of pending tasklets, it must have the TASKLET_STATE_SCHED flag set:
If the handler cannot be executed, the tasklet is put back into the list and TASKLET_SOFTIRQ is rescheduled to take care of all of those tasklets that for one of the two reasons listed earlier cannot be handled now:
9.3.12. How the Networking Code Uses softirqsThe networking subsystem has been assigned two different softirqs. NET_RX_SOFTIRQ handles incoming traffic and NET_TX_SOFTIRQ handles outgoing traffic. Both are registered in net_dev_init (described in Chapter 5) through the following lines:
Because different instances of the same softirq handler can run concurrently on different CPUs (unlike tasklets), networking code is both low latency and scalable. Both networking softirqs are higher in priority than normal tasklets (TASKLET_SOFTIRQ) but are lower in priority than high-priority tasklets (HI_SOFTIRQ). This prioritization guarantees that other high-priority tasks can proceed in a responsive and timely manner even when a system is under a high network load. The internals of the two handlers are covered in the sections "Processing the NET_RX_SOFTIRQ: net_rx_action" in Chapter 10 and "Processing the NET_TX_SOFTIRQ: net_tx_action" in Chapter 11. |
Sunday, October 18, 2009
Section 9.3. Interrupt Handlers
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment