Sunday, October 18, 2009

Section 9.3.  Interrupt Handlers










9.3. Interrupt Handlers


A good deal of the frame handling we discuss in this chapter takes place in response to interrupts from network hardware. The scheduling of functions triggered by interrupts is a complicated topic and deserves some study, even though it doesn't concern networking in particular. Therefore, in this section, we discuss the various ways that interrupts are handled by different network drivers and introduce the concepts of bottom halves and softirqs.


In Chapter 5, we saw how device drivers register their handlers with an IRQ number, but we did not see how hardware interrupts delegate frame processing to software interrupt handlers. This section will describe how an interrupt request associated with the reception of a frame is handled all the way to the point where protocol handlers discussed in Chapter 13 receive their packets. We will see the relationship between hardware IRQs and software IRQs and why the latter category is needed. We will briefly see how interrupts were handled with the old kernels and then compare the old approach to the new one introduced with kernel version 2.4. This discussion will show the advantages of the new model over the old one, especially in the area of performance.


Before launching into softirqs, we need a small introduction to the concept of bottom half handlers

. However, I will not go into much detail about them because they are documented in other resources, notably Understanding the Linux Kernel and Linux Device Drivers.



9.3.1. Reasons for Bottom Half Handlers




Whenever a CPU receives an interrupt notification, it invokes the handler associated with that interrupt, which is identified by a number. During the handler's executionin which the kernel code is said to be in interrupt context
interrupts are disabled for the CPU serving the interrupt. This means that if a CPU is busy serving one interrupt, it cannot receive other interrupts, whether of the same type or of different types.[*] Nor can the CPU execute any other process: it belongs totally to the interrupt handler and cannot be preempted.

[*] We saw in Chapter 5 that an interrupt handler that is declared as a slow handler is executed with the interrupts enabled on the local CPU.


In the simplest situation, these are the main events touched off by an interrupt:


  1. The device generates an interrupt and the hardware notifies the kernel.

  2. If the kernel is not serving another interrupt (and if interrupts are not disabled for other reasons) it will see the notification.

  3. The kernel disables interrupts for the local CPU and executes the handler associated with the interrupt type received.

  4. The kernel exits the interrupt handler and re-enables interrupts for the local CPU.


In short, interrupt handlers are nonpreemptible and non-reentrant. (A function is defined as non-reentrant when it cannot be interrupted by another invocation of itself. In the case of interrupt handlers, it simply means that they are executed with interrupts disabled.) This design choice helps reduce the likelihood of race conditions. However, because the CPU is so limited in what it can do, the nonpreemptible design has potentially serious effects on performance by the kernel as well as the processes waiting to be served by the CPU.


Therefore, the work done by interrupt handlers should be as quick as possible. The amount of processing needed by the interrupt handlers during interrupt context depends on the type of event. A keyboard, for instance, may simply send an interrupt every time a key is pressed, which requires very little effort to be handled: the handler simply needs to store the code of the key somewhere, and run a few times per second at most. At other times, the actions required to handle an interrupt are not trivial and their executions could require much CPU time. Network devices, for instance, have a relatively complex job: they need to allocate a buffer (sk_buff), copy the received data into it, initialize a few parameters within the buffer structure (protocol) to tell the higher-layer protocol handlers what kind of data is coming from the driver, and so on.


Here is where the concept of a bottom half handler comes into play. Even if the action triggered by an interrupt needs a lot of CPU time, most of this action can usually wait. Interrupts are allowed to preempt the CPU in the first place because if the operating system makes the hardware wait too long, it may lose data. This is obviously true of real-time streaming data, but also is true of any hardware that has to store incoming data in fixed-size buffers. And if the hardware loses data, there is usually no way to get it back.


On the other hand, if the kernel or a user-space process has to be delayed or preempted, no data will be lost (with the exception of real-time systems, which entail a completely different way of handling processes as well as interrupts). In light of these considerations, modern interrupt handlers are divided into a top half and a bottom half. The top half consists of everything that has to be executed before releasing the CPU, to preserve data. The bottom half contains everything that can be done at relative leisure.


One can define a bottom half as an asynchronous request to execute a particular function. Normally, when you want to execute a function, you do not have to request anythingyou simply invoke it. When an interrupt arrives, you have a lot to do and don't want to do it right away. Thus, you package most of the work into a function that you submit as a bottom half.


The following model allows the kernel to keep interrupts disabled for much less time than the simple model shown previously:


  1. The device signals the CPU to notify it of the interrupt.

  2. The CPU executes the associated top half, disabling further interrupt notifications until this handler has finished its job.

  3. Typically, a top half performs the following:

    1. It saves somewhere in RAM all the information that the kernel will need later to process the interrupt event.

    2. It marks a flag somewhere (or triggers something using another kernel mechanism) to make sure the kernel will know about the interrupt and will use the data saved by the handler to complete the event processing.

    3. Before terminating, it re-enables the interrupt notifications for the local CPU.

  4. At some later point, when the kernel is free of more pressing matters, it checks the flag set by the interrupt handler (signaling the presence of data to be processed) and calls the associated bottom half handler. It also clears the flag so that it can later recognize when the interrupt handler sets the flag again.


Over time, Linux developers have tried different types of bottom halves, which obey different rules. Networking has played a large role in the development of new implementations, because of networking's need for low latencythat is, a minimal amount of time between the reception of a frame and its delivery. Low latency is more important for network device drivers than for other types of devices because of the high number of tasks involved in reception and transmission. As described earlier in the section "Interrupts," it can be disastrous to let a large number of frames build up while waiting to be handled. Sound cards are another example of devices requiring fast response.




9.3.2. Bottom Halves Solutions







The kernel provides different mechanism for implementing bottom halves and for deferring work in general. These mechanisms differ mainly with regard to the following points:



Running context


Interrupts are seen by the kernel as having a different running context from user-space processes or other kernel code. When the function executed by a bottom half is capable of going to sleep, it is restricted to mechanisms allowed in process context, as opposed to interrupt context.


Concurrency and locking


When a mechanism can take advantage of SMP, this has implications for how serialization is enforced (if necessary) and how locking influences scalability.


In this chapter, we will look only at those mechanisms that do not need a process contextnamely, softirqs and tasklets. In the next section, we will briefly see their implications for concurrency and
locking.


When you need to defer the execution of a function that may sleep, you need to use a dedicated kernel thread or work queues
. A work queue is simply a queue where you can queue a request to execute a function, and a kernel thread will take care of it. In this case, the function would be executed in the context of a kernel thread, and therefore sleeping is allowed. Since the networking code mainly uses softirq and tasklets, we will not look at work queues.




9.3.3. Concurrency and Locking









Before launching into the code that network drivers use to handle bottom halves, we need some background on concurrency, which refers to functions that can interfere with each other either because they are scheduled on different CPUs or because one is suspended by the kernel to run another. Related topics are locks and the disabling of interrupts. (Concurrency is discussed in detail in both Understanding the Linux Kernel and Linux Device Drivers.)


Three different types of functions will be introduced in this chapter to handle interrupts, old-style bottom halves, softirqs, and tasklets. All of them can be used to schedule the execution of a function, but they come with some big differences. As far as concurrency is concerned, we can summarize the differences as follows:


  • Only one old-style bottom half can run at any time, regardless of the number of CPUs (kernel 2.2).

  • Only one instance of each tasklet can run at any time. Different tasklets can run concurrently on different CPUs. This means that given any tasklet, there is no need to enforce any serialization because already it is enforced by the kernel: you cannot have multiple instances of the same tasklet running concurrently.

  • Only one instance of each softirq can run at the same time on a CPU. However, the same softirq can run on different CPUs concurrently. This means that given any softirq you need to make sure that accesses to shared data by different CPUs use proper locking. To increase parallelization, the softirqs should be designed to access only per-CPU data as much as possible, reducing the need for locking considerably.


Therefore, these three features require different kinds of locking mechanisms. The higher the concurrency allowed, the more carefully the programmer has to design the code executed, for the sake of both accuracy and performance. Whether a softirq or a tasklet represents the best choice for any given context depends on both locking and concurrency requirements. In most cases, tasklets are the way to go. But given the tight response requirements of the receive and transmit networking tasks, softirqs are preferred in those two specific cases. We will see later in this chapter how the networking code uses softirqs.


In some cases, the programmer has to disable hardware interrupts, software interrupts
, or both. A detailed discussion of the contexts requires a background in SMP, preemption in the Linux kernel, and other matters outside the scope of this book. However, to understand the networking code you need to know the meaning of the main functions used to enable and disable interrupts. Table 9-1 summarizes the ones we need in this chapter (you can find many more in kernel/softirq.c, include/asm-XXX/hardirq.h, include/asm-XXX/spinlock.h, and include/linux/spinlock.h). Some of them may be defined globally and others per architecture.


Table 9-1. A few APIs related to software and hardware interrupts

Function/macro

Description

in_interrupt

in_interrupt returns TRUE if the CPU is currently serving a hardware or software interrupt, or preemption is disabled.

in_softirq

in_softirq returns TRUE if the CPU is currently serving a software interrupt.

in_irq

in_irq returns TRUE if the CPU is currently serving a hardware interrupt.

 

In the section "Preemption," and with the help of Figure 9-3, you can see how these three routines are implemented.

softirq_pending

Returns TRUE if there is at least one softirq pending (i.e., scheduled for execution) for the CPU whose ID was passed as the input argument.

local_softirq_pending

Returns TRUE if there is at least one softirq pending for the local CPU.

_ _raise_softirq_irqoff

Sets the flag associated with the input softirq type to mark it pending.

raise_softirq_irqoff

This is a wrapper around _ _raise_softirq_irqoff that also wakes up ksoftirqd when in_interrupt( ) returns FALSE.

raise_softirq

This is a wrapper around raise_softirq_irqoff that disables hardware interrupts before calling it and restores them to their original status.

_ _local_bh_enable


local_bh_enable


local_bh_disable

_ _local_bh_enable enables bottom halves (and thus softirqs/tasklets) on the local CPU, and local_bh_enable also invokes invoke_softirq if any softirq is pending and in_interrupt( ) returns FALSE.

 

local_bh_disable disables bottom halves on the local CPU.

local_irq_disable


local_irq_enable

Disable and enable interrupts on the local CPU.

local_irq_save


local_irq_restore

local_irq_save first saves the current state of interrupts on the local CPU and then disables them.

 

local_irq_restore restores the state of interrupts on the local CPU thanks to the information previously saved with local_irq_save.

spin_lock_bh


spin_unlock_bh

Acquire and release a spinlock, respectively. Both functions disable and then re-enable bottom halves and preemption
during the operation.





9.3.4. Preemption

























In time-sharing systems, the kernel has always been able to preempt user processes at will, but the kernel itself is often nonpreemptive, which means that once it starts running it will not be interrupted until it is ready to give up control. A nonpreemptive kernel sometimes holds up high-priority processes when they are ready to run because the kernel is executing a system call for a lower-priority process. To support real-time extensions and for other reasons, the Linux kernel was made fully preemptible during the 2.5 kernel development cycle. With this new kernel feature, system calls and other kernel tasks can be preempted by other kernel tasks with higher priorities.


Because much work had already been done to eliminate critical sections (nonpreemptible code) from the kernel to support SMP locking mechanisms, adding full preemption was not a major change to the kernel. Once preemption was added, developers just had to define explicitly where to disable it (in hardware and software interrupt code, in the scheduler itself, in the code protected by spin locks and read/write locks, etc.).


However, there are times when preemption, just like interrupts, must be disabled. In this section, I'll cover just a few functions related to preemption that you may bump into while browsing the code, and then briefly show how some of the locking macros have been updated to deal with preemption.


The following functions control preemption:



preempt_disable


Disables preemption for the current task. Can be called repeatedly, incrementing a reference counter.


preempt_enable



preempt_enable_no_resched


The reverse of preempt_disable, allowing preemption to be enabled again. preempt_enable_no_resched simply decrements a reference counter, which allows preemption to be re-enabled when it reaches zero. preempt_enable, in addition, checks whether the counter is zero and forces a call to schedule( ) to allow any higher-priority task to run.


preempt_check_resched


This function is called by preempt_enable and differentiates it from preempt_enable_no_resched.


The networking code does not deal with these routines directly. However, preempt_enable and preempt_disable are indirectly called, for instance, by locking primitives, like rcu_read_lock and rcu_read_unlock, spin_lock and spin_unlock, etc. Routines used to access per-CPU data structures, like get_cpu and get_cpu_var, also disable preemption before reading the data.


A counter for each process, named preempt_count and embedded in the thread_info structure, indicates whether a given process allows preemption. The field can be read with preempt_count( ) and is manipulated indirectly through the inc_preempt_count and dec_preempt_count functions defined in include/linux/preempt.h. There are situations in which the kernel should not be preempted. These include when it is servicing hardware, as well as when it uses one of the calls just shown to disable preemption. Therefore, preempt_count is split into three components. Each byte is a counter for a different condition that requires nonpreemption: hardware interrupts, software interrupts, and general nonpreemption. The layout of preempt_count is shown in Figure 9-3.



Figure 9-3. Structure of preempt_count



The figure shows, in addition to the purpose of each byte, the main functions that manipulate it. The high-order byte is not fully used at the moment, but its second least significant bit is set before calling the schedule function and tells that function that it has been called to preempt the current task.[*] In include/asm-xxx/hardirq.h you can find several macros that make it easier to read and write preempt_counter; some of these include the XXX_OFFSET variables shown in Figure 9-3 and used by the functions listed in the figure to increment or decrement the right byte.

[*] The PREEMPT_ACTIVE flag is defined on a per-architecture basis. The figure shows the most common definition.


Despite all this complexity, whenever a check has to be done on the current process to see if it can be preempted, all the kernel needs to know is whether preempt_count is NULL (it does not really matter why preemption is disabled).




9.3.5. Bottom-Half Handlers




The infrastructure for bottom halves must address the following needs:


  • Classifying the bottom half as the proper type

  • Registering the association between a bottom half type and its handler

  • Scheduling a bottom half for execution

  • Notifying the kernel about the presence of scheduled BHs


Let's first see how kernels up to version 2.2 handled bottom half handlers

, and then how they are handled with the softirqs used by kernels 2.4 and 2.6.



9.3.5.1. Bottom-half handlers in kernel 2.2









The 2.2 model for bottom-half handlers divides them into a large number of types, which are differentiated by when and how often the kernel checks for them and runs them. The 2.2 list is as follows, taken from include/linux/interrupt.h. In this book, we are most interested in NET_BH.



enum {
TIMER_BH = 0,
CONSOLE_BH,
TQUEUE_BH,
DIGI_BH,
SERIAL_BH,
RISCOM8_BH,
SPECIALIX_BH,
AURORA_BH,
ESP_BH,
NET_BH,
SCSI_BH,
IMMEDIATE_BH,
KEYBOARD_BH,
CYCLADES_BH,
CM206_BH,
JS_BH,
MACSERIAL_BH,
ISICOM_BH
};



Each bottom-half type is associated with a function handler by means of init_bh. The networking code, for instance, initializes the NET_BH bottom-half type to the net_bh handler in net_dev_init, which is covered in Chapter 5.



_ _initfunc(int net_dev_init(void))
{
... ... ...
init_bh(NET_BH, net_bh);
... ... ...
}



The main function used to unregister a BH handler is remove_bh. (There are other related functions too, such as enable_bh/disable_bh, but we do not need to see all of them.)


Whenever an interrupt handler wants to trigger the execution of a bottom half handler, it has to set the corresponding flag with mark_bh. This function is very simple: it sets a bit into a global bitmap bh_active, which, as we will see in a moment, is tested in several places.



extern inline void mark_bh(int nr)
{
set_bit(nr, &bh_active);
};



For instance, you will see later in the chapter that every time a network device driver has successfully received a frame, it signals the kernel about it with a call to netif_rx. The latter queues the newly received frame into the ingress queue backlog (shared by all the CPUs) and marks the NET_BH bottom-half handler flag.



skb_queue_tail(&backlog, skb);
mark_bh(NET_BH);
return



During several routine operations, the kernel checks whether any bottom halves are scheduled for execution. If any are waiting, the kernel runs the function do_bottom_half (currently in kernel/softirq.c), to execute them. The checks are performed during:



do_IRQ


Whenever the kernel is notified by an IRQ about a hardware interrupt, it calls do_IRQ to execute the associated handler. Since a good number of bottom halves are scheduled for transmission by interrupt handlers, what could give them less latency than an invocation right at the end of do_IRQ? For this reason, the regular timer interrupt that expires with frequency HZ represents an upper bound between two consecutive executions of do_bottom_half.


Returns from interrupts and exceptions (which includes system calls)


See arch/XXX/kernel/entry.S for the assembly language code that takes care of this case.


schedule


This function, which decides what to execute next on the CPU, checks if any bottom-half handlers are pending and gives them higher priority over other tasks.



asmlinkage void schedule(void)
{

/* Do "administrative" work here while we don't hold any locks */
if (bh_mask & bh_active)
goto handle_bh;
handle_bh_back:
... ... ...
handle_bh:
do_bottom_half( );
goto handle_bh_back;
... ... ...
}



run_bottom_half, the function used by do_bottom_half to execute the pending interrupt handlers, looks like this:



active = get_active_bhs( );
clear_active_bhs(active);
bh = bh_base;
do {
if (active & 1)
(*bh)( );
bh++;
active >>= 1;
} while (active);



The order in which the pending handlers are invoked depends on the positions of the associated flags inside the bitmap and the direction used to scan those flags (returned by get_active_bhs). In other words, bottom halves are not run on a first-come-first-served basis. And since networking bottom halves can take a long time, those that have the misfortune to be dequeued last can experience high latency.


Bottom halves in 2.2 and earlier kernels suffer from a ban on concurrency. Only one bottom half can run at any time, regardless of the number of CPUs.




9.3.5.2. Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq









The biggest improvement between kernels 2.2 and 2.4, as far as interrupt handling is concerned, was the introduction of software interrupts

(softirqs), which can be seen as the multithreaded version of bottom half handlers
. Not only can many softirqs run concurrently, but also the same softirq can run on different CPUs concurrently. The only restriction on concurrency is that only one instance of each softirq can run at the same time on a CPU.


The new softirq model has only six types (from include/linux/interrupt.h):



enum
{
HI_SOFTIRQ=0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
SCSI_SOFTIRQ,
TASKLET_SOFTIRQ
};



All the XXX_BH bottom-half types in the old model are still available to old drivers, but have been reimplemented to run as softirqs of the HI_SOFTIRQ type (which means they take priority over the other softirq types). The two types used by networking code, NET_TX_SOFTIRQ and NET_RX_SOFTIRQ, are introduced in the later section "How the Networking Code Uses softirqs." The next section will introduce tasklets.


Softirqs, like the old bottom halves, run with interrupts enabled and therefore can be suspended at any time to handle a new, incoming interrupt. However, the kernel does not allow a new request for a softirq to run on a CPU if another instance of that softirq has been suspended on that CPU; this drastically reduces the amount of locking needed. Each softirq type can maintain an array of data structures of type softnet_data, one per CPU, to hold state information about the current softirq; we'll see the contents of this structure in the section "softnet_data Structure." Since different instances of the same type of softirq can run simultaneously on different CPUs, the functions run by softirqs still need to lock other data structures that are shared, to avoid race conditions.


The functions used to register and schedule a softirq handler, and the logic behind them, are very similar to the ones used with 2.2 bottom halves.


softirq handlers are registered with the open_softirq function, which, unlike init_bh, accepts an extra parameter so that the function handler can be passed some input data if needed. None of the softirqs, however, currently uses that extra parameter, and a proposal has been floated to remove it. open_softirq simply copies the input parameters into the global array softirq_vec, declared in kernel/softirq.c, which holds the associations between types and handlers.



static struct softirq_action softirq_vec[32] _ _cacheline_aligned_in_smp;

void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
{
softirq_vec[nr].data = data;
softirq_vec[nr].action = action;
}



A softirq can be scheduled for execution on the local CPU by the following functions:



_ _raise_softirq_irqoff


This function, the counterpart of mark_bh in 2.2, simply sets the bit flag associated to the softirq to run. Later on, when the flag is checked, the associated handler will be invoked.


raise_softirq_irqoff


This is a wrapper around _ _cpu_raise_softirq that additionally schedules the ksoftirqd tHRead (discussed later in this chapter) if the function is not called from a hardware or software interrupt context and preemption has not been disabled. If the function is called from interrupt context, invoking the thread is not necessary because, as we will see, do_softirq will be called anyway.


raise_softirq


This is a wrapper around raise_softirq_irqoff that executes the latter with hardware interrupts disabled.


The following code, taken from kernel 2.4.5,[*] shows the model used at an early stage of softirq development. It is very similar to the 2.2 model, and invokes the function do_softirq, which is a counterpart to the 2.2 function do_bottom_half discussed in the previous section. do_softirq is called if at least one softirq has been scheduled for execution:

[*] It has been removed in 2.4.6.



asmlinkage void schedule(void)
{

/* Do "administrative" work here while we don't hold any locks */
if (softirq_active(this_cpu) & softirq_mask(this_cpu))
goto handle_softirq;
handle_softirq_back:
... ... ...
handle_softirq:
do_softirq( );
goto handle_softirq_back;
... ... ...
}



The only difference between this early stage of softirqs and the 2.2 bottom-half model is that the softirq version has to check the flags on a per-CPU basis, since each CPU has its own bitmap of pending softirqs.


The implementation of do_softirq is also very similar to its counterpart do_bottom_half in 2.2. The kernel also calls the function at some of the same points, but not entirely the same. The main change is the introduction of a new per-CPU kernel thread, ksoftirqd.


Here are the main points where do_softirq may be invoked:[*]

[*] It is also possible to call invoke_softirq instead of do_softirq directly. The former could be an alias to do_softirq or to its helper routine, _ _do_softirq, depending on whether the _ _ARCHIRQ_EXIT_IRQS_DISABLED symbol is defined.



do_IRQ


The skeleton for do_IRQ, which is defined in the per-architecture files arch/arch-name/kernel/irq.c, is:



fastcall unsigned int do_IRQ(struct pt_regs * regs)
{
irq_enter( );
... ... ...
/* handle the IRQ number "irq" with the registered handler */
... ... ...
irq_exit( );
return 1;
}



In kernel 2.4, the function also called do_softirq. For most architectures in 2.6, a call to do_softirq is made inside irq_exit instead. A minority still have it inside do_IRQ.


Since nested calls to irq_enter are allowed, irq_exit calls invoke_softirq only when all the usual conditions are met (there are no softirqs pending, etc.) and the reference count associated with the interrupt context has reached zero, indicating that the kernel is leaving the interrupt context.


Here is the generic definition of irq_exit from kernel/softirq.c, but there are architectures that define their own versions:



void irq_exit(void)
{
...
sub_preempt_count(IRQ_EXIT_OFFSET);
if (!in_interrupt( ) && local_softirq_pending( ))
invoke_softirq( );
preempt_enable_no_resched( );
}



smp_apic_timer_interrupt, which handles SMP timers in arch/XXX/kernel/apic.c, also uses irq_enter/irq_exit.


Returns from interrupts and exceptions (which include system calls)


This is the same as kernel 2.2.


local_bh_enable


When softirqs are re-enabled on a CPU, pending requests are processed (if any) with a call to do_softirq.


The kernel threads, ksoftirqd_CPU n


To keep softirqs from monopolizing all the CPUs (which could happen easily on a heavily loaded network because the NET_TX_SOFTIRQ and NET_RX_SOFTIRQ interrupts have a higher priority than user processes), developers introduced a new set of per-CPU threads. These have the names ksoftirqd_CPU0, ksoftirqd_CPU1, and so on, and can be seen by a ps command. More details on these threads appear in the section "ksoftirqd Kernel Threads."


I have described i386 behavior in general; other architectures may use different naming conventions or have additional timers that also invoke do_softirq.


Another interesting place where do_softirq is called is from within netif_rx_ni, which is briefly described in the section "Old Interface Between Device Drivers and Kernel: First Part of netif_rx" in Chapter 10. The traffic generator built into the kernel (net/core/pktgen.c) also calls do_softirq.





9.3.6. Tasklets








Most of the bottom halves of the 2.2 kernel variety have been converted to either softirqs or tasklets
. A tasklet is a function that some interrupt or other task has deferred to execute later. Tasklets are built on top of softirqs and are usually kicked off by interrupt handlers. (But other parts of the kernel, such as the neighboring subsystem discussed in Part VI, also use tasklets).[*]

[*] The kernel provides work queues as well. We will not cover them because they are not used much by the networking code. Refer to Understanding the Linux Kernel for a discussion of work queues.


In the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq," we saw the list of softirqs. HI_SOFTIRQ is used to implement high-priority tasklets, and TASKLET_SOFTIRQ is used for lower-priority ones. Each time a request for a deferred execution is issued, an instance of a tasklet_struct structure is queued onto either a list processed by HI_SOFTIRQ or another one that is instead processed by TASKLET_SOFTIRQ.


Since softirqs are handled independently by each CPU, it should not be a surprise that there are two lists of pending tasklet_structs for each CPU, one associated with HI_SOFTIRQ and one with TASKLET_SOFTIRQ. These are their definitions from kernel/softirq.c:



static DEFINE_PER_CPU(struct tasklet_head tasklet_vec) = { NULL };
static DEFINE_PER_CPU(struct tasklet_head tasklet_hi_vec) = { NULL };



At first sight, tasklets may seem to be just like the old bottom halves, but there actually are substantial differences:


  • There is no limit on the number of different tasklets, whereas the old bottom halves were limited to one type for each bit flag of bh_base.

  • Tasklets provide two levels of priority.

  • Different tasklets can run concurrently on different CPUs.

  • Tasklets, unlike old bottom halves and softirqs, are dynamic and do not need to be statically declared in an XXX_BH or XXX_SOFTIRQ enumeration list.


The tasklet_struct data structure is defined in include/linux/interrupt.h as follows:



struct tasklet_struct
{
struct tasklet_struct *next;
unsigned long state;
atomic_t count;
void (*func)(unsigned long);
unsigned long data;
};



The following is the field-by-field description:



struct tasklet_struct *next


A pointer used to link together the pending structures associated with the same CPU. New elements are added at the head by the functions tasklet_hi_schedule and tasklet_schedule.


unsigned long state


A bitmap flag whose possible values are represented by the TASKLET_STATE_XXX enums listed in include/linux/interrupt.h:



TASKLET_STATE_SCHED


The tasklet has been scheduled for execution, and the data structure is already in the list associated with HI_SOFTIRQ or TASKLET_SOFTIRQ, based on the priority assigned. The same tasklet cannot be scheduled concurrently on different CPUs. If other requests to execute the tasklet arrive when the first one has not started its execution yet, they will be dropped. Since for any given tasklet, there can be only one instance in execution, there is no reason to schedule it for execution more than once.


TASKLET_STATE_RUN


The tasklet is being executed. This flag is used to prevent multiple instances of the same tasklet from being executed concurrently. It is meaningful only for SMP systems. The flag is manipulated with the three locking functions tasklet_trylock, tasklet_unlock, and tasklet_unlock_wait.


atomic_t count


There are cases where you may need to temporarily disable and later re-enable a tasklet. This is accomplished by this counter: a value of zero means that the tasklet is disabled (and thus not executable) and nonzero means the tasklet is enabled. Its value is incremented and decremented by the tasklet[_hi]_enable and tasklet[_hi]_disable functions described later in this section.


void (*func)(unsigned long)



unsigned long data


func is the function to execute and data is an optional input that can be passed to func.


The following are some important kernel functions that handle tasklets, from kernel/softirq.c and include/linux/interrupt.h:



tasklet_init


Fills in the fields of a tasklet_struct structure with the func and data values provided as arguments.


tasklet_action, tasklet_hi_action


Execute low-priority and high-priority tasklets, respectively.


tasklet_schedule, tasklet_hi_schedule


Schedule a low-priority and a high-priority tasklet, respectively, for execution. They add the tasklet_struct structure to the list of pending tasklets associated with the local CPU and then schedule the associated softirq (TASKLET_SOFTIRQ or HI_SOFTIRQ). If the tasklet is already scheduled (but not running), these APIs do nothing (see the TASKLET_STATE_SCHED flag).


tasklet_enable, tasklet_hi_enable


These two functions are identical and are used to enable a tasklet.


tasklet_disable, tasklet_disable_nosync


Both of these functions disable a tasklet and can be used with low- and high-priority tasklets. Tasklet_disable is a wrapper to tasklet_disable_nosync. While the latter returns right away (it is asynchronous), the former returns only when the tasklet has terminated its execution in case it was running (it is synchronous).


tasklet_enable, tasklet_hi_enable, and tasklet_disable_nosync manipulate the value of the count field to declare the tasklet enabled or disabled. Nested calls are allowed.




9.3.7. Softirq Initialization
















During kernel initialization, softirq_init initializes the software IRQ layer with the two general-purpose softirqs: tasklet_action and tasklet_hi_action, which are associated with TASKLET_SOFTIRQ and HI_SOFTIRQ, respectively.



void _ _init softirq_init( )
{
open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
}



The two softirqs used by the networking code NET_RX_SOFTIRQ and NET_TX_SOFTIRQ are initialized in net_dev_init, one of the networking initialization functions (see the section "How the Networking Code Uses softirqs").


The other softirqs listed in the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq" are registered in the associated subsystems (SCSI_SOFTIRQ in drivers/scsi/scsi.c, TIMER_SOFTIRQ in kernel/timer.c, etc.).


HI_SOFTIRQ is mainly used by sound card device drivers.[*]

[*] In 2.4 kernels, all the bottom-half handlers of kernel version 2.2 were converted to high-priority tasklets by defining the mark_bh function as a wrapper around tasklet_hi_schedule.


Users of TASKLET_SOFTIRQ include:


  • Drivers for network interface cards (not only Ethernets)

  • Numerous other device drivers

  • Media layers (USB, IEEE 1394, etc.)

  • Networking subsystems (Neighboring, ATM qdisc, etc.)




9.3.8. Pending softirq Handling







We explained in the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq" when do_softirq is invoked to take care of the pending softirqs. Here we will see the internals of the function. You will notice how much it resembles the one used by kernel 2.2 described in the section "Bottom-half handlers in kernel 2.2."


do_softirq stops and does nothing if the CPU is currently serving a hardware or software interrupt. The function checks for this by calling in_interrupt, which is equivalent to if (in_irq( ) || in_softirq( )).


If do_softirq decides to proceed, it saves pending softirqs in pending with local_softirq_pending.



#ifndef _ _ARCH_HAS_DO_SOFTIRQ

asmlinkage void do_softirq(void)
{
if (in_interrupt( ))
return;

local_irq_save(flags);
pending = local_softirq_pending( );
if (pending)
_ _do_softirq( );
local_irq_restore;
}

EXPORT_SYMBOL(do_softirq);
#endif



From the preceding snapshot, it could seem that do_softirq runs with IRQs disabled, but that's not true. IRQs are kept disabled only when manipulating the bitmap of pending softirqs (i.e., accessing the softnet_data structure). You will see in a moment that _ _do_softirq internally re-enables IRQs when running the softirq handlers.



9.3.8.1. _ _do_softirq function


It is possible for the same softirq type to be scheduled multiple times while do_softirq is running. Since IRQs are enabled when running the softirq handlers, the bitmap of pending softirq can be manipulated while serving an interrupt, and therefore any of the softirq handlers that has been executed by _ _do_softirq could be re-scheduled during the execution of _ _do_softirq itself.


For this reason, before _ _do_softirq re-enables IRQs, it saves the current bitmap of the pending softirq on the local variable pending and clears it from the softnet_data instance associated with the local CPU using local_softirq_pending( )=0. Then based on pending, it calls all the necessary handlers.


Once all the handlers have been called, _ _do_softirq checks whether in the meantime any softirqs were scheduled again (this request disables IRQs). If there was at least one pending softirq, it will repeat the whole process. However, _ _do_softirq repeats it only up to MAX_SOFTIRQ_RESTART times (experimentation has found that 10 times works well).


The use of MAX_SOFTIRQ_RESTART is a design decision made to keep a single type of interruptparticularly a stream of networking interruptsfrom starving other interrupts out of one of the CPUs. Without the limit in _ _do_softirq, starvation could easily happen when a server is highly loaded by network traffic and the number of NET_RX_SOFTIRQ interrupts goes through the roof.


Let's see how starvation could take place. do_IRQ would raise a NET_RX_SOFTIRQ interrupt that would cause do_softirq to be executed. _ _do_softirq would clear the NET_RX_SOFTIRQ flag, but before it ended it would be interrupted by an interrupt that would set NET_RX_SOFTIRQ again, and so on, indefinitely.


Let's see now how the central part of _ _do_softirq manages to invoke the softirq handler. Every time one softirq type is served, its bit is cleared from the local copy of the active softirqs, pending. h is initialized to point to the global data structure softirq_vec that holds the associations between softirq types and their function handlers (for instance, NET_RX_SOFTIRQ is handled by net_rx_action). The loop ends when the bitmap is cleared.


Finally, if there are pending softirqs that cannot be handled because do_softirq must return, having repeated its job MAX_SOFTIRQ_RESTART times already, the ksoftirqd tHRead is awakened and given the responsibility of handling them later. Because do_softirq is invoked at so many points within the kernel, it is actually likely that a later invocation of do_softirq will handle these interrupts before the ksoftirqd tHRead is scheduled.



#define MAX_SOFTIRQ_RESTART 10

asmlinkage void _ _do_softirq(void)
{
struct softirq_action *h;
_ _u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;

pending = local_softirq_pending( );

local_bh_disable( );
cpu = smp_processor_id( );
restart:
/* Reset the pending bitmask before enabling irqs */
local_softirq_pending( ) = 0;

local_irq_enable( );

h = softirq_vec;

do {
if (pending & 1) {
h->action(h);
rcu_bh_qsctr_inc(cpu);
}
h++;
pending >>= 1;
} while (pending);

local_irq_disable( );

pending = local_softirq_pending( );
if (pending && --max_restart)
goto restart;

if (pending)
wakeup_softirqd( );

_ _local_bh_enable( );
}






9.3.9. Per-Architecture Processing of softirq


The do_softirq function provided in kernel/softirq.c can be overridden by another function provided by the architecture code (which ends up calling _ _do_softirq anyway). This explains why the definition of do_softirq in kernel/softirq.c is wrapped with the check on _ _ARCH_HAS_DO_SOFTIRQ (see the previous section).


A few architectures, including i386 (see arch/i386/kernel/irq.c), define their own version of do_softirq. Such architecture versions are used when the architectures use 4 KB stacks (instead of 8 KB) and use the remaining 4 K to implement stacked handling of both hard IRQs and softirqs. Please refer to Understanding the Linux Kernel for more detail.




9.3.10. ksoftirqd Kernel Threads









Background kernel threads are assigned the job of checking for softirqs that have been left unexecuted by the functions previously described, and executing as many of those softirqs as they can before needing to give that CPU back to other activities. There is one kernel thread for each CPU, named ksoftirqd_CPU0, ksoftirqd_CPU1, and so on. The section "Starting the threads" describes how these threads are started at CPU boot time.


The function ksoftirqd associated to these threads is pretty simple and is defined in the same file softirq.c:



static int ksoftirqd(void * _ _bind_cpu)
{
set_user_nice(current, 19);
...
while (!kthread_should_stop( )) {
if (!local_softirq_pending( ))
schedule( );

_ _set_current_state(TASK_RUNNING);

while (local_softirq_pending( )) {
/* Preempt disable stops cpu going offline.
If already offline, we'll be on wrong CPU:
don't process */
preempt_disable( );
if (cpu_is_offline((long)_ _bind_cpu))
goto wait_to_die;
do_softirq( );
preempt_enable( );
cond_resched( );
}
set_current_state(TASK_INTERRUPTIBLE);
}
_ _set_current_state(TASK_RUNNING);
return 0;
...
}



There are a couple of small details I want to emphasize. The priority of a process, also called the nice priority, is a number ranging from -20 (maximum) to +19 (minimum). The ksoftirqd threads are given a low priority of 19. This is done so that frequently running softirqs such as NET_RX_SOFTIRQ cannot completely kidnap the CPU, which would leave almost no resources to other processes. We already saw that do_softirq can be invoked from different places in the code, so this low priority doesn't represent a handicap. Once started, the loop simply keeps calling do_softirq (always with preemption disabled) until one of the following conditions becomes true:


  • There are no more pending softirqs to handle (local_softirq_pending( ) returns FALSE).

    In this case, the function sets the thread's state to TASK_INTERRUPTIBLE and calls schedule( ) to release the CPU. The thread can be awakened by means of wakeup_softirqd, which can be called from both _ _do_softirq itself and raise_softirq_irqoff.

  • The thread has run for too long and is asked to release the CPU.

    The handler associated with the timer interrupt, among other things, sets the need_resched flag to signal that the current process/thread has used its time slot. In this case, ksoftirqd releases the CPU, keeping its state as TASK_RUNNING, and will soon be resumed.



9.3.10.1. Starting the threads



There is one ksoftirqd thread for each CPU. When the system's first CPU comes online, the first thread is started at kernel boot time inside do_pre_smp_initcalls.[*] The ksoftirqd threads for the other CPUs that come up at boot time, and for any other CPU that may be enabled later on a system that can handle hot-pluggable CPUs, are taken care of through the cpu_chain notification chain.

[*] See Chapter 7 for details about how the kernel takes care of basic initializations at boot time.


Notification chains were introduced in Chapter 4. The cpu_chain chain lets various subsystems know when a CPU is up and running or when one dies. The softirq subsystem registers to the cpu_chain with spawn_ksoftirqd, called from the function do_pre_smp_initcalls mentioned previously. The callback routine cpu_callback that processes notifications from cpu_chain is used to initialize the necessary per-CPU data structures and start the ksoftirqd thread on the CPU.


The complete list of CPU_XXX notifications is in include/linux/notifier.h, but we need only four of them in the context of this chapter:



CPU_UP_PREPARE


Generated when the CPU starts coming up, but is not ready yet.


CPU_ONLINE


Generated when the CPU is ready.


CPU_UP_CANCELLED



CPU_DEAD


These two messages are generated only when the kernel is compiled with support for hot-pluggable CPUs. The first is used when one of the tasks triggered by a previous CPU_UP_PREPARE notification failed and therefore does not allow the CPU to go online. The second one is used when a CPU dies.


CPU_PREPARE_UP creates the thread and binds it to the associated CPU, but does not wake up the thread. CPU_ONLINE wakes up the thread. When a CPU dies, its associated ksoftirqd instance is killed:



static int _ _devinit cpu_callback(struct notifier_block *nfb, unsigned long action,
void *hcpu)
{
...
switch(action) {
...
}
return NOTIFY_OK;
}

static struct notifier_block _ _devinitdata cpu_nfb = {
.notifier_call = cpu_callback
};

_ _init int spawn_ksoftirqd(void)
{
void *cpu = (void *)(long)smp_processor_id( );
cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
register_cpu_notifier(&cpu_nfb);
return 0;
}



Note that spawn_ksoftirqd places two direct calls to cpu_callback before registering with cpu_chain via register_cpu_notifier. This is necessary because CPU notifications are not generated for the first CPU that comes online.





9.3.11. Tasklet Processing







The two handlers for low-latency tasklets (TASKLET_SOFTIRQ) and high-latency tasklets (HI_SOFTIRQ) are identical; they simply work on two different lists. For this reason, we will describe only one of them: tasklet_action, the one associated with TASKLET_SOFTIRQ.


Only one instance of each tasklet can be waiting for execution at any time. When tasklet_schedule or tasklet_hi_schedule schedules a tasklet, the function sets the TASKLET_STATE_SCHED bit described earlier in the section "Tasklets." Attempts to reschedule the same tasklet will be ignored because TASKLET_STATE_SCHED is already set. The bit is cleared only when the tasklet starts its execution; thus, during or after its execution another instance can be scheduled.


The tasklet_action function starts by copying the list of tasklets waiting to be processed into a local variable first; it then clears the global list.[*] This is the only part of the function that is executed with interrupts disabled. Disabling them is necessary to avoid race conditions with interrupt handlers that could add new elements to the list while tasklet_action accesses it.

[*] We will see that one of the networking softirq handlers (net_tx_action) does something similar.


At this point, the function goes through the list tasklet by tasklet. For each element it invokes the handler if both of the following are true:


  • The tasklet is not already runningin other words, TASKLET_STATE_RUN is clear. (The function runs tasklet_trylock to see whether TASKLET_STATE_RUN is already set; if not; tasklet_trylock sets the bit.)

  • The tasklet is enabled (count is zero).


The part of the function implementing these activities follows:



struct tasklet_struct *list;

local_irq_disable( );
list = _ _get_cpu_var(tasklet_vec).list;
_ _get_cpu_var(tasklet_vec).list = NULL;
local_irq_enable( );

while (list) {
struct tasklet_struct *t = list;

list = list->next;

if (tasklet_trylock(t)) {
if (!atomic_read(&t->count)) {



At this stage, since the tasklet was not already being executed and it was extracted from the list of pending tasklets, it must have the TASKLET_STATE_SCHED flag set:



if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
BUG( );
t->func(t->data);
tasklet_unlock(t);
continue;
}
tasklet_unlock(t);
}



If the handler cannot be executed, the tasklet is put back into the list and TASKLET_SOFTIRQ is rescheduled to take care of all of those tasklets that for one of the two reasons listed earlier cannot be handled now:



local_irq_disable( );
t->next = _ _get_cpu_var(tasklet_vec).list;
_ _get_cpu_var(tasklet_vec).list = t;
_ _raise_softirq_irqoff(TASKLET_SOFTIRQ);
local_irq_enable( );
}
}





9.3.12. How the Networking Code Uses softirqs







The networking subsystem has been assigned two different softirqs. NET_RX_SOFTIRQ handles incoming traffic and NET_TX_SOFTIRQ handles outgoing traffic. Both are registered in net_dev_init (described in Chapter 5) through the following lines:



open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);



Because different instances of the same softirq handler can run concurrently on different CPUs (unlike tasklets), networking code is both low latency and scalable.


Both networking softirqs are higher in priority than normal tasklets (TASKLET_SOFTIRQ) but are lower in priority than high-priority tasklets (HI_SOFTIRQ). This prioritization guarantees that other high-priority tasks can proceed in a responsive and timely manner even when a system is under a high network load.


The internals of the two handlers are covered in the sections "Processing the NET_RX_SOFTIRQ: net_rx_action" in Chapter 10 and "Processing the NET_TX_SOFTIRQ: net_tx_action" in Chapter 11.













No comments:

Post a Comment