Saturday, November 7, 2009

OS Hangs










OS Hangs


OS hangs come in two types: interruptible and non-interruptible. The first step to remedying a hang is to identify the type of hang. We know we have an interruptible hang when it responds to an external interrupt. Conversely, we know we have a non-interruptible hang when it does not.


To determine whether the hang responds to an external interrupt, attempt a ping test, checking for a response. If a keyboard is attached, perform a test by simply pressing the Caps Lock key to see whether the Caps Lock light cycles. If you have console access, determine whether the console gives you line returns when you press the Enter key. If one or more of these yields the sought response, you know you have an interruptible hang.


Note



Any time an OS hangs, one or more offending processes are usually responsible for the hang. This is true whether software or hardware is ultimately to blame for the OS hang. Even when a hardware problem exists, a process has made a request of the hardware that the hardware could not fulfill, so processes stack up as a result.






Troubleshooting Interruptible Hangs


The first step in troubleshooting an interruptible hang is obtaining a stack trace of the offending process or processes by using the Magic SysRq keystroke. Some Linux distributions have this functionality enabled by default, whereas others do not. We recommend always having this functionality enabled. The following example shows how to enable it.


Check whether the Magic SysRq is enabled:


# cat /proc/sys/kernel/sysrq
0
( 0 = disabled 1 = enabled)


Because it is not enabled, enable it:


# echo 1 > /proc/sys/kernel/sysrq
# cat /proc/sys/kernel/sysrq
1


Alternatively, we can use the sysctl command:


# sysctl -n kernel.sysrq
0
# sysctl -w kernel.sysrq=1
# sysctl -n kernel.sysrq
1


To make this setting persistent, just put an entry into the configuration file:


# /etc/sysctl.conf
kernel.sysrq=1


When the functionality is enabled, a stack trace can be obtained by sending a Break+t to the console. Unfortunately, this can be more difficult than it first appears. With a standard VGA console, this is accomplished with the Alt+sysrq+t keystroke combination; however, the keystroke combination is different for other console emulators, in which case you would need to determine the required key sequence by contacting the particular manufacturer. For example, if a Windows user utilizes emulation software, such as Reflections, key mapping can be an issue. Linux distributions sometimes provide tools such as cu and minicom, which do not affect the key mapping by default.


With the latest 2.4 kernel releases, a new file is introduced in /proc called sysrq-trigger. By simply echoing a predefined character to this file, the administrator avoids the need to send a break to the console. However, if the terminal window is hung, the break sequence is the only way. The following example shows how to use the functionality to obtain a stack trace.


Using a serial console or virtual console (/dev/ttyS0 or /dev/vc/1), press Alt+sysrq+h.


After this key combination is entered and sysrq is enabled, the output on the screen is as follows:


kernel: SysRq : HELP : loglevel0-8 reBoot tErm kIll saK showMem Off
showPc unRaw Sync showTasks Unmount


The following is a description of each parameter:


  • loglevel Sets the console logging level.

  • reBoot Resets the processor so that the system starts booting.

  • tErm Sends SIGTERM to all processes except init.

  • kIll Sends SIGKILL to all processes except init.

  • saK Kills all process in virtual console: Secure Access Keys. This ensures that the login prompt is init's and not some third-party application.

  • showMem Shows system memory report.

  • Off Shuts machine down.

  • showPc Dumps registers: Includes PID of running process, instruction pointers, and control registers.

  • unRaw Turns off keyboard RAW mode and sets it to ASCII (XLATE).

  • Sync Syncs filesystems.

  • showTasks Dumps all tasks and their info.

  • Unmount Attempts to remount all mounted filesystems as read-only.

  • shoWcpus Shows stacks for each CPU by "walking" each CPU (smp kernel).


With the latest kernels, it is possible to test the "Magic SysRq" functionality by writing a "key" character from the previous list to the /proc/sysrq-trigger file. The following example causes the CPU stacks to be written to the kernel ring buffer:


# echo w > /proc/sysrq-trigger


To view the processor stacks, simply execute the dmesg command or view the syslog.


# dmesg
...
SysRq : Show CPUs
CPU1:

Call Trace: [<e000000004415860>] sp=0xe0000040fc96fca0
bsp=0xe0000040fc969498 show_stack [kernel] 0x80
[<e000000004653d30>] sp=0xe0000040fc96fe60 bsp=0xe0000040fc969480
showacpu [kernel] 0x90
[<e000000004653d80>] sp=0xe0000040fc96fe60 bsp=0xe0000040fc969470
sysrq_handle_showcpus [kernel] 0x20
[<e000000004654380>] sp=0xe0000040fc96fe60 bsp=0xe0000040fc969428
__handle_sysrq_nolock [kernel] 0x120
[<e000000004654230>] sp=0xe0000040fc96fe60 bsp=0xe0000040fc9693f0
handle_sysrq [kernel] 0x70
[<e00000000458a930>] sp=0xe0000040fc96fe60 bsp=0xe0000040fc9693c8
write_sysrq_trigger [kernel] 0xf0
[<e0000000045167a0>] sp=0xe0000040fc96fe60 bsp=0xe0000040fc969348
sys_write [kernel] 0x1c0
[<e00000000440e900>] sp=0xe0000040fc96fe60 bsp=0xe0000040fc969348
ia64_ret_from_syscall [kernel] 0x0
CPU0:

Call Trace: [<e000000004415860>] sp=0xe0000040fbe07ac0
bsp=0xe0000040fbe01518 show_stack [kernel] 0x80
[<e000000004653d30>] sp=0xe0000040fbe07c80 bsp=0xe0000040fbe01500
showacpu [kernel] 0x90
[<e000000004446980>] sp=0xe0000040fbe07c80 bsp=0xe0000040fbe014a0
handle_IPI [kernel] 0x200
[<e000000004412500>] sp=0xe0000040fbe07c80 bsp=0xe0000040fbe01460
handle_IRQ_event [kernel] 0x100
[<e000000004412be0>] sp=0xe0000040fbe07c80 bsp=0xe0000040fbe01418
do_IRQ [kernel] 0x160
[<e000000004414e20>] sp=0xe0000040fbe07c80 bsp=0xe0000040fbe013d0
ia64_handle_irq [kernel] 0xc0
[<e00000000440e920>] sp=0xe0000040fbe07c80 bsp=0xe0000040fbe013d0
ia64_leave_kernel [kernel] 0x0
[<e00000000447b9b0>] sp=0xe0000040fbe07e20 bsp=0xe0000040fbe012c8
schedule [kernel] 0xa70
[<e000000004486ce0>] sp=0xe0000040fbe07e30 bsp=0xe0000040fbe01288
do_syslog [kernel] 0x460
[<e0000000045876f0>] sp=0xe0000040fbe07e60 bsp=0xe0000040fbe01260
kmsg_read [kernel] 0x30
[<e000000004516400>] sp=0xe0000040fbe07e60 bsp=0xe0000040fbe011d8
sys_read [kernel] 0x1c0
[<e00000000440e900>] sp=0xe0000040fbe07e60 bsp=0xe0000040fbe011d8
ia64_ret_from_syscall [kernel] 0x0
CPU3:

Call Trace: [<e000000004415860>] sp=0xe000004083e87b00
bsp=0xe000004083e81318 show_stack [kernel] 0x80
[<e000000004653d30>] sp=0xe000004083e87cc0 bsp=0xe000004083e81300
showacpu [kernel] 0x90
[<e000000004446980>] sp=0xe000004083e87cc0 bsp=0xe000004083e812a0
handle_IPI [kernel] 0x200
[<e000000004412500>] sp=0xe000004083e87cc0 bsp=0xe000004083e81260
handle_IRQ_event [kernel] 0x100
[<e000000004412be0>] sp=0xe000004083e87cc0 bsp=0xe000004083e81218
do_IRQ [kernel] 0x160
[<e000000004414e20>] sp=0xe000004083e87cc0 bsp=0xe000004083e811d0
ia64_handle_irq [kernel] 0xc0
[<e00000000440e920>] sp=0xe000004083e87cc0 bsp=0xe000004083e811d0
ia64_leave_kernel [kernel] 0x0
[<e0000000044160c0>] sp=0xe000004083e87e60 bsp=0xe000004083e811d0
default_idle [kernel] 0x0
[<e0000000044161e0>] sp=0xe000004083e87e60 bsp=0xe000004083e81160
cpu_idle [kernel] 0x100
[<e00000000499c380>] sp=0xe000004083e87e60 bsp=0xe000004083e81150
start_secondary [kernel] 0x80
[<e0000000044080c0>] sp=0xe000004083e87e60 bsp=0xe000004083e81150
start_ap [kernel] 0x1a0
CPU2:

Call Trace: [<e000000004415860>] sp=0xe0000040fdc77b00
bsp=0xe0000040fdc71318 show_stack [kernel] 0x80
[<e000000004653d30>] sp=0xe0000040fdc77cc0 bsp=0xe0000040fdc71300
showacpu [kernel] 0x90
[<e000000004446980>] sp=0xe0000040fdc77cc0 bsp=0xe0000040fdc712a0
handle_IPI [kernel] 0x200
[<e000000004412500>] sp=0xe0000040fdc77cc0 bsp=0xe0000040fdc71260
handle_IRQ_event [kernel] 0x100
[<e000000004412be0>] sp=0xe0000040fdc77cc0 bsp=0xe0000040fdc71218
do_IRQ [kernel] 0x160
[<e000000004414e20>] sp=0xe0000040fdc77cc0 bsp=0xe0000040fdc711d0
ia64_handle_irq [kernel] 0xc0
[<e00000000440e920>] sp=0xe0000040fdc77cc0 bsp=0xe0000040fdc711d0
ia64_leave_kernel [kernel] 0x0
[<e0000000044161d0>] sp=0xe0000040fdc77e60 bsp=0xe0000040fdc71160
cpu_idle [kernel] 0xf0
[<e00000000499c380>] sp=0xe0000040fdc77e60 bsp=0xe0000040fdc71150
start_secondary [kernel] 0x80
[<e0000000044080c0>] sp=0xe0000040fdc77e60 bsp=0xe0000040fdc71150
start_ap [kernel] 0x1a0



Scenario 2-1: Hanging OS

In this scenario, the OS hangs, but the user is unable to determine why. The way to start troubleshooting is to gather stacks and logs.


Because the OS is not responding to telnet, ssh, or any attempt to log in, we must resort to another log collection method. In this case, we test the keyboard to see whether the system still responds to interrupts. As mentioned at the start of this section, an easy test is to press the Caps Lock key and see whether the Caps Lock light toggles on and off. If not, the hang is considered non-interruptible, which we discuss later. If the light does toggle and the Magic SysRq keys are enabled, gather the system registers by pressing the Alt+sysrq+p key combination.


The following is output from the register dump:


SysRq: Show Regs (showPc)

Process:0,{ swapper}
kernel 2.4.9-e.3smp
EIP: 0010:[<c010542e>] CPU: 0EIP is at default_idle [kernel] 0x2e
EFLAGS: 00000246 Not Tainted
EAX 00000000 EBX: c030a000 ECX: c030a000 EDX: 00000000
ESI: c0105400 EDI: c030a000 EBP: ffffe000 DS: 0018 ES:0018
CR0: 8005003b CR2: 0819d038 CR3: 1544c000 CR4: 000006d0

Call Trace: [<c0105492>] cpu_idle [kernel] 0x32
[<c0105000>] stext [kernel] 0x0
[<c02405e0>] .rodata.str1.32 [kernel] 0x560


../drivers/char/sysrq.c
...
static struct sysrq_key_op sysrq_showregs_op = {
.handler = sysrq_handle_showregs,
.help_msg = "showPc",
.action_msg = "Show Regs",
};


Referring to the register dump output, we can assume the machine is in an idle loop because the kernel is in the default_idle function and the machine is no longer responding. This message also informs us that the kernel is not "tainted." The latest source code provides us with the various tainted kernel states, as shown in the following code snippet.


linux/kernel/panic.c
...
/**
* print_tainted - return a string to represent the kernel taint state.
*
* 'P' - Proprietary module has been loaded.
* 'F' - Module has been forcibly loaded.
* 'S' - SMP with CPUs not designed for SMP.
* 'U' - Unsupported modules loaded.
* 'X' - Modules with external support loaded.
*
* The string is overwritten by the next call to print_taint().
*/

const char *print_tainted(void)
{
static char buf[20];
if (tainted) {
snprintf(buf, sizeof(buf), "Tainted: %c%c%c%c%c",
tainted & TAINT_MACHINE_CHECK ? 'M' : ' ',
tainted & TAINT_PROPRIETARY_MODULE ? 'P' : 'G',
tainted & TAINT_FORCED_MODULE ? 'F' : ' ',
tainted & TAINT_UNSAFE_SMP ? 'S' : ' ',
tainted & TAINT_NO_SUPPORT ? 'U' :
(tainted & TAINT_EXTERNAL_SUPPORT ? 'X' :
' '));
}
else
snprintf(buf, sizeof(buf), "Not tainted");
return(buf);
}


In most cases, if the kernel were in a tainted state, a tech support organization would suggest that you remove the "unsupported" kernel module that is tainting the kernel before proceeding to troubleshoot the issue. In this case, the kernel is not tainted, so we proceed along our original path.


Reviewing the register dump tells us the offset location for the instruction pointer. In this case, the offset is at default_idle+46 (0x2e hex = 46 dec). With this new information, we can use GDB to obtain the instruction details.


gdb vmlinux-2.4.9-e.3smp
(gdb) disassemble default_idle
Dump of assembler code for function default_idle:
0xc0105400 <default_idle>: mov $0xffffe000,%ecx
0xc0105405 <default_idle+5>: and %esp,%ecx
0xc0105407 <default_idle+7>: mov 0x20(%ecx),%edx
0xc010540a <default_idle+10>: mov %edx,%eax
0xc010540c <default_idle+12>: shl $0x5,%eax
0xc010540f <default_idle+15>: add %edx,%eax
0xc0105411 <default_idle+17>: cmpb $0x0,0xc0366985(,%eax,4)
0xc0105419 <default_idle+25>: je 0xc0105431 <default_idle+49>
0xc010541b <default_idle+27>: mov 0xc0365208,%eax
0xc0105420 <default_idle+32>: test %eax,%eax
0xc0105422 <default_idle+34>: jne 0xc0105431 <default_idle+49>
0xc0105424 <default_idle+36>: cli
0xc0105425 <default_idle+37>: mov 0x14(%ecx),%eax
0xc0105428 <default_idle+40>: test %eax,%eax
0xc010542a <default_idle+42>: jne 0xc0105430 <default_idle+48>
0xc010542c <default_idle+44>: sti
0xc010542d <default_idle+45>: hlt
0xc010542e <default_idle+46>: ret Our offset!
0xc010542f <default_idle+47>: nop
0xc0105430 <default_idle+48>: sti
0xc0105431 <default_idle+49>: ret
0xc0105432 <default_idle+50>: lea 0x0(%esi,1),%esi


Now we know that the OS is hung on a return to caller. At this point, we are stuck because this return could have been caused by some other instruction that had already taken place.




Solution 2-1: Update the Kernel

The problem was solved when we updated the kernel to the latest supported patch release. Before spending considerable time and resources tracking down what appears to be a bug, or "feature," as we sometimes say, confirm that the kernel and all the relevant applications have been patched or updated to their latest revisions. In this case, after the kernel was patched, the hang was no longer reproducible.


The Magic SysRq is logged in three places: the kernel message ring buffer (read by dmesg), the syslog, and the console. The package responsible for this is sysklogd, which provides klogd and syslogd. Of course, not all events are logged to the console. Event levels control whether something is logged to the console. To enable all messages to be printed on the console, set the log level to 8 through dmesg -n 8 or klogd -c 8. If you are already on the console, you can use the SysRq keys to indicate the log level by pressing Alt+sysrq+level, where level is a number from 0 to 8. More details on these commands can be found in the dmesg and klogd man pages and of course in the source code.


Reviewing the source, we can see that not all the keyboard characters are used.


See: /drivers/char/sysrq.c
...
static struct sysrq_key_op *sysrq_key_table[SYSRQ_KEY_TABLE_LENGTH] = {
/* 0 */ &sysrq_loglevel_op,
/* 1 */ &sysrq_loglevel_op,
/* 2 */ &sysrq_loglevel_op,
/* 3 */ &sysrq_loglevel_op,
/* 4 */ &sysrq_loglevel_op,
/* 5 */ &sysrq_loglevel_op,
/* 6 */ &sysrq_loglevel_op,
/* 7 */ &sysrq_loglevel_op,
/* 8 */ &sysrq_loglevel_op,
/* 9 */ &sysrq_loglevel_op,
/* a */ NULL, /* Don't use for system provided sysrqs,
it is handled specially on the spark
and will never arrive */
/* b */ &sysrq_reboot_op,
/* c */ &sysrq_crash_op,
/* d */ NULL,
/* e */ &sysrq_term_op,
/* f */ NULL,
/* g */ NULL,
/* h */ NULL,
/* i */ &sysrq_kill_op,
/* j */ NULL,
#ifdef CONFIG_VT
/* k */ &sysrq_SAK_op,
#else
/* k */ NULL,
#endif
/* l */ NULL,
/* m */ &sysrq_showmem_op,
/* n */ NULL,
/* o */ NULL, /* This will often be registered
as 'Off' at init time */
/* p */ &sysrq_showregs_op,
/* q */ NULL,
/* r */ &sysrq_unraw_op,
/* s */ &sysrq_sync_op,
/* t */ &sysrq_showstate_op,
/* u */ &sysrq_mountro_op,
/* v */ NULL,
/* w */ &sysrq_showcpus_op,
/* x */ NULL,
/* w */ NULL,
/* z */ NULL
};
...


Collecting the dump is more difficult if the machine is not set up properly. In the case of an interruptible hang, the syslog daemon might not be able to write to its message file. In this case, we have to rely on the console to collect the dump messages. If the only console on the machine is a Graphics console, you must write the dump out by hand. Note that the dump messages are written only to the virtual console, not to X Windows. The Linux kernel addresses this panic scenario by making the LEDs on the keyboard blink, notifying the administrator that this is not an OS hang but rather an OS panic.


The following 2.4 series source code illustrates this LED-blinking feature that is used when the Linux kernel pulls a panic. Notice that we start with the kernel/panic.c source to determine which functions are called and to see whether anything relating to blinking is referenced.


# linux/kernel/panic.c
...
for(;;) {
#if defined(CONFIG_X86) && defined(CONFIG_VT)
extern void panic_blink(void);
panic_blink();
#endif
CHECK_EMERGENCY_SYNC
}
...


We tracked down the panic_blink() function in the following source:


# linux/drivers/char/pc_keyb.c
...
static int blink_frequency = HZ/2;

/* Tell the user who may be running in X and not see the console that
we have panicked. This is to distinguish panics from "real" lockups.
Could in theory send the panic message as Morse, but that is left as
an exercise for the reader. */
void panic_blink(void)
{
static unsigned long last_jiffie;
static char led;
/* Roughly 1/2s frequency. KDB uses about 1s. Make sure it is
different. */
if (!blink_frequency)
return;
if (jiffies - last_jiffie > blink_frequency) {
led ^= 0x01 | 0x04;
while (kbd_read_status() & KBD_STAT_IBF) mdelay(1);
kbd_write_output(KBD_CMD_SET_LEDS);
mdelay(1);
while (kbd_read_status() & KBD_STAT_IBF) mdelay(1);
mdelay(1);
kbd_write_output(led);
last_jiffie = jiffies;
}
}

static int __init panicblink_setup(char *str)
{
int par;
if (get_option(&str,&par))
blink_frequency = par*(1000/HZ);
return 1;
}
/* panicblink=0 disables the blinking as it caused problems with some
console switches. Otherwise argument is ms of a blink period. */
__setup("panicblink=", panicblink_setup);


By default, the 2.6 kernel release does not include the panic_blink() function. It was later added through a patch.


Even though this source informs the user that the machine has pulled a panic, it does not give us the stack or even the kernel state. Maybe, if we were lucky, klogd was able to write it to the syslog, but if we were lucky, the machine would not have panicked. For this reason, we recommend configuring a serial console so that you can collect the panic message if a panic takes place. Refer to Scenario 2-3 for an illustration.





Troubleshooting Non-Interruptible Hangs


The non-interruptible hang is the worst kind of hang because the standard troubleshooting techniques mentioned previously do not work, so correcting the problem is substantially more difficult. Again, the first step is to confirm that the hardware is supported and that all the drivers have been tested with this configuration. Keep in mind that hardware vendors and OS distributions spend vast resources confirming the supported configurations. Therefore, if the machine you are troubleshooting is outside of the supported configuration, it would be considered a best effort by those in the Linux community. It would be best to remove the unsupported hardware or software and see whether the problem persists.


Try to determine what the OS was doing before the hang. Ask these questions: Does the hang occur frequently? What application(s) are spawned at the time of the hang? What, if any, hardware is being accessed (for example, tape, disk, CD, DVD, and so on)? What software or hardware changes have been made on the system since the hangs began?


The answers to these questions provide "reference points" and establish an overall direction for troubleshooting the hang. For example, an application using a third-party driver module might have caused a crash that left CPU spinlocks (hardware locks) in place. In this case, it is probably not a coincidence that the machine hung every time when the user loaded his or her new driver module.


You should attempt to isolate the application or hardware interfaces being used. This might be futile, though, because the needed information might not be contained within the logs. Chances are, the CPU is looping in some type of spinlock, or the kernel attempts to crash, but the bus is in a compromised state, preventing the kernel from proceeding with the crash handler.


When the kernel has gotten into a non-interruptible hang, the goal is to get the kernel to pull a panic, save state, and create a dump. Linux achieves this goal on some platforms with a boot option to the kernel, enabling non-maskable interrupts (nmi_watchdog). More detailed information can be found in linux/Documentation/nmi_watchdog.txt. In short, the kernel sends an interrupt to the CPU every five seconds. As long as the CPU responds, the kernel stays up. When the interrupt does not return, the NMI handler generates an oops, which can be used to debug the hang. However, as with interruptible hangs, we must be able to collect the kernel dump, and this is where the serial console plays a role. If panic_on_oops is enabled, the kernel pulls a panic(), enabling other dump collection mechanisms, which are discussed later in this chapter.


We recommend getting the hardware and distribution vendors involved. As stated earlier, they have already put vast resources into confirming the hardware and software operability.


Thus, the most effective ways to troubleshoot a non-interruptible hang can be obvious. For example, sometimes it is important to take out all unnecessary hardware and drivers, leaving the machine in a "bare bones" state. It might also be important to confirm that the OS kernel is fully up-to-date on any patches. In addition, stop all unnecessary software.



Scenario 2-2: Linux 2.4.9-e.27 Kernel

In this scenario, the user added new fiber cards to an existing database system, and now the machine hangs intermittently, always at night when peak loads are down. Nothing of note seems to trigger it, although the user is running an outdated kernel. The user tried gathering a sysrq+p for each processor on the system, which requires two keystrokes for each processor. With newer kernels, sysrq+w performs a "walk" of all the CPUs with one keystroke. The user then tries gathering sysrq+m and sysrq+t. Unfortunately, the console is hung and does not accept break sequences.


The next step is to disable all unnecessary drivers and to enable a forced kernel panic. The user set up nmi-watchdog so that he could get a forced oops panic. Additionally, all hardware monitors were disabled, and unnecessary driver modules were unloaded. After disabling the hardware monitors, the user noticed that the hangs stopped occurring.




Solution 2-2: Update the Hardware Monitor Driver

We provided the configuration and troubleshooting methodology (action plan) to the hardware vendor so that its staff could offer a solution. The fact that this symptom only took place when running their monitor was enough ammunition. The hardware event lab was aware of such events and already had released a newer monitor. While the administrator was waiting on the newer monitor module to be available, he removed the old one, preventing the kernel from experiencing hangs.


If the kernel hang persists and the combination of these tools does not lead to an answer, enabling kernel debugging might be the way to go. Directions on how to use KDB can be found at http://oss.sgi.com/projects/kdb/.














No comments:

Post a Comment