OS HangsOS hangs come in two types: interruptible and non-interruptible. The first step to remedying a hang is to identify the type of hang. We know we have an interruptible hang when it responds to an external interrupt. Conversely, we know we have a non-interruptible hang when it does not. To determine whether the hang responds to an external interrupt, attempt a ping test, checking for a response. If a keyboard is attached, perform a test by simply pressing the Caps Lock key to see whether the Caps Lock light cycles. If you have console access, determine whether the console gives you line returns when you press the Enter key. If one or more of these yields the sought response, you know you have an interruptible hang. Note Any time an OS hangs, one or more offending processes are usually responsible for the hang. This is true whether software or hardware is ultimately to blame for the OS hang. Even when a hardware problem exists, a process has made a request of the hardware that the hardware could not fulfill, so processes stack up as a result. Troubleshooting Interruptible HangsThe first step in troubleshooting an interruptible hang is obtaining a stack trace of the offending process or processes by using the Magic SysRq keystroke. Some Linux distributions have this functionality enabled by default, whereas others do not. We recommend always having this functionality enabled. The following example shows how to enable it. Check whether the Magic SysRq is enabled: # cat /proc/sys/kernel/sysrq Because it is not enabled, enable it: # echo 1 > /proc/sys/kernel/sysrq Alternatively, we can use the sysctl command: # sysctl -n kernel.sysrq To make this setting persistent, just put an entry into the configuration file: # /etc/sysctl.conf When the functionality is enabled, a stack trace can be obtained by sending a Break+t to the console. Unfortunately, this can be more difficult than it first appears. With a standard VGA console, this is accomplished with the Alt+sysrq+t keystroke combination; however, the keystroke combination is different for other console emulators, in which case you would need to determine the required key sequence by contacting the particular manufacturer. For example, if a Windows user utilizes emulation software, such as Reflections, key mapping can be an issue. Linux distributions sometimes provide tools such as cu and minicom, which do not affect the key mapping by default. With the latest 2.4 kernel releases, a new file is introduced in /proc called sysrq-trigger. By simply echoing a predefined character to this file, the administrator avoids the need to send a break to the console. However, if the terminal window is hung, the break sequence is the only way. The following example shows how to use the functionality to obtain a stack trace. Using a serial console or virtual console (/dev/ttyS0 or /dev/vc/1), press Alt+sysrq+h. After this key combination is entered and sysrq is enabled, the output on the screen is as follows: kernel: SysRq : HELP : loglevel0-8 reBoot tErm kIll saK showMem Off The following is a description of each parameter:
With the latest kernels, it is possible to test the "Magic SysRq" functionality by writing a "key" character from the previous list to the /proc/sysrq-trigger file. The following example causes the CPU stacks to be written to the kernel ring buffer: # echo w > /proc/sysrq-trigger To view the processor stacks, simply execute the dmesg command or view the syslog. # dmesg Scenario 2-1: Hanging OSIn this scenario, the OS hangs, but the user is unable to determine why. The way to start troubleshooting is to gather stacks and logs. Because the OS is not responding to telnet, ssh, or any attempt to log in, we must resort to another log collection method. In this case, we test the keyboard to see whether the system still responds to interrupts. As mentioned at the start of this section, an easy test is to press the Caps Lock key and see whether the Caps Lock light toggles on and off. If not, the hang is considered non-interruptible, which we discuss later. If the light does toggle and the Magic SysRq keys are enabled, gather the system registers by pressing the Alt+sysrq+p key combination. The following is output from the register dump: SysRq: Show Regs (showPc) Referring to the register dump output, we can assume the machine is in an idle loop because the kernel is in the default_idle function and the machine is no longer responding. This message also informs us that the kernel is not "tainted." The latest source code provides us with the various tainted kernel states, as shown in the following code snippet. linux/kernel/panic.c In most cases, if the kernel were in a tainted state, a tech support organization would suggest that you remove the "unsupported" kernel module that is tainting the kernel before proceeding to troubleshoot the issue. In this case, the kernel is not tainted, so we proceed along our original path. Reviewing the register dump tells us the offset location for the instruction pointer. In this case, the offset is at default_idle+46 (0x2e hex = 46 dec). With this new information, we can use GDB to obtain the instruction details. gdb vmlinux-2.4.9-e.3smp Now we know that the OS is hung on a return to caller. At this point, we are stuck because this return could have been caused by some other instruction that had already taken place. Solution 2-1: Update the KernelThe problem was solved when we updated the kernel to the latest supported patch release. Before spending considerable time and resources tracking down what appears to be a bug, or "feature," as we sometimes say, confirm that the kernel and all the relevant applications have been patched or updated to their latest revisions. In this case, after the kernel was patched, the hang was no longer reproducible. The Magic SysRq is logged in three places: the kernel message ring buffer (read by dmesg), the syslog, and the console. The package responsible for this is sysklogd, which provides klogd and syslogd. Of course, not all events are logged to the console. Event levels control whether something is logged to the console. To enable all messages to be printed on the console, set the log level to 8 through dmesg -n 8 or klogd -c 8. If you are already on the console, you can use the SysRq keys to indicate the log level by pressing Alt+sysrq+level, where level is a number from 0 to 8. More details on these commands can be found in the dmesg and klogd man pages and of course in the source code. Reviewing the source, we can see that not all the keyboard characters are used. See: /drivers/char/sysrq.c Collecting the dump is more difficult if the machine is not set up properly. In the case of an interruptible hang, the syslog daemon might not be able to write to its message file. In this case, we have to rely on the console to collect the dump messages. If the only console on the machine is a Graphics console, you must write the dump out by hand. Note that the dump messages are written only to the virtual console, not to X Windows. The Linux kernel addresses this panic scenario by making the LEDs on the keyboard blink, notifying the administrator that this is not an OS hang but rather an OS panic. The following 2.4 series source code illustrates this LED-blinking feature that is used when the Linux kernel pulls a panic. Notice that we start with the kernel/panic.c source to determine which functions are called and to see whether anything relating to blinking is referenced. # linux/kernel/panic.c We tracked down the panic_blink() function in the following source: # linux/drivers/char/pc_keyb.c By default, the 2.6 kernel release does not include the panic_blink() function. It was later added through a patch. Even though this source informs the user that the machine has pulled a panic, it does not give us the stack or even the kernel state. Maybe, if we were lucky, klogd was able to write it to the syslog, but if we were lucky, the machine would not have panicked. For this reason, we recommend configuring a serial console so that you can collect the panic message if a panic takes place. Refer to Scenario 2-3 for an illustration. Troubleshooting Non-Interruptible HangsThe non-interruptible hang is the worst kind of hang because the standard troubleshooting techniques mentioned previously do not work, so correcting the problem is substantially more difficult. Again, the first step is to confirm that the hardware is supported and that all the drivers have been tested with this configuration. Keep in mind that hardware vendors and OS distributions spend vast resources confirming the supported configurations. Therefore, if the machine you are troubleshooting is outside of the supported configuration, it would be considered a best effort by those in the Linux community. It would be best to remove the unsupported hardware or software and see whether the problem persists. Try to determine what the OS was doing before the hang. Ask these questions: Does the hang occur frequently? What application(s) are spawned at the time of the hang? What, if any, hardware is being accessed (for example, tape, disk, CD, DVD, and so on)? What software or hardware changes have been made on the system since the hangs began? The answers to these questions provide "reference points" and establish an overall direction for troubleshooting the hang. For example, an application using a third-party driver module might have caused a crash that left CPU spinlocks (hardware locks) in place. In this case, it is probably not a coincidence that the machine hung every time when the user loaded his or her new driver module. You should attempt to isolate the application or hardware interfaces being used. This might be futile, though, because the needed information might not be contained within the logs. Chances are, the CPU is looping in some type of spinlock, or the kernel attempts to crash, but the bus is in a compromised state, preventing the kernel from proceeding with the crash handler. When the kernel has gotten into a non-interruptible hang, the goal is to get the kernel to pull a panic, save state, and create a dump. Linux achieves this goal on some platforms with a boot option to the kernel, enabling non-maskable interrupts (nmi_watchdog). More detailed information can be found in linux/Documentation/nmi_watchdog.txt. In short, the kernel sends an interrupt to the CPU every five seconds. As long as the CPU responds, the kernel stays up. When the interrupt does not return, the NMI handler generates an oops, which can be used to debug the hang. However, as with interruptible hangs, we must be able to collect the kernel dump, and this is where the serial console plays a role. If panic_on_oops is enabled, the kernel pulls a panic(), enabling other dump collection mechanisms, which are discussed later in this chapter. We recommend getting the hardware and distribution vendors involved. As stated earlier, they have already put vast resources into confirming the hardware and software operability. Thus, the most effective ways to troubleshoot a non-interruptible hang can be obvious. For example, sometimes it is important to take out all unnecessary hardware and drivers, leaving the machine in a "bare bones" state. It might also be important to confirm that the OS kernel is fully up-to-date on any patches. In addition, stop all unnecessary software. Scenario 2-2: Linux 2.4.9-e.27 KernelIn this scenario, the user added new fiber cards to an existing database system, and now the machine hangs intermittently, always at night when peak loads are down. Nothing of note seems to trigger it, although the user is running an outdated kernel. The user tried gathering a sysrq+p for each processor on the system, which requires two keystrokes for each processor. With newer kernels, sysrq+w performs a "walk" of all the CPUs with one keystroke. The user then tries gathering sysrq+m and sysrq+t. Unfortunately, the console is hung and does not accept break sequences. The next step is to disable all unnecessary drivers and to enable a forced kernel panic. The user set up nmi-watchdog so that he could get a forced oops panic. Additionally, all hardware monitors were disabled, and unnecessary driver modules were unloaded. After disabling the hardware monitors, the user noticed that the hangs stopped occurring. Solution 2-2: Update the Hardware Monitor DriverWe provided the configuration and troubleshooting methodology (action plan) to the hardware vendor so that its staff could offer a solution. The fact that this symptom only took place when running their monitor was enough ammunition. The hardware event lab was aware of such events and already had released a newer monitor. While the administrator was waiting on the newer monitor module to be available, he removed the old one, preventing the kernel from experiencing hangs. If the kernel hang persists and the combination of these tools does not lead to an answer, enabling kernel debugging might be the way to go. Directions on how to use KDB can be found at http://oss.sgi.com/projects/kdb/. |
Saturday, November 7, 2009
OS Hangs
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment