2.11BSD#431 - crash.0



CRASH(8V)	    UNIX Programmer's Manual		CRASH(8V)


NAME
     crash - what happens when the system crashes

DESCRIPTION
     This section explains what happens when the system crashes
     and (very briefly) how to analyze crash dumps.

     When the system crashes voluntarily it prints a message of
     the form

	  panic: why i gave up the ghost

     on the console, takes a dump on a mass storage peripheral,
     and then invokes an automatic reboot procedure as described
     in reboot(8).  Unless some unexpected inconsistency is
     encountered in the state of the file systems due to hardware
     or software failure, the system will then resume multi-user
     operations.  If the automatic file system check fails, the
     file systems should be checked and repaired with fsck(8)
     before continuing.

     The system has a large number of internal consistency
     checks; if one of these fails, then it will panic with a
     very short message indicating which one failed.  In many
     instances, this will be the name of the routine which
     detected the error, or a two-word description of the incon-
     sistency.	A full understanding of most panic messages
     requires perusal of the source code for the system.

     The most common cause of system failures is hardware
     failure, which can reflect itself in different ways.  Here
     are the messages which are most likely, with some hints as
     to causes.  Left unstated in all cases is the possibility
     that hardware or software error produced the message in some
     unexpected way.

     iinit
	  This cryptic panic message results from a failure to
	  mount the root filesystem during the bootstrap process.
	  Either the root filesystem has been corrupted, or the
	  system is attempting to use the wrong device as root
	  filesystem.  Usually, an alternate copy of the system
	  binary or an alternate root filesystem can be used to
	  bring up the system to investigate.

     Can't exec /etc/init
	  This is not a panic message, as reboots are likely to
	  be futile.  Late in the bootstrap procedure, the system
	  was unable to locate and execute the initialization
	  process, init(8).  The root filesystem is incorrect or
	  has been corrupted, or the mode or type of /etc/init
	  forbids execution.


Printed 11/26/99	  July 11, 1987                         1


CRASH(8V)	    UNIX Programmer's Manual		CRASH(8V)


     hard IO err in swap
	  The system encountered an error trying to write to the
	  swap device or an error in reading critical information
	  from a disk drive.  The offending disk should be fixed
	  if it is broken or unreliable.

     timeout table overflow
	  This really shouldn't be a panic, but until the data
	  structure involved is made to be extensible, running
	  out of entries causes a crash.  If this happens, make
	  the timeout table bigger.  (NCALL in param.c)

     trap type %o
	  An unexpected trap has occurred within the system; the
	  trap types are:

     0	  bus error
     1	  illegal instruction trap
     2	  BPT/trace trap
     3	  IOT
     4	  power fail trap (if autoreboot fails)
     5	  EMT
     6	  recursive system call (TRAP instruction)
     7	  programmed interrupt request
     11   protection fault (segmentation violation)
     12   parity trap

     In some of these cases it is possible for octal 020 to be
     added into the trap type; this indicates that the processor
     was in user mode when the trap occurred.

     In addition to the trap type, the system will have printed
     out three (or four) other numbers: ka6, which is the con-
     tents of the segmentation register for the area in which the
     system's stack is kept; aps, which is the location where the
     hardware stored the program status word during the trap; pc,
     which was the system's program counter when it faulted
     (already incremented to the next word); __ovno, the overlay
     number of the currently loaded kernel overlay when the trap
     occurred.

     The favorite trap types in system crashes are trap types 0
     and 11, indicating a wild reference.  The code is the refer-
     enced address, and the pc at the time of the fault is
     printed.  These problems tend to be easy to track down if
     they are kernel bugs since the processor stops cold, but
     random flakiness seems to cause this sometimes.  The
     debugger can be used to locate the instruction and subrou-
     tine corresponding to the PC value.  If that is insufficient
     to suggest the nature of the problem, more detailed examina-
     tion of the system status at the time of the trap usually
     can produce an explanation.


Printed 11/26/99	  July 11, 1987                         2


CRASH(8V)	    UNIX Programmer's Manual		CRASH(8V)


     init died
	  The system initialization process has exited.  This is
	  bad news, as no new users will then be able to log in.
	  Rebooting is the only fix, so the system just does it
	  right away.

     out of mbufs: map full
	  The network has exhausted its private page map for net-
	  work buffers.  This usually indicates that buffers are
	  being lost, and rather than allow the system to slowly
	  degrade, it reboots immediately.  The map may be made
	  larger if necessary.

     out of swap space
	  This really shouldn't be panics but there's no other
	  satisfactory solution.  The size of the swap area must
	  be increased.  The system attempts to avoid running out
	  of swap by refusing to start new processes when short
	  of swap space (resulting in ``No more proceses'' mes-
	  sages from the shell).

     &remap_area > SEG5
     _end > SEG5
	  The kernel detected at boot time that an unacceptable
	  portion of its data space extended into the region con-
	  trolled by KDSA5.  In the case of the first message,
	  the size of the kernel's data segment (excluding the
	  file, proc, and text tables) must be decreased.  In the
	  latter case, there are two possibilities: if
	  &remap_area is not greater than 0120000, the kernel
	  must be recompiled without defining the option NOKA5.
	  Otherwise, as above, the size of the kernel's data seg-
	  ment must be decreased.

     That completes the list of panic types you are likely to
     see.  There are many other panic messages which are less
     likely to occur; most of them detect logical inconsistencies
     within the kernel and thus ``cannot happen'' unless some
     part of the kernel has been modified.

     If the system stops or hangs without a panic, it is possible
     to stop it and take a dump of memory before rebooting.  A
     panic can be forced from the console, which will allow a
     dump, automatic reboot and file system check.  This is
     accomplished by halting the CPU, putting the processor in
     kernel mode, loading the PC with 40, and continuing without
     a reset (use continue, not start).  To put the processor in
     kernel mode, make sure the two high bits in the processor
     status word are zero.  (you'll need to consult the procesor
     handbook describing your processor to determine how to
     access the PC and PS ...) The message ``panic:  forced from
     console'' should print, and the automatic reboot will start.


Printed 11/26/99	  July 11, 1987                         3


CRASH(8V)	    UNIX Programmer's Manual		CRASH(8V)


     If this fails a dump of memory can be made on magtape: mount
     a tape (with write ring!), halt the CPU, load address 044,
     and perform a start (which does a reset).	This should write
     a copy of all of core on the tape with an EOF mark.  Cau-
     tion: Any error is taken to mean the end of core has been
     reached.  This means that you must be sure the ring is in,
     the tape is ready, and the tape is clean and new.	If the
     dump fails, you can try again, but some of the registers
     will be lost.  After this completes, halt again and reboot.

     After rebooting, or after an automatic file system check
     fails, check and fix the file systems with fsck.  If the
     system will not reboot, a runnable system must be obtained
     from a backup medium after verifying that the hardware is
     functioning normally.  A damaged root file system should be
     patched while running with an alternate root if possible.

     When the system crashes if crash dumping was enabled it
     writes (or at least attempts to write) an image of memory
     into the back end of the dump device, usually the same as
     the primary swap area.  After the system is rebooted, the
     program savecore(8) runs and preserves a copy of this core
     image and the current system in a specified directory for
     later perusal.  See savecore(8) for details.  A magtape dump
     can be read onto disk with dd(1).

     To analyze a dump you should begin by running adb(1) with
     the -k flag on the system load image and core dump.  If the
     core image is the result of a panic, the panic message is
     printed.  Normally the command ``$c'' or ``$C'' will provide
     a stack trace from the point of the crash and this will pro-
     vide a clue as to what went wrong.  ps(1) and
     pstat(8)canalsobeused to print the process table at the time
     of the crash via: ps -alxk and pstat -p.  If the mapping or
     the stack frame are incorrect, the following magic locations
     may be examined in an attempt to find out what went wrong.
     The registers R0, R1, R2, R3, R4, R5, SP, and KDSA6 (or
     KISA6 for machines without separate instruction and data)
     are saved at location 04.	If the core dump was taken on
     disk, these values also appear at 0300.  The value of KDSA6
     (KISA6) multiplied by 0100 (8) gives the address of the user
     structure and kernel stack for the running process.  Relabel
     these addresses 0140000 through 0142000.  R5 is C's frame or
     display pointer.  Stored at (R5) is the old R5 pointing to
     the previous stack frame.	At (R5)+2 is the saved PC of the
     calling procedure.  Trace this calling chain to an R5 value
     of 0141756 (0141754 for overlaid kernels), which is where
     the user's R5 is stored.  If the chain is broken, look for a
     plausible R5, PC pair and continue from there.  In most
     cases this procedure will give an idea of what is wrong.


Printed 11/26/99	  July 11, 1987                         4


CRASH(8V)	    UNIX Programmer's Manual		CRASH(8V)


     A more complete discussion of system debugging is impossible
     here.  See, however, ``Using ADB to Debug the UNIX Kernel''.

SEE ALSO
     adb(1), ps(1), pstat(1), boot(8), fsck(8), reboot(8),
     savecore(8)
     PDP-11 Processor Handbook for various processors for more
     information about PDP-11 memory management and general
     architecture.
     Using ADB to Debug the UNIX Kernel


Printed 11/26/99	  July 11, 1987                         5
2.11BSD#431 - /usr/man/cat8/crash.0