blob: 21c023bf93be8ea0d552a5155c4b9b532267548f [file] [view] [edit]
# lib/lockup_detector -- library for detecting kernel lockups
This library provides instrumentation for detecting different kinds of
lockups.
The library has two tools, a critical section detector and a CPU
heartbeat checker.
## Critical Section Detector
The critical section detector detects when the kernel has remained in
a critical section for "too long". Critical sections are marked using
`LOCKUP_TIMED_BEGIN()` and `LOCKUP_TIMED_END()`. The code executing
during an SMC call is an example of a critical section. A function
that temporarily disables interrupts would be another example.
### Self Checking and Cross Checking
Currently, detection is performed in two ways.
First, each time a CPU leaves a critical section (calls
`LOCKUP_TIMED_END()`), it will observe the time spent in the critical
section, and then update kcounters which track the number of long
running critical sections we have seen. Additionally, it will track
the "worst case" critical section time for that CPU.
Second, If a CPU calls `LOCKUP_TIMED_BEGIN()`, but never calls
`LOCKUP_TIMED_END()`, and the CPU has spent more than the configured
threshold amount of time in the critical section, a lockup will be
reported as a `KERNEL_OOPS` by another one of the CPUs when it is
performing a heartbeat check. If the amount of time spent by the
locked-up CPU in the critical section exceeds the fatal threshold, the
kernel will generate a crashlog and reboot, indicating a reboot reason
of `SOFTWARE_WATCHDOG` in the crashlog as it does.
The `k lockup status` command can be used to check if a CPU is
currently in a critical section, and to print the current worst case
critical section times for each CPU.
### Threshold
Choosing appropriate thresholds for the critical section detector is
important. The values should be small enough to detect performance
impacting lockups, but large enough to avoid false alarms, especially
when running on virtualized hardware. Setting both of these values to
0 will completely disable critical section lockup detection.
See also `kernel.lockup-detector.critical-section-threshold-ms` and
`kernel.lockup-detector.critical-section-fatal-threshold-ms`.
## Heartbeak Checker
The heartbeat checker is used to detect when a CPU has stopped
responding to interrupts.
When enabled, all CPUs will run a periodic timer callback that emits a
"heartbeat" by updating a timestamp in its per CPU structure.
Afterwards, they will check each of the other CPUs' last heartbeat and
if the last heartbeat is older than the configured threshold, a (rate
limited) `KERNEL_OOPS` will be emitted. If the time since last
heartbeat exceeds the fatal threshold, a crashlog will be generated
and the kernel will reboot, indicating a reboot reason of
`SOFTWARE_WATCHDOG` in the crashlog as it does.
`kernel.lockup-detector.heartbeat-period-ms` controls how frequently
the CPUs emit heartbeats and perform checks.
`kernel.lockup-detector.heartbeat-age-threshold-ms` controls how long
a CPU can go without emitting a heartbeat before it is considered to
be locked up and a `KERNEL_OOPS` is generated.
`kernel.lockup-detector.heartbeat-age-fatal-threshold-ms` controls how
long a CPU can go without emitting a heartbeat before a
`SOFTWARE_WATCHDOG` reboot is triggered.
## Future Directions
A future version of this library may add instrumentation for detecting
when a CPU has been "executing kernel code for too long", or has been
in a hypervisor or Secure Monitor call for too long.