| # lib/lockup_detector -- library for detecting kernel lockups |
| |
| This library provides instrumentation for detecting different kinds of |
| lockups. |
| |
| The library has two tools, a critical section detector and a CPU |
| heartbeat checker. |
| |
| ## Critical Section Detector |
| |
| The critical section detector detects when the kernel has remained in |
| a critical section for "too long". Critical sections are marked using |
| `LOCKUP_TIMED_BEGIN()` and `LOCKUP_TIMED_END()`. The code executing |
| during an SMC call is an example of a critical section. A function |
| that temporarily disables interrupts would be another example. |
| |
| ### Self Checking and Cross Checking |
| |
| Currently, detection is performed in two ways. |
| |
| First, each time a CPU leaves a critical section (calls |
| `LOCKUP_TIMED_END()`), it will observe the time spent in the critical |
| section, and then update kcounters which track the number of long |
| running critical sections we have seen. Additionally, it will track |
| the "worst case" critical section time for that CPU. |
| |
| Second, If a CPU calls `LOCKUP_TIMED_BEGIN()`, but never calls |
| `LOCKUP_TIMED_END()`, and the CPU has spent more than the configured |
| threshold amount of time in the critical section, a lockup will be |
| reported as a `KERNEL_OOPS` by another one of the CPUs when it is |
| performing a heartbeat check. If the amount of time spent by the |
| locked-up CPU in the critical section exceeds the fatal threshold, the |
| kernel will generate a crashlog and reboot, indicating a reboot reason |
| of `SOFTWARE_WATCHDOG` in the crashlog as it does. |
| |
| The `k lockup status` command can be used to check if a CPU is |
| currently in a critical section, and to print the current worst case |
| critical section times for each CPU. |
| |
| ### Threshold |
| |
| Choosing appropriate thresholds for the critical section detector is |
| important. The values should be small enough to detect performance |
| impacting lockups, but large enough to avoid false alarms, especially |
| when running on virtualized hardware. Setting both of these values to |
| 0 will completely disable critical section lockup detection. |
| |
| See also `kernel.lockup-detector.critical-section-threshold-ms` and |
| `kernel.lockup-detector.critical-section-fatal-threshold-ms`. |
| |
| ## Heartbeak Checker |
| |
| The heartbeat checker is used to detect when a CPU has stopped |
| responding to interrupts. |
| |
| When enabled, all CPUs will run a periodic timer callback that emits a |
| "heartbeat" by updating a timestamp in its per CPU structure. |
| Afterwards, they will check each of the other CPUs' last heartbeat and |
| if the last heartbeat is older than the configured threshold, a (rate |
| limited) `KERNEL_OOPS` will be emitted. If the time since last |
| heartbeat exceeds the fatal threshold, a crashlog will be generated |
| and the kernel will reboot, indicating a reboot reason of |
| `SOFTWARE_WATCHDOG` in the crashlog as it does. |
| |
| `kernel.lockup-detector.heartbeat-period-ms` controls how frequently |
| the CPUs emit heartbeats and perform checks. |
| |
| `kernel.lockup-detector.heartbeat-age-threshold-ms` controls how long |
| a CPU can go without emitting a heartbeat before it is considered to |
| be locked up and a `KERNEL_OOPS` is generated. |
| |
| `kernel.lockup-detector.heartbeat-age-fatal-threshold-ms` controls how |
| long a CPU can go without emitting a heartbeat before a |
| `SOFTWARE_WATCHDOG` reboot is triggered. |
| |
| ## Future Directions |
| |
| A future version of this library may add instrumentation for detecting |
| when a CPU has been "executing kernel code for too long", or has been |
| in a hypervisor or Secure Monitor call for too long. |