| commit | e2a59c531fbacd7ea16acdabb548fc2dcdca23f0 | [log] [tgz] |
|---|---|---|
| author | Filip Filmar <fmil@fuchsia.infra.roller.google.com> | Mon May 19 14:04:59 2025 -0700 |
| committer | Copybara-Service <copybara-worker@google.com> | Mon May 19 14:06:49 2025 -0700 |
| tree | c8701f6018b64ab40a39866508a3945f10e50612 | |
| parent | 9a940a0830c09a40fd5521f66ed6999a6f24e2c4 [diff] |
[roll] Roll fuchsia [starnix][hrtimer] Attempt to resolve flakiness in interval timer handling. There is a long-standing flakiness in the test infra of the test sysfs_power_tests.cm, for which I only recently found a plausible reason. The symptom is that interval timers stop firing, in a non-deterministic way, while they are expected to be firing. Now, it is hard to verify that the particular sequence of events shown below is causing the observed flakiness. We'll verify by watching the flaky test behavior over time. This change does fix three real identified issues, a claim which is confirmed by the included regression tests. Whether the fixes will also remove the observed test flakiness in test infra remains to be seen. One problem sequence of events is as follows: 1. Initial: timer heap is empty. Sleeps are allowed. 2. Starnix schedules an interval timer T. Sleeps are allowed. 3. T fires. The container wakes. Sleeps are not allowed while T's wake proxy message is being processed. 4. HR Timer Manager removes T from the timer heap. Since T an interval timer, sleeps are NOT allowed as we want to wait until T is rescheduled. T is not rescheduled yet. Heap is empty. 5. Starnix schedules another timer T2. Since T2 is not interval, sleeps are now allowed. T is still not rescheduled. 6. T2 expires. Hr Timer Manager removes T2 from the timer heap. Heap is now empty. Since T2 is not interval timer, the mark_all_proxy_messages_handled is run, and sleeps are now allowed. T is still not rescheduled. 7. Container is suspended, without any scheduled alarms. T is never rescheduled. The fix consists in keeping active track of interval timers which are in the state of "have just fired, but not rescheduled yet". Having such timers should prevent suspend until a new timer is scheduled. More generally, we should not allow a Starnix container to be suspended if there are any timers we know of which should be scheduled but are not. Instead, we should keep the container running until those timers get scheduled. The previous code handled this only for the last interval timer that fired, and only on expiry, which does not cover all the possible event interleavings. This issue was then eventually causing observable infra test flakes. This change seems like it could remove infra test flakes hr_timer_manager.rs, since it fixes behaviors that should be very adjacent to the problem behavior we observed. However, I remain cautious about claiming that it's a fix, since I had a few unsuccessful attempts before. Multiply: starnix-tests Tested: locally Original-Bug: 373731551 Original-Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/1277836 Original-Revision: 50b2eb2365744bfcff64e142b114beefb6c63f24 GitOrigin-RevId: f3d47786124dae52c6d8581af56b7794eef40907 Change-Id: Ic81753bce5966cb53a7319cc41382c5172ac48ee
This repository contains Fuchsia's Global Integration manifest files.
All changes should be made to the internal version of this repository. Our infrastructure automatically updates this version when the internal one changes.
Currently all changes must be made by a Google employee. Non-Google employees wishing to make a change can ask for assistance in one of the communication channels documented at get involved.
First install Jiri.
Next run:
$ jiri init $ jiri import minimal https://fuchsia.googlesource.com/integration $ jiri update
Third party projects should have their own subdirectory in ./third_party.