commit	e2a59c531fbacd7ea16acdabb548fc2dcdca23f0	[log] [tgz]
author	Filip Filmar <fmil@fuchsia.infra.roller.google.com>	Mon May 19 14:04:59 2025 -0700
committer	Copybara-Service <copybara-worker@google.com>	Mon May 19 14:06:49 2025 -0700
tree	c8701f6018b64ab40a39866508a3945f10e50612
parent	9a940a0830c09a40fd5521f66ed6999a6f24e2c4 [diff]

[roll] Roll fuchsia [starnix][hrtimer] Attempt to resolve flakiness in interval timer handling.

There is a long-standing flakiness in the test infra of the test
sysfs_power_tests.cm, for which I only recently found a plausible
reason. The symptom is that interval timers stop firing, in a
non-deterministic way, while they are expected to be firing.

Now, it is hard to verify that the particular sequence of events shown
below is causing the observed flakiness. We'll verify by watching the
flaky test behavior over time. This change does fix three real
identified issues, a claim which is confirmed by the included regression
tests. Whether the fixes will also remove the observed test flakiness in
test infra remains to be seen.

One problem sequence of events is as follows:

1. Initial: timer heap is empty. Sleeps are allowed.
2. Starnix schedules an interval timer T. Sleeps are allowed.
3. T fires. The container wakes. Sleeps are not allowed while T's wake
proxy message is being processed.
4. HR Timer Manager removes T from the timer heap. Since T an interval
timer, sleeps are NOT allowed as we want to wait until T is
rescheduled. T is not rescheduled yet. Heap is empty.
5. Starnix schedules another timer T2. Since T2 is not interval, sleeps
are now allowed. T is still not rescheduled.
6. T2 expires. Hr Timer Manager removes T2 from the timer heap. Heap is
now empty. Since T2 is not interval timer, the
mark_all_proxy_messages_handled is run, and sleeps are now allowed. T
is still not rescheduled.
7. Container is suspended, without any scheduled alarms. T is never
rescheduled.

The fix consists in keeping active track of interval timers which are in
the state of "have just fired, but not rescheduled yet". Having such
timers should prevent suspend until a new timer is scheduled.

More generally, we should not allow a Starnix container to be suspended
if there are any timers we know of which should be scheduled but are
not. Instead, we should keep the container running until those timers
get scheduled. The previous code handled this only for the last interval
timer that fired, and only on expiry, which does not cover all the
possible event interleavings. This issue was then eventually causing
observable infra test flakes.

This change seems like it could remove infra test flakes
hr_timer_manager.rs, since it fixes behaviors that should be very
adjacent to the problem behavior we observed. However, I remain cautious
about claiming that it's a fix, since I had a few unsuccessful attempts
before.

Multiply: starnix-tests
Tested: locally
Original-Bug: 373731551
Original-Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/1277836
Original-Revision: 50b2eb2365744bfcff64e142b114beefb6c63f24
GitOrigin-RevId: f3d47786124dae52c6d8581af56b7794eef40907
Change-Id: Ic81753bce5966cb53a7319cc41382c5172ac48ee

stem[diff]

1 file changed

tree: c8701f6018b64ab40a39866508a3945f10e50612

README.md

Integration

This repository contains Fuchsia's Global Integration manifest files.

Making changes

All changes should be made to the internal version of this repository. Our infrastructure automatically updates this version when the internal one changes.

Currently all changes must be made by a Google employee. Non-Google employees wishing to make a change can ask for assistance in one of the communication channels documented at get involved.

Obtaining the source

First install Jiri.

Next run:

$ jiri init
$ jiri import minimal https://fuchsia.googlesource.com/integration
$ jiri update

Third party

Third party projects should have their own subdirectory in ./third_party.