| # Defining a Stable Driver Runtime |
| |
| * Project lead: surajmalhotra@google.com |
| * Status: Approved |
| * Area(s): Devices |
| |
| ## Problem statement |
| |
| Banjo is an interface definition language (IDL) used to express interfaces used |
| between drivers. It is a derivative of FIDL, with a forked syntax from 2018. |
| While the syntax is similar, unlike FIDL, banjo was designed for synchronous |
| in-process communication, and the resulting codegen amounts to a very barebones |
| struct of function pointers, associated with a context pointer. |
| |
| A non-exhaustive list of problems with banjo include: |
| |
| * The generated code for banjo lacks a strategy for interface and type |
| evolution. This is a critical requirement for interface stability. |
| * Since early 2019, banjo has been largely in maintenance mode, and has fallen |
| behind FIDL in terms of ergonomics and features. Understanding how to write |
| banjo syntax has become confusing because the Fuchsia project has relied on |
| FIDL documentation to address the current gaps in banjo's features and |
| ergonomics. |
| * Banjo is optimized to be low overhead, placing a great deal of burden on |
| driver authors to figure out how to move state onto the heap |
| or handle an operation asynchronously. There is a great deal of boilerplate |
| involved with manual serialization logic required to do so. |
| * There are no strict requirements on how driver authors may invoke banjo |
| protocol methods, nor any guarantees on which context their own protocol |
| methods may be invoked, leading to unnecessary spawning of threads in order |
| to achieve safety (avoiding deadlocks). |
| * Banjo types are incompatible with FIDL types often leading to much |
| boilerplate when shifting to out of process communication. |
| |
| ## Solution statement |
| |
| We aim to solve these problems by evolving banjo into something better. The |
| three key features of the new transport will be: |
| |
| 1. A forced layer of indirection between drivers, to allow a runtime to |
| mediate driver-to-driver communications within the same process |
| 2. Migration away from C structs towards types built with evolution in mind. |
| 3. Enforcement of a threading model which is well defined |
| |
| We are expecting to find a solution with the following characteristics: |
| |
| * Shift all communication between drivers to be message oriented, utilizing |
| the FIDL wire format between drivers. |
| * Allow drivers to make synchronous calls into other drivers. |
| - With the caveat it is only allowed on threads owned by the driver. |
| * Share threads between drivers |
| - With the caveat that all communication on shared threads must be |
| asynchronous. |
| * Allow drivers to never deal with re-entrance or synchronization if they |
| don't opt-in (allowing them to avoid locks altogether). |
| * Allow for zero copy and zero serialization / deserialization between |
| drivers. |
| |
| We reserve the right to change our minds depending on the benchmark results of |
| early prototypes. If we cannot outperform mechanisms provided by the kernel, we |
| will need to try alternative designs. We also need to prove out our assumptions |
| that the mechanisms provided by the kernel are insufficient for our needs. |
| |
| We will try to track progress towards a new banjo with the following |
| milestones: |
| |
| 1. Update banjo syntax to match fidl syntax, use fidlc as the frontend, and |
| implement a custom backend which generates output equivalent to what banjoc |
| generates today. |
| 1. This allows us to avoid maintenance burden and future syntax drift. |
| 2. Architect a threading model for drivers that we want to design around. |
| 3. Decide on metrics/benchmarks to judge any forthcoming designs. |
| 4. Run experiments to see if we can meet required benchmarks with newer |
| transport. |
| 5. Implement new fidl backend and driver runtime. |
| 1. We will likely start by creating a variant of LLCPP fidl bindings which |
| targets new transport. |
| 6. Repeat the following steps for each driver stack in a loop: |
| 1. Migrate drivers which are co-resident in the same driver host over to |
| the new threading model, utilizing existing banjo transport. |
| 2. Migrate drivers which are co-resident in the same driver host over to |
| the new in-process FIDL transport. |
| |
| ## Dependencies |
| |
| We will likely need to work with the FIDL team to allow LLCPP bindings to be |
| abstracted away from zircon channels and ports to allow us to repurpose the |
| bindings mostly as-is on a new transport with minimal user visible differences. |
| We don't anticipate any changes necessary to the frontend IDL, but changes to |
| FIDL IR may be necessary. |
| |
| Additionally, migrating 300+ drivers will take a lot of effort and time, and |
| will require various teams throughout the organization to be involved to ensure |
| nothing breaks. |
| |
| ## Risks and mitigations |
| |
| A major change like this has long-term implications on performance |
| characteristics of our system by inducing additional overhead. Luckily, we have |
| built in some evolutionary support directly into our framework's architecture |
| to enable us to move towards another technology if the solution we build is |
| unable to meet future needs. We can do this by implementing new component |
| runners and having drivers target the new runner, which may have a different |
| driver runtime. Switching every driver over to the new driver runner will |
| likely be impractical, however, so we will end up needing to maintain both in |
| parallel, which has costs of its own. As such, we really want to get this |
| approach mostly right to avoid needing to take this course. |
| |
| Switching drivers to a new threading model also is a large cost to pay, and may |
| induce new bugs along the way. Many drivers lack tests. Additionally, for |
| drivers that do have tests, unit tests may also lose their validity after the |
| switch and may have to be rewritten alongside the transition. We have written a |
| great deal of our driver tests as integration tests which should continue to be |
| valid even after migration without any changes. We will continue to try to |
| invest in more integration tests and e2e tests prior to migration to prevent |
| introduction of new bugs. |
| |
| Estimating the migration timeline for the migration is another large risk. It |
| is hard to accurately estimate the cost here without having built a replacement |
| and trialed migration on at least one driver. We will need to continually be |
| cognizant of the cost as we implement our design, and automate as much of the |
| migration as possible. |