blob: b711b93f23d069647b8327403516e77b62975c94 [file] [log] [blame] [view]
<!--
(C) Copyright 2020 The Fuchsia Authors. All rights reserved.
Use of this source code is governed by a BSD-style license that can be
found in the LICENSE file.
-->
# Driver stack performance
The purpose of this document is to provide an overview into good and bad practices in regards to
performance for authoring new drivers or interacting with existing ones in Fuchsia.
This document covers the following topics:
- [API Definition](#api-definition)
- [Implementation](#implementation)
## API definition {#api-definition}
When authoring the API for a new driver stack, some key performance insights should be taken into
consideration. Driver stack APIs fall into the application or driver category.
Application APIs are used to give regular non-driver components access to hardware resources, while
driver APIs are used to allow drivers to communicate among themselves. For example,
[`fuchsia.hardware.network`][netdevice-fidl] is an application API to interact with network drivers,
and [`ddk.protocol.network.device`][netdevice-banjo] defines the lower level driver API for network
devices.
Device driver APIs are typically `ddk.protocol.*` banjo and `fuchsia.hardware.*` FIDL APIs.
#### Avoid synchronous operations
Units of work that are part of the fast path should avoid the expectation of synchronous completion,
especially (but not exclusively) if the operation requires crossing process boundaries. In the case
of FIDL APIs, unnecessary synchronicity may arise from a design where a new unit of work can't be
started until the last one is completed, meaning the caller has to wait idly until it is safe to
request the next unit of work.
Synchronous operations are acceptable from the performance standpoint in slow path operations such
as setup, teardown, and configuration.
#### Encourage batching
When an API definition explicitly defines a batch of work, as opposed to always transmitting single
units, users of the API are encouraged to plumb the batching through their applications or drivers.
An important feature of batching is reducing the number of times the API needs to be exercised for a
set amount of work. For an API that crosses process boundaries, that translates directly into
reduced syscall and scheduling overhead, which helps performance.
Furthermore, if the API definition itself is providing units of work in batches, device drivers can
more easily coalesce a batch of work items into a single unit through hardware acceleration. Many
device drivers use DMA, for example, to enqueue work with the specific hardware they drive. If a
batch of work items is received at once, it may be possible to reduce the number of transactions
with the hardware. Without a well-determined batch boundary, device drivers are forced to either
interact with hardware more often than necessary or resort to heuristics (such as a polling
interval) to reduce the hardware communication burden.
#### Avoid data copies
Avoiding data copies is especially important in high-bandwidth or low-latency applications such as
networking, audio, or video. Large payloads should cross API boundaries in a VMO whenever possible.
A common strategy is to negotiate a limited number of VMOs on setup and exchange references to
regions in those VMOs during operation.
Note: When defining a `banjo` API, it's technically feasible to share virtual memory pointers over
the API boundary. API authors should always be aware that doing so restricts users of the API to
**always** be in the same process, which is undesirable from a system architecture standpoint.
#### Clarify flow control
If the API batches work and does not set strict synchronous operation expectations as suggested
above, controlling the flow of information between the API's server and client is important both for
correctness and performance.
In regular operation, the API's flow control definition must allow for all parts to be able to
perform work without being blocked waiting for their counterpart by maintaining some invariant about
the total amount of work that the system is capable of performing (either fixed by definition or
pre-negotiated on setup).
For example, `fuchsia.hardware.network` enforces flow control by defining a finite amount of "units
of work", i.e. network packets, during set up. At any point in time, both parties involved can be
aware of how many packets can still be pushed across the API boundary using only locally-maintained
state.
#### Account for ordering
In some applications, all units of work must be performed in a set order for correctness. In other
applications, however, some ordering constraints may be weakened or lifted altogether.
When the stream of work items doesn't have to be executed in an exact order, the API can reflect
that to allow drivers and applications to easily identify work that can be parallelized.
For example, networking packets usually need to maintain ordering to not break application
protocols, but that is generally only true for each application stream instance (called *flows*). A
common strategy in network adapters is to define a set number of packet queues and assign each
application *flow* deterministically to one of the queues. The networking stack can, then, safely
parallelize operating the queues without having to look into the packet's contents. Observing such
common hardware facilities and enabling the best use of them in the API can translate into
meaningful performance improvements.
## Implementation {#implementation}
This section lists performance patterns and anti-patterns to observe when implementing code that
either serves or consumes a device driver API. The implementation can be a driver itself, or an
application that interacts with services that are provided by device drivers.
The following should always be observed when writing code that is on the fast path. Note that for
device driver implementations, part of the fast path is often in interrupt threads.
- **Avoid allocating memory or creating zircon objects**. Resources used on the fast path should
always be pre-allocated. Be especially cautious when using libraries such as FIDL bindings (notably
hlcpp) or `fit::function` that may cause implicit allocations.
- **Be mindful of syscalls**. Syscalls can be costly to operate, so great care must be taken to not
call into the kernel recklessly in performance-sensitive operations. Completely eliminating
syscalls may not be possible, but reducing the number of calls per unit of work (with batching, for
example) can greatly reduce system load.
- **Return shared resources quickly**. If the API defines a finite pool of shared resources, such
as shared memory regions, those should be reused or returned to the "available" pool as quickly as
possible, preferably in batches.
[netdevice-fidl]: /sdk/fidl/fuchsia.hardware.network/device.fidl
[netdevice-banjo]: /sdk/banjo/ddk.protocol.network.device/network-device.banjo