docs/concepts/drivers/driver_stack_performance.md - fuchsia - Git at Google

 <!--
     (C) Copyright 2020 The Fuchsia Authors. All rights reserved.
     Use of this source code is governed by a BSD-style license that can be
     found in the LICENSE file.
 -->

 # Driver stack performance

 The purpose of this document is to provide an overview into good and bad practices in regards to
 performance for authoring new drivers or interacting with existing ones in Fuchsia.

 This document covers the following topics:

  - [API Definition](#api-definition)
  - [Implementation](#implementation)

 ## API definition {#api-definition}

 When authoring the API for a new driver stack, some key performance insights should be taken into
 consideration. Driver stack APIs fall into the application or driver category.

 Application APIs are used to give regular non-driver components access to hardware resources, while
 driver APIs are used to allow drivers to communicate among themselves. For example,
 [`fuchsia.hardware.network`][netdevice-fidl] is an application API to interact with network drivers,
 and [`ddk.protocol.network.device`][netdevice-banjo] defines the lower level driver API for network
 devices.

 Device driver APIs are typically `ddk.protocol.*` banjo and `fuchsia.hardware.*` FIDL APIs.

 #### Avoid synchronous operations

 Units of work that are part of the fast path should avoid the expectation of synchronous completion,
 especially (but not exclusively) if the operation requires crossing process boundaries. In the case
 of FIDL APIs, unnecessary synchronicity may arise from a design where a new unit of work can't be
 started until the last one is completed, meaning the caller has to wait idly until it is safe to
 request the next unit of work.

 Synchronous operations are acceptable from the performance standpoint in slow path operations such
 as setup, teardown, and configuration.

 #### Encourage batching

 When an API definition explicitly defines a batch of work, as opposed to always transmitting single
 units, users of the API are encouraged to plumb the batching through their applications or drivers.

 An important feature of batching is reducing the number of times the API needs to be exercised for a
 set amount of work. For an API that crosses process boundaries, that translates directly into
 reduced syscall and scheduling overhead, which helps performance.

 Furthermore, if the API definition itself is providing units of work in batches, device drivers can
 more easily coalesce a batch of work items into a single unit through hardware acceleration. Many
 device drivers use DMA, for example, to enqueue work with the specific hardware they drive. If a
 batch of work items is received at once, it may be possible to reduce the number of transactions
 with the hardware. Without a well-determined batch boundary, device drivers are forced to either
 interact with hardware more often than necessary or resort to heuristics (such as a polling
 interval) to reduce the hardware communication burden.

 #### Avoid data copies

 Avoiding data copies is especially important in high-bandwidth or low-latency applications such as
 networking, audio, or video. Large payloads should cross API boundaries in a VMO whenever possible.
 A common strategy is to negotiate a limited number of VMOs on setup and exchange references to
 regions in those VMOs during operation.

 Note: When defining a `banjo` API, it's technically feasible to share virtual memory pointers over
 the API boundary. API authors should always be aware that doing so restricts users of the API to
 **always** be in the same process, which is undesirable from a system architecture standpoint.

 #### Clarify flow control

 If the API batches work and does not set strict synchronous operation expectations as suggested
 above, controlling the flow of information between the API's server and client is important both for
 correctness and performance.

 In regular operation, the API's flow control definition must allow for all parts to be able to
 perform work without being blocked waiting for their counterpart by maintaining some invariant about
 the total amount of work that the system is capable of performing (either fixed by definition or
 pre-negotiated on setup).

 For example, `fuchsia.hardware.network` enforces flow control by defining a finite amount of "units
 of work", i.e. network packets, during set up. At any point in time, both parties involved can be
 aware of how many packets can still be pushed across the API boundary using only locally-maintained
 state.

 #### Account for ordering

 In some applications, all units of work must be performed in a set order for correctness. In other
 applications, however, some ordering constraints may be weakened or lifted altogether.

 When the stream of work items doesn't have to be executed in an exact order, the API can reflect
 that to allow drivers and applications to easily identify work that can be parallelized.

 For example, networking packets usually need to maintain ordering to not break application
 protocols, but that is generally only true for each application stream instance (called *flows*). A
 common strategy in network adapters is to define a set number of packet queues and assign each
 application *flow* deterministically to one of the queues. The networking stack can, then, safely
 parallelize operating the queues without having to look into the packet's contents. Observing such
 common hardware facilities and enabling the best use of them in the API can translate into
 meaningful performance improvements.


 ## Implementation {#implementation}

 This section lists performance patterns and anti-patterns to observe when implementing code that
 either serves or consumes a device driver API. The implementation can be a driver itself, or an
 application that interacts with services that are provided by device drivers.

 The following should always be observed when writing code that is on the fast path. Note that for
 device driver implementations, part of the fast path is often in interrupt threads.

  - **Avoid allocating memory or creating zircon objects**. Resources used on the fast path should
  always be pre-allocated. Be especially cautious when using libraries such as FIDL bindings (notably
  hlcpp) or `fit::function` that may cause implicit allocations.
  - **Be mindful of syscalls**. Syscalls can be costly to operate, so great care must be taken to not
  call into the kernel recklessly in performance-sensitive operations. Completely eliminating
  syscalls may not be possible, but reducing the number of calls per unit of work (with batching, for
  example) can greatly reduce system load.
  - **Return shared resources quickly**. If the API defines a finite pool of shared resources, such
  as shared memory regions, those should be reused or returned to the  "available" pool as quickly as
  possible, preferably in batches.


 [netdevice-fidl]: /sdk/fidl/fuchsia.hardware.network/device.fidl
 [netdevice-banjo]: /sdk/banjo/ddk.protocol.network.device/network-device.banjo
	<!--
	(C) Copyright 2020 The Fuchsia Authors. All rights reserved.
	Use of this source code is governed by a BSD-style license that can be
	found in the LICENSE file.
	-->

	# Driver stack performance

	The purpose of this document is to provide an overview into good and bad practices in regards to
	performance for authoring new drivers or interacting with existing ones in Fuchsia.

	This document covers the following topics:

	- [API Definition](#api-definition)
	- [Implementation](#implementation)

	## API definition {#api-definition}

	When authoring the API for a new driver stack, some key performance insights should be taken into
	consideration. Driver stack APIs fall into the application or driver category.

	Application APIs are used to give regular non-driver components access to hardware resources, while
	driver APIs are used to allow drivers to communicate among themselves. For example,
	[`fuchsia.hardware.network`][netdevice-fidl] is an application API to interact with network drivers,
	and [`ddk.protocol.network.device`][netdevice-banjo] defines the lower level driver API for network
	devices.

	Device driver APIs are typically `ddk.protocol.` banjo and `fuchsia.hardware.` FIDL APIs.

	#### Avoid synchronous operations

	Units of work that are part of the fast path should avoid the expectation of synchronous completion,
	especially (but not exclusively) if the operation requires crossing process boundaries. In the case
	of FIDL APIs, unnecessary synchronicity may arise from a design where a new unit of work can't be
	started until the last one is completed, meaning the caller has to wait idly until it is safe to
	request the next unit of work.

	Synchronous operations are acceptable from the performance standpoint in slow path operations such
	as setup, teardown, and configuration.

	#### Encourage batching

	When an API definition explicitly defines a batch of work, as opposed to always transmitting single
	units, users of the API are encouraged to plumb the batching through their applications or drivers.

	An important feature of batching is reducing the number of times the API needs to be exercised for a
	set amount of work. For an API that crosses process boundaries, that translates directly into
	reduced syscall and scheduling overhead, which helps performance.

	Furthermore, if the API definition itself is providing units of work in batches, device drivers can
	more easily coalesce a batch of work items into a single unit through hardware acceleration. Many
	device drivers use DMA, for example, to enqueue work with the specific hardware they drive. If a
	batch of work items is received at once, it may be possible to reduce the number of transactions
	with the hardware. Without a well-determined batch boundary, device drivers are forced to either
	interact with hardware more often than necessary or resort to heuristics (such as a polling
	interval) to reduce the hardware communication burden.

	#### Avoid data copies

	Avoiding data copies is especially important in high-bandwidth or low-latency applications such as
	networking, audio, or video. Large payloads should cross API boundaries in a VMO whenever possible.
	A common strategy is to negotiate a limited number of VMOs on setup and exchange references to
	regions in those VMOs during operation.

	Note: When defining a `banjo` API, it's technically feasible to share virtual memory pointers over
	the API boundary. API authors should always be aware that doing so restricts users of the API to
	always be in the same process, which is undesirable from a system architecture standpoint.

	#### Clarify flow control

	If the API batches work and does not set strict synchronous operation expectations as suggested
	above, controlling the flow of information between the API's server and client is important both for
	correctness and performance.

	In regular operation, the API's flow control definition must allow for all parts to be able to
	perform work without being blocked waiting for their counterpart by maintaining some invariant about
	the total amount of work that the system is capable of performing (either fixed by definition or
	pre-negotiated on setup).

	For example, `fuchsia.hardware.network` enforces flow control by defining a finite amount of "units
	of work", i.e. network packets, during set up. At any point in time, both parties involved can be
	aware of how many packets can still be pushed across the API boundary using only locally-maintained
	state.

	#### Account for ordering

	In some applications, all units of work must be performed in a set order for correctness. In other
	applications, however, some ordering constraints may be weakened or lifted altogether.

	When the stream of work items doesn't have to be executed in an exact order, the API can reflect
	that to allow drivers and applications to easily identify work that can be parallelized.

	For example, networking packets usually need to maintain ordering to not break application
	protocols, but that is generally only true for each application stream instance (called flows). A
	common strategy in network adapters is to define a set number of packet queues and assign each
	application flow deterministically to one of the queues. The networking stack can, then, safely
	parallelize operating the queues without having to look into the packet's contents. Observing such
	common hardware facilities and enabling the best use of them in the API can translate into
	meaningful performance improvements.


	## Implementation {#implementation}

	This section lists performance patterns and anti-patterns to observe when implementing code that
	either serves or consumes a device driver API. The implementation can be a driver itself, or an
	application that interacts with services that are provided by device drivers.

	The following should always be observed when writing code that is on the fast path. Note that for
	device driver implementations, part of the fast path is often in interrupt threads.

	- Avoid allocating memory or creating zircon objects. Resources used on the fast path should
	always be pre-allocated. Be especially cautious when using libraries such as FIDL bindings (notably
	hlcpp) or `fit::function` that may cause implicit allocations.
	- Be mindful of syscalls. Syscalls can be costly to operate, so great care must be taken to not
	call into the kernel recklessly in performance-sensitive operations. Completely eliminating
	syscalls may not be possible, but reducing the number of calls per unit of work (with batching, for
	example) can greatly reduce system load.
	- Return shared resources quickly. If the API defines a finite pool of shared resources, such
	as shared memory regions, those should be reused or returned to the "available" pool as quickly as
	possible, preferably in batches.


	[netdevice-fidl]: /sdk/fidl/fuchsia.hardware.network/device.fidl
	[netdevice-banjo]: /sdk/banjo/ddk.protocol.network.device/network-device.banjo