src/media/audio/audio_core/README.md - fuchsia - Git at Google

 # Audio Core

 This is the core of the audio system. It implements the core FIDL APIs
 fuchsia.media.AudioCapturer and fuchsia.media.AudioRenderer. At a high level,
 audio core is organized as follows:

 ```plain
 +---------------------+        +---------------------+
 |   AudioRenderers    |        |    AudioCapturers   |
 | r1   r2   ...   rN  |        | c1   c2   ...   cN  |
 +---\---|----------|--+        +--Λ---ΛΛ-------------+
      \  |          |      +-------+  / |
       \ |          |   loopback     /  |
 +------VV----------V--+   |    +---/---|-------------+
 | o1   o2   ...   oN--+---+    | i1   i2   ...   iN  |
 |    AudioOutputs     |        |     AudioInputs     |
 +---------------------+        +---------------------+
 ```

 The relevant types are:

 *   AudioRenderers represent channels from applications that want to play audio.
 *   AudioCapturers represent channels to applications that want to record audio.
 *   AudioOutputs represent hardware outputs (speakers).
 *   AudioInputs represent hardware inputs (microphones).

 To control output routing, we use an enum called an AudioRenderUsage, which has
 values like BACKGROUND, MEDIA, COMMUNICATION, etc. We maintain a many-to-one
 mapping from AudioRenderUsage to AudioOutput, then map AudioRenderers to
 AudioOutputs based on this type. For example, if two AudioRenderers `r1` and
 `r2` are created with AudioRenderUsage MEDIA, they are both routed to the
 AudioOutput assigned to MEDIA (`o2` in the above graph).

 Input routing works similarly, using a type called AudioCaptureUsage.
 Additionally, special "loopback" inputs are routed from AudioOutputs.

 ## Output Pipelines

 At each AudioOutput, an OutputPipeline controls the mixing of input streams into
 a single output stream. This mixing happens in a graph structure that combines
 MixStage nodes (which mix multiple input streams into a single output) and
 EffectsStage nodes (which apply a sequence of transformations to an input
 stream). MixStage nodes also perform basic transformations: source format
 conversion, rechannelization, sample rate conversion, gain scaling, and
 accumulation. Gain scaling can be configured at both a per-stream level
 (per-AudioRenderer) and a per-AudioRenderUsage level (usually corresponding to
 volume controls on the device).

 For example, the following graph illustrates an OutputPipeline created for four
 renderers, three of which (r1, r2, r3) are mixed first because they share the
 same AudioRenderUsage.

 ```plain
      r1  r2  r3
      |   |   |
   +--V---V---V---+
   |  m1  m2  m3  |
   |   MixStage   |
   +------|-------+
          |
   +------V-------+
   | EffectsStage |
   | 1. high pass |
   | 2. compress  |
   +---------\----+   r4
              \       |
            +--V------V---+
            |   m1    m2  |
            |   MixStage  |
            +------|------+
                   V
                 device
 ```

 If loopback capability is enabled for the given device, then a specific pipeline
 stage can be designated as the loopback stage. For example, if certain
 hardware-specific effects should not be included in the loopback stream, the
 loopback stream can be injected at an early stage before the final output is
 sent to the device.

 ## Input Pipelines

 There is no complex processing in AudioCapturers other than a simple mixer.

 ## Streams, Frames, Clocks, and Timestamps

 We represent audio as PCM-encoded streams. Each stream is a sequence of frames
 with N samples per frame, where the audio has N channels. Within a stream,
 individual frames are identified by a _frame number_, which is a simple index
 into the stream. Each pipeline stage has zero or more source streams and
 destinations streams; in a pipeline graph, nodes are transformations and edges
 are streams.

 Each stream has a _reference clock_ which controls how the frame timeline
 advances relative to real time. An important question is how we translate
 between frame numbers and timestamps. At any given point in time, two kinds of
 frame numbers are important:

 *   The _presentation frame_ identifies the frame that is currently being played
     at a speaker or captured at a microphone. If it were possible to hear an
     individual frame, this would be the frame heard from the speaker at the
     current time.

 *   In output pipelines, the _safe write frame_ identifies the frame that the
     pipeline is about to push into the next stage of the pipeline (for example,
     at the driver level, this is the next frame to be pushed into the hardware).
     This is the streams' last chance to produce this frame; all frames before
     this point have been pushed to the next pipeline stage. Once the frame
     passes this point, there is a delay, known as the _presentation delay_,
     before the frame is played at the speaker. To illustrate:

     ```
                        |<--- delay -->|<-- writable --> ...
     frame timeline: +++++++++++++++++++++++++++++++++++
                        ^              ^
                    presentation   safe write
                      frame          frame
     ```

     In the above diagram, each `+` is a frame; frame numbers increase from left
     to right; the arrows (`^`) are pointers at a specific time; and the arrows
     advance from left to right as time increases. Viewed another way, frames
     move on a conveyor belt from right to left: a frame is first writable, then
     moves to the delay phase after it passes the safe write pointer, then is
     finally presented.

 *   In input pipelines, the _safe read frame_ serves a similar purpose: it
     identifies the frame that the stream has just obtained from the prior stage
     of the pipeline. As above, there may be a delay between when the frame is
     captured at the microphone and when it becomes safe to read. To illustrate:

     ```
                     ... <-- readable -->|<--- delay -->|
     frame timeline:     +++++++++++++++++++++++++++++++++++
                                         ^              ^
                                     safe read     presentation
                                       frame          frame
     ```

     The above diagram uses the same format as the prior diagram: frame numbers
     increase from left to right and the arrows advance left to right as time
     increases; or, viewed another way, frames move on a conveyor belt from right
     to left as time advances. Frames start "in the air", then are presented
     (captured) at the microphone, then enter the delay phase, and finally become
     readable. Note that frames are presented (captured) before they are safe to
     read, while in output pipelines, frames must be written before they can be
     presented (played at a speaker).

 We anchor frames to real time using presentation frames. For each stream, we
 define a translation function _ReferencePtsToFrame_ which translates between
 _presentation timestamps_ (PTS) and frame numbers:

 ```
 ReferencePtsToFrame(pts) = presentation frame number at pts
 ```

 PTS is relative to the stream's reference clock. Different streams may have
 different clocks, and clocks can drift, but between any two clocks there always
 exists a linear translation.

 Within Audio Core, pipeline stages communicate using frame numbers. When an
 output stream uses a different frame timeline than the stage's input streams, we
 can convert between the two frame timelines by hopping from input stream frame
 number, to input stream PTS (using an inverted ReferencePtsToFrame), to output
 stream PTS (using a clock-to-clock transformation), to output stream frame
 number (using ReferencePtsToFrame). When safe read/write frames are important,
 such as in drivers, we translate to and from presentation frames by adding the
 presentation delay, as illustrated in the timelines above.

 Outside of Audio Core, AudioRenderer clients may define both a "reference clock"
 and a "media timeline". Clients play audio by sending a stream of audio packets,
 where the "StreamPacket.pts" field gives the packet's PTS relative to the
 AudioRenderer's media timeline, and where the media timeline can be translated
 to the reference clock timline via the rate passed to SetPtsUnits and the
 offsets passed to Play.

 ## Debugging Tips

 ### Inspecting final output from the system mixer

 In development builds of the Fuchsia OS, the WavWriter can be used to examine
 the audio streams emitted by the system mixer (of course, this functionality is
 removed from official production builds where this might pose a media security
 or overall resource consumption risk).

 To enable the WavWriter support in the system audio mixer, change the bool
 (kWavWriterEnabled) found toward the top of driver_output.cc to 'true'. The
 Fuchsia system mixer produces a final mix output stream (potentially
 multi-channel) for every audio output device found. Thus, enabling the WavWriter
 will cause an audio file to be created for each output device.

 These files are created on the target (Fuchsia) device at location
 `/tmp/r/sys/<pkg>/wav_writer_N.wav`, where N is a unique integer for each output
 and `<pkg>` is the name of the `audio_core` package (such as
 `fuchsia.com:audio_core:0#meta:audio_core.cmx`). One can copy these files back
 to the host with: `fx scp <ip of fuchsia device>:/tmp/.../wav_writer_*.wav
 ~/Desktop/` At this time, once audio playback begins on any device, the system
 audio mixer produces audio for ALL audio output devices (even if no client is
 playing audio to that device). The wave files for these devices will, naturally,
 contain silence.

 Additionally, at this time Fuchsia continues mixing (once it has started) to
 output devices indefinitely, even after all clients have closed. This will
 change in the future. Until then, however, the most effective way to use this
 tracing feature is to `killall audio_core` on the target, once playback is
 complete. (The _audio_core_ process restarts automatically when needed, so this
 is benign.) The mixer calls `UpdateHeader` to update the 'length' parameter in
 both RIFF Chunk and WAV header after every successive file write, so the files
 should be complete and usable even if you kill audio_core during audio playback
 (which means that `Close` is never called).
	# Audio Core

	This is the core of the audio system. It implements the core FIDL APIs
	fuchsia.media.AudioCapturer and fuchsia.media.AudioRenderer. At a high level,
	audio core is organized as follows:

	```plain
	+---------------------+ +---------------------+
	\| AudioRenderers \| \| AudioCapturers \|
	\| r1 r2 ... rN \| \| c1 c2 ... cN \|
	+---\---\|----------\|--+ +--Λ---ΛΛ-------------+
	\ \| \| +-------+ / \|
	\ \| \| loopback / \|
	+------VV----------V--+ \| +---/---\|-------------+
	\| o1 o2 ... oN--+---+ \| i1 i2 ... iN \|
	\| AudioOutputs \| \| AudioInputs \|
	+---------------------+ +---------------------+
	```

	The relevant types are:

	* AudioRenderers represent channels from applications that want to play audio.
	* AudioCapturers represent channels to applications that want to record audio.
	* AudioOutputs represent hardware outputs (speakers).
	* AudioInputs represent hardware inputs (microphones).

	To control output routing, we use an enum called an AudioRenderUsage, which has
	values like BACKGROUND, MEDIA, COMMUNICATION, etc. We maintain a many-to-one
	mapping from AudioRenderUsage to AudioOutput, then map AudioRenderers to
	AudioOutputs based on this type. For example, if two AudioRenderers `r1` and
	`r2` are created with AudioRenderUsage MEDIA, they are both routed to the
	AudioOutput assigned to MEDIA (`o2` in the above graph).

	Input routing works similarly, using a type called AudioCaptureUsage.
	Additionally, special "loopback" inputs are routed from AudioOutputs.

	## Output Pipelines

	At each AudioOutput, an OutputPipeline controls the mixing of input streams into
	a single output stream. This mixing happens in a graph structure that combines
	MixStage nodes (which mix multiple input streams into a single output) and
	EffectsStage nodes (which apply a sequence of transformations to an input
	stream). MixStage nodes also perform basic transformations: source format
	conversion, rechannelization, sample rate conversion, gain scaling, and
	accumulation. Gain scaling can be configured at both a per-stream level
	(per-AudioRenderer) and a per-AudioRenderUsage level (usually corresponding to
	volume controls on the device).

	For example, the following graph illustrates an OutputPipeline created for four
	renderers, three of which (r1, r2, r3) are mixed first because they share the
	same AudioRenderUsage.

	```plain
	r1 r2 r3
	\| \| \|
	+--V---V---V---+
	\| m1 m2 m3 \|
	\| MixStage \|
	+------\|-------+
	\|
	+------V-------+
	\| EffectsStage \|
	\| 1. high pass \|
	\| 2. compress \|
	+---------\----+ r4
	\ \|
	+--V------V---+
	\| m1 m2 \|
	\| MixStage \|
	+------\|------+
	V
	device
	```

	If loopback capability is enabled for the given device, then a specific pipeline
	stage can be designated as the loopback stage. For example, if certain
	hardware-specific effects should not be included in the loopback stream, the
	loopback stream can be injected at an early stage before the final output is
	sent to the device.

	## Input Pipelines

	There is no complex processing in AudioCapturers other than a simple mixer.

	## Streams, Frames, Clocks, and Timestamps

	We represent audio as PCM-encoded streams. Each stream is a sequence of frames
	with N samples per frame, where the audio has N channels. Within a stream,
	individual frames are identified by a _frame number_, which is a simple index
	into the stream. Each pipeline stage has zero or more source streams and
	destinations streams; in a pipeline graph, nodes are transformations and edges
	are streams.

	Each stream has a _reference clock_ which controls how the frame timeline
	advances relative to real time. An important question is how we translate
	between frame numbers and timestamps. At any given point in time, two kinds of
	frame numbers are important:

	* The _presentation frame_ identifies the frame that is currently being played
	at a speaker or captured at a microphone. If it were possible to hear an
	individual frame, this would be the frame heard from the speaker at the
	current time.

	* In output pipelines, the _safe write frame_ identifies the frame that the
	pipeline is about to push into the next stage of the pipeline (for example,
	at the driver level, this is the next frame to be pushed into the hardware).
	This is the streams' last chance to produce this frame; all frames before
	this point have been pushed to the next pipeline stage. Once the frame
	passes this point, there is a delay, known as the _presentation delay_,
	before the frame is played at the speaker. To illustrate:

	```
	\|<--- delay -->\|<-- writable --> ...
	frame timeline: +++++++++++++++++++++++++++++++++++
	^ ^
	presentation safe write
	frame frame
	```

	In the above diagram, each `+` is a frame; frame numbers increase from left
	to right; the arrows (`^`) are pointers at a specific time; and the arrows
	advance from left to right as time increases. Viewed another way, frames
	move on a conveyor belt from right to left: a frame is first writable, then
	moves to the delay phase after it passes the safe write pointer, then is
	finally presented.

	* In input pipelines, the _safe read frame_ serves a similar purpose: it
	identifies the frame that the stream has just obtained from the prior stage
	of the pipeline. As above, there may be a delay between when the frame is
	captured at the microphone and when it becomes safe to read. To illustrate:

	```
	... <-- readable -->\|<--- delay -->\|
	frame timeline: +++++++++++++++++++++++++++++++++++
	^ ^
	safe read presentation
	frame frame
	```

	The above diagram uses the same format as the prior diagram: frame numbers
	increase from left to right and the arrows advance left to right as time
	increases; or, viewed another way, frames move on a conveyor belt from right
	to left as time advances. Frames start "in the air", then are presented
	(captured) at the microphone, then enter the delay phase, and finally become
	readable. Note that frames are presented (captured) before they are safe to
	read, while in output pipelines, frames must be written before they can be
	presented (played at a speaker).

	We anchor frames to real time using presentation frames. For each stream, we
	define a translation function _ReferencePtsToFrame_ which translates between
	_presentation timestamps_ (PTS) and frame numbers:

	```
	ReferencePtsToFrame(pts) = presentation frame number at pts
	```

	PTS is relative to the stream's reference clock. Different streams may have
	different clocks, and clocks can drift, but between any two clocks there always
	exists a linear translation.

	Within Audio Core, pipeline stages communicate using frame numbers. When an
	output stream uses a different frame timeline than the stage's input streams, we
	can convert between the two frame timelines by hopping from input stream frame
	number, to input stream PTS (using an inverted ReferencePtsToFrame), to output
	stream PTS (using a clock-to-clock transformation), to output stream frame
	number (using ReferencePtsToFrame). When safe read/write frames are important,
	such as in drivers, we translate to and from presentation frames by adding the
	presentation delay, as illustrated in the timelines above.

	Outside of Audio Core, AudioRenderer clients may define both a "reference clock"
	and a "media timeline". Clients play audio by sending a stream of audio packets,
	where the "StreamPacket.pts" field gives the packet's PTS relative to the
	AudioRenderer's media timeline, and where the media timeline can be translated
	to the reference clock timline via the rate passed to SetPtsUnits and the
	offsets passed to Play.

	## Debugging Tips

	### Inspecting final output from the system mixer

	In development builds of the Fuchsia OS, the WavWriter can be used to examine
	the audio streams emitted by the system mixer (of course, this functionality is
	removed from official production builds where this might pose a media security
	or overall resource consumption risk).

	To enable the WavWriter support in the system audio mixer, change the bool
	(kWavWriterEnabled) found toward the top of driver_output.cc to 'true'. The
	Fuchsia system mixer produces a final mix output stream (potentially
	multi-channel) for every audio output device found. Thus, enabling the WavWriter
	will cause an audio file to be created for each output device.

	These files are created on the target (Fuchsia) device at location
	`/tmp/r/sys/<pkg>/wav_writer_N.wav`, where N is a unique integer for each output
	and `<pkg>` is the name of the `audio_core` package (such as
	`fuchsia.com:audio_core:0#meta:audio_core.cmx`). One can copy these files back
	to the host with: `fx scp <ip of fuchsia device>:/tmp/.../wav_writer_*.wav
	~/Desktop/` At this time, once audio playback begins on any device, the system
	audio mixer produces audio for ALL audio output devices (even if no client is
	playing audio to that device). The wave files for these devices will, naturally,
	contain silence.

	Additionally, at this time Fuchsia continues mixing (once it has started) to
	output devices indefinitely, even after all clients have closed. This will
	change in the future. Until then, however, the most effective way to use this
	tracing feature is to `killall audio_core` on the target, once playback is
	complete. (The _audio_core_ process restarts automatically when needed, so this
	is benign.) The mixer calls `UpdateHeader` to update the 'length' parameter in
	both RIFF Chunk and WAV header after every successive file write, so the files
	should be complete and usable even if you kill audio_core during audio playback
	(which means that `Close` is never called).