blob: 6b3232a9600a795fec6348c04b692975743a6da3 [file] [log] [blame] [view]
# Audio Core
This is the core of the audio system. It implements the core FIDL APIs
fuchsia.media.AudioCapturer and fuchsia.media.AudioRenderer. At a high level,
audio core is organized as follows:
```plain
+---------------------+ +---------------------+
| AudioRenderers | | AudioCapturers |
| r1 r2 ... rN | | c1 c2 ... cN |
+---\---|----------|--+ +--Λ---ΛΛ-------------+
\ | | +-------+ / |
\ | | loopback / |
+------VV----------V--+ | +---/---|-------------+
| o1 o2 ... oN--+---+ | i1 i2 ... iN |
| AudioOutputs | | AudioInputs |
+---------------------+ +---------------------+
```
The relevant types are:
* AudioRenderers represent channels from applications that want to play audio.
* AudioCapturers represent channels to applications that want to record audio.
* AudioOutputs represent hardware outputs (speakers).
* AudioInputs represent hardware inputs (microphones).
To control output routing, we use an enum called an AudioRenderUsage, which has
values like BACKGROUND, MEDIA, COMMUNICATION, etc. We maintain a many-to-one
mapping from AudioRenderUsage to AudioOutput, then map AudioRenderers to
AudioOutputs based on this type. For example, if two AudioRenderers `r1` and
`r2` are created with AudioRenderUsage MEDIA, they are both routed to the
AudioOutput assigned to MEDIA (`o2` in the above graph).
Input routing works similarly, using a type called AudioCaptureUsage.
Additionally, special "loopback" inputs are routed from AudioOutputs.
## Output Pipelines
At each AudioOutput, an OutputPipeline controls the mixing of input streams into
a single output stream. This mixing happens in a graph structure that combines
MixStage nodes (which mix multiple input streams into a single output) and
EffectsStage nodes (which apply a sequence of transformations to an input
stream). MixStage nodes also perform basic transformations: source format
conversion, rechannelization, sample rate conversion, gain scaling, and
accumulation. Gain scaling can be configured at both a per-stream level
(per-AudioRenderer) and a per-AudioRenderUsage level (usually corresponding to
volume controls on the device).
For example, the following graph illustrates an OutputPipeline created for four
renderers, three of which (r1, r2, r3) are mixed first because they share the
same AudioRenderUsage.
```plain
r1 r2 r3
| | |
+--V---V---V---+
| m1 m2 m3 |
| MixStage |
+------|-------+
|
+------V-------+
| EffectsStage |
| 1. high pass |
| 2. compress |
+---------\----+ r4
\ |
+--V------V---+
| m1 m2 |
| MixStage |
+------|------+
V
device
```
If loopback capability is enabled for the given device, then a specific pipeline
stage can be designated as the loopback stage. For example, if certain
hardware-specific effects should not be included in the loopback stream, the
loopback stream can be injected at an early stage before the final output is
sent to the device.
## Input Pipelines
There is no complex processing in AudioCapturers other than a simple mixer.
## Streams, Frames, Clocks, and Timestamps
We represent audio as PCM-encoded streams. Each stream is a sequence of frames
with N samples per frame, where the audio has N channels. Within a stream,
individual frames are identified by a _frame number_, which is a simple index
into the stream. Each pipeline stage has zero or more source streams and
destinations streams; in a pipeline graph, nodes are transformations and edges
are streams.
Each stream has a _reference clock_ which controls how the frame timeline
advances relative to real time. An important question is how we translate
between frame numbers and timestamps. At any given point in time, two kinds of
frame numbers are important:
* The _presentation frame_ identifies the frame that is currently being played
at a speaker or captured at a microphone. If it were possible to hear an
individual frame, this would be the frame heard from the speaker at the
current time.
* In output pipelines, the _safe write frame_ identifies the frame that the
pipeline is about to push into the next stage of the pipeline (for example,
at the driver level, this is the next frame to be pushed into the hardware).
This is the streams' last chance to produce this frame; all frames before
this point have been pushed to the next pipeline stage. Once the frame
passes this point, there is a delay, known as the _presentation delay_,
before the frame is played at the speaker. To illustrate:
```
|<--- delay -->|<-- writable --> ...
frame timeline: +++++++++++++++++++++++++++++++++++
^ ^
presentation safe write
frame frame
```
In the above diagram, each `+` is a frame; frame numbers increase from left
to right; the arrows (`^`) are pointers at a specific time; and the arrows
advance from left to right as time increases. Viewed another way, frames
move on a conveyor belt from right to left: a frame is first writable, then
moves to the delay phase after it passes the safe write pointer, then is
finally presented.
* In input pipelines, the _safe read frame_ serves a similar purpose: it
identifies the frame that the stream has just obtained from the prior stage
of the pipeline. As above, there may be a delay between when the frame is
captured at the microphone and when it becomes safe to read. To illustrate:
```
... <-- readable -->|<--- delay -->|
frame timeline: +++++++++++++++++++++++++++++++++++
^ ^
safe read presentation
frame frame
```
The above diagram uses the same format as the prior diagram: frame numbers
increase from left to right and the arrows advance left to right as time
increases; or, viewed another way, frames move on a conveyor belt from right
to left as time advances. Frames start "in the air", then are presented
(captured) at the microphone, then enter the delay phase, and finally become
readable. Note that frames are presented (captured) before they are safe to
read, while in output pipelines, frames must be written before they can be
presented (played at a speaker).
We anchor frames to real time using presentation frames. For each stream, we
define a translation function _ReferencePtsToFrame_ which translates between
_presentation timestamps_ (PTS) and frame numbers:
```
ReferencePtsToFrame(pts) = presentation frame number at pts
```
PTS is relative to the stream's reference clock. Different streams may have
different clocks, and clocks can drift, but between any two clocks there always
exists a linear translation.
Within Audio Core, pipeline stages communicate using frame numbers. When an
output stream uses a different frame timeline than the stage's input streams, we
can convert between the two frame timelines by hopping from input stream frame
number, to input stream PTS (using an inverted ReferencePtsToFrame), to output
stream PTS (using a clock-to-clock transformation), to output stream frame
number (using ReferencePtsToFrame). When safe read/write frames are important,
such as in drivers, we translate to and from presentation frames by adding the
presentation delay, as illustrated in the timelines above.
Outside of Audio Core, AudioRenderer clients may define both a "reference clock"
and a "media timeline". Clients play audio by sending a stream of audio packets,
where the "StreamPacket.pts" field gives the packet's PTS relative to the
AudioRenderer's media timeline, and where the media timeline can be translated
to the reference clock timline via the rate passed to SetPtsUnits and the
offsets passed to Play.
## Debugging Tips
### Inspecting final output from the system mixer
In development builds of the Fuchsia OS, the WavWriter can be used to examine
the audio streams emitted by the system mixer (of course, this functionality is
removed from official production builds where this might pose a media security
or overall resource consumption risk).
To enable the WavWriter support in the system audio mixer, change the bool
(kWavWriterEnabled) found toward the top of driver_output.cc to 'true'. The
Fuchsia system mixer produces a final mix output stream (potentially
multi-channel) for every audio output device found. Thus, enabling the WavWriter
will cause an audio file to be created for each output device.
These files are created on the target (Fuchsia) device at location
`/tmp/r/sys/<pkg>/wav_writer_N.wav`, where N is a unique integer for each output
and `<pkg>` is the name of the `audio_core` package (such as
`fuchsia.com:audio_core:0#meta:audio_core.cmx`). One can copy these files back
to the host with: `fx scp <ip of fuchsia device>:/tmp/.../wav_writer_*.wav
~/Desktop/` At this time, once audio playback begins on any device, the system
audio mixer produces audio for ALL audio output devices (even if no client is
playing audio to that device). The wave files for these devices will, naturally,
contain silence.
Additionally, at this time Fuchsia continues mixing (once it has started) to
output devices indefinitely, even after all clients have closed. This will
change in the future. Until then, however, the most effective way to use this
tracing feature is to `killall audio_core` on the target, once playback is
complete. (The _audio_core_ process restarts automatically when needed, so this
is benign.) The mixer calls `UpdateHeader` to update the 'length' parameter in
both RIFF Chunk and WAV header after every successive file write, so the files
should be complete and usable even if you kill audio_core during audio playback
(which means that `Close` is never called).