Head-Tracking Library For Immersive Audio

This library handles the processing of head-tracking information, necessary for Immersive Audio functionality. It goes from bare sensor reading into the final pose fed into a virtualizer.

Basic Usage

The main entry point into this library is the HeadTrackingProcessor class. This class is provided with the following inputs:

Head pose, relative to some arbitrary world frame.
Screen pose, relative to some arbitrary world frame.
Display orientation, defined as the angle between the “physical” screen and the “logical” screen.
Transform between the screen and the sound stage.
Desired operational mode:
- Static: only the sound stage pose is taken into account. This will result in an experience where the sound stage moved with the listener's head.
- World-relative: both the head pose and stage pose are taken into account. This will result in an experience where the sound stage is perceived to be located at a fixed place in the world.
- Screen-relative: the head pose, screen pose and stage pose are all taken into account. This will result in an experience where the sound stage is perceived to be located at a fixed place relative to the screen.

Once inputs are provided, the calculate() method will make the following output available:

Stage pose, relative to the head. This aggregates all the inputs mentioned above and is ready to be fed into a virtualizer.
Actual operational mode. May deviate from the desired one in cases where the desired mode cannot be calculated (for example, as result of dropped messages from one of the sensors).

A recenter() operation is also available, which indicates to the system that whatever pose the screen and head are currently at should be considered as the “center” pose, or frame of reference.

Pose-Related Conventions

Naming and Composition

When referring to poses in code, it is always good practice to follow conventional naming, which highlights the reference and target frames clearly:

Bad:

Pose3f headPose;

Good:

Pose3f worldToHead;  // “world” is the reference frame,
                     // “head” is the target frame.

By following this convention, it is easy to follow correct composition of poses, by making sure adjacent frames are identical:

Pose3f aToD = aToB * bToC * cToD;

And similarly, inverting the transform simply flips the reference and target:

Pose3f aToB = bToA.inverse();

Twist

“Twist” is to pose what velocity is to distance: it is the time-derivative of a pose, representing the change in pose over a short period of time. Its naming convention always states one frame, e.g.: Twist3f headTwist;

This means that this twist represents the head-at-time-T to head-at-time-T+dt transform. Twists are not composable in the same way as poses.

Frames of Interest

The frames of interest in this library are defined as follows:

Head

This is the listener’s head. The origin is at the center point between the ear-drums, the X-axis goes from left ear to right ear, Y-axis goes from the back of the head towards the face and Z-axis goes from the bottom of the head to the top.

Screen

This is the primary screen that the user will be looking at, which is relevant for some Immersive Audio use-cases, such as watching a movie. We will follow a different convention for this frame than what the Sensor framework uses. The origin is at the center of the screen. X-axis goes from left to right, Z-axis goes from the screen bottom to the screen top, Y-axis goes “into” the screen ( from the direction of the viewer). The up/down/left/right of the screen are defined as the logical directions used for display. So when flipping the display orientation between “landscape” and “portrait”, the frame of reference will change with respect to the physical screen.

Stage

This is the frame of reference used by the virtualizer for positioning sound objects. It is not associated with any physical frame. In a typical multi-channel scenario, the listener is at the origin, the X-axis goes from left to right, Y-axis from back to front and Z-axis from down to up. For example, a front-right speaker is located at positive X, Y and Z=0, a height speaker will have a positive Z.

World

It is sometimes convenient to use an intermediate frame when dealing with head-to-screen transforms. The “world” frame is a frame of reference in the physical world, relative to which we can measure the head pose and screen pose. It is arbitrary, but expected to be stable (fixed).

Processing Description

Pose processing graph

The diagram above illustrates the processing that takes place from the inputs to the outputs.

Predictor

The Predictor block gets pose + twist (pose derivative) and extrapolates to obtain a predicted head pose (w/ given latency).

Bias

The Bias blocks establish the reference frame for the poses by having the ability to set the current pose as the reference for future poses (recentering).

Orientation Compensation

The Orientation Compensation block applies the display orientation to the screen pose to obtain the pose of the “logical screen” frame, in which the Y-axis is pointing in the direction of the logical screen “up” rather than the physical one.

Screen-Relative Pose

The Screen-Relative Pose block is provided with a head pose and a screen pose and estimates the pose of the head relative to the screen. Optionally, this module may indicate that the user is likely not in front of the screen via the “valid” output.

Stillness Detector

The stillness detector blocks detect when their incoming pose stream has been stable for a given amount of time (allowing for a configurable amount of error). When the head is considered still, we would trigger a recenter operation (“auto-recentering”) and when the screen is considered not still, the mode selector would use this information to force static mode.

Mode Selector

The Mode Selector block aggregates the various sources of pose information into a head-to-stage pose that is going to feed the virtualizer. It is controlled by the “desired mode” signal that indicates whether the preference is to be in either static, world-relative or screen-relative.

The actual mode may diverge from the desired mode. It is determined as follows:

If the desired mode is static, the actual mode is static.
If the desired mode is world-relative:
- If head and screen poses are fresh and the screen is stable (stillness detector output is true), the actual mode is world-relative.
- Otherwise the actual mode is static.
If the desired mode is screen-relative:
- If head and screen poses are fresh and the ‘valid’ signal is asserted, the actual mode is screen-relative.
- Otherwise, apply the same rules as the desired mode being world-relative.

Rate Limiter

A Rate Limiter block is applied to the final output to smooth out any abrupt transitions caused by any of the following events:

Mode switch.
Display orientation switch.
Recenter operation.