A new player in audio has emerged to the delight of anyone looking to enhance their streaming content. It’s called spatial sound.
Videographers spend hours using the latest innovations for creating the most impressive video that they can. Audio, however, has played second fiddle through the years, with the now-decades old stereo sound often being the default. At some point, engineers came up with surround sound. That’s where a number of speakers “surround” the listener to provide distinct sounds from a number of distinct positions. But as good as that was, it wasn’t truly immersive due to its physical placement limitations.
Impacting how a person hears
When a sound occurs, the human brain instantly processes the audio entering the ears. Then, it analyzes and defines the sound’s location. This analog auditory system can be duplicated digitally, provided the sound reaches the ears similarly to the real/analog world. This is why spatial sound works with the listener having a pair of headphones on. A digital algorithm relating to Head-Related Transfer Functions (HRTF) processes a pure sound so that it mimics what an ear would hear. Then, the brain responds to this sound in the same way as it would in the “real” world. Note that the headphones are “normal“ and do not have multiple drivers (speakers) in each ear cup.
Standard recorded audio (stereo/surround) expects the person listening to remain in a static position — not to suddenly turn left or right. Should the person do that, the audio that appears to be in a fixed position now moves as well. In the case of spatial sound, however, head tracking is part of the process of keeping the sounds in their designated 3D space. In this case, the person moving their head will find that the sounds remain in their “fixed” positions. This offers unique opportunities for creating a more realistic and immersive sound field.
The differences of spatial sound
To compare the differences between stereo and surround sound vs. spatial sound, understanding their limitations is an important step. For reference, pretend a listener is seated and facing forward. In the case of stereo sound, the sound stage comes from two front-facing speakers — one at the left and the other at the right of the listener. In the case of surround, multiple speakers create distinct audio streams. For example, hearing a sound directly to the left or to the left and behind the listener requires there to be a dedicated speaker positioned exactly there. But with spatial sound, the audio can occupy anywhere the engineer wishes to to be — there is no limitation requiring a dedicated speaker. Therefore, if the need is to have that sound directly to the left of the listener raised 3 feet and slightly angled towards the front, it can be.
To truly simplify between all three, spatial sound means that a sound can be positioned anywhere in the 3D space that is where the listener is because, unlike the limitations of stereo and surround sound, it does not rely on the listener remaining in a fixed position in order to deliver distinct sounds in that 3D space.
Two types of spatial sound
Spatial sound can be created as either binaural or object-based. Binaural is used in headphones, and object-based is used for listening through speakers. Examples of the latter are formats like Dolby Atmos, DTS:X and Sony 360 Reality Audio. To record binaural sound a pair of microphones are placed on either side of a dummy’s head. A more specialized head is used for the latter. Also in use can be a microphone with four cardioid mics in it, each pointed in a different direction so that they form a tetrahedral shape. A software decoder will convert this into front-back, up-down and left-right signals.
There are also field recorders which have built-in, multi-microphone sound arrays. A 360-degree spherical camera will make this possible, as it will have multiple microphones recording along with the video. An example of this is the affordable GoPro Fusion, which uses two lenses for video and four mics for audio. Of course, there are also higher-end models with greater numbers of mics. For example the Argus spherical video camera sports 64 microphones.
As an interesting aside, to say that spatial sound is “new” seems like a misnomer as its creation dates back to the 1970’s. However, the needed audio technology wasn’t available until the emergence of digital signal processing. The advent of interest and recent application of the technology from companies like Apple and Sony is a strong driver as well.
Creating spatial audio in the DAW
Creating spatial audio requires rendering out the sound using a DAW (digital audio workstation), which is a software program designed for audio manipulation. Examples of such programs are Protools HD and Premiere and Reaper, among others. These programs combine with a spatializer plugin (a specialized program that is added into the program) such as FB 360 Workstation and Dear VR. They also combine with programs geared towards a specific device, like Premier for Oculus Go. In all cases, there are specific procedures to follow, none of which are onerous. But it does require a PC, of which most modern models are more than powerful enough.
Also, specific services require that the spatial sound must be created to fit its requirements. For example, YouTube supports two different spatial audio formats (First Order Ambisonics and First Order Ambisonics with Head-Locked Stereo). So, the rendered-out video file must be exported out and become one of these two formats. Facebook can publish 360 videos with spatial sound on its News Feed by using the Audio 360 suite of tools to edit the audio in post-production and sync the audio with the 360 video’s field of view. Twitch, meanwhile, provides the Twitch Studio for processing the audio to create an immersive audio track for listeners wearing headphones.
Sound moves forward and around
Spatial sound is an effective means for creating a realistic and highly immersive sound field to accompany video content. Like any audio technology it should not be used indiscriminately. Instead, it should be used based on how best it can serve and enhance the video.