Chapter 2: The Ins and Outs of Audio Compression

OK, you might have noticed that sampling at 44,100 times a second with 65,536 possible values can make a lot of data. If you've ever extracted an audio cd to your PC as a wav then you will know that uncompressed digital audio can be seriously bigger than the mp3s you get off the 'net.

So, how do they compress the audio so much with so little apparent loss in quality?

How Audio Compresses

Well, there are two essential properties to audio compression.

1) Lossless Data compression. This is the zip (Huffman) type of compression where patterns are searched for in order to decrease the amount of data that needs to be stored. It's all math(s) and the net result is that you can make files smaller without getting rid of any of the data. This is useful. In video, there is a codec (COmpressor DECompressor) called HuffYUV that can do this. In audio there are a few codecs that can but a popular lossless codec is Monkey's Audio codec which is quite cool.

2) Psychoacoustic models. OK, this is the main area in which audio compression works. This is the lossy part of the compression where an encoder will throw away information in order to reduce the size. This is based on a mathematical model which attempts to describe what the human ear actually hears - i.e. with the intention of disregarding any information that cannot actually be heard.

Exactly what information to throw away depends on the codec being used. Some codecs are designed to remove certain frequencies so that the compression is best for voices (telephones work with such a system which is why all On Hold music sounds like poo even if it's a tune you like ^^).

Various models have been formulated over the years in order to reduce the sizes of audio files. However the most significant in recent years is undoubtedly the psychoacoustic models used in mpeg1 layer 3 (mp3) compression which I will talk about soon.

The Stages of MP3 Compression

First, let's look at the stages that take place in compressing an audio file. For this example, the mp3 codec is described:

[Note that these stages may not necessarily occur in this order ^_^]

The waveform is separated into small sections called frames (think of them as similar to video frames) and it is within each frame that the audio will be analysed. [These frames, when the audio is encoded, can then be 'interleaved' with the corresponding video frames to sync your audio but more on that later]

The section is analysed to see what frequencies are present (aka spectral analysis).

These figures are then compared to tables of data in the codec that contains information of the psychoacoustic models. In the mp3 codec, these models are very advanced and a great deal of the modeling is based on the principle known as masking which I will talk about in more detail in a moment. Any information that matches the psychoacoustic model is retained and the rest is discarded. This is the majority of the audio compression.

Depending on the bitrate, the codec uses the allotted amount of bits to store this data.

Once this has taken place, the result is then passed through the lossless Huffman zip-type compression which reduces the size by another 10%. [this is why there is no point in zipping an mp3… it's already been 'zipped']

That's basically what an mp3 compressor does, but what about that psychoacoustic model? What is masking?

Well, the main way that the mp3 codec removes information is by discovering which sounds are apparently undetectable or 'masked' and so cannot be heard. These sounds are then removed (hopefully) without any audible loss in sound.

Psychoacoustics and Masked Sounds:

There are two main types of masking effects - Simultaneous Masking and Temporal Masking.

Simultaneous Masking works under the principle that certain sounds can drown out other sounds when played at the same time. In quiet pieces of music you may be able to hear very subtle sounds such as the breathing of a singer. In a loud song, these elements would no longer be audible but that doesn't mean that the sound has now disappeared. It has also been ascertained that if you play two sounds with the second one at a slightly higher pitch but only slightly quieter, it will be very difficult for the second one to be heard. The brain basically does it's own filtering on the sound. However, if you have to distinct sounds playing, even if you can't hear one, you have much more information. This is the kind of information that is removed and that is the principle of simultaneous masking - the removal of the sounds the brain doesn't hear because of other sounds being present.

Temporal Masking works in a similar way but here the idea isn't that you can't hear one sound because of another one being similar, it's the fact that if you play one sound slightly after another one you wont be able to hear the second one (and vice versa) Again, this is sound information that would be removed.

This all sounds great, doesn't it? - you can remove sounds that you can't hear anyway and get small files. Well, that's kinda true but unfortunately the fact of the matter is that you are getting rid of lots of data and some people can tell.

There is no such thing as an audiophile's mp3 player... an audiophile can tell the difference.

Storing the Data: Bitrates and how they work.

I've touched on this already in the compression description but it's certainly an area worth looking at. You know what 16bit 44.1Khz audio means in terms of your audio sample but what about 128kbps in terms of your compressed file?

Well, what if you only had a certain amount of bits to describe a second of audio? Well, the way the data is stored is by working out how to represent the wave in each frame mathematically. Using a mathematical model known as Discrete Cosine Transform (DCT), a wave can be expressed in terms of a cumulative amount of cosine waves. The more cos waves that are used in the mathematical expression, the closer the result will be to the original waveform. The bitrate will store this data and the complexity and hence accuracy will be limited to the amount of data that can be stored per frame.

In plain terms, 128kbps means that you have 128kbits of data each second to describe your audio. This is an ok amount for a codec such as mp3 and is one of the more popular levels for mp3 encoding. Just like video, the quality of your audio encode will increase if it has more kbits/second to utilise. However, a 192kbps mp3 often sounds much closer to cd quality than 128kbps. It's all about storing data and it's a simple fact that the bitrate is a limiting factor.

CBR and VBR:

Most audio is compressed at a Constant Bitrate (CBR), which means that every second will have the same amount of bits available to it (there is a bit reservoir but essentially you are fixed to the amount of data per second). However, it is obvious that audio is anything but constant. There are quiet parts and loud parts, complicated parts and simple parts but in the end, if you encode with a constant bitrate, they all have to be described with the same amount of bits.

Hence, Xing (who are now part of RealNetworks) developed a system of Variable Bitrate (VBR) encoding in which the bitrate for each frame was scaled based upon the principle that some sections require less bits and others require more. What this means is that for the same filesize, a better quality audio can be encoded.

The debated usefulness of this in terms of music videos will be discussed later in final section about making your distributable video.

Joint Stereo

I figured it would be also worth mentioning the principle of Joint Stereo here as I will be recommending it as an option later on in chapter 6. Basically the idea is that for the most part, left and right channels are very very similar. So why bother having twice the data for most of the song when lots of it can be duplicated for each channel? This is there the Joint Stereo idea comes in. It compares the left and right channels and works out how much data it can save by making them identical and encoding the data once. This means there will be elements of your wav that are, in effect, mono. These are only elements however and it is a very useful addition for the reduction of file sizes.

There is a secondary pass to the Joint Stereo formation, which is quite clever and uses another psychoacoustic model. The theory is that we are very bad at telling where very high and very low frequency sounds are coming from. A practical example of this is subwoofer speakers - they can be stuck in the corner of the room away from the other speakers and you still can't really tell that the bass is coming from there. Taking this idea into consideration, bass sounds and very high pitched sounds are made mono - because you can't tell the difference.

Of course, with this method you do get a reduction in the stereo separation in your audio. Many people cannot tell the difference but it is there so if you want the best quality you may want to go for the normal stereo mode. Also, it can introduce errors that can't really be regained by increasing the bitrate. If the audio sounds a little crappy, try normal stereo.

AbsoluteDestiny - June 2002

Next - Getting your Audio

Index