This copy was created on the 30th of January, 2025.

Previously archived on The Internet Archive (WAVE and Canon) on the 9th of July, 2014 and the 11th of September, 2014 from Timothy John Weber's old personal website.

Originally written by Timothy John Weber.


The WAVE File Format

Introduction

The WAVE file format is a subset of Microsoft's RIFF spec, which can include lots of different kinds of data. It was originally intended for multimedia files, but the spec is open enough to allow pretty much anything to be placed in such a file, and ignored by programs that read the format correctly.

This description is not meant to be exhaustive, but to suggest simple ways of doing common tasks with waveform audio, and give some pointers to other sources of information.

Basics of digital audio and sound

First, some basics. Sound is air pressure fluctuation. Digitized sound is a graph of the change in air pressure over time. That's all there is to it.

For a good picture of this, open up Windows Sound Recorder and record a short sound, then look at the green bars it shows. When they're wide, there's a lot of air pressure, which your ear detects as a loud noise. When they're flat in the middle, there's no change in air pressure, which your ear detects as silence. The faster they go up and down, the higher the sound you hear.

When you record a sound, your microphone changes the air pressure fluctuations into electrical voltage fluctuations, which your sound card measures every so often and changes into numbers, called samples. When you play a sound back, the process is reversed, except that the voltage fluctuations go to your speakers instead of your microphone, and are converted back into air pressure by the speaker cone.

The speed with which your sound card samples the voltage is called the sample rate, and is expressed in kilohertz (kHz). One kHz is a thousand samples per second.

It's important to note that digitized audio stores nothing directly about a sound's frequency, pitch, or perceived loudness. You can run certain algorithms on the samples to determine these values approximately, but you can't just read them from the file.

What is RIFF?

RIFF is a file format for storing many kinds of data, primarily multimedia data like audio and video. It is based on chunks and sub-chunks. Each chunk has a type, represented by a four-character tag. This chunk type comes first in the file, followed by the size of the chunk, then the contents of the chunk.

The entire RIFF file is a big chunk that contains all the other chunks. The first thing in the contents of the RIFF chunk is the "form type," which describes the overall type of the file's contents. So the structure of a RIFF file looks like this:

    Offset  Contents
    (hex)
    0000    'R', 'I', 'F', 'F'
    0004    Length of the entire file - 8 (32-bit unsigned integer)
    0008    form type (4 characters)
    
    000C    first chunk type (4 character)
    0010    first chunk length (32-bit unsigned integer)
    0014    first chunk's data
    ...     ...

All integers are stored in the Intel low-high byte ordering (usually referred to as "little-endian").

A more detailed description of the RIFF format can be found in the Microsoft Win32 Multimedia API documentation, which is supplied as a Windows Help file with many Windows programming tools such as C++ compilers.

What is WAVE?

The WAVE format is a subset of RIFF used for storing digital audio. Its form type is "WAVE", and it requires two kinds of chunks:

WAVE can also contain any other chunk type allowed by RIFF, including LIST chunks, which are used to contain optional kinds of data such as the copyright date, author's name, etc. Chunks can appear in any order.

The WAVE file is thus very powerful, but also not trivial to parse. For this reason, and also possibly because a simpler (or inaccurate?) description of the WAVE format was promulgated before the Win32 API was released, a lot of older programs read and write a subset of the WAVE format, which I refer to as the "canonical" WAVE format. This subset basically consists of only two chunks, the fmt and data chunks, in that order, with the sample data in PCM format. For a detailed description of what this format looks like, and a description of the contents of the fmt chunk, look at the section under "The Canonical WAVE File Format".

What kind of compression is used in WAVE files?

The WAVE specification supports a number of different compression algorithms. The format tag entry in the fmt chunk indicates the type of compression used. A value of 1 indicates Pulse Code Modulation (PCM), which is a "straight," or uncompressed encoding of the samples. Values other than 1 indicate some form of compression. For more information on the values supported and how to decode the samples, see the Microsoft Win32 Multimedia API documentation.

How can I write data to a WAVE file?

The simplest way to write data into WAVE files produced by your own programs is to use the canonical format. This will be compatible with any other program, even the older ones.


The Canonical WAVE File Format

The canonical WAVE format starts with the RIFF header:

  Offset  Length   Contents
  0       4 bytes  'RIFF'
  4       4 bytes  <file length - 8>
  8       4 bytes  'WAVE'

(The '8' in the second entry is the length of the first two entries. I.e., the second entry is the number of bytes that follow in the file.)

Next, the fmt chunk describes the sample format:

  12      4 bytes  'fmt '
  16      4 bytes  0x00000010     // Length of the fmt data (16 bytes)
  20      2 bytes  0x0001         // Format tag: 1 = PCM
  22      2 bytes  <channels>     // Channels: 1 = mono, 2 = stereo
  24      4 bytes  <sample rate>  // Samples per second: e.g., 44100
  28      4 bytes  <bytes/second> // sample rate * block align
  32      2 bytes  <block align>  // channels * bits/sample / 8
  34      2 bytes  <bits/sample>  // 8 or 16

Finally, the data chunk contains the sample data:

  36      4 bytes  'data'
  40      4 bytes  <length of the data block>
  44        bytes  <sample data>

The sample data must end on an even byte boundary. All numeric data fields are in the Intel format of low-high byte ordering. 8-bit samples are stored as unsigned bytes, ranging from 0 to 255. 16-bit samples are stored as 2's-complement signed integers, ranging from -32768 to 32767.

For multi-channel data, samples are interleaved between channels, like this:

sample 0 for channel 0
sample 0 for channel 1
sample 1 for channel 0
sample 1 for channel 1
...

For stereo audio, channel 0 is the left channel and channel 1 is the right.