Art of audio timing
When I started designing OSS (the Linux sound driver at that time) I wanted to create an audio API that behaves like a streaming tape drive. A streaming tape drive runs at it’s maximum speed as long as the application can read or write data fast enough to keep the tape running. If this fails then the tape has to be stopped, rewound and reaccelerated. The actual tape operation continues when the tape reaches the position where the buffering failure occurred. This doesn’t ruin integrity of the data. However the speed will drop dramatically if the application cannot keep in speed with the tape.
Digital audio is actually very close to the streaming tape model. Audio signal is consumed/produced by the sound card at constant speed (depending on the sampling rate). The application must be able to read/write data at the rate the device runs. The difference is that buffering errors cannot be recovered. Every time a timing error happens some recorded data will be lost forever. During playback there will be more or less frequent gaps in the output and the signal will sound badly garbled.
In both cases the best strategy to avoid timing errors is use of normal blocking reads and writes. This is the way how audio devices should be used in OSS. There are some situations where the application should be designed for lower latencies than normally or to be able to respond external events (such as keyboard or mouse) without noticeable delays. These special cases will be discussed in this document. This text is based on my 18 years of experience with OSS and applications written for it. However the key observations I have made during just the last couple of years.
Low latencies
All “advanced” audio programmers keep thinking and talking about low latencies. What do low latencies mean and is there really any reason to pay any attention on them?
In general latency means the time difference between writing a sample to the audio device (say /dev/dsp) and the moment when it actually gets played by the device. There will always be some latency when playing or recording digital audio. Sometimes all latencies are bad. Sometimes they don’t matter. Sometimes bigger latencies may even be necessary to avoid playback buffer underruns or recording buffer overruns when the system is under load (say you want to recompile the kernel when listening to some MP3 files). In ALSA’s terminology these underruns and overruns are known as xruns. However the audio subsystem (be it OSS, ALSA or something else) doesn’t matter since these xruns are always caused by slowness of the application or too high (CPU) load of the system.
What does zero latency mean?
First of all zero latencies are impossible. It is possible to push the latencies to just couple of samples but the “laws of physics” make getting zero latencies as impossible as reaching the speed of light. It is not possible to get very low latencies using stock PC (PCI, USB or FireWire) audio hardware without paying too high price in CPU consumption. Typical “PC” hardware is simply not designed for this. The same is true with typical operating systems (like Linux and Unix) that are designed for different kind of purposes.
To push latencies to the bottom you will need to use special computers such as ones build around a DSP chip or a fast enough microcontroller. In addition you will need a DAC/ADC chip that is interfaced with the computer (controller) in proper way. Also the operating system and the application software needs to be designed in proper way.
An ideal system
Let’s focus on playback (recording will work in pretty much identical way). To play digital audio you will need a DAC (Digital to Analog Converter) that converts the samples to analog signal that can be fed to the speakers. The application program runs in a microcontroller and we can assume that there is no operating system involved. The application simply reads and writes the audio hardware registers directly. DAC chips are typically connected to the controller using a serial interface. We will focus on I2S interface but the others (such as AC97 and HDaudio links work in similar way). I2S is a 3-wire interface. The first signal is left/right (L/R) clock which tells if the sample being transferred belongs to the left or right schannel (1=left, 0=right). The frequency of the L/R signal is the sampling rate (which is 48000 Hz in our case). The L/R clock is also known as the word clock. The other signals are data and bit clock. The controller has a transmitter circuit (shift register) that sends the data to the DAC bit by bit. Raising edge on the clock signal is used to latch the bits to the DAC. The clock signal also drives the internal circuits of the DAC. Typically DACs require 64, 128 or 256 clocks to process one sample. For this reason the controller may need to feed extra (zero) bits to the DAC in addition to the actual data.
The DAC chip itself will cause some latencies. For each sample period it’s outputting the analog value of one sample. At the same time it is receiving a new sample over the I2S link. There may be additional processing steps that all require one sample pereiod. This gives a pipeline that is 2 to few samples deep. The controller in turn is sending one sample over its serial link. It probably needs another sample waiting in the registers for the moment when the previous one has been sent. The application is computing yet another sample and also it may need another (already computed) sample that is ready to be written to the audio controller registers. The result is that even in the optimum case it will take 2 to N sample (L/R clock) periods before the actual signal can be heard from the speakers. So the conclusion is that obtaining truly zero latencies is impossible.
A dedicated audio controller/DSP chip can have lightning fast interrupt response times so the application can be woken up to write a new sample in no time. The application can also busy wait in monitoring the status bits of the audio interface so that no interrupts are required.
How about “PC” hardware and OSS/ALSA?
Typical PC or Unix workstation/server hardware is designed for entirely different purposes than dedicated audio microcontroller/DSP hardware. The same is true with the operating systems that are supposed to be able to serve as many concurrent processes and devices as possible. For this reason typical sound cards have been designed in slightly different way. We still have a DAC and an audio controller that feeds the samples to tha DAC one by one. The difference is that the host CPU is no longer the audio controller. Instead the controller is a chip that is located on the sound card (be it PCI, PCI-E, USB or whatever). To free the host CPU for other tasks the controller reads the samples from an memory located buffer in background using DMA transfers (or whatever). For this an circular DMA buffer is used. The controller will read all the samples placed in the DMA buffer and it will automatically skip back to the beginning after all the samples have been consumed. This process will get repeated again and again until the application has written all available audio data to the DMA buffer. To optimize the bus bandwidh the controller will read the samples in bursts of few instead of arbitrating the bus for each sample. To improve interrupt and system call overhead the device driver (say OSS) will handle the samples in chunks called as fragments (or periods in ALSA terminology). After each fragment the device will raise an interrupt and OSS will do some bookkeeping before waking up the waiting application so that in can write some more audio data. The fragment size depends on the requirements of the application and the capabilities of the device. Larger fragment sizes provide slightly smaller CPU overhead while small fragments are required to reduce latencies.
Synchronous timing vs. asynchronous
Digital audio is unique because it represents time. The DAC chip outputs samples at fixed rate driven by a digical (crystal) clock. The sampling rate (Fs) is typically set up by the application but sometimes the device may be locked to some specific speed. Each sample will take exactly 1/Fs seconds to play. WIth sampling rate of 48000 Hz each sample equals to 1/48000th of a second. 48 samples equals to one millisecond and 48000 samples equals to one second. To be able to play a 48 kHz audio stream properly the application should be able to write 48000 samples or more (sustained) to the device. This also means that the time required to produce an sample should not be more than 1/48000th of a second. In addition the other concurrently running processes will take their time so in practice the total (multitasking) CPU load should not be more than about 80% (this result can be derived from queuing theory).
When trying to obtain lowest possible latencies it’s necessary to use relatively small buffer. The application needs to make sure that it doesn’t write too many samples to the kernel level DMA buffer managed by OSS (or ALSA). There are two ways to do this. In the synchronous method the application does its timing based on the samples consumed by the audio device. This is the simple way. The application simply writes samples to the audio device and OSS will automatically block it if there is not enough room in the DMA buffer. In the asynchronous method the application monitors the DMA buffer and only writes new samples if the buffer level is low enough. Is there any difference in performance between these methods? sure there is.
In the synchronous method the application determines the latencies by calling ioctl(SNDCTL_DSP_SETFRAGMENT). The application can request given fragment size (in powers of 2) and number of fragments. Typical low latency applications will request two fragments with relatively small size. The average latency will be 1.5*fragment size (in samples). After setting up the fragment size the application will keep writing new samples in fragment size chunks. If the buffer is full then the application will block for a moment that is (in average) 0.5*fraagment time.
In the asyncronous method the application uses ioctl(SNDCTL_DSP_GETODELAY) or some other ioctl to find out when it’s the right time to write more data to the device. Then it calls usleep() or some other system call to wait until it can write more date. This may look like as good approach as the synchronous one but it’s not. In fact this approach will fail very frequently. Why?
When using dual buffering the device is playing one fragment of audio data. The application must write some more data before the buffer drains completely. If the fragment size is 48 samples then one fragmemnt equals to one millisecond. In the synchronous approach the application is blocking on the write call until the device has completed playing the current fragment. At that moment the device will start playing the other fragment and the first fragment will become free for new data. At the same time the fragment interrupt handler of OSS will get fired and it will wake up the application to write new data. There will be some scheduling latency but there is plenty of time. The time starts clicking from the exact moment when a fragment becomes available. Even if there is some scheduling delay the application will have a chance to run as soon as possible.
The asynchronous approach is much more complicated. There are three different timings. The application runs at it’s own timing. When it has finished the processing for a new block of changes the exact time relative to the audio buffer can be anything. At that moment it will call some OSS ioctl to check the buffer availability. However this may happen at any moment. It may happen a nanosecond after OSS has got a fragmet interrupt. Equally well it may happen a nanosecond before that. Based on the information returned by the ioctl call the application will compute how long it should wait before writing the data. Already at his moment the application may be one fragment too late because the fragment interrupt has occurred one nanoecond after the ioctl call. Now the application will call usleep() or something else to wait until some buffer space becomes available. However this is another source of error. By definition the usleep() system call will block until at least the requested number of microsecond has been spent. The resolution depends on the system clock rate that may be 60, 100 or 1000 Hz. With the typical rate of 100 Hz the delay may be up to 0.01 milliseconds too long. Even if the usleep() call returns at the right moment there is a chance for error. Also the system timer is asynchronous to the audio buffering. When using the synchronous method the application will wake up at the desired moment plus some scheduling delay. However in the asynchronous method the application may get resumed at a moment when just 1% of the fragment interval is left. If the application requires more than 1% of a fragment interval to produce the next fragment then it’s almost sure that it will be too late rather frequently. There is simply no spare time left. Even a single competing process may delay the audio task so that it cannot write more data to the audio device before it’s too late.
In the other words
So for short latencies you should ask OSS to use small number of short fragments. Typically two of them. In the synchronous method you simply feed audio data to the device by calling write with fragment size chunks. The first write will fill the first fragment and start playback on the device. The second write will fill the second fragment in the DMA buffer. The third write will block until the first fragment has been played and there is room for the third fragment. Finally when the second fragment has been played the waiting write will continue immediately. Even the CPU is under high load there will be maximum time left to compute the next fragment. This is possible because the process will always get woken up near the beginning of the fragment time slot. If blocking in write is not allowed then you can use select/poll to get equal results.
The asynchronous method works in slightly different way. Write calls are triggered by some timer events that are asynchronous to the audio stream. For this reason the process will get a chance to produce new audio data much later. In the worst case this may happen near the very end of the fragment time slot. Any external activity in the system may delay the audio application and cause a buffer underrun.
Unfortunately it looks like typical legacy OSS/Free audio applications all use the asynchronous approach that is doomed to fail. I suspect that this is caused because the OSS/Free and ALSA’s OSS emulation don’t support normal blocking writes/reads properly. Application developers have been forced to use the highly unreliable asynchronous approach that is doomed to fail. This gives bad reputation to OSS because mos of the OSS applications no longer work properly.
So do not use asynchronous events to drive low latency audio application unless you can be sure that this timer has resolution that is much higher than the fragment time period.
Should I care about low latencies at all?
In most cases the answer is no. In some cases you need to pay some attention on it (for example when playing audio and video in lip sync). Just in very few cases you will need really low latencies. This is true if you need to do things like real time guitar effects processing or when there is a risk that latencies cumulate during a chain of multiple processing stages.
Many of you think that applications like Digital Audio Workstations (DAW) require low latencies. Typically a DAW application plays earlier recorded material and at the same time record new tracks performed by the vocalist or some other instruments. If the latencies are not minimal then the later tracks will get delayed by the amount of latency. However this is all bullshit. It doesn’t matter how large the latencies are as long as you know the exact value. Even if the latency between output and input is seconds you can easily compensate this by throwing away the right number of recorded samples.
What if I should not block?
There are some special situations where an audio application must be able to react to external events very rapidly. Unfortunately this is one of the most difficult aspects in audio programming. Also it’s one of the most commonly misunderstood areas. Audio application developers seem to think it’s sexy if they can manage to get their application to work without blocking. This is not correct.
The thing that irritates me most in the OSS world is that practically all popular audio applications try to avoid blocking audio reads and writes. They use clever looking algorithms to decide if they can read or write without blocking. Then they try to estimate when they will be able to read/write withot blocking. Finally they call usleep() to wait until that time. This is all bogus. Instead of using the reliable synchronous timing mode they block asynchronously in usleep(). This gives no benefit. Instead it breaks the application because asynchronous timing is doomed to fail.
There are applications that should avoid blocking (both synchronous and asynchronous). For example GUI applications are supposed to be able to react to the keyboard events and mouse clicks rapidly. However even they don’t need to avoid blocking completely. In most cases some amount of blocking is allowed before the user notices any delays in responsiveness. Typically GUI applications use mechanisms like poll, select or gdk_add_input to handle events from parallel sources (OSS, keyboard, mouse, serial port, network or whatever). This guarantees rapid response in all cases. However the OSS write/read cals can be blocking ones. When the application uses resonably small fragment sizes and writes it’s guaranteed that the application will not block for too long time. In most cases there will be no blocking at all.
It is perfectly OK to use mechanisms like poll and select to avoid blocking. However blocking on OSS reads and writes should not be avoided. In particular this is true when the application has to block on some asynchronous timer to avoid blocking on OSS. Application developers who do this should boild themselves in hot oil.