Overview
Proper audio configuration is essential for accurate transcription. This guide provides technical specifications and quality requirements for audio streaming with the WebSocket API.
Simple Audio Streaming : The API accepts WAV format audio with automatic resampling. Just send clear audio and get accurate transcriptions.
Primary Format
Format : WAV
Sample Rate : 16000 Hz
Bit Depth : 16-bit
Channels : 1 (Mono)
Encoding : PCM
Streaming
Chunk Size : 1-5 seconds of audio
Minimum : 0.5 seconds
Maximum : 30 seconds
Format : Binary data
Optimal (Recommended)
High Quality
Low Bandwidth
{
"sample_rate" : 16000 ,
"bit_depth" : 16 ,
"channels" : 1 ,
"format" : "wav" ,
"use_case" : "Real-time streaming (best performance)"
}
{
"sample_rate" : 44100 ,
"bit_depth" : 16 ,
"channels" : 1 ,
"format" : "wav" ,
"use_case" : "High-quality recordings"
}
{
"sample_rate" : 8000 ,
"bit_depth" : 16 ,
"channels" : 1 ,
"format" : "wav" ,
"use_case" : "Limited bandwidth scenarios"
}
Audio Quality Guidelines
Basic Quality Requirements
Audio Quality
Clear Speech : Avoid background noise
Consistent Volume : Stable audio levels
No Clipping : Avoid audio distortion
Single/Multiple Speakers : Both supported
Environment
Quiet Environment : Minimal background noise
Good Microphone : Clear audio capture
Stable Connection : Consistent streaming
Proper Distance : Speaker close to microphone
Audio Capture Implementation
Browser Audio Capture
class AudioCapture {
constructor () {
this . audioContext = null ;
this . mediaStream = null ;
this . processor = null ;
this . isRecording = false ;
}
async initialize () {
try {
// Request microphone access
this . mediaStream = await navigator . mediaDevices . getUserMedia ({
audio: {
sampleRate: 16000 ,
channelCount: 1 ,
sampleSize: 16 ,
echoCancellation: true ,
noiseSuppression: true
}
});
// Create audio context
this . audioContext = new AudioContext ({
sampleRate: 16000 ,
latencyHint: 'interactive'
});
// Create media stream source
const source = this . audioContext . createMediaStreamSource ( this . mediaStream );
// Create script processor
this . processor = this . audioContext . createScriptProcessor ( 1024 , 1 , 1 );
this . processor . onaudioprocess = ( event ) => {
if ( this . isRecording ) {
this . processAudioData ( event . inputBuffer );
}
};
// Connect audio pipeline
source . connect ( this . processor );
this . processor . connect ( this . audioContext . destination );
console . log ( 'Audio capture initialized' );
} catch ( error ) {
console . error ( 'Failed to initialize audio capture:' , error );
throw error ;
}
}
processAudioData ( inputBuffer ) {
const inputData = inputBuffer . getChannelData ( 0 );
// Convert to PCM 16-bit
const pcmData = new Int16Array ( inputData . length );
for ( let i = 0 ; i < inputData . length ; i ++ ) {
// Clamp to [-1, 1] and convert to 16-bit
const clamped = Math . max ( - 1 , Math . min ( 1 , inputData [ i ]));
pcmData [ i ] = clamped * 0x7FFF ;
}
// Send to WebSocket
this . sendToWebSocket ( pcmData . buffer );
}
sendToWebSocket ( audioBuffer ) {
if ( this . ws && this . ws . readyState === WebSocket . OPEN ) {
this . ws . send ( audioBuffer );
}
}
startRecording () {
this . isRecording = true ;
console . log ( 'Audio recording started' );
}
stopRecording () {
this . isRecording = false ;
console . log ( 'Audio recording stopped' );
}
destroy () {
if ( this . processor ) {
this . processor . disconnect ();
}
if ( this . audioContext ) {
this . audioContext . close ();
}
if ( this . mediaStream ) {
this . mediaStream . getTracks (). forEach ( track => track . stop ());
}
}
}
// Usage
const audioCapture = new AudioCapture ();
await audioCapture . initialize ();
audioCapture . startRecording ();
Python Audio Capture
import pyaudio
import numpy as np
import websocket
class AudioCapture :
def __init__ ( self , sample_rate = 16000 , chunk_size = 1024 ):
self .sample_rate = sample_rate
self .chunk_size = chunk_size
self .format = pyaudio.paInt16
self .channels = 1
self .audio = pyaudio.PyAudio()
self .stream = None
self .is_recording = False
def initialize ( self ):
"""Initialize audio capture"""
try :
self .stream = self .audio.open(
format = self .format,
channels = self .channels,
rate = self .sample_rate,
input = True ,
frames_per_buffer = self .chunk_size
)
print ( f "Audio capture initialized:" )
print ( f " Sample Rate: { self .sample_rate } Hz" )
print ( f " Channels: { self .channels } " )
print ( f " Format: 16-bit PCM" )
except Exception as e:
print ( f "Failed to initialize audio: { e } " )
raise
def start_recording ( self ):
"""Start audio recording"""
self .is_recording = True
print ( "Audio recording started" )
def stop_recording ( self ):
"""Stop audio recording"""
self .is_recording = False
print ( "Audio recording stopped" )
def read_audio ( self ):
"""Read audio data from stream"""
if self .stream and self .is_recording:
try :
data = self .stream.read( self .chunk_size, exception_on_overflow = False )
return data
except Exception as e:
print ( f "Error reading audio: { e } " )
return None
return None
def cleanup ( self ):
"""Clean up resources"""
if self .stream:
self .stream.stop_stream()
self .stream.close()
self .audio.terminate()
print ( "Audio capture cleaned up" )
# Usage example
def main ():
capture = AudioCapture()
try :
capture.initialize()
capture.start_recording()
# Connect to WebSocket
ws = websocket.WebSocket()
ws.connect( "ws://localhost:8000/ws/transcribe" )
# Stream audio data
while capture.is_recording:
audio_data = capture.read_audio()
if audio_data:
ws.send(audio_data, websocket. ABNF . OPCODE_BINARY )
except KeyboardInterrupt :
print ( "Stopping..." )
finally :
capture.stop_recording()
capture.cleanup()
if 'ws' in locals ():
ws.close()
if __name__ == "__main__" :
main()
Audio File Processing
// Process audio file
function processAudioFile ( file ) {
const reader = new FileReader ();
reader . onload = function ( e ) {
const arrayBuffer = e . target . result ;
// Create audio context
const audioContext = new AudioContext ();
audioContext . decodeAudioData ( arrayBuffer )
. then ( audioBuffer => {
// Resample to 16kHz if needed
const resampledBuffer = resampleAudio ( audioBuffer , 16000 );
// Convert to chunks and send
const chunkSize = 16000 ; // 1 second at 16kHz
for ( let i = 0 ; i < resampledBuffer . length ; i += chunkSize ) {
const chunk = resampledBuffer . slice ( i , i + chunkSize );
const pcmData = new Int16Array ( chunk . length );
for ( let j = 0 ; j < chunk . length ; j ++ ) {
pcmData [ j ] = Math . max ( - 1 , Math . min ( 1 , chunk [ j ])) * 0x7FFF ;
}
// Send chunk with delay
setTimeout (() => {
ws . send ( pcmData . buffer );
}, ( i / chunkSize ) * 1000 ); // 1 second delay between chunks
}
})
. catch ( error => {
console . error ( 'Error processing audio file:' , error );
});
};
reader . readAsArrayBuffer ( file );
}
function resampleAudio ( audioBuffer , targetSampleRate ) {
const originalSampleRate = audioBuffer . sampleRate ;
const originalData = audioBuffer . getChannelData ( 0 );
if ( originalSampleRate === targetSampleRate ) {
return originalData ;
}
const ratio = originalSampleRate / targetSampleRate ;
const newLength = Math . round ( originalData . length / ratio );
const result = new Float32Array ( newLength );
for ( let i = 0 ; i < newLength ; i ++ ) {
const index = i * ratio ;
const indexInt = Math . floor ( index );
const indexFrac = index - indexInt ;
if ( indexInt + 1 < originalData . length ) {
result [ i ] = originalData [ indexInt ] * ( 1 - indexFrac ) +
originalData [ indexInt + 1 ] * indexFrac ;
} else {
result [ i ] = originalData [ indexInt ];
}
}
return result ;
}
Common Audio Issues
Troubleshooting
Problem : Sending audio but no transcription messages receivedSolutions :
Check audio contains speech (not silence)
Verify audio format is correct (WAV, 16kHz, 16-bit, mono)
Ensure audio chunks are not too short (less than 0.5 seconds)
Monitor for warning messages from the server
Poor Transcription Quality
Problem : Transcriptions are inaccurate or incompleteSolutions :
Improve audio quality (reduce background noise)
Ensure clear speech and proper microphone distance
Check for audio clipping or distortion
Use recommended audio format (16kHz WAV)
Problem : Long delays between audio and transcriptionSolutions :
Reduce audio chunk size (1-2 seconds)
Check for “processing backlog” warnings
Ensure stable network connection
Verify server is not overloaded
Problem : Audio not being sent or connection droppingSolutions :
Check WebSocket connection is open
Verify binary data format is correct
Implement connection recovery logic
Monitor network stability
Audio Quality Tips
Use Good Hardware
Use a quality microphone and ensure proper positioning
Optimize Environment
Record in a quiet environment with minimal background noise
Check Audio Levels
Ensure audio is not too quiet or too loud (avoid clipping)
Test Audio Format
Use the recommended 16kHz, 16-bit, mono WAV format
Manage Chunk Size
Send audio in 1-5 second chunks for optimal processing
Optimal Settings
Audio Processing
Sample Rate : 16kHz for best performance
Chunk Size : 1-5 seconds
Format : WAV binary data
Channels : Mono (1 channel)
Network Usage
Bandwidth : ~32 kbps for 16kHz mono
Latency : 100-300ms typical
Reliability : Stable WebSocket connection
Buffering : Minimal client-side buffering
Resource Usage
CPU : Low - mainly audio capture and format conversion
Memory : Minimal - only current audio buffer
Network : Dependent on audio quality and chunk size
Storage : Audio files saved server-side for transcription
Best Practices
Audio Format
Use the recommended 16kHz, 16-bit, mono WAV format
Chunk Management
Send audio in 1-5 second chunks for optimal processing
Quality Control
Monitor audio quality and adjust if necessary
Error Handling
Listen for warning messages and handle audio issues
Resource Cleanup
Properly clean up audio resources when done
Coming Soon
Future Audio Features : Advanced audio processing, noise reduction, automatic gain control, and audio quality monitoring will be added in future versions.
Next Steps