Skip to main content

Overview

Proper audio configuration is essential for accurate transcription. This guide provides technical specifications and quality requirements for audio streaming with the WebSocket API.
Simple Audio Streaming: The API accepts WAV format audio with automatic resampling. Just send clear audio and get accurate transcriptions.

Audio Format Specifications

Primary Format

  • Format: WAV
  • Sample Rate: 16000 Hz
  • Bit Depth: 16-bit
  • Channels: 1 (Mono)
  • Encoding: PCM

Streaming

  • Chunk Size: 1-5 seconds of audio
  • Minimum: 0.5 seconds
  • Maximum: 30 seconds
  • Format: Binary data

Supported Alternative Formats

Audio Quality Guidelines

Basic Quality Requirements

Audio Quality

  • Clear Speech: Avoid background noise
  • Consistent Volume: Stable audio levels
  • No Clipping: Avoid audio distortion
  • Single/Multiple Speakers: Both supported

Environment

  • Quiet Environment: Minimal background noise
  • Good Microphone: Clear audio capture
  • Stable Connection: Consistent streaming
  • Proper Distance: Speaker close to microphone

Audio Capture Implementation

Browser Audio Capture

class AudioCapture {
  constructor() {
    this.audioContext = null;
    this.mediaStream = null;
    this.processor = null;
    this.isRecording = false;
  }
  
  async initialize() {
    try {
      // Request microphone access
      this.mediaStream = await navigator.mediaDevices.getUserMedia({
        audio: {
          sampleRate: 16000,
          channelCount: 1,
          sampleSize: 16,
          echoCancellation: true,
          noiseSuppression: true
        }
      });
      
      // Create audio context
      this.audioContext = new AudioContext({
        sampleRate: 16000,
        latencyHint: 'interactive'
      });
      
      // Create media stream source
      const source = this.audioContext.createMediaStreamSource(this.mediaStream);
      
      // Create script processor
      this.processor = this.audioContext.createScriptProcessor(1024, 1, 1);
      
      this.processor.onaudioprocess = (event) => {
        if (this.isRecording) {
          this.processAudioData(event.inputBuffer);
        }
      };
      
      // Connect audio pipeline
      source.connect(this.processor);
      this.processor.connect(this.audioContext.destination);
      
      console.log('Audio capture initialized');
      
    } catch (error) {
      console.error('Failed to initialize audio capture:', error);
      throw error;
    }
  }
  
  processAudioData(inputBuffer) {
    const inputData = inputBuffer.getChannelData(0);
    
    // Convert to PCM 16-bit
    const pcmData = new Int16Array(inputData.length);
    for (let i = 0; i < inputData.length; i++) {
      // Clamp to [-1, 1] and convert to 16-bit
      const clamped = Math.max(-1, Math.min(1, inputData[i]));
      pcmData[i] = clamped * 0x7FFF;
    }
    
    // Send to WebSocket
    this.sendToWebSocket(pcmData.buffer);
  }
  
  sendToWebSocket(audioBuffer) {
    if (this.ws && this.ws.readyState === WebSocket.OPEN) {
      this.ws.send(audioBuffer);
    }
  }
  
  startRecording() {
    this.isRecording = true;
    console.log('Audio recording started');
  }
  
  stopRecording() {
    this.isRecording = false;
    console.log('Audio recording stopped');
  }
  
  destroy() {
    if (this.processor) {
      this.processor.disconnect();
    }
    
    if (this.audioContext) {
      this.audioContext.close();
    }
    
    if (this.mediaStream) {
      this.mediaStream.getTracks().forEach(track => track.stop());
    }
  }
}

// Usage
const audioCapture = new AudioCapture();
await audioCapture.initialize();
audioCapture.startRecording();

Python Audio Capture

import pyaudio
import numpy as np
import websocket

class AudioCapture:
    def __init__(self, sample_rate=16000, chunk_size=1024):
        self.sample_rate = sample_rate
        self.chunk_size = chunk_size
        self.format = pyaudio.paInt16
        self.channels = 1
        
        self.audio = pyaudio.PyAudio()
        self.stream = None
        self.is_recording = False
        
    def initialize(self):
        """Initialize audio capture"""
        try:
            self.stream = self.audio.open(
                format=self.format,
                channels=self.channels,
                rate=self.sample_rate,
                input=True,
                frames_per_buffer=self.chunk_size
            )
            
            print(f"Audio capture initialized:")
            print(f"  Sample Rate: {self.sample_rate} Hz")
            print(f"  Channels: {self.channels}")
            print(f"  Format: 16-bit PCM")
            
        except Exception as e:
            print(f"Failed to initialize audio: {e}")
            raise
    
    def start_recording(self):
        """Start audio recording"""
        self.is_recording = True
        print("Audio recording started")
    
    def stop_recording(self):
        """Stop audio recording"""
        self.is_recording = False
        print("Audio recording stopped")
    
    def read_audio(self):
        """Read audio data from stream"""
        if self.stream and self.is_recording:
            try:
                data = self.stream.read(self.chunk_size, exception_on_overflow=False)
                return data
            except Exception as e:
                print(f"Error reading audio: {e}")
                return None
        return None
    
    def cleanup(self):
        """Clean up resources"""
        if self.stream:
            self.stream.stop_stream()
            self.stream.close()
        
        self.audio.terminate()
        print("Audio capture cleaned up")

# Usage example
def main():
    capture = AudioCapture()
    
    try:
        capture.initialize()
        capture.start_recording()
        
        # Connect to WebSocket
        ws = websocket.WebSocket()
        ws.connect("ws://localhost:8000/ws/transcribe")
        
        # Stream audio data
        while capture.is_recording:
            audio_data = capture.read_audio()
            if audio_data:
                ws.send(audio_data, websocket.ABNF.OPCODE_BINARY)
                
    except KeyboardInterrupt:
        print("Stopping...")
    finally:
        capture.stop_recording()
        capture.cleanup()
        if 'ws' in locals():
            ws.close()

if __name__ == "__main__":
    main()

Audio File Processing

File Input

// Process audio file
function processAudioFile(file) {
  const reader = new FileReader();
  
  reader.onload = function(e) {
    const arrayBuffer = e.target.result;
    
    // Create audio context
    const audioContext = new AudioContext();
    
    audioContext.decodeAudioData(arrayBuffer)
      .then(audioBuffer => {
        // Resample to 16kHz if needed
        const resampledBuffer = resampleAudio(audioBuffer, 16000);
        
        // Convert to chunks and send
        const chunkSize = 16000; // 1 second at 16kHz
        for (let i = 0; i < resampledBuffer.length; i += chunkSize) {
          const chunk = resampledBuffer.slice(i, i + chunkSize);
          const pcmData = new Int16Array(chunk.length);
          
          for (let j = 0; j < chunk.length; j++) {
            pcmData[j] = Math.max(-1, Math.min(1, chunk[j])) * 0x7FFF;
          }
          
          // Send chunk with delay
          setTimeout(() => {
            ws.send(pcmData.buffer);
          }, (i / chunkSize) * 1000); // 1 second delay between chunks
        }
      })
      .catch(error => {
        console.error('Error processing audio file:', error);
      });
  };
  
  reader.readAsArrayBuffer(file);
}

function resampleAudio(audioBuffer, targetSampleRate) {
  const originalSampleRate = audioBuffer.sampleRate;
  const originalData = audioBuffer.getChannelData(0);
  
  if (originalSampleRate === targetSampleRate) {
    return originalData;
  }
  
  const ratio = originalSampleRate / targetSampleRate;
  const newLength = Math.round(originalData.length / ratio);
  const result = new Float32Array(newLength);
  
  for (let i = 0; i < newLength; i++) {
    const index = i * ratio;
    const indexInt = Math.floor(index);
    const indexFrac = index - indexInt;
    
    if (indexInt + 1 < originalData.length) {
      result[i] = originalData[indexInt] * (1 - indexFrac) + 
                 originalData[indexInt + 1] * indexFrac;
    } else {
      result[i] = originalData[indexInt];
    }
  }
  
  return result;
}

Common Audio Issues

Troubleshooting

Problem: Sending audio but no transcription messages receivedSolutions:
  • Check audio contains speech (not silence)
  • Verify audio format is correct (WAV, 16kHz, 16-bit, mono)
  • Ensure audio chunks are not too short (less than 0.5 seconds)
  • Monitor for warning messages from the server
Problem: Transcriptions are inaccurate or incompleteSolutions:
  • Improve audio quality (reduce background noise)
  • Ensure clear speech and proper microphone distance
  • Check for audio clipping or distortion
  • Use recommended audio format (16kHz WAV)
Problem: Long delays between audio and transcriptionSolutions:
  • Reduce audio chunk size (1-2 seconds)
  • Check for “processing backlog” warnings
  • Ensure stable network connection
  • Verify server is not overloaded
Problem: Audio not being sent or connection droppingSolutions:
  • Check WebSocket connection is open
  • Verify binary data format is correct
  • Implement connection recovery logic
  • Monitor network stability

Audio Quality Tips

1

Use Good Hardware

Use a quality microphone and ensure proper positioning
2

Optimize Environment

Record in a quiet environment with minimal background noise
3

Check Audio Levels

Ensure audio is not too quiet or too loud (avoid clipping)
4

Test Audio Format

Use the recommended 16kHz, 16-bit, mono WAV format
5

Manage Chunk Size

Send audio in 1-5 second chunks for optimal processing

Performance Considerations

Optimal Settings

Audio Processing

  • Sample Rate: 16kHz for best performance
  • Chunk Size: 1-5 seconds
  • Format: WAV binary data
  • Channels: Mono (1 channel)

Network Usage

  • Bandwidth: ~32 kbps for 16kHz mono
  • Latency: 100-300ms typical
  • Reliability: Stable WebSocket connection
  • Buffering: Minimal client-side buffering

Resource Usage

  • CPU: Low - mainly audio capture and format conversion
  • Memory: Minimal - only current audio buffer
  • Network: Dependent on audio quality and chunk size
  • Storage: Audio files saved server-side for transcription

Best Practices

1

Audio Format

Use the recommended 16kHz, 16-bit, mono WAV format
2

Chunk Management

Send audio in 1-5 second chunks for optimal processing
3

Quality Control

Monitor audio quality and adjust if necessary
4

Error Handling

Listen for warning messages and handle audio issues
5

Resource Cleanup

Properly clean up audio resources when done

Coming Soon

Future Audio Features: Advanced audio processing, noise reduction, automatic gain control, and audio quality monitoring will be added in future versions.

Next Steps