Skip to main content

Overview

MediSync’s WebSocket API uses a simple message protocol for real-time audio transcription with speaker diarization. This page details all message types and data structures used for communication between client and server.
Simple Protocol: The WebSocket API uses only 3 message types for efficient real-time processing: session initialization, transcription results, and warnings.

Message Types

Server to Client Messages

The server sends JSON messages to the client for session management and transcription results.

1. Session Initialization

Sent immediately after successful WebSocket connection to provide the unique session identifier.
{
  "type": "session_init",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}
Fields:
  • type: Always "session_init"
  • session_id: Unique UUID for this WebSocket session

2. Transcription Result

Sent after processing each audio chunk that contains speech. Includes speaker identification and timing information.
{
  "type": "transcription",
  "text": "[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?",
  "timestamp": 1703123456.789,
  "audio_file": "audio_chunks/550e8400-e29b-41d4-a716-446655440000_20231221_143056_123456.wav"
}
Fields:
  • type: Always "transcription"
  • text: Speaker-attributed transcription with timing information in format [SPEAKER_ID | start_time --> end_time]: transcribed_text
  • timestamp: Unix timestamp when audio chunk was received
  • audio_file: Path to saved audio file for this transcription
Text Format: The transcription text follows this structured format:
[SPEAKER_ID | start_time --> end_time]: transcribed_text
  • SPEAKER_ID: Automatically assigned speaker identifier (SPEAKER_0, SPEAKER_1, etc.)
  • start_time: Start time in seconds with millisecond precision
  • end_time: End time in seconds with millisecond precision
  • transcribed_text: The actual transcribed speech
Example Multi-Speaker Conversation:
[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?
[SPEAKER_1 | 3.500 --> 6.120]: I'm doing great, thanks for asking!
[SPEAKER_0 | 6.200 --> 9.800]: That's wonderful to hear. What are your plans for the weekend?
[SPEAKER_1 | 10.100 --> 14.250]: I'm thinking about going hiking if the weather is nice.

3. Warning Messages

Sent when non-fatal issues occur during processing, such as audio processing delays or quality issues.
{
  "type": "warning",
  "message": "Audio chunk dropped - processing backlog."
}
Fields:
  • type: Always "warning"
  • message: Human-readable warning description
Common Warning Messages:
  • "Audio chunk dropped - processing backlog." - Server is behind on processing
  • "Audio quality low - consider improving recording conditions." - Audio quality issues detected
  • "No speech detected in audio chunk." - Silent audio received

Client to Server Messages

The client sends binary audio data to the server for transcription processing.

Audio Data

// Send audio buffer as binary data
const audioBuffer = new ArrayBuffer(1024 * 2); // 1024 samples * 2 bytes per sample
const audioView = new Int16Array(audioBuffer);

// Fill with audio data from microphone or file
navigator.mediaDevices.getUserMedia({ audio: true })
  .then(stream => {
    const audioContext = new AudioContext();
    const source = audioContext.createMediaStreamSource(stream);
    const processor = audioContext.createScriptProcessor(1024, 1, 1);
    
    processor.onaudioprocess = function(e) {
      const inputBuffer = e.inputBuffer;
      const inputData = inputBuffer.getChannelData(0);
      
      // Convert float32 to int16
      const int16Buffer = new Int16Array(inputData.length);
      for (let i = 0; i < inputData.length; i++) {
        int16Buffer[i] = Math.max(-32768, Math.min(32767, inputData[i] * 32767));
      }
      
      // Send as binary data
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(int16Buffer.buffer);
      }
    };
    
    source.connect(processor);
    processor.connect(audioContext.destination);
  });
Audio Requirements:
  • Format: WAV (recommended) or raw PCM
  • Sample Rate: 16kHz (recommended), 44.1kHz, 48kHz supported
  • Bit Depth: 16-bit
  • Channels: Mono (1 channel)
  • Chunk Size: 1024-4096 samples recommended
Supported Audio Formats:
  • Primary: WAV (16kHz, 16-bit, mono)
  • Fallback: Most common audio formats (automatically resampled)

Complete Message Flow

Basic Session Flow

1

Connect

const ws = new WebSocket('ws://localhost:8000/ws/transcribe');
2

Receive Session Init

{
  "type": "session_init",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}
3

Send Audio Data

// Send binary audio chunks
ws.send(audioBuffer);
4

Receive Transcriptions

{
  "type": "transcription",
  "text": "[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?",
  "timestamp": 1703123456.789,
  "audio_file": "audio_chunks/550e8400-e29b-41d4-a716-446655440000_20231221_143056_123456.wav"
}

Error Handling

class TranscriptionClient {
  constructor(url) {
    this.url = url;
    this.ws = null;
    this.sessionId = null;
    this.reconnectInterval = 5000;
  }
  
  connect() {
    this.ws = new WebSocket(this.url);
    
    this.ws.onopen = () => {
      console.log('Connected to transcription service');
    };
    
    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      this.handleMessage(data);
    };
    
    this.ws.onclose = () => {
      console.log('Connection closed, attempting reconnect...');
      setTimeout(() => this.connect(), this.reconnectInterval);
    };
    
    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
  }
  
  handleMessage(data) {
    switch(data.type) {
      case 'session_init':
        this.sessionId = data.session_id;
        console.log('Session initialized:', this.sessionId);
        this.onSessionInit(data);
        break;
        
      case 'transcription':
        console.log('Transcription:', data.text);
        this.onTranscription(data);
        break;
        
      case 'warning':
        console.warn('Warning:', data.message);
        this.onWarning(data);
        break;
        
      default:
        console.log('Unknown message type:', data.type);
    }
  }
  
  sendAudio(audioBuffer) {
    if (this.ws && this.ws.readyState === WebSocket.OPEN) {
      this.ws.send(audioBuffer);
    }
  }
  
  // Override these methods in your implementation
  onSessionInit(data) {}
  onTranscription(data) {}
  onWarning(data) {}
}

// Usage
const client = new TranscriptionClient('ws://localhost:8000/ws/transcribe');
client.connect();

// Override handlers
client.onTranscription = (data) => {
  // Parse speaker information
  const match = data.text.match(/\[(\w+) \| ([\d.]+) --> ([\d.]+)\]: (.+)/);
  if (match) {
    const [, speaker, startTime, endTime, text] = match;
    console.log(`${speaker} (${startTime}-${endTime}s): ${text}`);
  }
};

Best Practices

  • Use 16kHz sample rate for optimal processing
  • Ensure clear audio with minimal background noise
  • Send audio in 1-5 second chunks for best results
  • Monitor for audio quality warnings
  • Implement reconnection logic for network issues
  • Handle all three message types properly
  • Store session_id for reference
  • Close connections gracefully when done
  • Don’t overwhelm the server with too frequent audio chunks
  • Monitor warning messages for processing backlog
  • Buffer audio appropriately before sending
  • Use binary format for audio data
  • Always validate message types before processing
  • Implement proper error recovery
  • Log warnings for debugging
  • Handle connection drops gracefully

Language Support

The transcription service supports 100+ languages through Whisper AI. The current configuration defaults to German (de), but supports:
  • English (en)
  • German (de) - Default
  • Spanish (es)
  • French (fr)
  • Italian (it)
  • Portuguese (pt)
  • Dutch (nl)
  • And many more…
Language detection is automatic. The service will process audio in the detected language and return transcriptions accordingly.