WebSocket Messages - MediSync API

Overview

MediSync’s WebSocket API uses a simple message protocol for real-time audio transcription with speaker diarization. This page details all message types and data structures used for communication between client and server.

Simple Protocol: The WebSocket API uses only 3 message types for efficient real-time processing: session initialization, transcription results, and warnings.

Message Types

Server to Client Messages

The server sends JSON messages to the client for session management and transcription results.

1. Session Initialization

Sent immediately after successful WebSocket connection to provide the unique session identifier.

{
  "type": "session_init",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

Fields:

type: Always "session_init"
session_id: Unique UUID for this WebSocket session

2. Transcription Result

Sent after processing each audio chunk that contains speech. Includes speaker identification and timing information.

{
  "type": "transcription",
  "text": "[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?",
  "timestamp": 1703123456.789,
  "audio_file": "audio_chunks/550e8400-e29b-41d4-a716-446655440000_20231221_143056_123456.wav"
}

Fields:

type: Always "transcription"
text: Speaker-attributed transcription with timing information in format [SPEAKER_ID | start_time --> end_time]: transcribed_text
timestamp: Unix timestamp when audio chunk was received
audio_file: Path to saved audio file for this transcription

Text Format: The transcription text follows this structured format:

[SPEAKER_ID | start_time --> end_time]: transcribed_text

SPEAKER_ID: Automatically assigned speaker identifier (SPEAKER_0, SPEAKER_1, etc.)
start_time: Start time in seconds with millisecond precision
end_time: End time in seconds with millisecond precision
transcribed_text: The actual transcribed speech

Example Multi-Speaker Conversation:

[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?
[SPEAKER_1 | 3.500 --> 6.120]: I'm doing great, thanks for asking!
[SPEAKER_0 | 6.200 --> 9.800]: That's wonderful to hear. What are your plans for the weekend?
[SPEAKER_1 | 10.100 --> 14.250]: I'm thinking about going hiking if the weather is nice.

3. Warning Messages

Sent when non-fatal issues occur during processing, such as audio processing delays or quality issues.

{
  "type": "warning",
  "message": "Audio chunk dropped - processing backlog."
}

Fields:

type: Always "warning"
message: Human-readable warning description

Common Warning Messages:

"Audio chunk dropped - processing backlog." - Server is behind on processing
"Audio quality low - consider improving recording conditions." - Audio quality issues detected
"No speech detected in audio chunk." - Silent audio received

Client to Server Messages

The client sends binary audio data to the server for transcription processing.

Audio Data

// Send audio buffer as binary data
const audioBuffer = new ArrayBuffer(1024 * 2); // 1024 samples * 2 bytes per sample
const audioView = new Int16Array(audioBuffer);

// Fill with audio data from microphone or file
navigator.mediaDevices.getUserMedia({ audio: true })
  .then(stream => {
    const audioContext = new AudioContext();
    const source = audioContext.createMediaStreamSource(stream);
    const processor = audioContext.createScriptProcessor(1024, 1, 1);
    
    processor.onaudioprocess = function(e) {
      const inputBuffer = e.inputBuffer;
      const inputData = inputBuffer.getChannelData(0);
      
      // Convert float32 to int16
      const int16Buffer = new Int16Array(inputData.length);
      for (let i = 0; i < inputData.length; i++) {
        int16Buffer[i] = Math.max(-32768, Math.min(32767, inputData[i] * 32767));
      }
      
      // Send as binary data
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(int16Buffer.buffer);
      }
    };
    
    source.connect(processor);
    processor.connect(audioContext.destination);
  });

Audio Requirements:

Format: WAV (recommended) or raw PCM
Sample Rate: 16kHz (recommended), 44.1kHz, 48kHz supported
Bit Depth: 16-bit
Channels: Mono (1 channel)
Chunk Size: 1024-4096 samples recommended

Supported Audio Formats:

Primary: WAV (16kHz, 16-bit, mono)
Fallback: Most common audio formats (automatically resampled)

Complete Message Flow

Basic Session Flow

Connect

const ws = new WebSocket('ws://localhost:8000/ws/transcribe');

Receive Session Init

{
  "type": "session_init",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

Send Audio Data

// Send binary audio chunks
ws.send(audioBuffer);

Receive Transcriptions

{
  "type": "transcription",
  "text": "[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?",
  "timestamp": 1703123456.789,
  "audio_file": "audio_chunks/550e8400-e29b-41d4-a716-446655440000_20231221_143056_123456.wav"
}

Error Handling

class TranscriptionClient {
  constructor(url) {
    this.url = url;
    this.ws = null;
    this.sessionId = null;
    this.reconnectInterval = 5000;
  }
  
  connect() {
    this.ws = new WebSocket(this.url);
    
    this.ws.onopen = () => {
      console.log('Connected to transcription service');
    };
    
    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      this.handleMessage(data);
    };
    
    this.ws.onclose = () => {
      console.log('Connection closed, attempting reconnect...');
      setTimeout(() => this.connect(), this.reconnectInterval);
    };
    
    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
  }
  
  handleMessage(data) {
    switch(data.type) {
      case 'session_init':
        this.sessionId = data.session_id;
        console.log('Session initialized:', this.sessionId);
        this.onSessionInit(data);
        break;
        
      case 'transcription':
        console.log('Transcription:', data.text);
        this.onTranscription(data);
        break;
        
      case 'warning':
        console.warn('Warning:', data.message);
        this.onWarning(data);
        break;
        
      default:
        console.log('Unknown message type:', data.type);
    }
  }
  
  sendAudio(audioBuffer) {
    if (this.ws && this.ws.readyState === WebSocket.OPEN) {
      this.ws.send(audioBuffer);
    }
  }
  
  // Override these methods in your implementation
  onSessionInit(data) {}
  onTranscription(data) {}
  onWarning(data) {}
}

// Usage
const client = new TranscriptionClient('ws://localhost:8000/ws/transcribe');
client.connect();

// Override handlers
client.onTranscription = (data) => {
  // Parse speaker information
  const match = data.text.match(/\[(\w+) \| ([\d.]+) --> ([\d.]+)\]: (.+)/);
  if (match) {
    const [, speaker, startTime, endTime, text] = match;
    console.log(`${speaker} (${startTime}-${endTime}s): ${text}`);
  }
};

Best Practices

Audio Quality

Use 16kHz sample rate for optimal processing
Ensure clear audio with minimal background noise
Send audio in 1-5 second chunks for best results
Monitor for audio quality warnings

Connection Management

Implement reconnection logic for network issues
Handle all three message types properly
Store session_id for reference
Close connections gracefully when done

Performance

Don’t overwhelm the server with too frequent audio chunks
Monitor warning messages for processing backlog
Buffer audio appropriately before sending
Use binary format for audio data

Error Handling

Always validate message types before processing
Implement proper error recovery
Log warnings for debugging
Handle connection drops gracefully

Language Support

The transcription service supports 30+ languages through a multilingual model. The current configuration defaults to German (de), but supports:

English (en)
German (de) - Default
Spanish (es)
French (fr)
Italian (it)
Portuguese (pt)
Dutch (nl)
And many more…

Language detection is automatic. The multilingual model can transcribe different languages within the same session — including mid-conversation language switches.

WebSocket API

​Overview

​Message Types

​Server to Client Messages

​1. Session Initialization

​2. Transcription Result

​3. Warning Messages

​Client to Server Messages

​Audio Data

​Complete Message Flow

​Basic Session Flow

​Error Handling

​Best Practices

​Language Support

Overview

Message Types

Server to Client Messages

1. Session Initialization

2. Transcription Result

3. Warning Messages

Client to Server Messages

Audio Data

Complete Message Flow

Basic Session Flow

Error Handling

Best Practices

Language Support