Overview
MediSync’s WebSocket API uses a simple message protocol for real-time audio transcription with speaker diarization. This page details all message types and data structures used for communication between client and server.
Simple Protocol : The WebSocket API uses only 3 message types for efficient real-time processing: session initialization, transcription results, and warnings.
Message Types
Server to Client Messages
The server sends JSON messages to the client for session management and transcription results.
1. Session Initialization
Sent immediately after successful WebSocket connection to provide the unique session identifier.
Message Format
JavaScript
Python
{
"type" : "session_init" ,
"session_id" : "550e8400-e29b-41d4-a716-446655440000"
}
Fields:
type: Always "session_init"
session_id: Unique UUID for this WebSocket session
2. Transcription Result
Sent after processing each audio chunk that contains speech. Includes speaker identification and timing information.
Message Format
JavaScript
Python
{
"type" : "transcription" ,
"text" : "[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?" ,
"timestamp" : 1703123456.789 ,
"audio_file" : "audio_chunks/550e8400-e29b-41d4-a716-446655440000_20231221_143056_123456.wav"
}
Fields:
type: Always "transcription"
text: Speaker-attributed transcription with timing information in format [SPEAKER_ID | start_time --> end_time]: transcribed_text
timestamp: Unix timestamp when audio chunk was received
audio_file: Path to saved audio file for this transcription
Text Format:
The transcription text follows this structured format:
[SPEAKER_ID | start_time --> end_time]: transcribed_text
SPEAKER_ID: Automatically assigned speaker identifier (SPEAKER_0, SPEAKER_1, etc.)
start_time: Start time in seconds with millisecond precision
end_time: End time in seconds with millisecond precision
transcribed_text: The actual transcribed speech
Example Multi-Speaker Conversation:
[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?
[SPEAKER_1 | 3.500 --> 6.120]: I'm doing great, thanks for asking!
[SPEAKER_0 | 6.200 --> 9.800]: That's wonderful to hear. What are your plans for the weekend?
[SPEAKER_1 | 10.100 --> 14.250]: I'm thinking about going hiking if the weather is nice.
3. Warning Messages
Sent when non-fatal issues occur during processing, such as audio processing delays or quality issues.
Message Format
JavaScript
Python
{
"type" : "warning" ,
"message" : "Audio chunk dropped - processing backlog."
}
Fields:
type: Always "warning"
message: Human-readable warning description
Common Warning Messages:
"Audio chunk dropped - processing backlog." - Server is behind on processing
"Audio quality low - consider improving recording conditions." - Audio quality issues detected
"No speech detected in audio chunk." - Silent audio received
Client to Server Messages
The client sends binary audio data to the server for transcription processing.
Audio Data
// Send audio buffer as binary data
const audioBuffer = new ArrayBuffer ( 1024 * 2 ); // 1024 samples * 2 bytes per sample
const audioView = new Int16Array ( audioBuffer );
// Fill with audio data from microphone or file
navigator . mediaDevices . getUserMedia ({ audio: true })
. then ( stream => {
const audioContext = new AudioContext ();
const source = audioContext . createMediaStreamSource ( stream );
const processor = audioContext . createScriptProcessor ( 1024 , 1 , 1 );
processor . onaudioprocess = function ( e ) {
const inputBuffer = e . inputBuffer ;
const inputData = inputBuffer . getChannelData ( 0 );
// Convert float32 to int16
const int16Buffer = new Int16Array ( inputData . length );
for ( let i = 0 ; i < inputData . length ; i ++ ) {
int16Buffer [ i ] = Math . max ( - 32768 , Math . min ( 32767 , inputData [ i ] * 32767 ));
}
// Send as binary data
if ( ws . readyState === WebSocket . OPEN ) {
ws . send ( int16Buffer . buffer );
}
};
source . connect ( processor );
processor . connect ( audioContext . destination );
});
Audio Requirements:
Format: WAV (recommended) or raw PCM
Sample Rate: 16kHz (recommended), 44.1kHz, 48kHz supported
Bit Depth: 16-bit
Channels: Mono (1 channel)
Chunk Size: 1024-4096 samples recommended
Supported Audio Formats:
Primary: WAV (16kHz, 16-bit, mono)
Fallback: Most common audio formats (automatically resampled)
Complete Message Flow
Basic Session Flow
Connect
const ws = new WebSocket ( 'ws://localhost:8000/ws/transcribe' );
Receive Session Init
{
"type" : "session_init" ,
"session_id" : "550e8400-e29b-41d4-a716-446655440000"
}
Send Audio Data
// Send binary audio chunks
ws . send ( audioBuffer );
Receive Transcriptions
{
"type" : "transcription" ,
"text" : "[SPEAKER_0 | 0.000 --> 3.450]: Hello, how are you doing today?" ,
"timestamp" : 1703123456.789 ,
"audio_file" : "audio_chunks/550e8400-e29b-41d4-a716-446655440000_20231221_143056_123456.wav"
}
Error Handling
Complete Client Implementation
class TranscriptionClient {
constructor ( url ) {
this . url = url ;
this . ws = null ;
this . sessionId = null ;
this . reconnectInterval = 5000 ;
}
connect () {
this . ws = new WebSocket ( this . url );
this . ws . onopen = () => {
console . log ( 'Connected to transcription service' );
};
this . ws . onmessage = ( event ) => {
const data = JSON . parse ( event . data );
this . handleMessage ( data );
};
this . ws . onclose = () => {
console . log ( 'Connection closed, attempting reconnect...' );
setTimeout (() => this . connect (), this . reconnectInterval );
};
this . ws . onerror = ( error ) => {
console . error ( 'WebSocket error:' , error );
};
}
handleMessage ( data ) {
switch ( data . type ) {
case 'session_init' :
this . sessionId = data . session_id ;
console . log ( 'Session initialized:' , this . sessionId );
this . onSessionInit ( data );
break ;
case 'transcription' :
console . log ( 'Transcription:' , data . text );
this . onTranscription ( data );
break ;
case 'warning' :
console . warn ( 'Warning:' , data . message );
this . onWarning ( data );
break ;
default :
console . log ( 'Unknown message type:' , data . type );
}
}
sendAudio ( audioBuffer ) {
if ( this . ws && this . ws . readyState === WebSocket . OPEN ) {
this . ws . send ( audioBuffer );
}
}
// Override these methods in your implementation
onSessionInit ( data ) {}
onTranscription ( data ) {}
onWarning ( data ) {}
}
// Usage
const client = new TranscriptionClient ( 'ws://localhost:8000/ws/transcribe' );
client . connect ();
// Override handlers
client . onTranscription = ( data ) => {
// Parse speaker information
const match = data . text . match ( / \[ ( \w + ) \| ( [ \d. ] + ) --> ( [ \d. ] + ) \] : ( . + ) / );
if ( match ) {
const [, speaker , startTime , endTime , text ] = match ;
console . log ( ` ${ speaker } ( ${ startTime } - ${ endTime } s): ${ text } ` );
}
};
Best Practices
Use 16kHz sample rate for optimal processing
Ensure clear audio with minimal background noise
Send audio in 1-5 second chunks for best results
Monitor for audio quality warnings
Implement reconnection logic for network issues
Handle all three message types properly
Store session_id for reference
Close connections gracefully when done
Always validate message types before processing
Implement proper error recovery
Log warnings for debugging
Handle connection drops gracefully
Language Support
The transcription service supports 100+ languages through Whisper AI. The current configuration defaults to German (de), but supports:
English (en)
German (de) - Default
Spanish (es)
French (fr)
Italian (it)
Portuguese (pt)
Dutch (nl)
And many more…
Language detection is automatic. The service will process audio in the detected language and return transcriptions accordingly.