Overview
Conversimple supports two conversation modes, each optimized for different use cases:- STT Mode (Speech-to-Text → LLM → Text-to-Speech): Traditional pipeline with maximum flexibility
- STS Mode (Speech-to-Speech): Unified pipeline for ultra-low latency
STT Mode: Maximum Flexibility
Architecture
When to Use STT Mode
Custom LLM Logic
Need to customize LLM behavior, prompts, or temperature settings
Multi-Provider
Want to use different providers for STT, LLM, and TTS
Processing Pipeline
Need to process or transform text between stages
Advanced Control
Require fine-grained control over each stage
Characteristics
Latency: Under 1 second typical response time Flexibility: Very High- Separate configuration for each service
- Custom prompt engineering
- Text transformation between stages
- Provider mixing (e.g., Deepgram STT + OpenAI LLM + ElevenLabs TTS)
- Complex conversation logic
- Custom LLM prompting strategies
- Multi-language support with specific providers
- Advanced text processing requirements
Example: STT Mode Configuration
STS Mode: Ultra-Low Latency
Architecture
When to Use STS Mode
Ultra-Low Latency
Need the fastest possible response times
Natural Flow
Want the most natural conversation dynamics
Simplified Stack
Prefer fewer moving parts and dependencies
Gemini Optimized
Leverage Gemini’s native speech-to-speech capabilities
Characteristics
Latency: Ultra-low, typically under 600ms- Single unified service for fastest response
- Approximately 2x faster than STT mode
- Better interruption handling
- Single provider (currently Gemini Live)
- Less control over individual stages
- Function calling fully supported
- Optimized for conversation flow
- Customer service chatbots
- Real-time support agents
- Interactive voice assistants
- Natural conversation experiences
Example: STS Mode Configuration
Comparison
| Feature | STT Mode | STS Mode |
|---|---|---|
| Latency | < 1 second | < 600ms (Ultra-low) |
| Providers | Mix & match | Single provider |
| Flexibility | Very High | Moderate |
| Setup Complexity | Higher | Lower |
| Function Calling | ✅ Supported | ✅ Supported |
| Interruptions | Good | Excellent |
| Custom Prompts | Full control | System instruction |
| Multi-language | Provider-specific | Gemini languages |
| Cost | Per-service pricing | Single service pricing |
Function Calling Support
Both modes fully support function calling:STT Mode Function Calling
STS Mode Function Calling
Choosing the Right Mode
Choose STT Mode If:
- You need to use specific providers (e.g., OpenAI, Deepgram, ElevenLabs)
- You require custom LLM configuration or prompt engineering
- You need to process or transform text between stages
- You want maximum control over each component
- You need flexibility to mix and match AI services
Choose STS Mode If:
- Minimal latency is critical for your use case
- You want the simplest architecture
- Natural conversation flow is a priority
- You’re comfortable with Gemini Live as your provider
- You prefer fewer dependencies to manage
Switching Between Modes
You can easily switch between modes by changing the configuration:Best Practices
For STT Mode
- Optimize LLM prompts for your use case
- Consider provider costs and rate limits
- Test latency across the full pipeline
- Monitor each service independently
For STS Mode
- Use for latency-critical applications
- Leverage Gemini’s natural conversation capabilities
- Test interruption handling thoroughly
- Monitor overall conversation quality