Skip to main content

Overview

Conversimple supports two conversation modes, each optimized for different use cases:
  1. STT Mode (Speech-to-Text → LLM → Text-to-Speech): Traditional pipeline with maximum flexibility
  2. STS Mode (Speech-to-Speech): Unified pipeline for ultra-low latency

STT Mode: Maximum Flexibility

Architecture

User Speech

Speech-to-Text Service (Gemini Live STT)
    ↓ Transcription
Large Language Model (Gemini 2.5 Pro)
    ↓ Generated Text Response
Text-to-Speech Service (Gemini TTS)
    ↓ Audio Output
User Speaker

When to Use STT Mode

Custom LLM Logic

Need to customize LLM behavior, prompts, or temperature settings

Multi-Provider

Want to use different providers for STT, LLM, and TTS

Processing Pipeline

Need to process or transform text between stages

Advanced Control

Require fine-grained control over each stage

Characteristics

Latency: Under 1 second typical response time Flexibility: Very High
  • Separate configuration for each service
  • Custom prompt engineering
  • Text transformation between stages
  • Provider mixing (e.g., Deepgram STT + OpenAI LLM + ElevenLabs TTS)
Use Cases:
  • Complex conversation logic
  • Custom LLM prompting strategies
  • Multi-language support with specific providers
  • Advanced text processing requirements

Example: STT Mode Configuration

from conversimple import ConversimpleAgent, tool

class CustomAgent(ConversimpleAgent):
    """Agent using STT mode for maximum flexibility"""

    def __init__(self, **kwargs):
        super().__init__(
            mode="stt",  # Explicit STT mode
            stt_provider="gemini_live",
            llm_provider="gemini_pro",
            llm_config={
                "temperature": 0.7,
                "system_instruction": "You are a helpful assistant..."
            },
            tts_provider="gemini_live",
            **kwargs
        )

    @tool("Get customer information")
    def get_customer(self, customer_id: str) -> dict:
        return {"name": "John", "tier": "premium"}

STS Mode: Ultra-Low Latency

Architecture

User Speech

Gemini Live STS Service
    ↓ Complete Speech-to-Speech Processing
User Speaker

When to Use STS Mode

Ultra-Low Latency

Need the fastest possible response times

Natural Flow

Want the most natural conversation dynamics

Simplified Stack

Prefer fewer moving parts and dependencies

Gemini Optimized

Leverage Gemini’s native speech-to-speech capabilities

Characteristics

Latency: Ultra-low, typically under 600ms
  • Single unified service for fastest response
  • Approximately 2x faster than STT mode
  • Better interruption handling
Flexibility: Moderate
  • Single provider (currently Gemini Live)
  • Less control over individual stages
  • Function calling fully supported
  • Optimized for conversation flow
Use Cases:
  • Customer service chatbots
  • Real-time support agents
  • Interactive voice assistants
  • Natural conversation experiences

Example: STS Mode Configuration

from conversimple import ConversimpleAgent, tool

class FastAgent(ConversimpleAgent):
    """Agent using STS mode for minimal latency"""

    def __init__(self, **kwargs):
        super().__init__(
            mode="sts",  # Speech-to-Speech mode
            sts_provider="gemini_live",
            system_instruction="You are a helpful assistant...",
            **kwargs
        )

    @tool("Get customer information")
    def get_customer(self, customer_id: str) -> dict:
        return {"name": "John", "tier": "premium"}

Comparison

FeatureSTT ModeSTS Mode
Latency< 1 second< 600ms (Ultra-low)
ProvidersMix & matchSingle provider
FlexibilityVery HighModerate
Setup ComplexityHigherLower
Function Calling✅ Supported✅ Supported
InterruptionsGoodExcellent
Custom PromptsFull controlSystem instruction
Multi-languageProvider-specificGemini languages
CostPer-service pricingSingle service pricing

Function Calling Support

Both modes fully support function calling:

STT Mode Function Calling

User: "Book me a flight to NYC"
    ↓ STT
"Book me a flight to NYC"
    ↓ LLM (decides to call tool)
tool_call: book_flight(destination="NYC")
    ↓ Your Agent
{"booking_id": "ABC123", "price": 450}
    ↓ LLM (generates response)
"I've booked your flight to NYC for $450"
    ↓ TTS
Audio: "I've booked your flight..."

STS Mode Function Calling

User: "Book me a flight to NYC"
    ↓ Gemini Live STS
tool_call: book_flight(destination="NYC")
    ↓ Your Agent
{"booking_id": "ABC123", "price": 450}
    ↓ Gemini Live STS
Audio: "I've booked your flight..."
Function calling works identically in both modes - the only difference is the processing pipeline.

Choosing the Right Mode

Choose STT Mode If:

  • You need to use specific providers (e.g., OpenAI, Deepgram, ElevenLabs)
  • You require custom LLM configuration or prompt engineering
  • You need to process or transform text between stages
  • You want maximum control over each component
  • You need flexibility to mix and match AI services

Choose STS Mode If:

  • Minimal latency is critical for your use case
  • You want the simplest architecture
  • Natural conversation flow is a priority
  • You’re comfortable with Gemini Live as your provider
  • You prefer fewer dependencies to manage

Switching Between Modes

You can easily switch between modes by changing the configuration:
# Development: Use STS for fast iteration
dev_agent = MyAgent(mode="sts", sts_provider="gemini_live")

# Production: Switch to STT for custom LLM
prod_agent = MyAgent(
    mode="stt",
    stt_provider="deepgram",
    llm_provider="openai",
    tts_provider="elevenlabs"
)
Your tool definitions and business logic remain unchanged.

Best Practices

For STT Mode

  • Optimize LLM prompts for your use case
  • Consider provider costs and rate limits
  • Test latency across the full pipeline
  • Monitor each service independently

For STS Mode

  • Use for latency-critical applications
  • Leverage Gemini’s natural conversation capabilities
  • Test interruption handling thoroughly
  • Monitor overall conversation quality

Next Steps