Documentation Index Fetch the complete documentation index at: https://docs.conversimple.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
As your voice agent grows, you’ll need to scale to handle more concurrent conversations. This guide covers practical scaling strategies.
Vertical Scaling
Increase Resources
Start by adding more resources to your existing server:
# Kubernetes: Increase resources
resources :
requests :
memory : "512Mi" # Was 256Mi
cpu : "500m" # Was 250m
limits :
memory : "1Gi" # Was 512Mi
cpu : "1000m" # Was 500m
When to use:
Conversation volume growing but manageable
Simple to implement
Cost-effective for moderate growth
Limits:
Single server can only scale so far
No redundancy
Horizontal Scaling
Multiple Instances
Run multiple agent instances:
# Kubernetes: Scale replicas
apiVersion : apps/v1
kind : Deployment
metadata :
name : conversimple-agent
spec :
replicas : 5 # Run 5 instances
# Docker Compose: Scale services
docker-compose up --scale agent= 5
Benefits:
Handle more concurrent conversations
Built-in redundancy
Easy to scale up/down
Load Balancing
Distribute conversations across instances:
# Simple round-robin load balancer
class LoadBalancer :
def __init__ ( self , agent_urls : list ):
self .agents = cycle(agent_urls)
def get_next_agent ( self ):
"""Get next agent in rotation"""
return next ( self .agents)
# Usage
balancer = LoadBalancer([
"http://agent-1:8000" ,
"http://agent-2:8000" ,
"http://agent-3:8000" ,
])
# Route new conversation
agent_url = balancer.get_next_agent()
Auto-Scaling
Kubernetes Horizontal Pod Autoscaler
Automatically scale based on load:
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : agent-autoscaler
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : conversimple-agent
minReplicas : 2
maxReplicas : 10
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 70
This will:
Start with 2 instances minimum
Scale up to 10 instances maximum
Add instances when CPU > 70%
Remove instances when CPU < 70%
Connection Pooling
Reuse database connections:
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
# Create connection pool
engine = create_engine(
DATABASE_URL ,
poolclass = QueuePool,
pool_size = 10 ,
max_overflow = 20
)
class OptimizedAgent ( ConversimpleAgent ):
def __init__ ( self , ** kwargs ):
super (). __init__ ( ** kwargs)
self .db = engine
Caching
Cache frequently accessed data:
from functools import lru_cache
@lru_cache ( maxsize = 1000 )
def get_product ( product_id : str ):
"""Cached product lookup"""
return database.query( f "SELECT * FROM products WHERE id = ' { product_id } '" )
Monitoring Capacity
Track Key Metrics
Monitor these metrics to know when to scale:
class MonitoredAgent ( ConversimpleAgent ):
def get_metrics ( self ):
return {
"active_conversations" : len ( self .conversations),
"cpu_percent" : psutil.cpu_percent(),
"memory_percent" : psutil.virtual_memory().percent,
"uptime_seconds" : time.time() - self .start_time
}
# Alert when capacity reached
if metrics[ "active_conversations" ] > 80 :
send_alert( "Consider scaling up - 80+ conversations active" )
Best Practices
1. Start Small, Scale as Needed
# Development: Single instance
replicas: 1
# Production: Start with 2-3, auto-scale as needed
replicas: 2
maxReplicas: 10
2. Set Conversation Limits
# Prevent overload
class LimitedAgent ( ConversimpleAgent ):
MAX_CONVERSATIONS = 50
async def start_conversation ( self , conv_id ):
if len ( self .conversations) >= self . MAX_CONVERSATIONS :
raise CapacityError( "At capacity" )
await super ().start_conversation(conv_id)
3. Monitor and Alert
# Alert on high load
if active_conversations > ( MAX_CONVERSATIONS * 0.8 ):
logger.warning( "Running at 80 % c apacity" )
send_alert( "Consider scaling up" )
Scaling Checklist
When scaling your agent:
Next Steps
Deployment Deploy your agent
Monitoring Monitor your agent