Cut LLM costs with model routing, prompt caching, output limits, semantic caching, and batch processing.
LLM Cost Optimisation
# API costs (approximate 2024)
# Claude Haiku: $0.25/$1.25 per million tokens (in/out)
# Claude Sonnet: $3/$15 per million tokens
# Claude Opus: $15/$75 per million tokens
# GPT-4o mini: $0.15/$0.60 per million tokens
# GPT-4o: $5/$15 per million tokens
# Strategy 1: Model routing
def choose_model(task_complexity: str) -> str:
if task_complexity == 'simple':
return 'claude-haiku-4-5' # 60x cheaper than Opus
elif task_complexity == 'medium':
return 'claude-sonnet-4-5'
return 'claude-opus-4-5'
# Strategy 2: Prompt caching (Anthropic)
# Mark static system prompts with cache_control
# Up to 90% discount on cached token reads
# Strategy 3: Output length control
# 'Respond in 100 words or fewer.'
# max_tokens=150 # hard cap
# Strategy 4: Semantic cache
# Cache responses for similar questions
# Threshold: cosine_sim >= 0.95 = cache hit
# Strategy 5: Batch API
# OpenAI: 50% discount for async batch jobs
# Anthropic: batch API for non-real-time tasks
# Strategy 6: Compress prompts
# Remove redundant instructions
# Use abbreviations in few-shot examples