Model Routing
The gateway's model routing system determines which AI provider handles each request. Routing is configured per-tenant using a tier-based system with priority fallback, letting you balance cost, performance, and reliability.
Tier-Based Routing
Models are organized into tiers that map to different use cases and cost levels:
| Tier | Use Case | Example Models |
|---|---|---|
| Standard | Everyday tasks, quick answers, high-volume usage | Claude Haiku, GPT-4o Mini, Gemini Flash |
| Premium | Complex reasoning, nuanced writing, detailed analysis | Claude Sonnet, GPT-4o, Gemini Pro |
| Enterprise | Mission-critical tasks requiring the most capable models | Claude Opus, GPT-4o (high-context) |
When a user or skill specifies a tier, the gateway selects the highest-priority model configured for that tier.
Supported Providers
The gateway supports four AI providers:
| Provider | Models | Notes |
|---|---|---|
| Anthropic | Claude Opus, Claude Sonnet, Claude Haiku | Primary provider for most deployments |
| OpenAI | GPT-4o, GPT-4o Mini | Full OpenAI-compatible API |
| Gemini Pro, Gemini Flash | Google AI Studio or Vertex AI | |
| Ollama | Llama, Qwen, DeepSeek, Gemma | Self-hosted open-source models |
Tip: Use Ollama for sensitive workloads that must stay on-premises. The AI request never leaves your network when routed to a local Ollama instance.
Priority Fallback
Each tier can have multiple models configured with priority ordering. If the primary model is unavailable (provider outage, rate limit exceeded), the gateway automatically falls back to the next model in the priority list.
Example configuration:
Premium tier:
1. Claude Sonnet (priority 1) → Primary
2. GPT-4o (priority 2) → First fallback
3. Gemini Pro (priority 3) → Second fallback Fallback is transparent to the user — the request is retried automatically with the next provider if the primary fails.
Max Token Limits
Configure maximum token limits per model to control costs and prevent unexpectedly long responses:
- Max input tokens: Limits the size of the prompt sent to the model
- Max output tokens: Limits the length of the model's response
When a request exceeds the input token limit, the gateway returns an error. When the output reaches the limit, the response is truncated.
Configuring Routing Rules
Routing rules are configured per-tenant through the Portal gateway admin or the desktop app's settings:
- Navigate to Gateway Admin in the Portal (or Settings in the desktop app)
- Select the Model Routing tab
- For each tier, add models with their provider, model ID, priority, and token limits
- Click Save to apply the changes
Example Configuration
{
"standard": [
{ "provider": "anthropic", "model": "claude-3-haiku", "priority": 1, "maxTokens": 4096 },
{ "provider": "openai", "model": "gpt-4o-mini", "priority": 2, "maxTokens": 4096 }
],
"premium": [
{ "provider": "anthropic", "model": "claude-4-sonnet", "priority": 1, "maxTokens": 8192 },
{ "provider": "openai", "model": "gpt-4o", "priority": 2, "maxTokens": 8192 }
],
"enterprise": [
{ "provider": "anthropic", "model": "claude-4-opus", "priority": 1, "maxTokens": 16384 }
]
} Request Model Override
Individual requests can specify a model directly (e.g., "model": "claude-4-sonnet")
rather than a tier. When a specific model is requested, the gateway routes directly to that
provider, bypassing tier selection but still applying DLP, audit, and rate limit policies.
Next Steps
- Set up DLP policies for data protection
- Configure cost controls and budgets
- Manage routing from the Portal gateway admin