Model Routing

The gateway's model routing system determines which AI provider handles each request. Routing is configured per-tenant using a tier-based system with priority fallback, letting you balance cost, performance, and reliability.

Tier-Based Routing

Models are organized into tiers that map to different use cases and cost levels:

Tier	Use Case	Example Models
Standard	Everyday tasks, quick answers, high-volume usage	Claude Haiku, GPT-4o Mini, Gemini Flash
Premium	Complex reasoning, nuanced writing, detailed analysis	Claude Sonnet, GPT-4o, Gemini Pro
Enterprise	Mission-critical tasks requiring the most capable models	Claude Opus, GPT-4o (high-context)

When a user or skill specifies a tier, the gateway selects the highest-priority model configured for that tier.

Supported Providers

The gateway supports four AI providers:

Provider	Models	Notes
Anthropic	Claude Opus, Claude Sonnet, Claude Haiku	Primary provider for most deployments
OpenAI	GPT-4o, GPT-4o Mini	Full OpenAI-compatible API
Google	Gemini Pro, Gemini Flash	Google AI Studio or Vertex AI
Ollama	Llama, Qwen, DeepSeek, Gemma	Self-hosted open-source models

Tip: Use Ollama for sensitive workloads that must stay on-premises. The AI request never leaves your network when routed to a local Ollama instance.

Priority Fallback

Each tier can have multiple models configured with priority ordering. If the primary model is unavailable (provider outage, rate limit exceeded), the gateway automatically falls back to the next model in the priority list.

Example configuration:

Premium tier:
  1. Claude Sonnet (priority 1) → Primary
  2. GPT-4o (priority 2)       → First fallback
  3. Gemini Pro (priority 3)   → Second fallback

Fallback is transparent to the user — the request is retried automatically with the next provider if the primary fails.

Max Token Limits

Configure maximum token limits per model to control costs and prevent unexpectedly long responses:

Max input tokens: Limits the size of the prompt sent to the model
Max output tokens: Limits the length of the model's response

When a request exceeds the input token limit, the gateway returns an error. When the output reaches the limit, the response is truncated.

Configuring Routing Rules

Routing rules are configured per-tenant through the Portal gateway admin or the desktop app's settings:

Navigate to Gateway Admin in the Portal (or Settings in the desktop app)
Select the Model Routing tab
For each tier, add models with their provider, model ID, priority, and token limits
Click Save to apply the changes

Example Configuration

{
  "standard": [
    { "provider": "anthropic", "model": "claude-3-haiku", "priority": 1, "maxTokens": 4096 },
    { "provider": "openai", "model": "gpt-4o-mini", "priority": 2, "maxTokens": 4096 }
  ],
  "premium": [
    { "provider": "anthropic", "model": "claude-4-sonnet", "priority": 1, "maxTokens": 8192 },
    { "provider": "openai", "model": "gpt-4o", "priority": 2, "maxTokens": 8192 }
  ],
  "enterprise": [
    { "provider": "anthropic", "model": "claude-4-opus", "priority": 1, "maxTokens": 16384 }
  ]
}

Request Model Override

Individual requests can specify a model directly (e.g., "model": "claude-4-sonnet") rather than a tier. When a specific model is requested, the gateway routes directly to that provider, bypassing tier selection but still applying DLP, audit, and rate limit policies.

Next Steps

Set up DLP policies for data protection
Configure cost controls and budgets
Manage routing from the Portal gateway admin