Model Routing

The gateway's model routing system determines which AI provider handles each request. Routing is configured per-tenant using a tier-based system with priority fallback, letting you balance cost, performance, and reliability.

Tier-Based Routing

Models are organized into tiers that map to different use cases and cost levels:

Tier Use Case Example Models
Standard Everyday tasks, quick answers, high-volume usage Claude Haiku, GPT-4o Mini, Gemini Flash
Premium Complex reasoning, nuanced writing, detailed analysis Claude Sonnet, GPT-4o, Gemini Pro
Enterprise Mission-critical tasks requiring the most capable models Claude Opus, GPT-4o (high-context)

When a user or skill specifies a tier, the gateway selects the highest-priority model configured for that tier.

Supported Providers

The gateway supports four AI providers:

Provider Models Notes
Anthropic Claude Opus, Claude Sonnet, Claude Haiku Primary provider for most deployments
OpenAI GPT-4o, GPT-4o Mini Full OpenAI-compatible API
Google Gemini Pro, Gemini Flash Google AI Studio or Vertex AI
Ollama Llama, Qwen, DeepSeek, Gemma Self-hosted open-source models

Tip: Use Ollama for sensitive workloads that must stay on-premises. The AI request never leaves your network when routed to a local Ollama instance.

Priority Fallback

Each tier can have multiple models configured with priority ordering. If the primary model is unavailable (provider outage, rate limit exceeded), the gateway automatically falls back to the next model in the priority list.

Example configuration:

Premium tier:
  1. Claude Sonnet (priority 1) → Primary
  2. GPT-4o (priority 2)       → First fallback
  3. Gemini Pro (priority 3)   → Second fallback

Fallback is transparent to the user — the request is retried automatically with the next provider if the primary fails.

Max Token Limits

Configure maximum token limits per model to control costs and prevent unexpectedly long responses:

  • Max input tokens: Limits the size of the prompt sent to the model
  • Max output tokens: Limits the length of the model's response

When a request exceeds the input token limit, the gateway returns an error. When the output reaches the limit, the response is truncated.

Configuring Routing Rules

Routing rules are configured per-tenant through the Portal gateway admin or the desktop app's settings:

  1. Navigate to Gateway Admin in the Portal (or Settings in the desktop app)
  2. Select the Model Routing tab
  3. For each tier, add models with their provider, model ID, priority, and token limits
  4. Click Save to apply the changes

Example Configuration

{
  "standard": [
    { "provider": "anthropic", "model": "claude-3-haiku", "priority": 1, "maxTokens": 4096 },
    { "provider": "openai", "model": "gpt-4o-mini", "priority": 2, "maxTokens": 4096 }
  ],
  "premium": [
    { "provider": "anthropic", "model": "claude-4-sonnet", "priority": 1, "maxTokens": 8192 },
    { "provider": "openai", "model": "gpt-4o", "priority": 2, "maxTokens": 8192 }
  ],
  "enterprise": [
    { "provider": "anthropic", "model": "claude-4-opus", "priority": 1, "maxTokens": 16384 }
  ]
}

Request Model Override

Individual requests can specify a model directly (e.g., "model": "claude-4-sonnet") rather than a tier. When a specific model is requested, the gateway routes directly to that provider, bypassing tier selection but still applying DLP, audit, and rate limit policies.

Next Steps