Multi-Provider AI Infrastructure

Vertex AI + NVIDIA NIM + Ollama Hybrid Stack

Executive Summary: RUNE operates a hybrid AI infrastructure combining Google Vertex AI (Gemini 1.5 Pro), NVIDIA NIM APIs (developer access), and Ollama local inference (rune-ai-v1 on RTX 4070). This multi-provider approach optimizes for cost, latency, and capability—using cloud APIs for complex reasoning and local models for high-frequency, low-latency tasks.

Live API Configuration

Active API Integrations

Currently configured and operational in CONFIG/RUNE_CONFIG.json:

NVIDIA NIM API: Developer access for inference endpoints (nvapi-***). Used for cost-effective batch processing.
Google Gemini API: Direct API key for Gemini 1.5 Pro multi-modal reasoning (AIzaSy***)
OpenAI API: GPT-4o for comparison benchmarks and fallback (sk-proj-***)
GitHub API: Automated issue creation for agent task distribution (github_pat_***)
Stripe API: Live payment processing for SHOWROOM e-commerce (sk_live_***)
Dialogflow CX: Webhook integration for conversational AI flows

NVIDIA Integration

Currently using NVIDIA developer APIs. Google rep recommended NVIDIA Inception program (application pending).

NIM Endpoints: Access to NVIDIA-hosted inference for Llama, Mistral, and custom models
Cost Optimization: NVIDIA APIs as primary backend, Gemini for complex multi-modal tasks
Inception Status: Developer access active, full Inception membership pending
Future Plan: Apply for NVIDIA Inception once revenue milestones hit

Ollama Local Stack

Three models running locally on RTX 4070 Super (16GB VRAM):

rune-ai-v1:latest (4.4 GB) — Custom fine-tuned model for RUNE HQ orchestration
rune-domain-agent:latest (4.4 GB) — Specialized agent for domain-specific workflows
mistral:7b (4.4 GB) — Base model for experimentation and fine-tuning

Hardware Infrastructure

Production Rigs

Documented in STATE.json hardware_stack:

Rig 1: RTX 3060 (12GB VRAM) + 32GB RAM — Tier 2 worker inference
Rig 2: RTX 4070 Super (16GB VRAM) + 64GB RAM + AMD 9900X — Tier 1 supervisor (Llama 70B capable)
Laptop: Floater backup + daily coordinator interface

Inference Stack

Tier 1 (Supervisor): Llama 3.1 70B Q4_K_M — 400-600ms latency, $0/month
Tier 2 (Workers): Mistral 8x7B MoE Q4 — 100-200ms latency, $0/month
Tier 3 (Floaters): Phi-3.5 Mini Q4 — 20-50ms latency, $0/month

The Google Cloud rep specifically recommended pursuing NVIDIA Inception for additional compute credits and support. Current strategy: Use NVIDIA NIM APIs for cost-effective inference, escalate complex multi-modal tasks to Gemini 1.5 Pro, and run high-frequency orchestration locally via Ollama. This hybrid approach keeps costs near $0 for development while maintaining access to premium capabilities.

See It Live

Neural Hub

See Vertex AI integration in action. Multi-model orchestration with real-time cost tracking.

OPEN DASHBOARD →

Ollama Docs

Documentation for local LLM inference. Run models on consumer hardware at $0/month.

OLLAMA DOCS →

NVIDIA NIM

Developer documentation for NVIDIA inference endpoints. Enterprise-grade model hosting.

NVIDIA NIM →

Vertex AI Docs

Official Google Cloud documentation for Vertex AI platform and Gemini models.

READ DOCS →