Multi-Provider AI Infrastructure
Vertex AI + NVIDIA NIM + Ollama Hybrid Stack
Executive Summary: RUNE operates a hybrid AI infrastructure combining Google Vertex AI (Gemini 1.5 Pro), NVIDIA NIM APIs (developer access), and Ollama local inference (rune-ai-v1 on RTX 4070). This multi-provider approach optimizes for cost, latency, and capability—using cloud APIs for complex reasoning and local models for high-frequency, low-latency tasks.
Live API Configuration
Active API Integrations
Currently configured and operational in CONFIG/RUNE_CONFIG.json:
- NVIDIA NIM API: Developer access for inference endpoints (nvapi-***). Used for cost-effective batch processing.
- Google Gemini API: Direct API key for Gemini 1.5 Pro multi-modal reasoning (AIzaSy***)
- OpenAI API: GPT-4o for comparison benchmarks and fallback (sk-proj-***)
- GitHub API: Automated issue creation for agent task distribution (github_pat_***)
- Stripe API: Live payment processing for SHOWROOM e-commerce (sk_live_***)
- Dialogflow CX: Webhook integration for conversational AI flows
NVIDIA Integration
Currently using NVIDIA developer APIs. Google rep recommended NVIDIA Inception program (application pending).
- NIM Endpoints: Access to NVIDIA-hosted inference for Llama, Mistral, and custom models
- Cost Optimization: NVIDIA APIs as primary backend, Gemini for complex multi-modal tasks
- Inception Status: Developer access active, full Inception membership pending
- Future Plan: Apply for NVIDIA Inception once revenue milestones hit
Ollama Local Stack
Three models running locally on RTX 4070 Super (16GB VRAM):
- rune-ai-v1:latest (4.4 GB) — Custom fine-tuned model for RUNE HQ orchestration
- rune-domain-agent:latest (4.4 GB) — Specialized agent for domain-specific workflows
- mistral:7b (4.4 GB) — Base model for experimentation and fine-tuning
Hardware Infrastructure
Production Rigs
Documented in STATE.json hardware_stack:
- Rig 1: RTX 3060 (12GB VRAM) + 32GB RAM — Tier 2 worker inference
- Rig 2: RTX 4070 Super (16GB VRAM) + 64GB RAM + AMD 9900X — Tier 1 supervisor (Llama 70B capable)
- Laptop: Floater backup + daily coordinator interface
Inference Stack
- Tier 1 (Supervisor): Llama 3.1 70B Q4_K_M — 400-600ms latency, $0/month
- Tier 2 (Workers): Mistral 8x7B MoE Q4 — 100-200ms latency, $0/month
- Tier 3 (Floaters): Phi-3.5 Mini Q4 — 20-50ms latency, $0/month
The Google Cloud rep specifically recommended pursuing NVIDIA Inception for additional compute credits and support. Current strategy: Use NVIDIA NIM APIs for cost-effective inference, escalate complex multi-modal tasks to Gemini 1.5 Pro, and run high-frequency orchestration locally via Ollama. This hybrid approach keeps costs near $0 for development while maintaining access to premium capabilities.
See It Live
Neural Hub
See Vertex AI integration in action. Multi-model orchestration with real-time cost tracking.
OPEN DASHBOARD →Ollama Docs
Documentation for local LLM inference. Run models on consumer hardware at $0/month.
OLLAMA DOCS →NVIDIA NIM
Developer documentation for NVIDIA inference endpoints. Enterprise-grade model hosting.
NVIDIA NIM →Vertex AI Docs
Official Google Cloud documentation for Vertex AI platform and Gemini models.
READ DOCS →