robotadocs
← robota.io

Local LLM Setup

Robota works with any OpenAI-compatible local inference server. This guide covers the two most popular options — Ollama and LM Studio — plus any llama.cpp server or custom endpoint.

No API key required. Local models run entirely on your machine. Your code, prompts, and conversation history never leave your device.


Quick Start

# 1. Start your local model server (see below)
 
# 2. Configure Robota to use it
robota --configure
 
# Select "Gemma / LM Studio" and enter your server URL:
#   Ollama:    http://localhost:11434/v1
#   LM Studio: http://localhost:1234/v1

Option 1: Ollama

Ollama is the easiest way to run local models. It manages downloads and serves an OpenAI-compatible API automatically.

Install and start

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
 
# Pull a model (examples)
ollama pull llama3.2          # 3B — fast, good for most tasks
ollama pull codellama         # Optimized for code
ollama pull qwen2.5-coder     # Strong at TypeScript/Python
ollama pull mistral           # Good general purpose
 
# Verify the server is running
curl http://localhost:11434/v1/models

Ollama starts automatically and listens on http://localhost:11434.

Configure Robota

robota --configure

When prompted:

  • Provider: Select Gemma / LM Studio (uses OpenAI-compatible API)
  • Base URL: http://localhost:11434/v1
  • Model: Enter the model name exactly as in ollama list (e.g. llama3.2, codellama)
  • API key: Enter any value (e.g. ollama) — Ollama does not validate it

Or set via environment variables:

ROBOTA_PROVIDER=gemma \
ROBOTA_BASE_URL=http://localhost:11434/v1 \
ROBOTA_MODEL=llama3.2 \
ROBOTA_API_KEY=ollama \
robota
ModelSizeBest for
qwen2.5-coder:7b7BTypeScript, Python, code review
codellama:13b13BGeneral code generation
llama3.2:3b3BFast responses, simple tasks
deepseek-coder-v2:16b16BComplex reasoning, refactoring

Larger models produce better results but require more RAM and run slower.


Option 2: LM Studio

LM Studio provides a GUI for downloading and running models, with a built-in local API server.

Install and start

  1. Download LM Studio from lmstudio.ai
  2. In the Discover tab, search for and download a model (e.g. Gemma 3, Llama 3.2, Qwen 2.5 Coder)
  3. In the Local Server tab, click Start Server

The server runs on http://localhost:1234 by default.

Configure Robota

robota --configure

When prompted:

  • Provider: Select Gemma / LM Studio
  • Base URL: http://localhost:1234/v1
  • Model: Enter the model name exactly as shown in LM Studio's Local Server tab
  • API key: lm-studio (placeholder — LM Studio does not validate it)

Option 3: llama.cpp Server

If you compile and run llama.cpp directly:

# Start llama.cpp server
./llama-server -m models/your-model.gguf --port 8080
 
# Configure Robota
robota --configure
# Base URL: http://localhost:8080/v1
# API key: any value (e.g. "local")

Troubleshooting

"Connection refused" or "Network error"

  • Verify the server is running: curl http://localhost:11434/v1/models
  • Check the port number matches your configuration
  • On some systems, the server may bind to 127.0.0.1 only — try http://127.0.0.1:11434/v1

Model not responding

  • Ensure the model name in Robota's config exactly matches the model loaded in your server
  • For Ollama, run ollama list to see available model names
  • For LM Studio, the model name is shown in the Local Server tab header

Slow responses

  • Local models are slower than cloud APIs, especially on CPU
  • Use smaller quantized models (e.g. q4_k_m variants) for faster inference
  • Enable GPU acceleration in Ollama or LM Studio settings if available

Tool calling not working

Some local models do not support the OpenAI function-calling format. If tool calls fail silently:

  1. Try a model known to support tools: qwen2.5-coder, llama3.2, mistral-nemo
  2. In Robota, run /mode plan to switch to plan mode, which requires fewer tool calls
  3. Check the model's documentation for tool-call support

Tips for Local Model Usage

  • Context windows are smaller. Most local models have 4K–32K token context vs 200K+ for cloud models. Use /compact when the context fills up.
  • Response quality varies. Code quality from 7B models is good for straightforward tasks; use 13B+ for complex refactoring.
  • No rate limits. You can run as many sessions as your hardware supports.
  • Offline capable. Once the model is downloaded, no internet connection is required.