Skip to content

Local Models

Human supports multiple local inference servers. No API key is required for any local provider. Add your provider to ~/.human/config.json and set default_provider and default_model.

llama.cpp provides a high-performance local server for GGUF models.

Terminal window
brew install llama.cpp

Or build from source:

Terminal window
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

Download a GGUF model from Hugging Face or run:

Terminal window
# Example: Llama 3.2 3B
curl -L -o models/llama-3.2-3b.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf"
Terminal window
llama-server --model path/to/model.gguf --port 8080

Or with llama.cpp binary:

Terminal window
./llama-server -m models/llama-3.2-3b.gguf --port 8080

Add to ~/.human/config.json:

{
"default_provider": "llamacpp",
"default_model": "llama-3.2-3b",
"providers": [
{
"name": "llamacpp",
"base_url": "http://localhost:8080/v1"
}
]
}

Use "llama.cpp" as the provider name if you prefer (same backend). The default_model should match the name you pass to llama-server (or a friendly alias; the server may accept partial matches).


Ollama is the easiest way to run local models.

Terminal window
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
Terminal window
ollama pull llama3
ollama pull mistral
ollama pull codellama

Ollama runs as a service. If it’s not started:

Terminal window
ollama serve

Default URL: http://localhost:11434

Add to ~/.human/config.json:

{
"default_provider": "ollama",
"default_model": "llama3",
"providers": [
{
"name": "ollama",
"base_url": "http://localhost:11434"
}
]
}

Models: llama3, mistral, codellama, qwen2, phi3, gemma2, etc. Use the exact name from ollama list:

Terminal window
ollama list

Example output:

NAME ID SIZE MODIFIED
llama3:latest abc123... 4.7 GB 2 days ago
mistral:7b def456... 4.1 GB 1 week ago

LM Studio provides a graphical interface to load and run local models.

  1. Download and install LM Studio
  2. Download a model (e.g. Llama 3.2) from the in-app model browser
  3. Load the model and start the local server (Server tab)
  4. Default port: 1234

Add to ~/.human/config.json:

{
"default_provider": "lmstudio",
"default_model": "lmstudio-community/Llama-3.2-3B-Instruct-GGUF",
"providers": [
{
"name": "lmstudio",
"base_url": "http://localhost:1234/v1"
}
]
}

Use "lm-studio" as the provider name (same backend). The model name must match what LM Studio shows in the Server tab when the local server is running. Check the “Loaded model” display for the exact identifier.


vLLM is a high-throughput server for HF-style models.

Terminal window
pip install vllm
Terminal window
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-3B-Instruct --port 8000

Example output:

INFO: Started server process
INFO: Uvicorn running on http://0.0.0.0:8000

Add to ~/.human/config.json:

{
"default_provider": "vllm",
"default_model": "meta-llama/Llama-3.2-3B-Instruct",
"providers": [
{
"name": "vllm",
"base_url": "http://localhost:8000/v1"
}
]
}

The default_model must match the --model passed to the vLLM server.


sglang is a fast structured generation engine.

Terminal window
pip install "sglang[all]"
python -m sglang.launch_server --model-path meta-llama/Llama-3.2-3B-Instruct --port 30000

Add to ~/.human/config.json:

{
"default_provider": "sglang",
"default_model": "meta-llama/Llama-3.2-3B-Instruct",
"providers": [
{
"name": "sglang",
"base_url": "http://localhost:30000/v1"
}
]
}

osaurus is another OpenAI-compatible server.

Terminal window
pip install osaurus
osaurus serve --model <model-name> --port 1337

Config:

{
"default_provider": "osaurus",
"default_model": "model-name",
"providers": [
{
"name": "osaurus",
"base_url": "http://localhost:1337/v1"
}
]
}

ProviderPortBest for
llama.cpp8080GGUF models, low RAM
Ollama11434Easiest setup, many models
LM Studio1234GUI, casual use
vLLM8000High throughput, HF models
sglang30000Fast structured output
osaurus1337Alternative server

All use the OpenAI chat completions API, so Human treats them uniformly.