local llm tools · featured

Ollama

Run large language models locally with a single command. Ollama handles model downloads, quantization, and an OpenAI-compatible API — the fastest way to run Llama, Mistral, Gemma, and Phi locally.
★ 95.0k ↓ 4200.0k installs by Ollama updated Apr 1, 2025 MIT

## what it does

What it does

Ollama makes running local LLMs as simple as ollama run llama3. It handles model downloads (with content verification), GGUF quantization selection, GPU offloading, and serves a local OpenAI-compatible REST API on port 11434.

With 95,000+ GitHub stars and millions of downloads, it’s the dominant local LLM runtime.

Installation

# macOS / Linux — one-line installer:
curl -fsSL https://ollama.com/install.sh | sh

# macOS via Homebrew:
brew install ollama

# Windows: download installer from https://ollama.com/download
# Docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

Running models

# Pull and run (interactive):
ollama run llama3.3

# Pull only:
ollama pull mistral

# List installed models:
ollama list

# Remove a model:
ollama rm llama3.3

OpenAI-compatible API

Ollama exposes an API compatible with OpenAI’s client libraries:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Hello!"}]
)

This means you can swap Ollama in for OpenAI in any existing project by just changing the base URL.

ModelSizeUse case
llama3.370B (Q4)General purpose, best quality
mistral7BFast, good for most tasks
gemma34B / 12BGoogle’s efficient model
phi414BStrong reasoning
deepseek-r17B–671BReasoning / coding
llava7B / 13BVision + text (multimodal)

Hardware requirements

  • Minimum: 4GB RAM — runs small quantized models (Phi 3 Mini, Gemma 2B)
  • Comfortable: 16GB unified memory — most 7B–13B models run fast
  • Fast: M3 Max / M4 Pro, NVIDIA RTX 3090+ — 70B models usable

GPU offloading is automatic — Ollama detects your GPU and offloads as many layers as VRAM allows.

## platforms

macoslinuxwindowsdocker

4GB RAM minimum (quantized). 8GB+ recommended for 7B models. M1/M2/M3 Mac or NVIDIA GPU for fast inference.

## embed this badge

cache ✓ in cache.directory
![cached](https://cache.directory/badge/ollama.svg)