> ## Documentation Index > Fetch the complete documentation index at: https://morphik.ai/docs/llms.txt > Use this file to discover all available pages before exploring further. # Local Inference > Run Morphik completely offline with local embedding and completion models Morphik comes with built-in support for running **both embeddings and completions** locally, ensuring your data never leaves your machine. Choose between two powerful local inference engines: * **Lemonade** - Windows-only, optimized for AMD GPUs and NPUs * **Ollama** - Cross-platform (Windows, macOS, Linux), supports various hardware Both are pre-configured in Morphik and can be selected through the UI or configuration file. ## Why Local Inference? Running models locally provides several key advantages: * **Complete Privacy**: Your data never leaves your machine * **No API Costs**: Eliminate ongoing API expenses * **Low Latency**: No network round-trips for inference * **Offline Capability**: Work without internet connectivity * **Hardware Acceleration**: Leverage your local GPU, NPU, or specialized AI processors

🍋

Lemonade

Run embeddings & completions locally with AMD GPU/NPU acceleration

Lemonade SDK provides high-performance local inference on Windows, with optimizations for AMD hardware. It exposes an OpenAI-compatible API and is **already configured in Morphik**. **Built-in Support**: Lemonade models are pre-configured in `morphik.toml` for both embeddings and completions. Simply install Lemonade Server and select the models in the UI. ### System Requirements * **Windows 10/11 only** (x86/x64) * **8GB+ RAM** (16GB recommended) * **Python 3.10+** * **Optional but recommended**: * AMD Ryzen AI 300 series (NPU acceleration) * AMD Radeon 7000/9000 series (GPU acceleration) ### Quick Start Download and install Lemonade from the official site: [lemonade-server.ai](https://lemonade-server.ai/). Start the Lemonade server following their documentation. Make sure it is running and note the port. The API is OpenAI-compatible (e.g., `/api/v1/models`). ### Option 1: Using the UI (Recommended) 1. Open the Morphik UI and go to Settings → API Keys 2. Select "Lemonade" (🍋). No API key is required 3. Enter the host and port where Lemonade is running Lemonade provider settings with host and port

Lemonade provider settings with host and port

4. Open Chat and use the model selector pill (top left) to pick a Lemonade model Chat model selector showing Lemonade models

Running inside Docker? Use `host.docker.internal` instead of `localhost` for the host field. If you are not using a vision-capable model, turn off ColPali in chat settings (settings → ColPali) to avoid vision-dependent paths. ### Option 2: Edit morphik.toml You can also set Lemonade models directly in `morphik.toml` so they're used by default. Ensure the `api_base` points to your Lemonade server: ```toml theme={null} lemonade_qwen = { model_name = "openai/Qwen2.5-VL-7B-Instruct-GGUF", api_base = "http://localhost:8020/api/v1", vision = true } lemonade_embedding = { model_name = "openai/nomic-embed-text-v1-GGUF", api_base = "http://localhost:8020/api/v1" } [completion] model = "lemonade_qwen" [embedding] model = "lemonade_embedding" ``` If your system has under 16GB RAM, prefer models under \~4B parameters or smaller quantizations (e.g., Q4/Q5). Larger models may fail to load or will be very slow on low-memory systems. ### Performance Tips * **Model Quantization**: Use GGUF quantized models for better performance * **Low-memory systems**: Under 16GB RAM, prefer models under 4B parameters * **Hardware Acceleration**: Automatically detects and uses AMD GPUs/NPUs when available * **Memory Management**: Models are cached after first download ### Troubleshooting * Verify server health: `curl http://localhost:8020/health` * List models: `curl http://localhost:8020/api/v1/models` * For Docker: Use `host.docker.internal` instead of `localhost` * Check firewall settings for port 8020 * Ensure sufficient disk space (5-15GB per model) * Try smaller quantized versions (Q4, Q5) * Check model compatibility with `lemonade list` * Use GGUF quantized models for better performance * Monitor GPU/NPU usage with system tools * Adjust batch size and context length in model config

Ollama - All Platforms

Run embeddings & completions locally on Windows, macOS, or Linux

Ollama provides cross-platform local inference for both embeddings and completions. It's **already configured in Morphik** and supports various hardware accelerators. **Built-in Support**: Ollama models are pre-configured in `morphik.toml` for both embeddings and completions. Simply install Ollama and select the models in the UI. ### System Requirements * **macOS**: Apple Silicon (M1/M2/M3) or Intel Mac with 8GB+ RAM * **Linux**: x86\_64 or ARM64, 8GB+ RAM, optional NVIDIA GPU * **Windows**: Windows 10/11, 8GB+ RAM, optional NVIDIA GPU ### Quick Start ```bash theme={null} brew install ollama # Or: curl -fsSL https://ollama.com/install.sh | sh ``` ```bash theme={null} curl -fsSL https://ollama.com/install.sh | sh ``` Download installer from [ollama.com/download](https://ollama.com/download/windows) ```bash theme={null} # Start Ollama service ollama serve ``` Or use Docker Compose with Morphik: ```bash theme={null} docker compose --profile ollama -f docker-compose.run.yml up -d ``` ### Option 1: Using the UI (Recommended) 1. Open Morphik UI and navigate to Settings 2. Select Ollama models from the dropdown for: * **Completion Model**: `ollama_qwen_vision` or `ollama_llama_vision` * **Embedding Model**: `ollama_embedding` (nomic-embed-text) ### Option 2: Edit morphik.toml Morphik comes with pre-configured Ollama models: ```toml theme={null} # Already configured in morphik.toml ollama_qwen_vision = { model_name = "ollama_chat/qwen2.5vl:latest", api_base = "http://localhost:11434", vision = true } ollama_embedding = { model_name = "ollama/nomic-embed-text", api_base = "http://localhost:11434" } # To use Ollama as default: [completion] model = "ollama_qwen_vision" [embedding] model = "ollama_embedding" ``` When running Morphik in Docker, change `localhost` to `ollama:11434` if using the Ollama profile, or `host.docker.internal:11434` if running Ollama separately. Pull the pre-configured models: ```bash theme={null} # For embeddings (required for RAG) ollama pull nomic-embed-text # For completions (choose one) ollama pull qwen2.5vl:latest # Vision-capable, 7B ollama pull llama3.2-vision # Vision-capable, 11B ollama pull qwen2:1.5b # Text-only, fast ``` Then select them in the UI chat interface! ### Hardware Acceleration **Apple Silicon (M1/M2/M3)** * Ollama automatically uses Metal for GPU acceleration * No additional configuration needed * Excellent performance on unified memory architecture **NVIDIA GPUs** * Install CUDA drivers (11.8+ recommended) * Ollama auto-detects and uses available GPUs * Monitor usage: `nvidia-smi` **AMD GPUs (Linux)** * ROCm support is experimental * Set environment variable: `HSA_OVERRIDE_GFX_VERSION=10.3.0` ### Performance Tuning **Memory Management** ```bash theme={null} # Set GPU memory limit (NVIDIA) OLLAMA_MAX_VRAM=8GB ollama serve # Adjust number of parallel requests OLLAMA_NUM_PARALLEL=4 ollama serve # Keep models loaded in memory OLLAMA_KEEP_ALIVE=30m ollama serve ``` **Model Quantization** Ollama supports various quantization levels: * `q4_0` - 4-bit quantization (smallest, fastest) * `q5_1` - 5-bit quantization (balanced) * `q8_0` - 8-bit quantization (best quality) ```bash theme={null} # Pull specific quantization ollama pull llama3.2:3b-q4_0 # Smaller, faster ollama pull llama3.2:3b-q8_0 # Better quality ``` ### Monitoring & Management **Check Status** ```bash theme={null} # List loaded models ollama list # View running models ollama ps # Check API health curl http://localhost:11434/api/tags ``` **Resource Usage** ```bash theme={null} # Monitor in real-time watch -n 1 ollama ps # Check model details ollama show llama3.2 --modelfile ``` ### Creating Custom Models Create specialized models for your use case: ```dockerfile theme={null} # Modelfile FROM llama3.2:3b # Set parameters PARAMETER temperature 0.1 PARAMETER num_ctx 4096 # Add system prompt SYSTEM """You are a helpful assistant specialized in document analysis and information retrieval. Always provide accurate, concise responses based on the provided context.""" ``` Build and use: ```bash theme={null} ollama create morphik-assistant -f Modelfile ollama run morphik-assistant ```