> ## Documentation Index
> Fetch the complete documentation index at: https://morphik.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Local Inference

> Run Morphik completely offline with local embedding and completion models

Morphik comes with built-in support for running **both embeddings and completions** locally, ensuring your data never leaves your machine. Choose between two powerful local inference engines:

* **Lemonade** - Windows-only, optimized for AMD GPUs and NPUs
* **Ollama** - Cross-platform (Windows, macOS, Linux), supports various hardware

Both are pre-configured in Morphik and can be selected through the UI or configuration file.

## Why Local Inference?

Running models locally provides several key advantages:

* **Complete Privacy**: Your data never leaves your machine
* **No API Costs**: Eliminate ongoing API expenses
* **Low Latency**: No network round-trips for inference
* **Offline Capability**: Work without internet connectivity
* **Hardware Acceleration**: Leverage your local GPU, NPU, or specialized AI processors

<Tabs>
  <Tab title="Lemonade">
    <div style={{display: 'flex', alignItems: 'center', gap: '20px', marginBottom: '30px'}}>
      <img src="https://mintcdn.com/databridge/eYNTu58F8b1Z2Eq-/images/amd-logo.svg?fit=max&auto=format&n=eYNTu58F8b1Z2Eq-&q=85&s=9bcbd32ddecaf9e66ecddfad2cb0e331" alt="AMD" style={{height: '60px', objectFit: 'contain'}} width="800" height="191" data-path="images/amd-logo.svg" />

      <span style={{fontSize: '48px'}}>🍋</span>

      <div>
        <h2 style={{margin: 0}}>Lemonade</h2>
        <p style={{margin: '5px 0 0 0', color: '#666'}}>Run embeddings & completions locally with AMD GPU/NPU acceleration</p>
      </div>
    </div>

    Lemonade SDK provides high-performance local inference on Windows, with optimizations for AMD hardware. It exposes an OpenAI-compatible API and is **already configured in Morphik**.

    <Note>
      **Built-in Support**: Lemonade models are pre-configured in `morphik.toml` for both embeddings and completions. Simply install Lemonade Server and select the models in the UI.
    </Note>

    ### System Requirements

    * **Windows 10/11 only** (x86/x64)
    * **8GB+ RAM** (16GB recommended)
    * **Python 3.10+**
    * **Optional but recommended**:
      * AMD Ryzen AI 300 series (NPU acceleration)
      * AMD Radeon 7000/9000 series (GPU acceleration)

    ### Quick Start

    <Steps>
      <Step title="Download Lemonade">
        Download and install Lemonade from the official site:
        [lemonade-server.ai](https://lemonade-server.ai/).
      </Step>

      <Step title="Start Lemonade Server">
        Start the Lemonade server following their documentation. Make sure it is running and note the port.
        The API is OpenAI-compatible (e.g., `/api/v1/models`).
      </Step>

      <Step title="Configure Morphik - Two Options">
        ### Option 1: Using the UI (Recommended)

        1. Open the Morphik UI and go to Settings → API Keys
        2. Select "Lemonade" (🍋). No API key is required
        3. Enter the host and port where Lemonade is running
                   <img src="https://mintcdn.com/databridge/eYNTu58F8b1Z2Eq-/images/add_port_to_lemonade.png?fit=max&auto=format&n=eYNTu58F8b1Z2Eq-&q=85&s=7ffe802c4360482e95fec1fec80d9bc3" alt="Lemonade provider settings with host and port" style={{maxWidth: '640px', margin: '16px 0'}} width="2208" height="1251" data-path="images/add_port_to_lemonade.png" />
        4. Open Chat and use the model selector pill (top left) to pick a Lemonade model
                   <img src="https://mintcdn.com/databridge/eYNTu58F8b1Z2Eq-/images/see_lemonade_models_in_chat.png?fit=max&auto=format&n=eYNTu58F8b1Z2Eq-&q=85&s=8f67516caac6dbdda1b22a6fe92a674a" alt="Chat model selector showing Lemonade models" style={{maxWidth: '640px', margin: '16px 0'}} width="2227" height="1194" data-path="images/see_lemonade_models_in_chat.png" />

        <Warning>
          Running inside Docker? Use `host.docker.internal` instead of `localhost` for the host field.
        </Warning>

        <Warning>
          If you are not using a vision-capable model, turn off ColPali in chat settings (settings → ColPali) to avoid vision-dependent paths.
        </Warning>

        ### Option 2: Edit morphik.toml

        You can also set Lemonade models directly in `morphik.toml` so they're used by default.
        Ensure the `api_base` points to your Lemonade server:

        ```toml theme={null}
        lemonade_qwen = {
          model_name = "openai/Qwen2.5-VL-7B-Instruct-GGUF",
          api_base = "http://localhost:8020/api/v1",
          vision = true
        }
        lemonade_embedding = {
          model_name = "openai/nomic-embed-text-v1-GGUF",
          api_base = "http://localhost:8020/api/v1"
        }

        [completion]
        model = "lemonade_qwen"

        [embedding]
        model = "lemonade_embedding"
        ```
      </Step>
    </Steps>

    <Warning>
      If your system has under 16GB RAM, prefer models under \~4B parameters or smaller quantizations
      (e.g., Q4/Q5). Larger models may fail to load or will be very slow on low-memory systems.
    </Warning>

    ### Performance Tips

    * **Model Quantization**: Use GGUF quantized models for better performance
    * **Low-memory systems**: Under 16GB RAM, prefer models under 4B parameters
    * **Hardware Acceleration**: Automatically detects and uses AMD GPUs/NPUs when available
    * **Memory Management**: Models are cached after first download

    ### Troubleshooting

    <AccordionGroup>
      <Accordion title="Connection Issues">
        * Verify server health: `curl http://localhost:8020/health`
        * List models: `curl http://localhost:8020/api/v1/models`
        * For Docker: Use `host.docker.internal` instead of `localhost`
        * Check firewall settings for port 8020
      </Accordion>

      <Accordion title="Model Loading Errors">
        * Ensure sufficient disk space (5-15GB per model)
        * Try smaller quantized versions (Q4, Q5)
        * Check model compatibility with `lemonade list`
      </Accordion>

      <Accordion title="Performance Issues">
        * Use GGUF quantized models for better performance
        * Monitor GPU/NPU usage with system tools
        * Adjust batch size and context length in model config
      </Accordion>
    </AccordionGroup>
  </Tab>

  <Tab title="Ollama">
    <div style={{display: 'flex', alignItems: 'center', gap: '20px', marginBottom: '30px'}}>
      <img src="https://mintcdn.com/databridge/eYNTu58F8b1Z2Eq-/images/ollama-logo.png?fit=max&auto=format&n=eYNTu58F8b1Z2Eq-&q=85&s=aa6d43acf6119056b5d758c677ab9d9e" alt="Ollama" style={{height: '60px', objectFit: 'contain'}} width="181" height="256" data-path="images/ollama-logo.png" />

      <div>
        <h2 style={{margin: 0}}>Ollama - All Platforms</h2>
        <p style={{margin: '5px 0 0 0', color: '#666'}}>Run embeddings & completions locally on Windows, macOS, or Linux</p>
      </div>
    </div>

    Ollama provides cross-platform local inference for both embeddings and completions. It's **already configured in Morphik** and supports various hardware accelerators.

    <Note>
      **Built-in Support**: Ollama models are pre-configured in `morphik.toml` for both embeddings and completions. Simply install Ollama and select the models in the UI.
    </Note>

    ### System Requirements

    * **macOS**: Apple Silicon (M1/M2/M3) or Intel Mac with 8GB+ RAM
    * **Linux**: x86\_64 or ARM64, 8GB+ RAM, optional NVIDIA GPU
    * **Windows**: Windows 10/11, 8GB+ RAM, optional NVIDIA GPU

    ### Quick Start

    <Steps>
      <Step title="Install Ollama">
        <Tabs>
          <Tab title="macOS">
            ```bash theme={null}
            brew install ollama
            # Or: curl -fsSL https://ollama.com/install.sh | sh
            ```
          </Tab>

          <Tab title="Linux">
            ```bash theme={null}
            curl -fsSL https://ollama.com/install.sh | sh
            ```
          </Tab>

          <Tab title="Windows">
            Download installer from [ollama.com/download](https://ollama.com/download/windows)
          </Tab>
        </Tabs>
      </Step>

      <Step title="Start Ollama">
        ```bash theme={null}
        # Start Ollama service
        ollama serve
        ```

        Or use Docker Compose with Morphik:

        ```bash theme={null}
        docker compose --profile ollama -f docker-compose.run.yml up -d
        ```
      </Step>

      <Step title="Configure Morphik - Two Options">
        ### Option 1: Using the UI (Recommended)

        1. Open Morphik UI and navigate to Settings
        2. Select Ollama models from the dropdown for:
           * **Completion Model**: `ollama_qwen_vision` or `ollama_llama_vision`
           * **Embedding Model**: `ollama_embedding` (nomic-embed-text)

        ### Option 2: Edit morphik.toml

        Morphik comes with pre-configured Ollama models:

        ```toml theme={null}
        # Already configured in morphik.toml
        ollama_qwen_vision = { 
          model_name = "ollama_chat/qwen2.5vl:latest", 
          api_base = "http://localhost:11434", 
          vision = true 
        }
        ollama_embedding = { 
          model_name = "ollama/nomic-embed-text", 
          api_base = "http://localhost:11434" 
        }

        # To use Ollama as default:
        [completion]
        model = "ollama_qwen_vision"

        [embedding]
        model = "ollama_embedding"
        ```

        <Warning>
          When running Morphik in Docker, change `localhost` to `ollama:11434` if using the Ollama profile, or `host.docker.internal:11434` if running Ollama separately.
        </Warning>
      </Step>

      <Step title="Download and Use Models">
        Pull the pre-configured models:

        ```bash theme={null}
        # For embeddings (required for RAG)
        ollama pull nomic-embed-text

        # For completions (choose one)
        ollama pull qwen2.5vl:latest    # Vision-capable, 7B
        ollama pull llama3.2-vision      # Vision-capable, 11B
        ollama pull qwen2:1.5b          # Text-only, fast
        ```

        Then select them in the UI chat interface!
      </Step>
    </Steps>

    ### Hardware Acceleration

    **Apple Silicon (M1/M2/M3)**

    * Ollama automatically uses Metal for GPU acceleration
    * No additional configuration needed
    * Excellent performance on unified memory architecture

    **NVIDIA GPUs**

    * Install CUDA drivers (11.8+ recommended)
    * Ollama auto-detects and uses available GPUs
    * Monitor usage: `nvidia-smi`

    **AMD GPUs (Linux)**

    * ROCm support is experimental
    * Set environment variable: `HSA_OVERRIDE_GFX_VERSION=10.3.0`

    ### Performance Tuning

    **Memory Management**

    ```bash theme={null}
    # Set GPU memory limit (NVIDIA)
    OLLAMA_MAX_VRAM=8GB ollama serve

    # Adjust number of parallel requests
    OLLAMA_NUM_PARALLEL=4 ollama serve

    # Keep models loaded in memory
    OLLAMA_KEEP_ALIVE=30m ollama serve
    ```

    **Model Quantization**

    Ollama supports various quantization levels:

    * `q4_0` - 4-bit quantization (smallest, fastest)
    * `q5_1` - 5-bit quantization (balanced)
    * `q8_0` - 8-bit quantization (best quality)

    ```bash theme={null}
    # Pull specific quantization
    ollama pull llama3.2:3b-q4_0  # Smaller, faster
    ollama pull llama3.2:3b-q8_0  # Better quality
    ```

    ### Monitoring & Management

    **Check Status**

    ```bash theme={null}
    # List loaded models
    ollama list

    # View running models
    ollama ps

    # Check API health
    curl http://localhost:11434/api/tags
    ```

    **Resource Usage**

    ```bash theme={null}
    # Monitor in real-time
    watch -n 1 ollama ps

    # Check model details
    ollama show llama3.2 --modelfile
    ```

    ### Creating Custom Models

    Create specialized models for your use case:

    ```dockerfile theme={null}
    # Modelfile
    FROM llama3.2:3b

    # Set parameters
    PARAMETER temperature 0.1
    PARAMETER num_ctx 4096

    # Add system prompt
    SYSTEM """You are a helpful assistant specialized in document analysis 
    and information retrieval. Always provide accurate, concise responses 
    based on the provided context."""
    ```

    Build and use:

    ```bash theme={null}
    ollama create morphik-assistant -f Modelfile
    ollama run morphik-assistant
    ```
  </Tab>
</Tabs>
