Cart

    Sorry, we could not find any results for your search querry.

    Using speech-to-text with OpenClaw

    OpenClaw, previously known as ClawBot and MoltBot, is a personal AI agent that you run on your own infrastructure. You control OpenClaw via a chat app of your choice, such as WhatsApp, and you can also use speech-to-text (STT) and/or text-to-speech (TTS). This means you don’t have to type messages—you can speak to OpenClaw instead. 

    In this guide, we explain how to use speech-to-text with OpenClaw on Ubuntu/Debian, what options are available, and the main trade-offs and benefits.

    • Before you start this guide, make sure you have a VPS/computer/laptop with OpenClaw, and that you’ve completed the onboarding.
       
    • Are you using a local model and want the fastest performance and highest accuracy? Then we recommend 4 CPU cores, but even with 2 CPU cores, STT performs surprisingly well.
     

     

    Which STT options does OpenClaw offer?

     

    OpenClaw essentially has two options for STT: 

    • Gateway voice-note handling (automatic STT): This uses tools.media.audio in your OpenClaw configuration (this is the best option for ‘natural conversation’). OpenClaw automatically chooses the first available provider, unless you’ve configured a specific order: 
      Local CLIs (e.g. local Whisper) → Gemini → OpenAI → Groq → Whisper.cpp CLI → sherpa-onnx, etc.
       
    • Agent skills, namely openai-whisper / openai-whisper-api. These skills teach the agent to run a CLI tool to transcribe an audio file when it decides to use the relevant skill.

     

    STT speed vs accuracy

    Option Speed Accuracy
    OpenAI gpt-4o-mini-transcribe ⭐⭐⭐⭐ ⭐⭐⭐⭐
    OpenAI gpt-4o-transcribe ⭐⭐⭐ ⭐⭐⭐⭐⭐ 
    OpenAI whisper-1 ⭐⭐⭐ ⭐⭐⭐
    Groq whisper-large-v3-turbo ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
    Groq whisper-large-v3 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
    Gemini CLI (gemini) - performance depends on the LLM ⭐→⭐⭐⭐⭐ ⭐→⭐⭐⭐⭐
    Local Whisper.cpp (tiny→large) - depends on model choice ⭐⭐⭐⭐⭐→⭐ ⭐→⭐⭐⭐⭐⭐
    • For the fastest and best overall option, we recommend using gateway voice-note handling and not using skills.
       
    • Do you speak English well? Then we recommend the local model with gateway voice-note handling; even without a GPU and with just 2 CPU cores, this is a remarkably fast and accurate option.
     

     

    The differences between the OpenAI-Whisper and OpenAI-Whisper-API skills

     

    In both cases, the underlying technology is OpenAI’s Whisper STT. However, there are a few important differences:

    OpenAI-whisper OpenAI-whisper-API
    Free Paid
    No API key required Requires an API key from OpenAI
    Performance depends on your own hardware Excellent performance

     

    Configuring gateway voice-note handling STT

     

    Local model (no API)

     

    In this section, you’ll install gateway voice-note handling for STT using a number of components:

    • whisper.cpp: a C++ implementation of Whisper that performs better than ‘standard’ Whisper
    • The Whisper CLI tool. 
    • A quantised model for STT, carefully balancing performance and accuracy. 
    • FFmpeg for compatibility with different types of audio
    • OpenBLAS for a performance boost

     

    Step 1

    Install all dependencies, including tools (where needed), FFmpeg, the associated FFmpeg libraries, and OpenBLAS:

    sudo apt -y update
    sudo apt install -y git cmake build-essential pkg-config ffmpeg libavcodec-dev libavformat-dev libavutil-dev libOpenBLAS-dev

     

    Step 2

    Build Whisper from scratch with FFmpeg and OpenBLAS support. 
    The final command makes the Whisper Command Line Interface (CLI) executable by adding it to a directory included in the PATH environment variable.

    cd /opt
    sudo git clone https://github.com/ggml-org/whisper.cpp.git
    cd whisper.cpp
    sudo cmake -B build -DWHISPER_FFMPEG=yes -DGGML_BLAS=1 -DCMAKE_BUILD_TYPE=Release
    sudo cmake --build build -j 4
    sudo cp -f ./build/bin/whisper-cli /usr/local/bin/whisper-cli

     

    Step 3

    Create a directory to store the STT models and install an STT model. 

    You can find an overview of available models on this page. Optionally replace ggml-base.en-q5_1.bin with the name of the model you want; tiny is faster than ‘base’, and small is more accurate. 

    sudo mkdir -p /opt/whisper-models
    cd /opt/whisper-models
    sudo wget -O ggml-base.en-q5_1.bin \
      "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en-q5_1.bin"

     

    Step 4

    Open the OpenClaw configuration in openclaw.json:

    sudo nano ~/.openclaw/openclaw.json

    Look for the “tools” section. You’ll typically already see two “web” tools defined there. Below that, add a “media” block as shown in the example below. Replace: 

    • 3 in “-t”, “3” with the number of CPU cores on your VPS, with a minimum of 2 and a maximum of your VPS core count minus 1 (e.g. 3 cores on a 4-core VPS).
    • Optionally, the directory and model under “command” and “--model” if you changed these yourself in the earlier steps.
      "tools": {
        "web": {
          "search": {
            "enabled": false
          },
          "fetch": {
            "enabled": false
          }
        },
        "media": {
          "audio": {
            "enabled": true,
            "models": [
              {
                "type": "cli",
                "command": "/usr/local/bin/whisper-cli",
                "args": [
                  "--model", "/opt/whisper-models/ggml-base.en-q5_1.bin",
                  "--file", "{{MediaPath}}",
                  "-t", "3"
                ],
                "timeoutSeconds": 20
              }
            ]
          }
        }
      },

    Save the changes and close the file (Ctrl + X > Y > Enter).


     

    Step 5

    Restart the OpenClaw gateway to apply the changes:

    openclaw gateway restart

    That’s it! You can now send voice messages via your chosen communication channel (e.g. WhatsApp); they’ll be transcribed automatically and quickly, and OpenClaw will reply based on that text.


     

    Cloud model (API key required)

     

    Step 1

    OpenClaw reads environment variables, among other places, from ~/.openclaw/.env (and does not overwrite existing values). Create/open that file: 

    nano ~/.openclaw/.env

    In the file you opened, add the API key(s) you want to use, then save your changes and close the file (Ctrl + X > Y > Enter):

    OPENAI_API_KEY="sk-..."
    GROQ_API_KEY="gsk_..."
    DEEPGRAM_API_KEY="dg_..." 

    OpenAI tip: you can also use the onboarding wizard; it asks for the OPENAI_API_KEY and automates this process, but it stores the key in ~/.openclaw/openclaw.json rather than .env. 


     

    Step 2 — switch tools.media.audio to provider model(s) 

    Open ~/.openclaw/openclaw.json and replace your current media.audio.models (CLI) with one or more provider entries. Then save your changes and close the file (Ctrl + X > Y > Enter). 

    nano ~/.openclaw/openclaw.json 

    Option A — OpenAI 

    OpenClaw’s OpenAI implementation supports:

    • default model: gpt-4o-mini-transcribe (fast)
    • a more accurate but slightly slower alternative: gpt-4o-transcribe 
    {
      "tools": {
        "media": {
          "audio": {
            "enabled": true,
            "models": [
              { "provider": "openai", "model": "gpt-4o-mini-transcribe" }
            ]
          }
        }
      }
    }

    Option B — Groq (often ‘snappy’ for short voice notes) 

    Available Groq STT models: 

    • whisper-large-v3-turbo (faster/cheaper) 
    • whisper-large-v3 (slightly more accurate)
    {
      "tools": {
        "media": {
          "audio": {
            "enabled": true,
            "models": [
              { "provider": "groq", "model": "whisper-large-v3-turbo" }
            ]
          }
        }
      }
    }
    

    Option C — Deepgram (simple + stable)

    Available Deepgram STT models: 

    • flux-general-en
    • nova-3
    • nova-2
    {
      "tools": {
        "media": {
          "audio": {
            "enabled": true,
            "models": [
              { "provider": "deepgram", "model": "nova-3" }
            ]
          }
        }
      }
    }
    

     

    Step 3 (optional)

    You can list multiple models in your configuration. OpenClaw will try the first model first; if it fails, times out, or the file is too large, OpenClaw will move on to the next one. For example: Groq → OpenAI → Deepgram → local CLI fallback. For instance:

    {
      "tools": {
        "media": {
          "audio": {
            "enabled": true,
            "maxBytes": 20971520,
            "models": [
              { "provider": "groq", "model": "whisper-large-v3-turbo" },
              { "provider": "openai", "model": "gpt-4o-mini-transcribe" },
              { "provider": "deepgram", "model": "nova-3" },
              {
                "type": "cli",
                "command": "/usr/local/bin/whisper-cli",
                "args": ["--model", "/opt/whisper-models/ggml-base.en-q5_1.bin", "--file", "{{MediaPath}}", "-t", "3"],
                "timeoutSeconds": 20
              }
            ]
          }
        }
      }
    }

     

    Step 4

    Restart the OpenClaw gateway to apply the changes:

    openclaw gateway restart

     

    Configuring OpenClaw STT via skills

     

    Configuring OpenClaw STT is very straightforward. However, there are a few caveats when using skills:

    • Skills are slower than tools—in this case, the gateway voice-note handling tool.
    • Skills give the agent the option to use a CLI tool, but it’s up to the agent/the LLM to decide whether the agent uses that skill or not. This isn’t a major issue for STT, but it is for TTS. 

    To configure STT via a skill, click ‘Skills’ in the left-hand menu of the web dashboard. 

    Then scroll down and, depending on your preference, click Enable next to (1) openai-whisper (hosted locally on your VPS) or (2) openai-whisper-api (hosted by OpenAI). In the latter case, also enter your API key (3) and click Save key.

    That’s it! You can now send voice messages via your chosen communication channel (e.g. WhatsApp); they’ll automatically be converted to text, after which OpenClaw will reply based on that text.


    Need help?

    Receive personal support from our supporters

    Contact us