Using speech-to-text with OpenClaw

OpenClaw, previously known as ClawBot and MoltBot, is a personal AI agent that you run on your own infrastructure. You control OpenClaw via a chat app of your choice, such as WhatsApp, and you can also use speech-to-text (STT) and/or text-to-speech (TTS). This means you don’t have to type messages—you can speak to OpenClaw instead.

In this guide, we explain how to use speech-to-text with OpenClaw on Ubuntu/Debian, what options are available, and the main trade-offs and benefits.

Before you start this guide, make sure you have a VPS/computer/laptop with OpenClaw, and that you’ve completed the onboarding.
Are you using a local model and want the fastest performance and highest accuracy? Then we recommend 4 CPU cores, but even with 2 CPU cores, STT performs surprisingly well.

Which STT options does OpenClaw offer?

OpenClaw essentially has two options for STT:

Gateway voice-note handling (automatic STT): This uses tools.media.audio in your OpenClaw configuration (this is the best option for ‘natural conversation’). OpenClaw automatically chooses the first available provider, unless you’ve configured a specific order:
Local CLIs (e.g. local Whisper) → Gemini → OpenAI → Groq → Whisper.cpp CLI → sherpa-onnx, etc.
Agent skills, namely openai-whisper / openai-whisper-api. These skills teach the agent to run a CLI tool to transcribe an audio file when it decides to use the relevant skill.

STT speed vs accuracy

Option	Speed	Accuracy
OpenAI `gpt-4o-mini-transcribe`	⭐⭐⭐⭐	⭐⭐⭐⭐
OpenAI `gpt-4o-transcribe`	⭐⭐⭐	⭐⭐⭐⭐⭐
OpenAI `whisper-1`	⭐⭐⭐	⭐⭐⭐
Groq `whisper-large-v3-turbo`	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Groq `whisper-large-v3`	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Gemini CLI (`gemini`) - performance depends on the LLM	⭐→⭐⭐⭐⭐	⭐→⭐⭐⭐⭐
Local Whisper.cpp (tiny→large) - depends on model choice	⭐⭐⭐⭐⭐→⭐	⭐→⭐⭐⭐⭐⭐

For the fastest and best overall option, we recommend using gateway voice-note handling and not using skills.
Do you speak English well? Then we recommend the local model with gateway voice-note handling; even without a GPU and with just 2 CPU cores, this is a remarkably fast and accurate option.

The differences between the OpenAI-Whisper and OpenAI-Whisper-API skills

In both cases, the underlying technology is OpenAI’s Whisper STT. However, there are a few important differences:

OpenAI-whisper	OpenAI-whisper-API
Free	Paid
No API key required	Requires an API key from OpenAI
Performance depends on your own hardware	Excellent performance

Configuring gateway voice-note handling STT

Local model (no API)

In this section, you’ll install gateway voice-note handling for STT using a number of components:

whisper.cpp: a C++ implementation of Whisper that performs better than ‘standard’ Whisper
The Whisper CLI tool.
A quantised model for STT, carefully balancing performance and accuracy.
FFmpeg for compatibility with different types of audio
OpenBLAS for a performance boost

Step 1

Install all dependencies, including tools (where needed), FFmpeg, the associated FFmpeg libraries, and OpenBLAS:

sudo apt -y update
sudo apt install -y git cmake build-essential pkg-config ffmpeg libavcodec-dev libavformat-dev libavutil-dev libOpenBLAS-dev

Step 2

Build Whisper from scratch with FFmpeg and OpenBLAS support.
The final command makes the Whisper Command Line Interface (CLI) executable by adding it to a directory included in the PATH environment variable.

cd /opt
sudo git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
sudo cmake -B build -DWHISPER_FFMPEG=yes -DGGML_BLAS=1 -DCMAKE_BUILD_TYPE=Release
sudo cmake --build build -j 4
sudo cp -f ./build/bin/whisper-cli /usr/local/bin/whisper-cli

Step 3

Create a directory to store the STT models and install an STT model.

You can find an overview of available models on this page. Optionally replace ggml-base.en-q5_1.bin with the name of the model you want; tiny is faster than ‘base’, and small is more accurate.

sudo mkdir -p /opt/whisper-models
cd /opt/whisper-models
sudo wget -O ggml-base.en-q5_1.bin \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en-q5_1.bin"

Step 4

Open the OpenClaw configuration in openclaw.json:

sudo nano ~/.openclaw/openclaw.json

Look for the “tools” section. You’ll typically already see two “web” tools defined there. Below that, add a “media” block as shown in the example below. Replace:

3 in “-t”, “3” with the number of CPU cores on your VPS, with a minimum of 2 and a maximum of your VPS core count minus 1 (e.g. 3 cores on a 4-core VPS).
Optionally, the directory and model under “command” and “--model” if you changed these yourself in the earlier steps.

  "tools": {
    "web": {
      "search": {
        "enabled": false
      },
      "fetch": {
        "enabled": false
      }
    },
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          {
            "type": "cli",
            "command": "/usr/local/bin/whisper-cli",
            "args": [
              "--model", "/opt/whisper-models/ggml-base.en-q5_1.bin",
              "--file", "{{MediaPath}}",
              "-t", "3"
            ],
            "timeoutSeconds": 20
          }
        ]
      }
    }
  },

Save the changes and close the file (Ctrl + X > Y > Enter).

Step 5

Restart the OpenClaw gateway to apply the changes:

openclaw gateway restart

That’s it! You can now send voice messages via your chosen communication channel (e.g. WhatsApp); they’ll be transcribed automatically and quickly, and OpenClaw will reply based on that text.

Cloud model (API key required)

Step 1

OpenClaw reads environment variables, among other places, from ~/.openclaw/.env (and does not overwrite existing values). Create/open that file:

nano ~/.openclaw/.env

In the file you opened, add the API key(s) you want to use, then save your changes and close the file (Ctrl + X > Y > Enter):

OPENAI_API_KEY="sk-..."
GROQ_API_KEY="gsk_..."
DEEPGRAM_API_KEY="dg_..."

OpenAI tip: you can also use the onboarding wizard; it asks for the OPENAI_API_KEY and automates this process, but it stores the key in ~/.openclaw/openclaw.json rather than .env.

Step 2 — switch tools.media.audio to provider model(s)

Open ~/.openclaw/openclaw.json and replace your current media.audio.models (CLI) with one or more provider entries. Then save your changes and close the file (Ctrl + X > Y > Enter).

nano ~/.openclaw/openclaw.json

Option A — OpenAI

OpenClaw’s OpenAI implementation supports:

default model: gpt-4o-mini-transcribe (fast)
a more accurate but slightly slower alternative: gpt-4o-transcribe

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          { "provider": "openai", "model": "gpt-4o-mini-transcribe" }
        ]
      }
    }
  }
}

Option B — Groq (often ‘snappy’ for short voice notes)

Available Groq STT models:

whisper-large-v3-turbo (faster/cheaper)
whisper-large-v3 (slightly more accurate)

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          { "provider": "groq", "model": "whisper-large-v3-turbo" }
        ]
      }
    }
  }
}

Option C — Deepgram (simple + stable)

Available Deepgram STT models:

flux-general-en
nova-3
nova-2

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          { "provider": "deepgram", "model": "nova-3" }
        ]
      }
    }
  }
}

Step 3 (optional)

You can list multiple models in your configuration. OpenClaw will try the first model first; if it fails, times out, or the file is too large, OpenClaw will move on to the next one. For example: Groq → OpenAI → Deepgram → local CLI fallback. For instance:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          { "provider": "groq", "model": "whisper-large-v3-turbo" },
          { "provider": "openai", "model": "gpt-4o-mini-transcribe" },
          { "provider": "deepgram", "model": "nova-3" },
          {
            "type": "cli",
            "command": "/usr/local/bin/whisper-cli",
            "args": ["--model", "/opt/whisper-models/ggml-base.en-q5_1.bin", "--file", "{{MediaPath}}", "-t", "3"],
            "timeoutSeconds": 20
          }
        ]
      }
    }
  }
}

Step 4

Restart the OpenClaw gateway to apply the changes:

openclaw gateway restart

Configuring OpenClaw STT via skills

Configuring OpenClaw STT is very straightforward. However, there are a few caveats when using skills:

Skills are slower than tools—in this case, the gateway voice-note handling tool.
Skills give the agent the option to use a CLI tool, but it’s up to the agent/the LLM to decide whether the agent uses that skill or not. This isn’t a major issue for STT, but it is for TTS.

To configure STT via a skill, click ‘Skills’ in the left-hand menu of the web dashboard.

Then scroll down and, depending on your preference, click Enable next to (1) openai-whisper (hosted locally on your VPS) or (2) openai-whisper-api (hosted by OpenAI). In the latter case, also enter your API key (3) and click Save key.

That’s it! You can now send voice messages via your chosen communication channel (e.g. WhatsApp); they’ll automatically be converted to text, after which OpenClaw will reply based on that text.

Sorry, we could not find any results for your search querry.

Account

Orders & billing

Other questions

Domain & DNS

Webhosting & email

VPS

Kubernetes

OpenStack