Cart

    Sorry, we could not find any results for your search querry.

    using text-to-speech with OpenClaw

    OpenClaw, previously known as ClawBot and MoltBot, is a personal AI agent that you run on your own infrastructure. You control OpenClaw via a chat app of your choice, such as WhatsApp, and you can also use speech-to-text (STT) and/or text-to-speech (TTS). This means you don’t have to type messages—you can speak to OpenClaw instead.

    In this guide, we explain how to use text-to-speech with OpenClaw on Ubuntu/Debian, which options are available, and the main trade-offs and benefits of each choice.

    • Before you start this guide, make sure you have a VPS/computer/laptop with OpenClaw, and that you’ve completed the onboarding.
       
    • Are you using a local model and want the fastest performance and highest accuracy? Then we recommend 4 CPU cores, but even with 2 CPU cores, STT performs well.
     

     

    Which TTS options does OpenClaw offer?

     

    OpenClaw offers a number of TTS options: 

    • The built-in TTS tool (recommended): Supports OpenAI, ElevenLabs and Edge (free).
       
    • SAG: A skill connected to ElevenLabs. You’ll need an ElevenLabs API key for this option.
       
    • sherpa-onnx-tts: Local TTS that requires some additional command-line configuration. The quality is comparable to 
       
    • Unofficial: espeak-ng: The fastest option, running locally on your VPS. It sounds much more robotic and is configured via the command line.

     

    TTS speed vs naturalness

    Option Speed Naturalness
    OpenAI gpt-4o-mini-tts ⭐⭐⭐⭐ ⭐⭐⭐⭐½
    OpenAI tts-1 ⭐⭐⭐⭐⭐  ⭐⭐⭐
    OpenAI tts-1-hd ⭐⭐⭐ ⭐⭐⭐⭐⭐
    Edge (local) ⭐⭐⭐⭐½ ⭐⭐⭐
    ElevenLabs ⭐⭐⭐ ⭐⭐⭐⭐⭐
    sherpa-onnx-tts ⭐⭐⭐⭐⭐  ⭐⭐⭐→⭐⭐⭐⭐ (performance vs accuracy)
    espeak-ng ⭐⭐⭐⭐⭐ (very fast, even on CPU) ⭐→⭐⭐ (if you like robotic voices)

     

    The built-in TTS tool

     

    OpenClaw has a built-in TTS tool that’s easy to work with, because you can manage it in several places:

    • The chat in the web dashboard
    • The OpenClaw command-line TUI (openclaw tui)
    • A communication channel such as WhatsApp. Note that when the tool is enabled, replies via a communication channel will be slower, simply because every response is also converted to speech via TTS.

    The TTS tool does not interfere with Sherpa-onnx-tts or espeak-ng. For example, if OpenClaw sends an audio message back via one of these tools, OpenClaw recognises this and the TTS tool is not triggered as well.

     

     

    Adding API key(s)

    First add your API key(s) via the command line as follows (not in a chat conversation):

    echo OPENAI_API_KEY=sk-proj-......... >> ~/.openclaw/.env
    echo ELEVENLABS_API_KEY=sk_.......... >> ~/.openclaw/.env

     

    Simply start a conversation with OpenClaw via one of the options above and use the commands below to manage the TTS tool.

     

    Checking TTS status

    /tts status

    Turning TTS on

    /tts on

    Turning TTS off

    /tts off

    Changing the TTS provider

    Use one of: openai, elevenlabs or edge:

    /tts provider openai

     

    Configuring SAG TTS

     

    The SAG TTS skill is connected to ElevenLabs. ElevenLabs produces the most natural-sounding voices for TTS. 

    Whether OpenClaw uses the SAG skill depends on whether the agent thinks it should use that skill. Whether it reaches that conclusion depends, among other things, on the quality of the LLM used and the instructions you give OpenClaw (editable in ~/.openclaw/workspace/TOOLS.md and SOUL.md). Results are mixed, and if you prefer speed and reliability, we recommend the TTS tool. 

     

    Step 1

    Enable SAG by navigating in the OpenClaw dashboard to ‘Skills’ (1), enter your ElevenLabs API-key and save it (2) and clicking on ‘Enable’ (3).


     

    Stap 2

    OpenClaw saves the API key out-of-the-box in openclaw.json. For security reasons we recommend changing this in openclaw.json:

    nano ~/.openclaw/openclaw.json

    Change the ‘Skills’ section so the SAG part looks as follows:

      "skills": {
        "install": {
          "nodeManager": "npm"
        },
        "entries": {
          "sag": {
            "enabled": true,
            "apiKey": "${SAG_API_KEY}"
          }
        }
      },

    Add your API key to the .env file (replace sk_etc with your own API-key):

    echo SAG_API_KEY=sk_3c34083<redacted>b39d5a91e6c7d3 >> ~/.openclaw/.env

     

    Sherpa-onnx-tts

     

    How well Sherpa-onnx-tts works depends heavily on the LLM you choose (the more expensive models tend to perform better). If you use this option, you’ll probably need to add an instruction in ~/.openclaw/workspace/SOUL.md to use this skill, and not to call it more than once per run.

     

     

    Step 1

    OpenClaw has a skill called sherpa-onnx-tts that runs locally and doesn’t require a cloud TTS service. 

    First download the sherpa-onnx runtime:

    mkdir -p ~/.openclaw/tools/sherpa-onnx-tts/runtime
    cd ~/.openclaw/tools/sherpa-onnx-tts/runtime
    
    curl -L -o sherpa-onnx-runtime.tar.bz2 \
      https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.24/sherpa-onnx-v1.12.24-linux-x64-shared.tar.bz2
    
    tar -xjf sherpa-onnx-runtime.tar.bz2 --strip-components=1

    The latest version at the time of writing is 1.12.24. You can find an overview of available versions at https://github.com/k2-fsa/sherpa-onnx/releases/.


     

    Step 2

    Download a sherpa-onnx model. In this example, we download lessac-medium (US English). Optionally replace lessac-medium with lessac-low (faster) or lessac-high (better quality). You can find a full overview of available models at https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models.

    mkdir -p ~/.openclaw/tools/sherpa-onnx-tts/models
    cd ~/.openclaw/tools/sherpa-onnx-tts/models
    
    curl -L -o vits-piper-en_US-lessac-medium.tar.bz2 \
      https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-lessac-medium.tar.bz2
    
    tar -xjf vits-piper-en_US-lessac-medium.tar.bz2
    rm vits-piper-en_US-lessac-medium.tar.bz2

     

    Step 3

    Open openclaw.json to reference the sherpa-onnx runtime and the model you’ve chosen:

    nano ~/.openclaw/openclaw.json
    Scroll to the skills section and make sure sherpa-onnx is included under entries as well, for example:
     skills: {
       entries: {
         "sherpa-onnx-tts": { 
           "enabled": true 
         }
       }
     }

    You’ll often already have some entries here. The full section might then look like this:

      "skills": {
        "install": {
          "nodeManager": "npm"
        },
        "entries": {
          "sherpa-onx-tts": {
            "enabled": true
          },
          "openai-whisper": {
            "enabled": false
          },
          "openai-whisper-api": {
            "enabled": false
          }
        }
      },

     

    Espeak-ng (experimental)

     

    It’s important that the LLM model you’re using supports ‘tool-calling’ if you use espeak-ng. In particular, some self-hosted models such as gpt-oss:20b perform poorly here. So check the documentation for the relevant LLM before you implement this option.

     

     

    Step 1

    Install espeak-ng and the FFmpeg encoder if you don’t already have them.

    sudo apt -y update
    sudo apt -y install -y espeak-ng ffmpeg

     

    Step 2

    Create a ‘fast TTS’ script that the OpenClaw skill can use later. For safety, text is converted to audio via environment variables rather than directly via shell arguments.

    Skills in the ~/.openclaw/skills/ folder are available to all agents on your server. 

     
    mkdir -p ~/.openclaw/skills/fast-tts/bin
    nano ~/.openclaw/skills/fast-tts/bin/fast-tts
    chmod +x ~/.openclaw/skills/fast-tts/bin/fast-tts
    Paste the following code, save your changes, and close the file (Ctrl + X > Y > Enter).
    #!/usr/bin/env bash
    set -euo pipefail
    
    : "${TTS_TEXT:?Set TTS_TEXT in env}"
    TTS_CHANNEL="${TTS_CHANNEL:-whatsapp}"
    TTS_SPEED="${TTS_SPEED:-185}"
    TTS_VOICE="${TTS_VOICE:-}"
    TTS_APPEND_VOICE_MARKER="${TTS_APPEND_VOICE_MARKER:-0}"
    
    require_bin() {
      if ! command -v "$1" >/dev/null 2>&1; then
        echo "ERROR: missing dependency: $1" >&2
        exit 127
      fi
    }
    
    require_bin espeak-ng
    require_bin ffmpeg
    
    script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd -P)"
    base_dir="$(cd "$script_dir/.." && pwd -P)"
    
    # Prefer OpenClaw's media root so local-path safety checks accept the file.
    default_out="$base_dir/media/outbound"
    if [[ -n "${HOME:-}" ]]; then
      default_out="$HOME/.openclaw/media/outbound"
    fi
    OUT_DIR="${OUT_DIR:-$default_out}"
    mkdir -p "$OUT_DIR"
    
    fname="tts-$(date +%Y%m%d-%H%M%S)-$RANDOM.ogg"
    out="$OUT_DIR/$fname"
    
    wav="$(mktemp "$OUT_DIR/tts-XXXXXX.wav")"
    trap 'rm -f "$wav"' EXIT
    
    espeak_cmd=(espeak-ng -s "$TTS_SPEED" -w "$wav" --stdin)
    if [[ -n "$TTS_VOICE" ]]; then
      espeak_cmd+=(-v "$TTS_VOICE")
    fi
    
    # Read text from stdin so content does not appear in argv/process listings.
    printf '%s\n' "$TTS_TEXT" | "${espeak_cmd[@]}"
    
    ffmpeg -hide_banner -loglevel error -y \
      -i "$wav" -ac 1 -ar 48000 \
      -c:a libopus -b:a 64k -vbr on -compression_level 5 -application voip \
      "$out"
    
    if [[ ! -s "$out" ]]; then
      echo "ERROR: failed to create audio at $out" >&2
      exit 1
    fi
    
    chmod 644 "$out" 2>/dev/null || true
    
    # IMPORTANT: use absolute file URL so the channel worker CWD doesn't matter
    abs="$(cd "$(dirname "$out")" && pwd -P)/$(basename "$out")"
    
    media="MEDIA:file://$abs"
    channel="$(printf '%s' "$TTS_CHANNEL" | tr '[:upper:]' '[:lower:]')"
    if [[ "$channel" == "telegram" && "$TTS_APPEND_VOICE_MARKER" == "1" ]]; then
      # Some OpenClaw builds support this marker; default off for compatibility.
      media+="[[audio_as_voice]]"
    fi
    echo "$media"

     

    Step 3

    Teach your agent how to actually use the skill. Create a new SKILL.md file:

    nano ~/.openclaw/skills/fast-tts/SKILL.md
    Paste the following code into the file, save your changes, and close the file (Ctrl + X > Y > Enter).
    ---
    name: fast-tts
    description: Ultra-fast local TTS (espeak-ng + ffmpeg) for OpenClaw chat channels. Use when you want audio-only replies (prefer no written text) in WhatsApp/Telegram. Includes a minimal plugin skeleton for OpenClaw voice-call telephony TTS with local espeak-ng.
    metadata: {"openclaw":{"requires":{"bins":["espeak-ng","ffmpeg"]}}}
    ---
    
    When the user wants an audio-only reply in chat channels:
    
    1) Put the full reply text into env var `TTS_TEXT` (do NOT put the text on the shell command line).
    2) Call the exec tool with:
       - command: "{baseDir}/bin/fast-tts"
       - env: { "TTS_TEXT": "<your reply text>", "TTS_CHANNEL": "<channel>" }
       - Set `TTS_CHANNEL` from message context. Supported values: `whatsapp`, `telegram`. Defaults to `whatsapp`.
    3) The command prints a single `MEDIA:` line.
    4) Respond with exactly that `MEDIA:` line and nothing else.
    
    Notes:
    - The script emits OGG/Opus at 48kHz mono and appends `[[audio_as_voice]]` automatically for Telegram voice bubbles.
    - Optional env vars: `TTS_SPEED` (default `185`), `TTS_VOICE` (espeak voice id), `OUT_DIR` (output directory).
    - Use `assets/openclaw-espeak-telephony-plugin/` as the minimal starting point for local `espeak-ng` telephony TTS in the OpenClaw voice-call path.
    - Integration notes for the voice-call fork are in `references/voice-call-provider-skeleton.md`.
    - Config guidance for audio-only chat behavior is in `references/openclaw-config-for-audio-only.md`.

     

    Step 4

    Enable the skill in your openclaw.json configuration as well:

    nano ~/.openclaw/openclaw.json
    Scroll to the skills section and make sure fast-tts is included under entries as well, for example:
     skills: {
       entries: {
         "fast-tts": { enabled: true }
       }
     }

    You’ll often already have some entries here. The full section might then look like this:

      "skills": {
        "install": {
          "nodeManager": "npm"
        },
        "entries": {
          "fast-tts": {
            "enabled": true
          },
          "openai-whisper": {
            "enabled": false
          },
          "openai-whisper-api": {
            "enabled": false
          }
        }
      },

     

    Step 5

    Instruct your agent to actually use the skill. Open SOUL.md:

    nano ~/.openclaw/workspace/SOUL.md

    Add a section like the example below (don’t remove any existing content), then save your changes and close the file (Ctrl + X > Y > Enter).

    • Replace <jegebruikersnaam> with your own username.
    ## Telegram/WhatsApp
    
    When replying on Telegram/WhatsApp:
    
    - Call exec exactly once per turn with:
      {"command":"/home/testtransip/.openclaw/skills/fast-tts/bin/fast-tts","env":{"TTS_CHANNEL":"telegram","OUT_DIR":"/home/<jegebruikersnaam>/.openclaw/media/outbound","TTS_APPEND_VOICE_MARKER":"0","TTS_TEXT":"<full reply text>"}}
    - Never put reply text in the command string.
    - Never output "Action:" or "Action Input:" to chat.
    - call once per turn
    - Assume your first call in a turn is successful.
    - Never use tts
    - Final output must be only the MEDIA: line returned by the tool.
    - If exec or fast-tts fails once in a turn, do not retry again in that same turn.

     

    Step 6

    Restart your OpenClaw gateway so that skills are reloaded:

    openclaw gateway restart

     

    Potential bug

     

    There is an active report stating that MEDIA: lines are sometimes displayed as plain text (without an attachment) with certain model/provider combinations. A temporary workaround is to send audio via the message tool instead. If you run into this issue, try updating OpenClaw with openclaw update.

    Need help?

    Receive personal support from our supporters

    Contact us