using text-to-speech with OpenClaw

OpenClaw, previously known as ClawBot and MoltBot, is a personal AI agent that you run on your own infrastructure. You control OpenClaw via a chat app of your choice, such as WhatsApp, and you can also use speech-to-text (STT) and/or text-to-speech (TTS). This means you don’t have to type messages—you can speak to OpenClaw instead.

In this guide, we explain how to use text-to-speech with OpenClaw on Ubuntu/Debian, which options are available, and the main trade-offs and benefits of each choice.

Before you start this guide, make sure you have a VPS/computer/laptop with OpenClaw, and that you’ve completed the onboarding.
Are you using a local model and want the fastest performance and highest accuracy? Then we recommend 4 CPU cores, but even with 2 CPU cores, STT performs well.

Which TTS options does OpenClaw offer?

OpenClaw offers a number of TTS options:

The built-in TTS tool (recommended): Supports OpenAI, ElevenLabs and Edge (free).
SAG: A skill connected to ElevenLabs. You’ll need an ElevenLabs API key for this option.
sherpa-onnx-tts: Local TTS that requires some additional command-line configuration. The quality is comparable to
Unofficial: espeak-ng: The fastest option, running locally on your VPS. It sounds much more robotic and is configured via the command line.

TTS speed vs naturalness

Option	Speed	Naturalness
OpenAI `gpt-4o-mini-tts`	⭐⭐⭐⭐	⭐⭐⭐⭐½
OpenAI `tts-1`	⭐⭐⭐⭐⭐	⭐⭐⭐
OpenAI `tts-1-hd`	⭐⭐⭐	⭐⭐⭐⭐⭐
Edge (local)	⭐⭐⭐⭐½	⭐⭐⭐
ElevenLabs	⭐⭐⭐	⭐⭐⭐⭐⭐
sherpa-onnx-tts	⭐⭐⭐⭐⭐	⭐⭐⭐→⭐⭐⭐⭐ (performance vs accuracy)
espeak-ng	⭐⭐⭐⭐⭐ (very fast, even on CPU)	⭐→⭐⭐ (if you like robotic voices)

The built-in TTS tool

OpenClaw has a built-in TTS tool that’s easy to work with, because you can manage it in several places:

The chat in the web dashboard
The OpenClaw command-line TUI (openclaw tui)
A communication channel such as WhatsApp. Note that when the tool is enabled, replies via a communication channel will be slower, simply because every response is also converted to speech via TTS.

The TTS tool does not interfere with Sherpa-onnx-tts or espeak-ng. For example, if OpenClaw sends an audio message back via one of these tools, OpenClaw recognises this and the TTS tool is not triggered as well.

Adding API key(s)

First add your API key(s) via the command line as follows (not in a chat conversation):

echo OPENAI_API_KEY=sk-proj-......... >> ~/.openclaw/.env
echo ELEVENLABS_API_KEY=sk_.......... >> ~/.openclaw/.env

Simply start a conversation with OpenClaw via one of the options above and use the commands below to manage the TTS tool.

Checking TTS status

/tts status

Turning TTS on

/tts on

Turning TTS off

/tts off

Changing the TTS provider

Use one of: openai, elevenlabs or edge:

/tts provider openai

Configuring SAG TTS

The SAG TTS skill is connected to ElevenLabs. ElevenLabs produces the most natural-sounding voices for TTS.

Whether OpenClaw uses the SAG skill depends on whether the agent thinks it should use that skill. Whether it reaches that conclusion depends, among other things, on the quality of the LLM used and the instructions you give OpenClaw (editable in ~/.openclaw/workspace/TOOLS.md and SOUL.md). Results are mixed, and if you prefer speed and reliability, we recommend the TTS tool.

Step 1

Enable SAG by navigating in the OpenClaw dashboard to ‘Skills’ (1), enter your ElevenLabs API-key and save it (2) and clicking on ‘Enable’ (3).

Stap 2

OpenClaw saves the API key out-of-the-box in openclaw.json. For security reasons we recommend changing this in openclaw.json:

nano ~/.openclaw/openclaw.json

Change the ‘Skills’ section so the SAG part looks as follows:

  "skills": {
    "install": {
      "nodeManager": "npm"
    },
    "entries": {
      "sag": {
        "enabled": true,
        "apiKey": "${SAG_API_KEY}"
      }
    }
  },

Add your API key to the .env file (replace sk_etc with your own API-key):

echo SAG_API_KEY=sk_3c34083<redacted>b39d5a91e6c7d3 >> ~/.openclaw/.env

Sherpa-onnx-tts

How well Sherpa-onnx-tts works depends heavily on the LLM you choose (the more expensive models tend to perform better). If you use this option, you’ll probably need to add an instruction in ~/.openclaw/workspace/SOUL.md to use this skill, and not to call it more than once per run.

Step 1

OpenClaw has a skill called sherpa-onnx-tts that runs locally and doesn’t require a cloud TTS service.

First download the sherpa-onnx runtime:

mkdir -p ~/.openclaw/tools/sherpa-onnx-tts/runtime
cd ~/.openclaw/tools/sherpa-onnx-tts/runtime

curl -L -o sherpa-onnx-runtime.tar.bz2 \
  https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.24/sherpa-onnx-v1.12.24-linux-x64-shared.tar.bz2

tar -xjf sherpa-onnx-runtime.tar.bz2 --strip-components=1

The latest version at the time of writing is 1.12.24. You can find an overview of available versions at https://github.com/k2-fsa/sherpa-onnx/releases/.

Step 2

Download a sherpa-onnx model. In this example, we download lessac-medium (US English). Optionally replace lessac-medium with lessac-low (faster) or lessac-high (better quality). You can find a full overview of available models at https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models.

mkdir -p ~/.openclaw/tools/sherpa-onnx-tts/models
cd ~/.openclaw/tools/sherpa-onnx-tts/models

curl -L -o vits-piper-en_US-lessac-medium.tar.bz2 \
  https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-lessac-medium.tar.bz2

tar -xjf vits-piper-en_US-lessac-medium.tar.bz2
rm vits-piper-en_US-lessac-medium.tar.bz2

Step 3

Open openclaw.json to reference the sherpa-onnx runtime and the model you’ve chosen:

nano ~/.openclaw/openclaw.json

Scroll to the skills section and make sure sherpa-onnx is included under entries as well, for example:

 skills: {
   entries: {
     "sherpa-onnx-tts": { 
       "enabled": true 
     }
   }
 }

You’ll often already have some entries here. The full section might then look like this:

  "skills": {
    "install": {
      "nodeManager": "npm"
    },
    "entries": {
      "sherpa-onx-tts": {
        "enabled": true
      },
      "openai-whisper": {
        "enabled": false
      },
      "openai-whisper-api": {
        "enabled": false
      }
    }
  },

Espeak-ng (experimental)

It’s important that the LLM model you’re using supports ‘tool-calling’ if you use espeak-ng. In particular, some self-hosted models such as gpt-oss:20b perform poorly here. So check the documentation for the relevant LLM before you implement this option.

Step 1

Install espeak-ng and the FFmpeg encoder if you don’t already have them.

sudo apt -y update
sudo apt -y install -y espeak-ng ffmpeg

Step 2

Create a ‘fast TTS’ script that the OpenClaw skill can use later. For safety, text is converted to audio via environment variables rather than directly via shell arguments.

Skills in the ~/.openclaw/skills/ folder are available to all agents on your server.

mkdir -p ~/.openclaw/skills/fast-tts/bin
nano ~/.openclaw/skills/fast-tts/bin/fast-tts
chmod +x ~/.openclaw/skills/fast-tts/bin/fast-tts

Paste the following code, save your changes, and close the file (Ctrl + X > Y > Enter).

#!/usr/bin/env bash
set -euo pipefail

: "${TTS_TEXT:?Set TTS_TEXT in env}"
TTS_CHANNEL="${TTS_CHANNEL:-whatsapp}"
TTS_SPEED="${TTS_SPEED:-185}"
TTS_VOICE="${TTS_VOICE:-}"
TTS_APPEND_VOICE_MARKER="${TTS_APPEND_VOICE_MARKER:-0}"

require_bin() {
  if ! command -v "$1" >/dev/null 2>&1; then
    echo "ERROR: missing dependency: $1" >&2
    exit 127
  fi
}

require_bin espeak-ng
require_bin ffmpeg

script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd -P)"
base_dir="$(cd "$script_dir/.." && pwd -P)"

# Prefer OpenClaw's media root so local-path safety checks accept the file.
default_out="$base_dir/media/outbound"
if [[ -n "${HOME:-}" ]]; then
  default_out="$HOME/.openclaw/media/outbound"
fi
OUT_DIR="${OUT_DIR:-$default_out}"
mkdir -p "$OUT_DIR"

fname="tts-$(date +%Y%m%d-%H%M%S)-$RANDOM.ogg"
out="$OUT_DIR/$fname"

wav="$(mktemp "$OUT_DIR/tts-XXXXXX.wav")"
trap 'rm -f "$wav"' EXIT

espeak_cmd=(espeak-ng -s "$TTS_SPEED" -w "$wav" --stdin)
if [[ -n "$TTS_VOICE" ]]; then
  espeak_cmd+=(-v "$TTS_VOICE")
fi

# Read text from stdin so content does not appear in argv/process listings.
printf '%s\n' "$TTS_TEXT" | "${espeak_cmd[@]}"

ffmpeg -hide_banner -loglevel error -y \
  -i "$wav" -ac 1 -ar 48000 \
  -c:a libopus -b:a 64k -vbr on -compression_level 5 -application voip \
  "$out"

if [[ ! -s "$out" ]]; then
  echo "ERROR: failed to create audio at $out" >&2
  exit 1
fi

chmod 644 "$out" 2>/dev/null || true

# IMPORTANT: use absolute file URL so the channel worker CWD doesn't matter
abs="$(cd "$(dirname "$out")" && pwd -P)/$(basename "$out")"

media="MEDIA:file://$abs"
channel="$(printf '%s' "$TTS_CHANNEL" | tr '[:upper:]' '[:lower:]')"
if [[ "$channel" == "telegram" && "$TTS_APPEND_VOICE_MARKER" == "1" ]]; then
  # Some OpenClaw builds support this marker; default off for compatibility.
  media+="[[audio_as_voice]]"
fi
echo "$media"

Step 3

Teach your agent how to actually use the skill. Create a new SKILL.md file:

nano ~/.openclaw/skills/fast-tts/SKILL.md

Paste the following code into the file, save your changes, and close the file (Ctrl + X > Y > Enter).

---
name: fast-tts
description: Ultra-fast local TTS (espeak-ng + ffmpeg) for OpenClaw chat channels. Use when you want audio-only replies (prefer no written text) in WhatsApp/Telegram. Includes a minimal plugin skeleton for OpenClaw voice-call telephony TTS with local espeak-ng.
metadata: {"openclaw":{"requires":{"bins":["espeak-ng","ffmpeg"]}}}
---

When the user wants an audio-only reply in chat channels:

1) Put the full reply text into env var `TTS_TEXT` (do NOT put the text on the shell command line).
2) Call the exec tool with:
   - command: "{baseDir}/bin/fast-tts"
   - env: { "TTS_TEXT": "<your reply text>", "TTS_CHANNEL": "<channel>" }
   - Set `TTS_CHANNEL` from message context. Supported values: `whatsapp`, `telegram`. Defaults to `whatsapp`.
3) The command prints a single `MEDIA:` line.
4) Respond with exactly that `MEDIA:` line and nothing else.

Notes:
- The script emits OGG/Opus at 48kHz mono and appends `[[audio_as_voice]]` automatically for Telegram voice bubbles.
- Optional env vars: `TTS_SPEED` (default `185`), `TTS_VOICE` (espeak voice id), `OUT_DIR` (output directory).
- Use `assets/openclaw-espeak-telephony-plugin/` as the minimal starting point for local `espeak-ng` telephony TTS in the OpenClaw voice-call path.
- Integration notes for the voice-call fork are in `references/voice-call-provider-skeleton.md`.
- Config guidance for audio-only chat behavior is in `references/openclaw-config-for-audio-only.md`.

Step 4

Enable the skill in your openclaw.json configuration as well:

nano ~/.openclaw/openclaw.json

Scroll to the skills section and make sure fast-tts is included under entries as well, for example:

 skills: {
   entries: {
     "fast-tts": { enabled: true }
   }
 }

You’ll often already have some entries here. The full section might then look like this:

  "skills": {
    "install": {
      "nodeManager": "npm"
    },
    "entries": {
      "fast-tts": {
        "enabled": true
      },
      "openai-whisper": {
        "enabled": false
      },
      "openai-whisper-api": {
        "enabled": false
      }
    }
  },

Step 5

Instruct your agent to actually use the skill. Open SOUL.md:

nano ~/.openclaw/workspace/SOUL.md

Add a section like the example below (don’t remove any existing content), then save your changes and close the file (Ctrl + X > Y > Enter).

Replace <jegebruikersnaam> with your own username.

## Telegram/WhatsApp

When replying on Telegram/WhatsApp:

- Call exec exactly once per turn with:
  {"command":"/home/testtransip/.openclaw/skills/fast-tts/bin/fast-tts","env":{"TTS_CHANNEL":"telegram","OUT_DIR":"/home/<jegebruikersnaam>/.openclaw/media/outbound","TTS_APPEND_VOICE_MARKER":"0","TTS_TEXT":"<full reply text>"}}
- Never put reply text in the command string.
- Never output "Action:" or "Action Input:" to chat.
- call once per turn
- Assume your first call in a turn is successful.
- Never use tts
- Final output must be only the MEDIA: line returned by the tool.
- If exec or fast-tts fails once in a turn, do not retry again in that same turn.

Step 6

Restart your OpenClaw gateway so that skills are reloaded:

openclaw gateway restart

Potential bug

There is an active report stating that MEDIA: lines are sometimes displayed as plain text (without an attachment) with certain model/provider combinations. A temporary workaround is to send audio via the message tool instead. If you run into this issue, try updating OpenClaw with openclaw update.

Sorry, we could not find any results for your search querry.

Account

Orders & billing

Other questions

Domain & DNS

Webhosting & email

VPS

Kubernetes

OpenStack

using text-to-speech with OpenClaw

Which TTS options does OpenClaw offer?

The built-in TTS tool

Configuring SAG TTS

Sherpa-onnx-tts

Espeak-ng (experimental)

Potential bug

Index

Need help?