Creating a Single Speech Agent¶

This tutorial demonstrates how to create a single AI agent with real-time text-to-speech (TTS) capabilities using the Swarms framework and the voice-agents package. This setup is ideal for interactive applications where you want the agent to "speak" its responses as they are generated.

Prerequisites¶

Python 3.10+
OpenAI API key (for both LLM and TTS)
swarms library
voice-agents library

Tutorial Steps¶

Install Dependencies Install the necessary packages:
```
pip3 install -U swarms voice-agents
```
Set Up Environment Ensure your OpenAI API key is set in your environment:
```
export OPENAI_API_KEY="your-api-key-here"
```
Initialize the Agent Create an agent with streaming_on=True. This is crucial for the TTS callback to work in real-time.
Configure the TTS Callback Use the StreamingTTSCallback from the voice-agents package. You can choose different voices like "alloy", "echo", "fable", "onyx", "nova", or "shimmer".
Run the Agent Pass the streaming_callback to the agent.run() method.

Code Example¶

from swarms import Agent
from voice_agents import StreamingTTSCallback

# Initialize the agent
agent = Agent(
    agent_name="Quantitative-Trading-Agent",
    agent_description="Advanced quantitative trading and algorithmic analysis agent",
    model_name="gpt-4o",
    dynamic_temperature_enabled=True,
    max_loops=1,
    streaming_on=True, # Required for real-time TTS
)

# Create the streaming TTS callback
# voice: alloy, echo, fable, onyx, nova, shimmer
tts_callback = StreamingTTSCallback(voice="alloy", model="tts-1")

# Run the agent with streaming TTS callback
out = agent.run(
    task="What are the top five best energy stocks across nuclear, solar, gas, and other energy sources?",
    streaming_callback=tts_callback,
)

# Flush any remaining text in the buffer to ensure the last sentence is spoken
tts_callback.flush()

print(out)

How it Works¶

Streaming: When streaming_on=True, the agent yields tokens as they are generated.
Callback: The StreamingTTSCallback collects these tokens into sentences and sends them to the OpenAI TTS API.
Real-time Audio: Audio is played back as soon as the first sentence is processed, significantly reducing latency compared to waiting for the full response.
Flushing: The .flush() method is called at the end to process any remaining text that didn't end with a sentence delimiter.