|
| 1 | +--- |
| 2 | +slug: openai-ffmpeg-integration |
| 3 | +title: How to Integrate OpenAI TTS with FFmpeg in a FastAPI Service |
| 4 | +date: 2025-03-06 |
| 5 | +authors: [nicolad] |
| 6 | +tags: |
| 7 | + [ |
| 8 | + openai, |
| 9 | + ffmpeg, |
| 10 | + audio, |
| 11 | + text-to-speech, |
| 12 | + fastapi, |
| 13 | + python, |
| 14 | + machine-learning, |
| 15 | + generative-ai, |
| 16 | + api, |
| 17 | + ] |
| 18 | +--- |
| 19 | + |
| 20 | +## Introduction |
| 21 | + |
| 22 | +**OpenAI** offers powerful text-to-speech capabilities, enabling developers to generate spoken audio from raw text. Meanwhile, **FFmpeg** is the de facto standard tool for audio/video processing—used heavily for tasks like merging audio files, converting formats, and applying filters. Combining these two in a **FastAPI** application can produce a scalable, production-ready text-to-speech (TTS) workflow that merges and manipulates audio via FFmpeg under the hood. |
| 23 | + |
| 24 | +This article demonstrates how to: |
| 25 | + |
| 26 | +1. Accept text input through a **FastAPI** endpoint |
| 27 | +2. Chunk text and use **OpenAI** to generate MP3 segments |
| 28 | +3. Merge generated segments with **FFmpeg** (through the [pydub](https://github.com/jiaaro/pydub) interface) |
| 29 | +4. Return or store a final MP3 file, ideal for streamlined TTS pipelines |
| 30 | + |
| 31 | +By the end, you’ll understand how to build a simple but effective text-to-speech microservice that leverages the power of **OpenAI** and **FFmpeg**. |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## 1. Why Combine OpenAI and FFmpeg |
| 36 | + |
| 37 | +- **Chunked Processing**: Long text might exceed certain API limits or timeouts. Splitting into smaller parts ensures each piece is handled reliably. |
| 38 | +- **Post-processing**: Merging segments, adding intros or outros, or applying custom filters (such as volume adjustments) becomes trivial with FFmpeg. |
| 39 | +- **Scalability**: A background task system (like FastAPI’s `BackgroundTasks`) can handle requests without blocking the main thread. |
| 40 | +- **Automation**: Minimizes manual involvement—one endpoint can receive text and produce a final merged MP3. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## 2. FastAPI Endpoint and Background Tasks |
| 45 | + |
| 46 | +Below is the **FastAPI** code that implements a TTS service using the OpenAI API and pydub (which uses FFmpeg internally). It splits the input text into manageable chunks, generates MP3 files per chunk, then merges them: |
| 47 | + |
| 48 | +```python |
| 49 | +import os |
| 50 | +import time |
| 51 | +import logging |
| 52 | +from pathlib import Path |
| 53 | + |
| 54 | +from dotenv import load_dotenv |
| 55 | +from fastapi import APIRouter, HTTPException, Request, BackgroundTasks |
| 56 | +from fastapi.responses import JSONResponse |
| 57 | +from pydantic import BaseModel |
| 58 | +from openai import OpenAI |
| 59 | +from pydub import AudioSegment |
| 60 | + |
| 61 | +load_dotenv(".env.local") |
| 62 | + |
| 63 | +OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY") |
| 64 | +client = OpenAI(api_key=OPENAI_API_KEY) |
| 65 | + |
| 66 | +router = APIRouter() |
| 67 | + |
| 68 | +logging.basicConfig( |
| 69 | + level=logging.DEBUG, # Set root logger to debug level |
| 70 | + format='%(levelname)s | %(name)s | %(message)s' |
| 71 | +) |
| 72 | +logger = logging.getLogger(__name__) |
| 73 | +logger.setLevel(logging.DEBUG) |
| 74 | + |
| 75 | +class AudioRequest(BaseModel): |
| 76 | + input: str |
| 77 | + |
| 78 | +def chunk_text(text: str, chunk_size: int = 4096): |
| 79 | + """ |
| 80 | + Generator that yields `text` in chunks of `chunk_size`. |
| 81 | + """ |
| 82 | + for i in range(0, len(text), chunk_size): |
| 83 | + yield text[i:i + chunk_size] |
| 84 | + |
| 85 | +@router.post("/speech") |
| 86 | +async def generate_speech(request: Request, body: AudioRequest, background_tasks: BackgroundTasks): |
| 87 | + """ |
| 88 | + Fires off the TTS request in the background (fire-and-forget). |
| 89 | + Logs are added to track progress. No zip file is created. |
| 90 | + """ |
| 91 | + model = "tts-1" |
| 92 | + voice = "onyx" |
| 93 | + |
| 94 | + if not body.input: |
| 95 | + raise HTTPException( |
| 96 | + status_code=400, |
| 97 | + detail="Missing required field: input" |
| 98 | + ) |
| 99 | + |
| 100 | + # Current time for folder naming or logging |
| 101 | + timestamp = int(time.time() * 1000) |
| 102 | + |
| 103 | + # Create a folder for storing output |
| 104 | + output_folder = Path(".") / f"speech_{timestamp}" |
| 105 | + output_folder.mkdir(exist_ok=True) |
| 106 | + |
| 107 | + # Split the input into chunks |
| 108 | + chunks = list(chunk_text(body.input, 4096)) |
| 109 | + |
| 110 | + # Schedule the actual speech generation in the background |
| 111 | + background_tasks.add_task( |
| 112 | + generate_audio_files, |
| 113 | + chunks=chunks, |
| 114 | + output_folder=output_folder, |
| 115 | + model=model, |
| 116 | + voice=voice, |
| 117 | + timestamp=timestamp |
| 118 | + ) |
| 119 | + |
| 120 | + # Log and return immediately |
| 121 | + logger.info(f"Speech generation task started at {timestamp} with {len(chunks)} chunks.") |
| 122 | + return JSONResponse({"detail": f"Speech generation started. Timestamp: {timestamp}"}) |
| 123 | + |
| 124 | +def generate_audio_files(chunks, output_folder, model, voice, timestamp): |
| 125 | + """ |
| 126 | + Generates audio files for each chunk. Runs in the background. |
| 127 | + After all chunks are created, merges them into a single MP3 file. |
| 128 | + """ |
| 129 | + try: |
| 130 | + # Generate individual chunk MP3s |
| 131 | + for index, chunk in enumerate(chunks): |
| 132 | + speech_filename = f"speech-chunk-{index + 1}.mp3" |
| 133 | + speech_file_path = output_folder / speech_filename |
| 134 | + |
| 135 | + logger.info(f"Generating audio for chunk {index + 1}/{len(chunks)}...") |
| 136 | + |
| 137 | + response = client.audio.speech.create( |
| 138 | + model=model, |
| 139 | + voice=voice, |
| 140 | + input=chunk, |
| 141 | + response_format="mp3", |
| 142 | + ) |
| 143 | + |
| 144 | + response.stream_to_file(speech_file_path) |
| 145 | + logger.info(f"Chunk {index + 1} audio saved to {speech_file_path}") |
| 146 | + |
| 147 | + # Merge all generated MP3 files into a single file |
| 148 | + logger.info("Merging all audio chunks into one file...") |
| 149 | + merged_audio = AudioSegment.empty() |
| 150 | + |
| 151 | + def file_index(file_path: Path): |
| 152 | + # Expects file names like 'speech-chunk-1.mp3' |
| 153 | + return int(file_path.stem.split('-')[-1]) |
| 154 | + |
| 155 | + sorted_audio_files = sorted(output_folder.glob("speech-chunk-*.mp3"), key=file_index) |
| 156 | + for audio_file in sorted_audio_files: |
| 157 | + chunk_audio = AudioSegment.from_file(audio_file, format="mp3") |
| 158 | + merged_audio += chunk_audio |
| 159 | + |
| 160 | + merged_output_file = output_folder / f"speech-merged-{timestamp}.mp3" |
| 161 | + merged_audio.export(merged_output_file, format="mp3") |
| 162 | + logger.info(f"Merged audio saved to {merged_output_file}") |
| 163 | + |
| 164 | + logger.info(f"All speech chunks generated and merged for timestamp {timestamp}.") |
| 165 | + except Exception as e: |
| 166 | + logger.error(f"OpenAI error (timestamp {timestamp}): {e}") |
| 167 | +``` |
| 168 | + |
| 169 | +### Key Takeaways |
| 170 | + |
| 171 | +- **`AudioRequest`** model enforces the presence of an `input` field. |
| 172 | +- **`chunk_text`** ensures no chunk exceeds 4096 characters (you can adjust this size). |
| 173 | +- **BackgroundTasks** offloads the TTS generation so the API can respond promptly. |
| 174 | +- **pydub** merges MP3 files (which in turn calls FFmpeg). |
| 175 | + |
| 176 | +--- |
| 177 | + |
| 178 | +## 3. Using FFmpeg Under the Hood |
| 179 | + |
| 180 | +Installing **[pydub](https://github.com/jiaaro/pydub)** requires **FFmpeg** on your system. Ensure FFmpeg is in your PATH—otherwise you’ll get errors when merging or saving MP3 files. For Linux (Ubuntu/Debian): |
| 181 | + |
| 182 | +```bash |
| 183 | +sudo apt-get update |
| 184 | +sudo apt-get install ffmpeg |
| 185 | +``` |
| 186 | + |
| 187 | +For macOS (using Homebrew): |
| 188 | + |
| 189 | +```bash |
| 190 | +brew install ffmpeg |
| 191 | +``` |
| 192 | + |
| 193 | +If you’re on Windows, install FFmpeg from [FFmpeg’s official site](https://ffmpeg.org/) or use a package manager like [chocolatey](https://chocolatey.org/) or [scoop](https://scoop.sh/). |
| 194 | + |
| 195 | +--- |
| 196 | + |
| 197 | +## 4. Mermaid JS Diagram |
| 198 | + |
| 199 | +Below is a **Mermaid** sequence diagram illustrating the workflow: |
| 200 | + |
| 201 | +```mermaid |
| 202 | +sequenceDiagram |
| 203 | + participant User as User |
| 204 | + participant FastAPI as FastAPI Service |
| 205 | + participant OpenAI as OpenAI API |
| 206 | + participant FFmpeg as FFmpeg (pydub) |
| 207 | +
|
| 208 | + User->>FastAPI: POST /speech {"input": "..."} |
| 209 | + note right of FastAPI: Validate request & chunk text |
| 210 | + FastAPI->>FastAPI: Start background task (generate_audio_files) |
| 211 | + FastAPI-->>User: {"detail": "Speech generation started"} |
| 212 | +
|
| 213 | + FastAPI->>OpenAI: For each chunk: request audio.speech.create() |
| 214 | + note right of FastAPI: Receives chunk as MP3 |
| 215 | + FastAPI->>FastAPI: Save chunk to disk |
| 216 | +
|
| 217 | + FastAPI->>FFmpeg: Merge MP3 chunks with pydub |
| 218 | + note right of FFmpeg: Produces single MP3 file |
| 219 | + FFmpeg-->>FastAPI: Merged MP3 path |
| 220 | + FastAPI-->>User: (Background task completes) |
| 221 | +``` |
| 222 | + |
| 223 | +**Explanation**: |
| 224 | + |
| 225 | +1. **User** sends a POST request with text data. |
| 226 | +2. **FastAPI** quickly acknowledges the request, then spawns a background task. |
| 227 | +3. Chunks of text are processed via **OpenAI** TTS, saving individual MP3 files. |
| 228 | +4. **pydub** merges them (calling **FFmpeg** behind the scenes). |
| 229 | +5. Final merged file is ready in your output directory. |
| 230 | + |
| 231 | +--- |
| 232 | + |
| 233 | +## 5. Conclusion |
| 234 | + |
| 235 | +Integrating **OpenAI** text-to-speech with **FFmpeg** via **pydub** in a **FastAPI** application provides a robust, scalable way to automate TTS pipelines: |
| 236 | + |
| 237 | +- **Reliability**: Chunk-based processing handles large inputs without overloading the API. |
| 238 | +- **Versatility**: FFmpeg’s audio manipulation potential is nearly limitless. |
| 239 | +- **Speed**: Background tasks ensure the main API remains responsive. |
| 240 | + |
| 241 | +With the sample code above, you can adapt chunk sizes, add authentication, or expand the pipeline to include more sophisticated post-processing (like watermarking, crossfading, or mixing in music). Enjoy building richer audio capabilities into your apps—**OpenAI** and **FFmpeg** make a powerful duo. |
0 commit comments