Skip to content

Commit 7f47f7c

Browse files
author
Vadim Nicolai
committed
Small improvement.
1 parent 9617ba6 commit 7f47f7c

File tree

1 file changed

+241
-0
lines changed

1 file changed

+241
-0
lines changed
Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
---
2+
slug: openai-ffmpeg-integration
3+
title: How to Integrate OpenAI TTS with FFmpeg in a FastAPI Service
4+
date: 2025-03-06
5+
authors: [nicolad]
6+
tags:
7+
[
8+
openai,
9+
ffmpeg,
10+
audio,
11+
text-to-speech,
12+
fastapi,
13+
python,
14+
machine-learning,
15+
generative-ai,
16+
api,
17+
]
18+
---
19+
20+
## Introduction
21+
22+
**OpenAI** offers powerful text-to-speech capabilities, enabling developers to generate spoken audio from raw text. Meanwhile, **FFmpeg** is the de facto standard tool for audio/video processing—used heavily for tasks like merging audio files, converting formats, and applying filters. Combining these two in a **FastAPI** application can produce a scalable, production-ready text-to-speech (TTS) workflow that merges and manipulates audio via FFmpeg under the hood.
23+
24+
This article demonstrates how to:
25+
26+
1. Accept text input through a **FastAPI** endpoint
27+
2. Chunk text and use **OpenAI** to generate MP3 segments
28+
3. Merge generated segments with **FFmpeg** (through the [pydub](https://github.com/jiaaro/pydub) interface)
29+
4. Return or store a final MP3 file, ideal for streamlined TTS pipelines
30+
31+
By the end, you’ll understand how to build a simple but effective text-to-speech microservice that leverages the power of **OpenAI** and **FFmpeg**.
32+
33+
---
34+
35+
## 1. Why Combine OpenAI and FFmpeg
36+
37+
- **Chunked Processing**: Long text might exceed certain API limits or timeouts. Splitting into smaller parts ensures each piece is handled reliably.
38+
- **Post-processing**: Merging segments, adding intros or outros, or applying custom filters (such as volume adjustments) becomes trivial with FFmpeg.
39+
- **Scalability**: A background task system (like FastAPI’s `BackgroundTasks`) can handle requests without blocking the main thread.
40+
- **Automation**: Minimizes manual involvement—one endpoint can receive text and produce a final merged MP3.
41+
42+
---
43+
44+
## 2. FastAPI Endpoint and Background Tasks
45+
46+
Below is the **FastAPI** code that implements a TTS service using the OpenAI API and pydub (which uses FFmpeg internally). It splits the input text into manageable chunks, generates MP3 files per chunk, then merges them:
47+
48+
```python
49+
import os
50+
import time
51+
import logging
52+
from pathlib import Path
53+
54+
from dotenv import load_dotenv
55+
from fastapi import APIRouter, HTTPException, Request, BackgroundTasks
56+
from fastapi.responses import JSONResponse
57+
from pydantic import BaseModel
58+
from openai import OpenAI
59+
from pydub import AudioSegment
60+
61+
load_dotenv(".env.local")
62+
63+
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
64+
client = OpenAI(api_key=OPENAI_API_KEY)
65+
66+
router = APIRouter()
67+
68+
logging.basicConfig(
69+
level=logging.DEBUG, # Set root logger to debug level
70+
format='%(levelname)s | %(name)s | %(message)s'
71+
)
72+
logger = logging.getLogger(__name__)
73+
logger.setLevel(logging.DEBUG)
74+
75+
class AudioRequest(BaseModel):
76+
input: str
77+
78+
def chunk_text(text: str, chunk_size: int = 4096):
79+
"""
80+
Generator that yields `text` in chunks of `chunk_size`.
81+
"""
82+
for i in range(0, len(text), chunk_size):
83+
yield text[i:i + chunk_size]
84+
85+
@router.post("/speech")
86+
async def generate_speech(request: Request, body: AudioRequest, background_tasks: BackgroundTasks):
87+
"""
88+
Fires off the TTS request in the background (fire-and-forget).
89+
Logs are added to track progress. No zip file is created.
90+
"""
91+
model = "tts-1"
92+
voice = "onyx"
93+
94+
if not body.input:
95+
raise HTTPException(
96+
status_code=400,
97+
detail="Missing required field: input"
98+
)
99+
100+
# Current time for folder naming or logging
101+
timestamp = int(time.time() * 1000)
102+
103+
# Create a folder for storing output
104+
output_folder = Path(".") / f"speech_{timestamp}"
105+
output_folder.mkdir(exist_ok=True)
106+
107+
# Split the input into chunks
108+
chunks = list(chunk_text(body.input, 4096))
109+
110+
# Schedule the actual speech generation in the background
111+
background_tasks.add_task(
112+
generate_audio_files,
113+
chunks=chunks,
114+
output_folder=output_folder,
115+
model=model,
116+
voice=voice,
117+
timestamp=timestamp
118+
)
119+
120+
# Log and return immediately
121+
logger.info(f"Speech generation task started at {timestamp} with {len(chunks)} chunks.")
122+
return JSONResponse({"detail": f"Speech generation started. Timestamp: {timestamp}"})
123+
124+
def generate_audio_files(chunks, output_folder, model, voice, timestamp):
125+
"""
126+
Generates audio files for each chunk. Runs in the background.
127+
After all chunks are created, merges them into a single MP3 file.
128+
"""
129+
try:
130+
# Generate individual chunk MP3s
131+
for index, chunk in enumerate(chunks):
132+
speech_filename = f"speech-chunk-{index + 1}.mp3"
133+
speech_file_path = output_folder / speech_filename
134+
135+
logger.info(f"Generating audio for chunk {index + 1}/{len(chunks)}...")
136+
137+
response = client.audio.speech.create(
138+
model=model,
139+
voice=voice,
140+
input=chunk,
141+
response_format="mp3",
142+
)
143+
144+
response.stream_to_file(speech_file_path)
145+
logger.info(f"Chunk {index + 1} audio saved to {speech_file_path}")
146+
147+
# Merge all generated MP3 files into a single file
148+
logger.info("Merging all audio chunks into one file...")
149+
merged_audio = AudioSegment.empty()
150+
151+
def file_index(file_path: Path):
152+
# Expects file names like 'speech-chunk-1.mp3'
153+
return int(file_path.stem.split('-')[-1])
154+
155+
sorted_audio_files = sorted(output_folder.glob("speech-chunk-*.mp3"), key=file_index)
156+
for audio_file in sorted_audio_files:
157+
chunk_audio = AudioSegment.from_file(audio_file, format="mp3")
158+
merged_audio += chunk_audio
159+
160+
merged_output_file = output_folder / f"speech-merged-{timestamp}.mp3"
161+
merged_audio.export(merged_output_file, format="mp3")
162+
logger.info(f"Merged audio saved to {merged_output_file}")
163+
164+
logger.info(f"All speech chunks generated and merged for timestamp {timestamp}.")
165+
except Exception as e:
166+
logger.error(f"OpenAI error (timestamp {timestamp}): {e}")
167+
```
168+
169+
### Key Takeaways
170+
171+
- **`AudioRequest`** model enforces the presence of an `input` field.
172+
- **`chunk_text`** ensures no chunk exceeds 4096 characters (you can adjust this size).
173+
- **BackgroundTasks** offloads the TTS generation so the API can respond promptly.
174+
- **pydub** merges MP3 files (which in turn calls FFmpeg).
175+
176+
---
177+
178+
## 3. Using FFmpeg Under the Hood
179+
180+
Installing **[pydub](https://github.com/jiaaro/pydub)** requires **FFmpeg** on your system. Ensure FFmpeg is in your PATH—otherwise you’ll get errors when merging or saving MP3 files. For Linux (Ubuntu/Debian):
181+
182+
```bash
183+
sudo apt-get update
184+
sudo apt-get install ffmpeg
185+
```
186+
187+
For macOS (using Homebrew):
188+
189+
```bash
190+
brew install ffmpeg
191+
```
192+
193+
If you’re on Windows, install FFmpeg from [FFmpeg’s official site](https://ffmpeg.org/) or use a package manager like [chocolatey](https://chocolatey.org/) or [scoop](https://scoop.sh/).
194+
195+
---
196+
197+
## 4. Mermaid JS Diagram
198+
199+
Below is a **Mermaid** sequence diagram illustrating the workflow:
200+
201+
```mermaid
202+
sequenceDiagram
203+
participant User as User
204+
participant FastAPI as FastAPI Service
205+
participant OpenAI as OpenAI API
206+
participant FFmpeg as FFmpeg (pydub)
207+
208+
User->>FastAPI: POST /speech {"input": "..."}
209+
note right of FastAPI: Validate request & chunk text
210+
FastAPI->>FastAPI: Start background task (generate_audio_files)
211+
FastAPI-->>User: {"detail": "Speech generation started"}
212+
213+
FastAPI->>OpenAI: For each chunk: request audio.speech.create()
214+
note right of FastAPI: Receives chunk as MP3
215+
FastAPI->>FastAPI: Save chunk to disk
216+
217+
FastAPI->>FFmpeg: Merge MP3 chunks with pydub
218+
note right of FFmpeg: Produces single MP3 file
219+
FFmpeg-->>FastAPI: Merged MP3 path
220+
FastAPI-->>User: (Background task completes)
221+
```
222+
223+
**Explanation**:
224+
225+
1. **User** sends a POST request with text data.
226+
2. **FastAPI** quickly acknowledges the request, then spawns a background task.
227+
3. Chunks of text are processed via **OpenAI** TTS, saving individual MP3 files.
228+
4. **pydub** merges them (calling **FFmpeg** behind the scenes).
229+
5. Final merged file is ready in your output directory.
230+
231+
---
232+
233+
## 5. Conclusion
234+
235+
Integrating **OpenAI** text-to-speech with **FFmpeg** via **pydub** in a **FastAPI** application provides a robust, scalable way to automate TTS pipelines:
236+
237+
- **Reliability**: Chunk-based processing handles large inputs without overloading the API.
238+
- **Versatility**: FFmpeg’s audio manipulation potential is nearly limitless.
239+
- **Speed**: Background tasks ensure the main API remains responsive.
240+
241+
With the sample code above, you can adapt chunk sizes, add authentication, or expand the pipeline to include more sophisticated post-processing (like watermarking, crossfading, or mixing in music). Enjoy building richer audio capabilities into your apps—**OpenAI** and **FFmpeg** make a powerful duo.

0 commit comments

Comments
 (0)