A Python script that generates llms.txt
and llms-full.txt
files for any website using Firecrawl and OpenAI APIs.
llms.txt
is a standardized format for making website content more accessible to Large Language Models (LLMs). It provides:
- llms.txt: A concise index of all pages with titles and descriptions
- llms-full.txt: Complete content of all pages for comprehensive access
- 🗺️ Website Mapping: Automatically discovers all URLs on a website using Firecrawl's map endpoint
- 📄 Content Scraping: Extracts markdown content from each page
- 🤖 AI Summaries: Uses OpenAI's GPT-4o-mini to generate concise titles and descriptions
- ⚡ Parallel Processing: Processes multiple URLs concurrently for faster generation
- 🎯 Configurable Limits: Set maximum number of URLs to process
- 📁 Flexible Output: Choose to generate both files or just llms.txt
- Python 3.7+
- Firecrawl API key (Get one here)
- OpenAI API key (Get one here)
- Clone the repository:
git clone <repository-url>
cd <repository-directory>
- Install dependencies:
pip install -r requirements.txt
-
Set up API keys (choose one method):
Option A: Using .env file (recommended)
cp env.example .env # Edit .env and add your API keys
Option B: Using environment variables
export FIRECRAWL_API_KEY="your-firecrawl-api-key" export OPENAI_API_KEY="your-openai-api-key"
Option C: Using command line arguments (See usage examples below)
Generate llms.txt and llms-full.txt for a website:
python generate-llmstxt.py https://example.com
# Limit to 50 URLs
python generate-llmstxt.py https://example.com --max-urls 50
# Save to specific directory
python generate-llmstxt.py https://example.com --output-dir ./output
# Only generate llms.txt (skip full text)
python generate-llmstxt.py https://example.com --no-full-text
# Enable verbose logging
python generate-llmstxt.py https://example.com --verbose
# Specify API keys via command line
python generate-llmstxt.py https://example.com \
--firecrawl-api-key "fc-..." \
--openai-api-key "sk-..."
url
(required): The website URL to process--max-urls
: Maximum number of URLs to process (default: 20)--output-dir
: Directory to save output files (default: current directory)--firecrawl-api-key
: Firecrawl API key (defaults to .env file or FIRECRAWL_API_KEY env var)--openai-api-key
: OpenAI API key (defaults to .env file or OPENAI_API_KEY env var)--no-full-text
: Only generate llms.txt, skip llms-full.txt--verbose
: Enable verbose logging for debugging
# https://example.com llms.txt
- [Page Title](https://example.com/page1): Brief description of the page content here
- [Another Page](https://example.com/page2): Another concise description of page content
# https://example.com llms-full.txt
<|firecrawl-page-1-lllmstxt|>
## Page Title
Full markdown content of the page...
<|firecrawl-page-2-lllmstxt|>
## Another Page
Full markdown content of another page...
- Website Mapping: Uses Firecrawl's
/map
endpoint to discover all URLs on the website - Batch Processing: Processes URLs in batches of 10 for efficiency
- Content Extraction: Scrapes each URL to extract markdown content
- AI Summarization: For each page, GPT-4o-mini generates:
- A 3-4 word title
- A 9-10 word description
- File Generation: Creates formatted llms.txt and llms-full.txt files
- Failed URL scrapes are logged and skipped
- If no URLs are found, the script exits with an error
- API errors are logged with details for debugging
- Rate limiting is handled with delays between batches
- Processing time depends on the number of URLs and response times
- Default batch size is 10 URLs processed concurrently
- Small delays between batches prevent rate limiting
- For large websites, consider using
--max-urls
to limit processing
python generate-llmstxt.py https://small-blog.com --max-urls 20
python generate-llmstxt.py https://docs.example.com --max-urls 100 --verbose
python generate-llmstxt.py https://example.com --no-full-text --max-urls 50
The script checks for API keys in this order:
- Command line arguments (
--firecrawl-api-key
,--openai-api-key
) .env
file in the current directory- Environment variables (
FIRECRAWL_API_KEY
,OPENAI_API_KEY
)
Ensure you've either:
- Created a
.env
file with your API keys (copy fromenv.example
) - Set environment variables:
FIRECRAWL_API_KEY
andOPENAI_API_KEY
- Or pass them via command line arguments
If you encounter rate limits:
- Reduce concurrent workers in the code
- Add longer delays between batches
- Process fewer URLs at once
For very large websites:
- Use
--max-urls
to limit the number of pages - Process in smaller batches
- Use
--no-full-text
to skip full content generation
MIT License - see LICENSE file for details