🎯 Overview
The llms.txt Generator is a sophisticated web crawling tool designed to create structured summaries of websites in Markdown format. It intelligently categorizes content and generates clean, organized documentation suitable for AI training data, website analysis, or content audits.
🔍 Smart Crawling
Automatically discovers sitemaps, follows navigation patterns, and intelligently categorizes content across your entire website.
📱 Responsive Design
Professional interface that works seamlessly across all devices with modern styling and intuitive controls.
🎨 Professional Styling
Orange-themed design (#ff9500) matching modern web standards with clean typography and smooth interactions.
📥 Multiple Export Options
Download generated content as .txt or .md files, or copy directly to clipboard for immediate use.
📖 Usage Guide
🚀 Quick Start
- Enter a website URL (e.g., https://example.com)
- Choose your crawl type: Shallow or Deep
- Select link source: All, Header, Navigation, or Footer
- Click "Generate llms.txt"
- Preview the generated markdown content
- Copy to clipboard or download as .txt/.md file
🔍 Deep Crawl Process
When using deep crawl, the system follows this comprehensive discovery process:
- Sitemap Discovery: Checks /sitemap.xml first
- Fallback Options: If not found, tries /sitemap_index.xml
- Robots.txt Analysis: Extracts sitemap references from robots.txt
- Homepage Fallback: Uses homepage navigation if no sitemaps found
- Recursive Exploration: Follows internal links up to 100 pages
- Smart Filtering: Excludes feeds, media files, and non-content pages
Note: Deep crawl may take longer for large websites as it comprehensively explores the entire site structure and content.
⚙️ Technical Specifications
🛠️ Technology Stack
- Frontend: HTML5, CSS3, JavaScript (ES6+)
- Backend: PHP 8.2 with cURL and DOMDocument
- Server: PHP built-in development server
- Dependencies: php82, php82Extensions.curl, php82Extensions.dom, php82Extensions.mbstring
🔧 Sitemap Discovery Algorithm
1. Try /sitemap.xml
2. If not found → try /sitemap_index.xml
3. If not found → parse /robots.txt for Sitemap: entries
4. If no sitemaps → fallback to homepage crawling
For sitemap indexes:
- Recursively parse all child sitemaps
- Extract all <loc> entries (page URLs)
- Filter and deduplicate results
🚫 Filtering Rules
The system automatically excludes:
- URLs with fragments (#)
- Feed URLs (/feed/, RSS, Atom, .xml files)
- Media files (.jpg, .png, .pdf, .mp4, etc.)
- Tag, archive, and parameterized pages
- WordPress attachment and upload directories
- External domains (only crawls same domain)
📊 Performance Specifications
- Shallow Crawl: Homepage + immediate navigation (default: 25 pages)
- Deep Crawl: Comprehensive sitemap discovery (default: 1200 pages)
- Execution Time: 5 minutes for deep crawls, 2 minutes for shallow
- Request Timeout: 15 seconds per page request
- Memory: 512MB for deep crawls, optimized for large websites
- Output: Clean Markdown format with page descriptions and natural site hierarchy
⚙️ Backend Configuration
To adjust the number of pages crawled, modify these variables in crawler.php:
// Configuration: Adjust these values to control crawling limits
private $shallowCrawlLimit = 25; // Default limit for shallow crawl
private $deepCrawlLimit = 1200; // Maximum limit for deep crawl
// Examples:
// For faster crawling: set to 10 and 500
// For more comprehensive: set to 50 and 2000
// Maximum recommended: 100 and 3000 (with proper server resources)
Performance Note: Higher page limits will increase crawling time and server load.
Deep crawls with 1200+ pages may take 5-15 minutes. See
Admin Features Guide for detailed configuration instructions.
🔧 Link Source Behavior (Shallow Crawl)
- All Links: Extracts from header, navigation, footer, and main content areas
- Header Only: Links from site header/banner section only
- Navigation Only: Links from navigation menus and site structure
- Footer Only: Links from footer area (legal, utility pages)