llms.txt Generator - Documentation

🎯 Overview

The llms.txt Generator is a sophisticated web crawling tool designed to create structured summaries of websites in Markdown format. It intelligently categorizes content and generates clean, organized documentation suitable for AI training data, website analysis, or content audits.

🔍 Smart Crawling

Automatically discovers sitemaps, follows navigation patterns, and intelligently categorizes content across your entire website.

📱 Responsive Design

Professional interface that works seamlessly across all devices with modern styling and intuitive controls.

🎨 Professional Styling

Orange-themed design (#ff9500) matching modern web standards with clean typography and smooth interactions.

📥 Multiple Export Options

Download generated content as .txt or .md files, or copy directly to clipboard for immediate use.

✨ Features & Capabilities

🔧 Crawling Options

Shallow Crawl: Extracts links from homepage and main navigation only
Deep Crawl: Comprehensive site exploration with sitemap discovery and recursive link following

🎯 Link Source Selection

All Links: Complete extraction from entire webpage
Header Only: Focus on main navigation and branding links
Navigation Only: Extract menu structure and site hierarchy
Footer Only: Utility links, legal pages, and secondary navigation

🗂️ Intelligent Categorization

Main Pages

About, Contact, Privacy, Terms, Legal

Services

Service offerings, Consulting, Solutions

Products

Product catalog, Shop, Pricing, Store

Tools

Applications, Calculators, Generators

Blog & Resources

Articles, News, Guides, Learning materials

Documentation

API docs, References, Manuals, Wiki

Case Studies

Portfolio, Work examples, Project showcases

Company Info

Team, Careers, Press, Company history

📖 Usage Guide

🚀 Quick Start

Enter a website URL (e.g., https://example.com)
Choose your crawl type: Shallow or Deep
Select link source: All, Header, Navigation, or Footer
Click "Generate llms.txt"
Preview the generated markdown content
Copy to clipboard or download as .txt/.md file

🔍 Deep Crawl Process

When using deep crawl, the system follows this comprehensive discovery process:

Sitemap Discovery: Checks /sitemap.xml first
Fallback Options: If not found, tries /sitemap_index.xml
Robots.txt Analysis: Extracts sitemap references from robots.txt
Homepage Fallback: Uses homepage navigation if no sitemaps found
Recursive Exploration: Follows internal links up to 100 pages
Smart Filtering: Excludes feeds, media files, and non-content pages

Note: Deep crawl may take longer for large websites as it comprehensively explores the entire site structure and content.

⚙️ Technical Specifications

🛠️ Technology Stack

Frontend: HTML5, CSS3, JavaScript (ES6+)
Backend: PHP 8.2 with cURL and DOMDocument
Server: PHP built-in development server
Dependencies: php82, php82Extensions.curl, php82Extensions.dom, php82Extensions.mbstring

🔧 Sitemap Discovery Algorithm

1. Try /sitemap.xml
2. If not found → try /sitemap_index.xml  
3. If not found → parse /robots.txt for Sitemap: entries
4. If no sitemaps → fallback to homepage crawling

For sitemap indexes:
- Recursively parse all child sitemaps
- Extract all <loc> entries (page URLs)
- Filter and deduplicate results
            

🚫 Filtering Rules

The system automatically excludes:

URLs with fragments (#)
Feed URLs (/feed/, RSS, Atom, .xml files)
Media files (.jpg, .png, .pdf, .mp4, etc.)
Tag, archive, and parameterized pages
WordPress attachment and upload directories
External domains (only crawls same domain)

📊 Performance Specifications

Shallow Crawl: Homepage + immediate navigation (default: 25 pages)
Deep Crawl: Comprehensive sitemap discovery (default: 1200 pages)
Execution Time: 5 minutes for deep crawls, 2 minutes for shallow
Request Timeout: 15 seconds per page request
Memory: 512MB for deep crawls, optimized for large websites
Output: Clean Markdown format with page descriptions and natural site hierarchy

⚙️ Backend Configuration

To adjust the number of pages crawled, modify these variables in crawler.php:

// Configuration: Adjust these values to control crawling limits
private $shallowCrawlLimit = 25;    // Default limit for shallow crawl
private $deepCrawlLimit = 1200;     // Maximum limit for deep crawl

// Examples:
// For faster crawling: set to 10 and 500
// For more comprehensive: set to 50 and 2000
// Maximum recommended: 100 and 3000 (with proper server resources)
            

Performance Note: Higher page limits will increase crawling time and server load. Deep crawls with 1200+ pages may take 5-15 minutes. See Admin Features Guide for detailed configuration instructions.

🔧 Link Source Behavior (Shallow Crawl)

All Links: Extracts from header, navigation, footer, and main content areas
Header Only: Links from site header/banner section only
Navigation Only: Links from navigation menus and site structure
Footer Only: Links from footer area (legal, utility pages)