📊 Current Configuration Overview
The llms.txt Generator currently supports two crawling modes with configurable limits:
| Crawl Type |
Current Limit |
Recommended Range |
Maximum Recommended |
| Shallow Crawl |
25 pages |
10-50 pages |
100 pages |
| Deep Crawl |
1200 pages |
100-1500 pages |
2000 pages |
📍 File Location: All crawl limits are configured in the crawler.php file, specifically at lines 19-20.
🚀 Step-by-Step: Increasing Crawl Limits
📂 Step 1: Locate the Configuration File
- Open your project directory
- Find the file named
crawler.php in the root folder
- Open
crawler.php in your text editor or IDE
- Navigate to approximately lines 19-20 (near the top of the class definition)
🔍 Step 2: Find the Current Configuration
Look for these specific lines in the WebsiteCrawler class:
// Configuration: Adjust these values to control crawling limits
private $shallowCrawlLimit = 25; // Default limit for shallow crawl
private $deepCrawlLimit = 1200; // Maximum limit for deep crawl
✏️ Step 3: Modify the Limits
📈 To Increase Shallow Crawl Limit:
// Original
private $shallowCrawlLimit = 25;
// Examples of increases:
private $shallowCrawlLimit = 50; // For medium sites
private $shallowCrawlLimit = 100; // For large navigation structures
private $shallowCrawlLimit = 200; // For comprehensive shallow crawl
📈 To Increase Deep Crawl Limit:
// Original
private $deepCrawlLimit = 1200;
// Examples of increases:
private $deepCrawlLimit = 1500; // For large corporate sites
private $deepCrawlLimit = 2000; // For comprehensive analysis
private $deepCrawlLimit = 3000; // For maximum coverage (use with caution)
⚠️ Important: Increasing limits above 2000 pages may cause performance issues and timeouts. Monitor server resources carefully.
💾 Step 4: Save and Test
- Save the
crawler.php file
- If using a development server, restart it to apply changes
- Test with a small website first to verify the changes work
- Monitor performance and adjust if needed
📉 Step-by-Step: Decreasing Crawl Limits
🎯 When to Decrease Limits
- Server has limited processing power or memory
- You need faster crawling for quick previews
- Working with very slow websites
- Reducing server load during peak usage
✏️ Decrease Configuration Examples
📉 For Faster Shallow Crawls:
// Original
private $shallowCrawlLimit = 25;
// Decreased options:
private $shallowCrawlLimit = 10; // Quick preview only
private $shallowCrawlLimit = 15; // Basic navigation structure
private $shallowCrawlLimit = 5; // Minimal crawl (homepage + key pages)
📉 For Faster Deep Crawls:
// Original
private $deepCrawlLimit = 1200;
// Decreased options:
private $deepCrawlLimit = 500; // Medium-depth analysis
private $deepCrawlLimit = 200; // Limited deep crawl
private $deepCrawlLimit = 100; // Shallow-deep hybrid
private $deepCrawlLimit = 50; // Minimal deep crawl
✅ Benefits of Lower Limits: Faster processing, reduced server load, quicker results, better for testing and development.
⚙️ Advanced Configuration Options
🕐 Execution Time Limits
If you increase crawl limits significantly, you may also need to adjust execution time:
// Find this section in the crawl() method (around line 31-33):
if ($crawlType === 'deep') {
set_time_limit(300); // 5 minutes for deep crawls
ini_set('memory_limit', '512M'); // Increase memory for large sites
}
// Increase time limit for larger crawls:
set_time_limit(600); // 10 minutes
set_time_limit(900); // 15 minutes
set_time_limit(1200); // 20 minutes
// Increase memory for very large sites:
ini_set('memory_limit', '1G'); // 1 GB
ini_set('memory_limit', '2G'); // 2 GB
🌐 Request Timeout Settings
Adjust individual page request timeouts (around line 106):
// Original
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
// Adjust for slower sites:
curl_setopt($ch, CURLOPT_TIMEOUT, 30); // 30 seconds per page
curl_setopt($ch, CURLOPT_TIMEOUT, 45); // 45 seconds per page
curl_setopt($ch, CURLOPT_TIMEOUT, 60); // 1 minute per page (for very slow sites)
📊 Recommended Combinations
| Use Case |
Shallow Limit |
Deep Limit |
Time Limit |
Memory |
| Quick Testing |
10 |
50 |
120s |
256M |
| Standard Use |
25 |
1200 |
300s |
512M |
| Large Sites |
50 |
2000 |
600s |
1G |
| Enterprise |
100 |
3000 |
1200s |
2G |
🚨 Performance Considerations & Warnings
⚠️ Important Warnings
Server Resources: Higher limits consume more CPU, memory, and bandwidth. Monitor your server resources carefully.
Timeout Risks: Very high limits may cause PHP or web server timeouts. Always test changes incrementally.
Target Site Impact: Aggressive crawling may overload target websites. Be respectful of rate limits.
📊 Performance Impact Guide
- 10-50 pages: Minimal impact, fast processing (under 30 seconds)
- 50-200 pages: Light impact, moderate processing (1-2 minutes)
- 200-500 pages: Moderate impact, longer processing (2-5 minutes)
- 500-1500 pages: High impact, extended processing (5-15 minutes)
- 1500+ pages: Very high impact, may require optimization (15+ minutes)
🔧 Optimization Tips
- Test Incrementally: Start with small increases and monitor performance
- Monitor Memory: Watch server memory usage during large crawls
- Consider Caching: Implement result caching for frequently crawled sites
- Use Appropriate Mode: Use shallow crawl for quick analysis, deep crawl only when needed
- Monitor Logs: Check server logs for errors or timeouts
🛠️ Troubleshooting Common Issues
❌ Problem: "Maximum execution time exceeded"
Solution: Increase the set_time_limit() value or decrease crawl limits.
❌ Problem: "Out of memory" errors
Solution: Increase memory_limit or reduce crawl limits for large sites.
❌ Problem: Slow crawling performance
Solution: Reduce CURLOPT_TIMEOUT or implement parallel processing.
❌ Problem: Incomplete results
Solution: Check if limits are too low or if target site has restrictions.
✅ Testing Your Changes
- Start with a small, known website
- Test shallow crawl first, then deep crawl
- Monitor browser developer console for errors
- Check server logs for any issues
- Gradually test with larger, more complex sites