How to Configure a QueryBurst Website Crawl for AI & SEO Analysis
Effective crawl configuration determines which pages are analysed and which are excluded. Key settings include URL vs Search Console seeding, subdirectory discovery, path exclusions (starts with, exact match, contains), file type exclusions, and a max page limit that controls crawl scope and processing time.
The Sites & Crawls screen in QueryBurst manages all of these settings, along with previous crawls, recrawling, and crawl quota.
How to Access Sites & Crawls
Click Sites & Crawls in the sidebar, or the Go to Sites & Crawls button on the home screen.
Previous Crawls

The default tab shows all previously crawled sites. Each row displays:
| Column | Description |
|---|---|
| Name | The site's domain |
| Status | Green dot (ready), spinner (crawling/processing), or red dot (failed) |
| URL | The crawl root URL |
| Pages | Total pages indexed (hidden while crawling) |
| Date | When the crawl was created |
Click any ready site to open it and access all reports.
Actions
- Recrawl — Re-index the site with updated configuration (available on ready/failed sites)
- Delete — Permanently remove the site and all associated data (pages, chunks, intelligence, search indices)
- Cancel — Stop a crawl that's currently in progress
Only one crawl can run at a time. While a crawl is active, the + New Crawl tab is disabled.
Starting a New Crawl

Switch to the + New Crawl tab. There are two ways to specify your site:
Enter URL
Type or paste your website URL (e.g. example.com or https://example.com). The protocol is added automatically if omitted. This mode works without a Search Console connection.
GSC Property
If Search Console is connected, you can select from your verified properties. This is preferred when available — it ensures the crawl URL matches your GSC property exactly, which enables keyword data in reports like Topic Coverage and Page Insights.
The mode toggle only appears when you have GSC properties available.
Subdirectory Discovery
After entering a URL or selecting a GSC property, click Discover Subdirectories (or it runs automatically for GSC properties). This fetches the site's sitemap and maps out all top-level subdirectories.
The panel shows:
- Each subdirectory path (e.g.
/blog,/docs,/products) - The number of URLs found in each
- A total URL count across the entire site
Including and Excluding Subdirectories
Each subdirectory has a checkbox. Uncheck any subdirectory to exclude it from the crawl. This is useful for:
- Skipping large blog archives that aren't relevant to your analysis
- Excluding translated versions of pages (e.g.
/fr/,/de/) - Focusing the crawl on specific sections of a large site
Use Select all / Deselect all for bulk changes. The estimated page count updates as you toggle directories.
Exclude Paths Manually
Below the subdirectory panel, you can add custom path exclusions. Three match types are available:
| Match Type | Behaviour | Example |
|---|---|---|
| Starts with | Excludes the path and everything under it | /blog/archive excludes /blog/archive/2023, /blog/archive/old etc. |
| Exact path | Excludes one specific URL | /about/old-team excludes only that page |
| Contains | Excludes any URL containing the segment | /fr/ excludes all French language pages |
Added patterns appear as removable pills. These combine with subdirectory exclusions — both are applied during the crawl.
Excluded File Types
An expandable panel lists common non-content file types that are excluded by default:
.pdf,.xml,.jpg,.png,.gif,.svg,.webp— media and documents.zip— archives.css,.js— static assets
Uncheck any type to include it in the crawl (rarely needed). These exclusions prevent the crawler from wasting page quota on binary or non-content URLs.
Max Pages
A slider controls the maximum number of pages to crawl. The range is 50 to 5,000 (or your remaining monthly quota, whichever is lower). The default is 500.
If subdirectory discovery found an estimated page count, it's shown next to the label. Setting the slider higher than the estimate is fine — the crawl will simply stop when all reachable pages are indexed.
Time Estimates
| Max Pages | Estimated Full Pipeline Time |
|---|---|
| Up to 100 | 10–20 minutes |
| 100–500 | 30–45 minutes |
| 500–1,000 | ~90 minutes |
| 1,000–2,000 | ~2 hours |
| 2,000–3,000 | ~3 hours |
| 3,000–5,000 | Up to 4 hours |
Core tools (Page Reports, Link Analysis, search) become available roughly halfway through — you don't need to wait for the full intelligence pipeline.
Crawl Quota
A quota display shows your monthly page usage:
- Pages used / Pages remaining / Monthly limit
- Pro plans include 10,000 pages per month
- Individual crawls are capped at 5,000 pages
If your remaining quota is lower than the site's page count, the crawl will still run but will only index pages up to your limit.
Recrawling a Site
Click Recrawl on any ready or failed site. A configuration modal opens with:
- Max pages — Defaults to the previous crawl's setting
- Excluded paths — Pre-populated from the previous crawl's configuration
You can adjust both before starting. The recrawl re-fetches all pages, rebuilds the link graph, and re-runs the intelligence pipeline (but only re-processes pages whose content has changed, thanks to content-hash caching).
Partial Recrawl Warning
If the site has more pages than your remaining quota, a warning appears. Pages not reached during the recrawl will be archived and their data removed.
What Happens After Starting a Crawl
- The page switches to the Previous Crawls tab with a spinner on the new site
- Pages are fetched and converted to markdown
- Content is split into chunks for retrieval
- The internal link graph is built (structural depth, PageRank, orphan detection)
- Homepage redirect detection runs automatically
- Paginated URLs are detected and excluded from analysis
- Once the crawl completes, the Site Intelligence pipeline starts automatically
Click the site at any time after it reaches "ready" status to start exploring reports. Intelligence-dependent features (Topics, Issues) enable once the pipeline finishes.
Tips
- Start small — For your first crawl, use 500 pages to get results quickly. You can always recrawl with a higher limit
- Exclude non-content paths — Blog archives, tag pages, and translated duplicates waste quota without adding analytical value
- Use subdirectory discovery — It's much easier than manually listing exclusion patterns
- Check the site's URL — If your site redirects (e.g.
domain.com→domain.com/en/), QueryBurst auto-detects this, but entering the final URL directly avoids confusion - Recrawl after content changes — Only changed pages are re-processed, so recrawls are faster than initial crawls