How to Configure a QueryBurst Website Crawl for AI & SEO Analysis

Effective crawl configuration determines which pages are analysed and which are excluded. Key settings include URL vs Search Console seeding, subdirectory discovery, path exclusions (starts with, exact match, contains), file type exclusions, and a max page limit that controls crawl scope and processing time.

The Sites & Crawls screen in QueryBurst manages all of these settings, along with previous crawls, recrawling, and crawl quota.

How to Access Sites & Crawls

Click Sites & Crawls in the sidebar, or the Go to Sites & Crawls button on the home screen.

Previous Crawls

queryburst - previous crawls

The default tab shows all previously crawled sites. Each row displays:

ColumnDescription
NameThe site's domain
StatusGreen dot (ready), spinner (crawling/processing), or red dot (failed)
URLThe crawl root URL
PagesTotal pages indexed (hidden while crawling)
DateWhen the crawl was created

Click any ready site to open it and access all reports.

Actions

  • Recrawl — Re-index the site with updated configuration (available on ready/failed sites)
  • Delete — Permanently remove the site and all associated data (pages, chunks, intelligence, search indices)
  • Cancel — Stop a crawl that's currently in progress

Only one crawl can run at a time. While a crawl is active, the + New Crawl tab is disabled.

Starting a New Crawl

queryburst - new crawl

Switch to the + New Crawl tab. There are two ways to specify your site:

Enter URL

Type or paste your website URL (e.g. example.com or https://example.com). The protocol is added automatically if omitted. This mode works without a Search Console connection.

GSC Property

If Search Console is connected, you can select from your verified properties. This is preferred when available — it ensures the crawl URL matches your GSC property exactly, which enables keyword data in reports like Topic Coverage and Page Insights.

The mode toggle only appears when you have GSC properties available.

Subdirectory Discovery

After entering a URL or selecting a GSC property, click Discover Subdirectories (or it runs automatically for GSC properties). This fetches the site's sitemap and maps out all top-level subdirectories.

The panel shows:

  • Each subdirectory path (e.g. /blog/docs/products)
  • The number of URLs found in each
  • A total URL count across the entire site

Including and Excluding Subdirectories

Each subdirectory has a checkbox. Uncheck any subdirectory to exclude it from the crawl. This is useful for:

  • Skipping large blog archives that aren't relevant to your analysis
  • Excluding translated versions of pages (e.g. /fr//de/)
  • Focusing the crawl on specific sections of a large site

Use Select all / Deselect all for bulk changes. The estimated page count updates as you toggle directories.

Exclude Paths Manually

Below the subdirectory panel, you can add custom path exclusions. Three match types are available:

Match TypeBehaviourExample
Starts withExcludes the path and everything under it/blog/archive excludes /blog/archive/2023/blog/archive/old etc.
Exact pathExcludes one specific URL/about/old-team excludes only that page
ContainsExcludes any URL containing the segment/fr/ excludes all French language pages

Added patterns appear as removable pills. These combine with subdirectory exclusions — both are applied during the crawl.

Excluded File Types

An expandable panel lists common non-content file types that are excluded by default:

  • .pdf.xml.jpg.png.gif.svg.webp — media and documents
  • .zip — archives
  • .css.js — static assets

Uncheck any type to include it in the crawl (rarely needed). These exclusions prevent the crawler from wasting page quota on binary or non-content URLs.

Max Pages

A slider controls the maximum number of pages to crawl. The range is 50 to 5,000 (or your remaining monthly quota, whichever is lower). The default is 500.

If subdirectory discovery found an estimated page count, it's shown next to the label. Setting the slider higher than the estimate is fine — the crawl will simply stop when all reachable pages are indexed.

Time Estimates

Max PagesEstimated Full Pipeline Time
Up to 10010–20 minutes
100–50030–45 minutes
500–1,000~90 minutes
1,000–2,000~2 hours
2,000–3,000~3 hours
3,000–5,000Up to 4 hours

Core tools (Page Reports, Link Analysis, search) become available roughly halfway through — you don't need to wait for the full intelligence pipeline.

Crawl Quota

A quota display shows your monthly page usage:

  • Pages used / Pages remaining / Monthly limit
  • Pro plans include 10,000 pages per month
  • Individual crawls are capped at 5,000 pages

If your remaining quota is lower than the site's page count, the crawl will still run but will only index pages up to your limit.

Recrawling a Site

Click Recrawl on any ready or failed site. A configuration modal opens with:

  • Max pages — Defaults to the previous crawl's setting
  • Excluded paths — Pre-populated from the previous crawl's configuration

You can adjust both before starting. The recrawl re-fetches all pages, rebuilds the link graph, and re-runs the intelligence pipeline (but only re-processes pages whose content has changed, thanks to content-hash caching).

Partial Recrawl Warning

If the site has more pages than your remaining quota, a warning appears. Pages not reached during the recrawl will be archived and their data removed.

What Happens After Starting a Crawl

  1. The page switches to the Previous Crawls tab with a spinner on the new site
  2. Pages are fetched and converted to markdown
  3. Content is split into chunks for retrieval
  4. The internal link graph is built (structural depth, PageRank, orphan detection)
  5. Homepage redirect detection runs automatically
  6. Paginated URLs are detected and excluded from analysis
  7. Once the crawl completes, the Site Intelligence pipeline starts automatically

Click the site at any time after it reaches "ready" status to start exploring reports. Intelligence-dependent features (Topics, Issues) enable once the pipeline finishes.

Tips

  1. Start small — For your first crawl, use 500 pages to get results quickly. You can always recrawl with a higher limit
  2. Exclude non-content paths — Blog archives, tag pages, and translated duplicates waste quota without adding analytical value
  3. Use subdirectory discovery — It's much easier than manually listing exclusion patterns
  4. Check the site's URL — If your site redirects (e.g. domain.com → domain.com/en/), QueryBurst auto-detects this, but entering the final URL directly avoids confusion
  5. Recrawl after content changes — Only changed pages are re-processed, so recrawls are faster than initial crawls