Advanced Crawlers

AskVio supports multiple crawling strategies. In the backend, each crawl is stored with:

  • type: simple, custompages, or sitemaps (how URLs are discovered)
  • mode: fast or advanced (how each page is fetched)

This means you can combine URL discovery strategy + rendering strategy depending on your site architecture.

Normal vs advanced crawlers

Crawler modeTechnology usedBest whenTradeoffs
Fast (normal) HTTP fetch + HTML parsing (fetch + cheerio) Content exists in server-rendered HTML (docs, blogs, marketing pages) Very fast and lightweight, but misses content only rendered after JavaScript execution
Advanced (dynamic) Headless Chromium rendering (puppeteer-core + chrome-aws-lambda) SPAs and JS-heavy pages where text appears after client-side rendering More resource intensive and slower; use when fast mode misses meaningful content

How URL discovery works (crawl types)

TypeHow it worksUse this when
simpleStarts from one URL and discovers internal links recursively.You want quick onboarding from a homepage or docs index.
custompagesUses your explicit list of pages (plus optional start URL).You need strict control over exactly what gets ingested.
sitemapsParses sitemap URLs and nested sitemap indexes to produce the URL list.You have a large or frequently changing site with maintained sitemap files.

When to use each mode

  • Choose fast mode first for most documentation and content websites. It is cheaper, simpler, and usually sufficient.
  • Use advanced mode when sampled crawled pages are missing key body text, product details, or navigation-generated content.
  • Prefer sitemap type for enterprise-scale websites because discovery is explicit and deterministic.
  • Prefer custompages type for curated corpora (legal pages, support-only pages, policy pages).
  • Use simple type to bootstrap quickly, then refine with custom pages/sitemaps after reviewing crawl output.

Deep dive: common crawler setups

1) Simple crawl (single entry point)

Great for small help centers or docs hubs where links are well-connected.

POST /api/ingest
{
  "clientId": "acme",
  "url": "https://docs.acme.com",
  "maxPages": 200,
  "mode": "fast"
}

2) Curated pages crawl (high precision)

Best when you only want trusted URLs indexed (for example compliance content).

POST /api/ingest
{
  "clientId": "acme",
  "pages": [
    "https://acme.com/pricing",
    "https://acme.com/security",
    "https://acme.com/terms"
  ],
  "maxPages": 50,
  "mode": "fast"
}

3) Sitemap-driven crawl (large structured site)

Uses XML sitemap parsing and supports sitemap indexes recursively.

POST /api/ingest
{
  "clientId": "acme",
  "sitemaps": [
    "https://acme.com/sitemap.xml",
    "https://docs.acme.com/sitemap_index.xml"
  ],
  "maxPages": 5000,
  "mode": "fast"
}

4) Advanced dynamic crawl (JS-rendered content)

Use this when fast mode captures little/no meaningful text from SPA routes.

POST /api/ingest
{
  "clientId": "acme",
  "url": "https://app.acme.com/help",
  "maxPages": 300,
  "mode": "advanced"
}

5) Scheduled refresh crawl

Keep answers aligned with changing docs by enabling schedules (hourly scheduler checks due runs; running crawls are finalized automatically).

POST /api/ingest
{
  "clientId": "acme",
  "sitemaps": ["https://docs.acme.com/sitemap.xml"],
  "maxPages": 1000,
  "mode": "fast",
  "scheduleEnabled": true,
  "intervalAmount": 1,
  "intervalUnit": "week"
}

Operational recommendations

  • Start with a low maxPages pilot crawl and review quality before scaling.
  • Use /api/validate-sitemap before launching large sitemap crawls to catch invalid feeds early.
  • If your site has mixed rendering, split crawls by section: keep docs on fast, use advanced only for dynamic app/help routes.
  • Re-crawl after major navigation or URL structure changes so stale URLs are replaced.

Screenshot placeholders

If you want to add UI screenshots later, keep placeholders like these:

  • <img src="/img/docs/advanced-crawlers-mode-selector.png" alt="Crawler creation form showing mode selector with Fast and Advanced options highlighted">
  • <img src="/img/docs/advanced-crawlers-sitemap-example.png" alt="Crawler setup screen with sitemap URLs and page limit configured before launch">
  • <img src="/img/docs/advanced-crawlers-history.png" alt="Crawl history table showing status, pages crawled, and termination reason for recent runs">