Advanced Crawlers
AskVio supports multiple crawling strategies. In the backend, each crawl is stored with:
- type:
simple,custompages, orsitemaps(how URLs are discovered) - mode:
fastoradvanced(how each page is fetched)
This means you can combine URL discovery strategy + rendering strategy depending on your site architecture.
Normal vs advanced crawlers
| Crawler mode | Technology used | Best when | Tradeoffs |
|---|---|---|---|
| Fast (normal) | HTTP fetch + HTML parsing (fetch + cheerio) |
Content exists in server-rendered HTML (docs, blogs, marketing pages) | Very fast and lightweight, but misses content only rendered after JavaScript execution |
| Advanced (dynamic) | Headless Chromium rendering (puppeteer-core + chrome-aws-lambda) |
SPAs and JS-heavy pages where text appears after client-side rendering | More resource intensive and slower; use when fast mode misses meaningful content |
How URL discovery works (crawl types)
| Type | How it works | Use this when |
|---|---|---|
| simple | Starts from one URL and discovers internal links recursively. | You want quick onboarding from a homepage or docs index. |
| custompages | Uses your explicit list of pages (plus optional start URL). | You need strict control over exactly what gets ingested. |
| sitemaps | Parses sitemap URLs and nested sitemap indexes to produce the URL list. | You have a large or frequently changing site with maintained sitemap files. |
When to use each mode
- Choose fast mode first for most documentation and content websites. It is cheaper, simpler, and usually sufficient.
- Use advanced mode when sampled crawled pages are missing key body text, product details, or navigation-generated content.
- Prefer sitemap type for enterprise-scale websites because discovery is explicit and deterministic.
- Prefer custompages type for curated corpora (legal pages, support-only pages, policy pages).
- Use simple type to bootstrap quickly, then refine with custom pages/sitemaps after reviewing crawl output.
Deep dive: common crawler setups
1) Simple crawl (single entry point)
Great for small help centers or docs hubs where links are well-connected.
POST /api/ingest
{
"clientId": "acme",
"url": "https://docs.acme.com",
"maxPages": 200,
"mode": "fast"
}
2) Curated pages crawl (high precision)
Best when you only want trusted URLs indexed (for example compliance content).
POST /api/ingest
{
"clientId": "acme",
"pages": [
"https://acme.com/pricing",
"https://acme.com/security",
"https://acme.com/terms"
],
"maxPages": 50,
"mode": "fast"
}
3) Sitemap-driven crawl (large structured site)
Uses XML sitemap parsing and supports sitemap indexes recursively.
POST /api/ingest
{
"clientId": "acme",
"sitemaps": [
"https://acme.com/sitemap.xml",
"https://docs.acme.com/sitemap_index.xml"
],
"maxPages": 5000,
"mode": "fast"
}
4) Advanced dynamic crawl (JS-rendered content)
Use this when fast mode captures little/no meaningful text from SPA routes.
POST /api/ingest
{
"clientId": "acme",
"url": "https://app.acme.com/help",
"maxPages": 300,
"mode": "advanced"
}
5) Scheduled refresh crawl
Keep answers aligned with changing docs by enabling schedules (hourly scheduler checks due runs; running crawls are finalized automatically).
POST /api/ingest
{
"clientId": "acme",
"sitemaps": ["https://docs.acme.com/sitemap.xml"],
"maxPages": 1000,
"mode": "fast",
"scheduleEnabled": true,
"intervalAmount": 1,
"intervalUnit": "week"
}
Operational recommendations
- Start with a low
maxPagespilot crawl and review quality before scaling. - Use
/api/validate-sitemapbefore launching large sitemap crawls to catch invalid feeds early. - If your site has mixed rendering, split crawls by section: keep docs on
fast, useadvancedonly for dynamic app/help routes. - Re-crawl after major navigation or URL structure changes so stale URLs are replaced.
Screenshot placeholders
If you want to add UI screenshots later, keep placeholders like these:
<img src="/img/docs/advanced-crawlers-mode-selector.png" alt="Crawler creation form showing mode selector with Fast and Advanced options highlighted"><img src="/img/docs/advanced-crawlers-sitemap-example.png" alt="Crawler setup screen with sitemap URLs and page limit configured before launch"><img src="/img/docs/advanced-crawlers-history.png" alt="Crawl history table showing status, pages crawled, and termination reason for recent runs">