Web Page Reader Configuration
5 min read
This page documents every setting under Web Page Reader in the Spaces configuration UI. Each setting is shown by its UI label, with the underlying configuration field name in italics.
A "web page reader" is what fetches a page's content after a search returns a URL — also called a "crawler" in the code. Some search engines return URLs only and the configured reader is run after every search; others return content directly and the reader is not used unless the orchestrator AI asks for a specific URL.
The list of available readers is decided at the platform level by the ACTIVE_INHOUSE_CRAWLERS environment variable plus the presence of API keys for third-party readers. Only readers on that list appear in the Web Page Reader selector. See the Activation Reference page under Platform / Infrastructure for what controls availability.
For deployment-side credential setup (API keys for the third-party readers), see the Web Page Reader Setup page under Platform / Infrastructure. For sequence diagrams of how each reader actually fetches a page, see Web Page Readers under Platform / Architecture.
Note for the next iteration: most reader settings below currently render in the Spaces UI with auto-generated camelCase labels because their Pydantic models use only
description=...and nottitle=.... The labels in the Settings tables on this page are the recommended UI labels — they will become the actual labels once a follow-up code change adds explicittitle=values.
Common setting
Every reader shares one setting:
UI Label | Field | Type | Default |
|---|---|---|---|
Timeout |
| integer (seconds) |
|
Per-page request timeout. Pages that take longer than this are abandoned with an error.
Basic Reader
Display name in the UI: BasicCrawler
What it does
Built-in lightweight reader. Fetches each URL with a single HTTP GET (using a randomized user-agent header) and converts the HTML response to Markdown locally. All URLs are fetched in parallel.
Cannot render JavaScript — only sees the initial HTML response.
Settings
UI Label | Field | Type | Default |
|---|---|---|---|
Timeout |
| integer (seconds) |
|
URL Pattern Blacklist |
| list of regex strings |
|
Unwanted Content Types |
| set of strings | PDFs, Office documents, images, video, audio |
Maximum Concurrent Requests |
| integer |
|
URL Pattern Blacklist — Regex patterns; URLs matching any pattern are skipped before any request is sent. The default blocks PDFs. This is also where to add internal-network ranges for SSRF defense (see Security Risk Assessment).
Unwanted Content Types — HTTP
Content-Typevalues that are rejected after the response arrives. The default blocks binary file types.Maximum Concurrent Requests — Pages fetched in parallel.
When to use it
Default for most Spaces. Fast, no external dependency, no API costs. Use when the sites you expect to fetch are mostly static HTML (news, documentation, public reference content) and don't depend on client-side JavaScript.
Limitations
Cannot fetch JavaScript-heavy sites (single-page applications, sites that render content client-side).
Often blocked by modern anti-bot protections on major news and finance sites.
Browser-Based Reader
Display name in the UI: Crawl4AiCrawler
What it does
Built-in browser-based reader. Each URL is loaded in a real headless Chromium browser inside assistants-core, with JavaScript execution, full-page scrolling, overlay removal, and human-browsing simulation. Then the rendered HTML is converted to Markdown.
Settings
UI Label | Field | Type | Default |
|---|---|---|---|
Timeout |
| integer (seconds) |
|
Maximum Concurrent Requests |
| integer |
|
Maximum Browser Sessions |
| integer |
|
Markdown Generator Config (markdown_generator_config):
UI Label | Field | Default |
|---|---|---|
Markdown Options |
|
|
Rate Limiter Config (rate_limiter_config):
UI Label | Field | Default |
|---|---|---|
Base Delay |
|
|
Maximum Delay |
|
|
Maximum Retries |
|
|
Rate-Limit HTTP Codes |
|
|
Browser Config (crawler_config):
UI Label | Field | Default |
|---|---|---|
Cache Mode |
|
|
Scan Full Page |
|
|
Wait Until |
|
|
Scroll Delay |
|
|
Remove Overlay Elements |
|
|
Simulate User |
|
|
Override Navigator |
|
|
Content Filter Config (pruning_content_filter_config):
UI Label | Field | Default |
|---|---|---|
Enable Content Filter |
|
|
Quality Threshold |
|
|
Threshold Type |
|
|
Minimum Word Threshold |
|
|
When to use it
When the sites you expect to fetch render their content with JavaScript or use anti-bot protection that the Basic Reader cannot get past. Tradeoff: latency is roughly 3x higher than Basic, and resource usage on the pod is significantly higher (one Chromium instance per concurrent page).
Limitations
Significantly slower than Basic.
Heavy resource usage on the
assistants-corepod.Operational overhead (Chromium binary, profiles, memory).
Firecrawl Reader (Experimental)
Display name in the UI: FirecrawlCrawler (Experimental)
What it does
Third-party API-based reader. Sends URLs to Firecrawl's batch-scrape API and receives clean Markdown content per page.
Settings
UI Label | Field | Type | Default |
|---|---|---|---|
Timeout |
| integer (seconds) |
|
When to use it
When you want Firecrawl's reliability for difficult sites without managing a browser stack inside the pod. Off-loads the rendering work and the egress traffic to Firecrawl. Per-call cost.
Jina Reader (Experimental)
Display name in the UI: JinaCrawler (Experimental)
What it does
Third-party API-based reader. Sends each URL to Jina's Reader API, which fetches and parses the page (in browser mode by default) and returns structured Markdown content.
Settings
UI Label | Field | Type | Default |
|---|---|---|---|
Timeout |
| integer (seconds) |
|
Headers |
| dict |
|
The default headers ask Jina for Markdown output, browser-mode rendering, and a "Do Not Track" hint. Override only when you have a specific reason — see Jina's documentation for header semantics.
When to use it
When you want consistently clean Markdown output and your sites benefit from Jina's browser-mode rendering. Per-call cost.
Tavily Reader (Experimental)
Display name in the UI: TavilyCrawler (Experimental)
What it does
Third-party API-based reader. Sends URLs (in batches of up to 20) to Tavily's extract API and receives Markdown content per page.
Settings
UI Label | Field | Type | Default |
|---|---|---|---|
Timeout |
| integer (seconds) |
|
Extraction Depth |
|
|
|
Extraction Depth —
advancedreturns more thorough extracted content;basicis faster but lighter.
When to use it
When you want batch extraction at low latency and your sites are well-supported by Tavily. Per-call cost.
Choosing a Reader
Reader | Built-in | JavaScript rendering | API key required | Best for |
|---|---|---|---|---|
Basic | Yes | No | No | Static HTML sites, fast crawling, default for most Spaces |
Browser-Based | Yes | Yes | No | JavaScript-heavy sites, single-page apps |
Firecrawl (Experimental) | No | Yes | Yes | Reliable cloud-based scraping with no browser stack on the pod |
Jina (Experimental) | No | Yes | Yes | High-quality Markdown extraction with browser-mode rendering |
Tavily (Experimental) | No | Yes | Yes | Batch URL extraction at low latency |
Related pages
For the architectural sequence diagrams of each reader (request paths, what egresses where): see Web Page Readers under Platform / Architecture.
For platform-level activation (
ACTIVE_INHOUSE_CRAWLERS, API-key auto-activation) and per-reader credential setup: see Activation Reference and Web Page Reader Setup under Platform / Infrastructure.