Web Page Reader Configuration

5 min read

This page documents every setting under Web Page Reader in the Spaces configuration UI. Each setting is shown by its UI label, with the underlying configuration field name in italics.

A "web page reader" is what fetches a page's content after a search returns a URL — also called a "crawler" in the code. Some search engines return URLs only and the configured reader is run after every search; others return content directly and the reader is not used unless the orchestrator AI asks for a specific URL.

The list of available readers is decided at the platform level by the ACTIVE_INHOUSE_CRAWLERS environment variable plus the presence of API keys for third-party readers. Only readers on that list appear in the Web Page Reader selector. See the Activation Reference page under Platform / Infrastructure for what controls availability.

For deployment-side credential setup (API keys for the third-party readers), see the Web Page Reader Setup page under Platform / Infrastructure. For sequence diagrams of how each reader actually fetches a page, see Web Page Readers under Platform / Architecture.

Note for the next iteration: most reader settings below currently render in the Spaces UI with auto-generated camelCase labels because their Pydantic models use only description=... and not title=.... The labels in the Settings tables on this page are the recommended UI labels — they will become the actual labels once a follow-up code change adds explicit title= values.

Common setting

Every reader shares one setting:

UI Label	Field	Type	Default
Timeout	`timeout`	integer (seconds)	`10`

Per-page request timeout. Pages that take longer than this are abandoned with an error.

Basic Reader

Display name in the UI: BasicCrawler

What it does

Built-in lightweight reader. Fetches each URL with a single HTTP GET (using a randomized user-agent header) and converts the HTML response to Markdown locally. All URLs are fetched in parallel.

Cannot render JavaScript — only sees the initial HTML response.

Settings

UI Label	Field	Type	Default
Timeout	`timeout`	integer (seconds)	`10`
URL Pattern Blacklist	`url_pattern_blacklist`	list of regex strings	`[".*\\.pdf$"]`
Unwanted Content Types	`unwanted_content_types`	set of strings	PDFs, Office documents, images, video, audio
Maximum Concurrent Requests	`max_concurrent_requests`	integer	`10`

URL Pattern Blacklist — Regex patterns; URLs matching any pattern are skipped before any request is sent. The default blocks PDFs. This is also where to add internal-network ranges for SSRF defense (see Security Risk Assessment).
Unwanted Content Types — HTTP Content-Type values that are rejected after the response arrives. The default blocks binary file types.
Maximum Concurrent Requests — Pages fetched in parallel.

When to use it

Default for most Spaces. Fast, no external dependency, no API costs. Use when the sites you expect to fetch are mostly static HTML (news, documentation, public reference content) and don't depend on client-side JavaScript.

Limitations

Cannot fetch JavaScript-heavy sites (single-page applications, sites that render content client-side).
Often blocked by modern anti-bot protections on major news and finance sites.

Browser-Based Reader

Display name in the UI: Crawl4AiCrawler

What it does

Built-in browser-based reader. Each URL is loaded in a real headless Chromium browser inside assistants-core, with JavaScript execution, full-page scrolling, overlay removal, and human-browsing simulation. Then the rendered HTML is converted to Markdown.

Settings

UI Label	Field	Type	Default
Timeout	`timeout`	integer (seconds)	`10`
Maximum Concurrent Requests	`max_concurrent_requests`	integer	`10`
Maximum Browser Sessions	`max_session_permit`	integer	`10`

Markdown Generator Config (markdown_generator_config):

UI Label	Field	Default
Markdown Options	`markdown_generator_config.options`	`{"ignore_links": true, "ignore_emphasis": true, "ignore_images": true}`

Rate Limiter Config (rate_limiter_config):

UI Label	Field	Default
Base Delay	`rate_limiter_config.base_delay`	`(0.5, 1.0)` seconds
Maximum Delay	`rate_limiter_config.max_delay`	`1.0` seconds
Maximum Retries	`rate_limiter_config.max_retries`	`0`
Rate-Limit HTTP Codes	`rate_limiter_config.rate_limit_codes`	`[429, 503]`

Browser Config (crawler_config):

UI Label	Field	Default
Cache Mode	`crawler_config.cache_mode`	`bypass`
Scan Full Page	`crawler_config.scan_full_page`	`true`
Wait Until	`crawler_config.wait_until`	`domcontentloaded`
Scroll Delay	`crawler_config.scroll_delay`	`0.05` seconds
Remove Overlay Elements	`crawler_config.remove_overlay_elements`	`true`
Simulate User	`crawler_config.simulate_user`	`true`
Override Navigator	`crawler_config.override_navigator`	`true`

Content Filter Config (pruning_content_filter_config):

UI Label	Field	Default
Enable Content Filter	`pruning_content_filter_config.enabled`	`true`
Quality Threshold	`pruning_content_filter_config.threshold`	`0.5`
Threshold Type	`pruning_content_filter_config.threshold_type`	`fixed`
Minimum Word Threshold	`pruning_content_filter_config.min_word_threshold`	`10`

When to use it

When the sites you expect to fetch render their content with JavaScript or use anti-bot protection that the Basic Reader cannot get past. Tradeoff: latency is roughly 3x higher than Basic, and resource usage on the pod is significantly higher (one Chromium instance per concurrent page).

Limitations

Significantly slower than Basic.
Heavy resource usage on the assistants-core pod.
Operational overhead (Chromium binary, profiles, memory).

Firecrawl Reader (Experimental)

Display name in the UI: FirecrawlCrawler (Experimental)

What it does

Third-party API-based reader. Sends URLs to Firecrawl's batch-scrape API and receives clean Markdown content per page.

Settings

UI Label	Field	Type	Default
Timeout	`timeout`	integer (seconds)	`10`

When to use it

When you want Firecrawl's reliability for difficult sites without managing a browser stack inside the pod. Off-loads the rendering work and the egress traffic to Firecrawl. Per-call cost.

Jina Reader (Experimental)

Display name in the UI: JinaCrawler (Experimental)

What it does

Third-party API-based reader. Sends each URL to Jina's Reader API, which fetches and parses the page (in browser mode by default) and returns structured Markdown content.

Settings

UI Label	Field	Type	Default
Timeout	`timeout`	integer (seconds)	`10`
Headers	`headers`	dict	`{"X-Return-Format": "markdown", "X-Engine": "browser", "DNT": "1"}`

The default headers ask Jina for Markdown output, browser-mode rendering, and a "Do Not Track" hint. Override only when you have a specific reason — see Jina's documentation for header semantics.

When to use it

When you want consistently clean Markdown output and your sites benefit from Jina's browser-mode rendering. Per-call cost.

Tavily Reader (Experimental)

Display name in the UI: TavilyCrawler (Experimental)

What it does

Third-party API-based reader. Sends URLs (in batches of up to 20) to Tavily's extract API and receives Markdown content per page.

Settings

UI Label	Field	Type	Default
Timeout	`timeout`	integer (seconds)	`10`
Extraction Depth	`depth`	`basic` / `advanced`	`advanced`

Extraction Depth — advanced returns more thorough extracted content; basic is faster but lighter.

When to use it

When you want batch extraction at low latency and your sites are well-supported by Tavily. Per-call cost.

Choosing a Reader

Reader	Built-in	JavaScript rendering	API key required	Best for
Basic	Yes	No	No	Static HTML sites, fast crawling, default for most Spaces
Browser-Based	Yes	Yes	No	JavaScript-heavy sites, single-page apps
Firecrawl (Experimental)	No	Yes	Yes	Reliable cloud-based scraping with no browser stack on the pod
Jina (Experimental)	No	Yes	Yes	High-quality Markdown extraction with browser-mode rendering
Tavily (Experimental)	No	Yes	Yes	Batch URL extraction at low latency

For the architectural sequence diagrams of each reader (request paths, what egresses where): see Web Page Readers under Platform / Architecture.
For platform-level activation (ACTIVE_INHOUSE_CRAWLERS, API-key auto-activation) and per-reader credential setup: see Activation Reference and Web Page Reader Setup under Platform / Infrastructure.

Web Page Reader Configuration

Common setting

Basic Reader

What it does

Settings

When to use it

Limitations

Browser-Based Reader

What it does

Settings

When to use it

Limitations

Firecrawl Reader (Experimental)

What it does

Settings

When to use it

Jina Reader (Experimental)

What it does

Settings

When to use it

Tavily Reader (Experimental)

What it does

Settings

When to use it

Choosing a Reader

Related pages