Web Page Reader Configuration

5 min read

This page documents every setting under Web Page Reader in the Spaces configuration UI. Each setting is shown by its UI label, with the underlying configuration field name in italics.

A "web page reader" is what fetches a page's content after a search returns a URL — also called a "crawler" in the code. Some search engines return URLs only and the configured reader is run after every search; others return content directly and the reader is not used unless the orchestrator AI asks for a specific URL.

The list of available readers is decided at the platform level by the ACTIVE_INHOUSE_CRAWLERS environment variable plus the presence of API keys for third-party readers. Only readers on that list appear in the Web Page Reader selector. See the Activation Reference page under Platform / Infrastructure for what controls availability.

For deployment-side credential setup (API keys for the third-party readers), see the Web Page Reader Setup page under Platform / Infrastructure. For sequence diagrams of how each reader actually fetches a page, see Web Page Readers under Platform / Architecture.

Note for the next iteration: most reader settings below currently render in the Spaces UI with auto-generated camelCase labels because their Pydantic models use only description=... and not title=.... The labels in the Settings tables on this page are the recommended UI labels — they will become the actual labels once a follow-up code change adds explicit title= values.

Common setting

Every reader shares one setting:

UI Label

Field

Type

Default

Timeout

timeout

integer (seconds)

10

Per-page request timeout. Pages that take longer than this are abandoned with an error.


Basic Reader

Display name in the UI: BasicCrawler

What it does

Built-in lightweight reader. Fetches each URL with a single HTTP GET (using a randomized user-agent header) and converts the HTML response to Markdown locally. All URLs are fetched in parallel.

Cannot render JavaScript — only sees the initial HTML response.

Settings

UI Label

Field

Type

Default

Timeout

timeout

integer (seconds)

10

URL Pattern Blacklist

url_pattern_blacklist

list of regex strings

[".*\\.pdf$"]

Unwanted Content Types

unwanted_content_types

set of strings

PDFs, Office documents, images, video, audio

Maximum Concurrent Requests

max_concurrent_requests

integer

10

  • URL Pattern Blacklist — Regex patterns; URLs matching any pattern are skipped before any request is sent. The default blocks PDFs. This is also where to add internal-network ranges for SSRF defense (see Security Risk Assessment).

  • Unwanted Content Types — HTTP Content-Type values that are rejected after the response arrives. The default blocks binary file types.

  • Maximum Concurrent Requests — Pages fetched in parallel.

When to use it

Default for most Spaces. Fast, no external dependency, no API costs. Use when the sites you expect to fetch are mostly static HTML (news, documentation, public reference content) and don't depend on client-side JavaScript.

Limitations

  • Cannot fetch JavaScript-heavy sites (single-page applications, sites that render content client-side).

  • Often blocked by modern anti-bot protections on major news and finance sites.


Browser-Based Reader

Display name in the UI: Crawl4AiCrawler

What it does

Built-in browser-based reader. Each URL is loaded in a real headless Chromium browser inside assistants-core, with JavaScript execution, full-page scrolling, overlay removal, and human-browsing simulation. Then the rendered HTML is converted to Markdown.

Settings

UI Label

Field

Type

Default

Timeout

timeout

integer (seconds)

10

Maximum Concurrent Requests

max_concurrent_requests

integer

10

Maximum Browser Sessions

max_session_permit

integer

10

Markdown Generator Config (markdown_generator_config):

UI Label

Field

Default

Markdown Options

markdown_generator_config.options

{"ignore_links": true, "ignore_emphasis": true, "ignore_images": true}

Rate Limiter Config (rate_limiter_config):

UI Label

Field

Default

Base Delay

rate_limiter_config.base_delay

(0.5, 1.0) seconds

Maximum Delay

rate_limiter_config.max_delay

1.0 seconds

Maximum Retries

rate_limiter_config.max_retries

0

Rate-Limit HTTP Codes

rate_limiter_config.rate_limit_codes

[429, 503]

Browser Config (crawler_config):

UI Label

Field

Default

Cache Mode

crawler_config.cache_mode

bypass

Scan Full Page

crawler_config.scan_full_page

true

Wait Until

crawler_config.wait_until

domcontentloaded

Scroll Delay

crawler_config.scroll_delay

0.05 seconds

Remove Overlay Elements

crawler_config.remove_overlay_elements

true

Simulate User

crawler_config.simulate_user

true

Override Navigator

crawler_config.override_navigator

true

Content Filter Config (pruning_content_filter_config):

UI Label

Field

Default

Enable Content Filter

pruning_content_filter_config.enabled

true

Quality Threshold

pruning_content_filter_config.threshold

0.5

Threshold Type

pruning_content_filter_config.threshold_type

fixed

Minimum Word Threshold

pruning_content_filter_config.min_word_threshold

10

When to use it

When the sites you expect to fetch render their content with JavaScript or use anti-bot protection that the Basic Reader cannot get past. Tradeoff: latency is roughly 3x higher than Basic, and resource usage on the pod is significantly higher (one Chromium instance per concurrent page).

Limitations

  • Significantly slower than Basic.

  • Heavy resource usage on the assistants-core pod.

  • Operational overhead (Chromium binary, profiles, memory).


Firecrawl Reader (Experimental)

Display name in the UI: FirecrawlCrawler (Experimental)

What it does

Third-party API-based reader. Sends URLs to Firecrawl's batch-scrape API and receives clean Markdown content per page.

Settings

UI Label

Field

Type

Default

Timeout

timeout

integer (seconds)

10

When to use it

When you want Firecrawl's reliability for difficult sites without managing a browser stack inside the pod. Off-loads the rendering work and the egress traffic to Firecrawl. Per-call cost.


Jina Reader (Experimental)

Display name in the UI: JinaCrawler (Experimental)

What it does

Third-party API-based reader. Sends each URL to Jina's Reader API, which fetches and parses the page (in browser mode by default) and returns structured Markdown content.

Settings

UI Label

Field

Type

Default

Timeout

timeout

integer (seconds)

10

Headers

headers

dict

{"X-Return-Format": "markdown", "X-Engine": "browser", "DNT": "1"}

The default headers ask Jina for Markdown output, browser-mode rendering, and a "Do Not Track" hint. Override only when you have a specific reason — see Jina's documentation for header semantics.

When to use it

When you want consistently clean Markdown output and your sites benefit from Jina's browser-mode rendering. Per-call cost.


Tavily Reader (Experimental)

Display name in the UI: TavilyCrawler (Experimental)

What it does

Third-party API-based reader. Sends URLs (in batches of up to 20) to Tavily's extract API and receives Markdown content per page.

Settings

UI Label

Field

Type

Default

Timeout

timeout

integer (seconds)

10

Extraction Depth

depth

basic / advanced

advanced

  • Extraction Depthadvanced returns more thorough extracted content; basic is faster but lighter.

When to use it

When you want batch extraction at low latency and your sites are well-supported by Tavily. Per-call cost.


Choosing a Reader

Reader

Built-in

JavaScript rendering

API key required

Best for

Basic

Yes

No

No

Static HTML sites, fast crawling, default for most Spaces

Browser-Based

Yes

Yes

No

JavaScript-heavy sites, single-page apps

Firecrawl (Experimental)

No

Yes

Yes

Reliable cloud-based scraping with no browser stack on the pod

Jina (Experimental)

No

Yes

Yes

High-quality Markdown extraction with browser-mode rendering

Tavily (Experimental)

No

Yes

Yes

Batch URL extraction at low latency

  • For the architectural sequence diagrams of each reader (request paths, what egresses where): see Web Page Readers under Platform / Architecture.

  • For platform-level activation (ACTIVE_INHOUSE_CRAWLERS, API-key auto-activation) and per-reader credential setup: see Activation Reference and Web Page Reader Setup under Platform / Infrastructure.

Last updated