Crawler Setup and Configuration

3 min read

Crawlers fetch web page content and convert it to readable text. They are used when a search engine returns only URLs and snippets (i.e., the engine "requires scraping"), and also for READ_URL steps in V2 mode.

How Crawlers Are Activated

The active crawler list is computed at startup from two sources:

Built-in crawlers from ACTIVE_INHOUSE_CRAWLERS (default: ["basic", "crawl4ai"]).
API-key-based crawlers that are automatically added when their API key is provided:
- FIRECRAWL_API_KEY present --> Firecrawl crawler enabled.
- JINA_API_KEY present --> Jina crawler enabled.
- TAVILY_API_KEY present --> Tavily crawler enabled.

The space-level configuration UI shows only the activated crawlers as options.

Basic Crawler

Category: Built-in (no external dependency)

How it works: Sends HTTP GET requests to each URL using httpx with randomized User-Agent headers to avoid bot detection. Converts the HTML response to Markdown using the markdownify library. All URLs are fetched concurrently (up to max_concurrent_requests).

Data flow:

Outbound: HTTP GET requests to target websites with randomized User-Agent headers. Routed through the platform proxy if configured (see 03-proxy.md).
Inbound: HTML content, converted to Markdown locally.

Setup: No API keys or external services required. Included by default in ACTIVE_INHOUSE_CRAWLERS.

Limitations:

Cannot render JavaScript-heavy pages (only sees the initial HTML).
Blocks PDF and binary content types by default.
URL patterns can be blacklisted (default: *.pdf URLs are skipped).

Crawl4AI Crawler

Category: Built-in (no external API key, but uses a local Chromium browser)

How it works: Uses the crawl4ai library to launch a headless Chromium browser. Each page is loaded with JavaScript rendering, full page scrolling, overlay removal, and user simulation. Content is filtered using a pruning algorithm that removes low-quality blocks, then converted to Markdown.

Data flow:

Outbound: Browser HTTP requests to target websites (same as a real browser). Includes JavaScript execution, CSS loading, and image loading (in text mode).
Inbound: Fully rendered page content, filtered and converted to Markdown locally.

Setup: No API keys required. The Chromium browser is included in the Docker image. Included by default in ACTIVE_INHOUSE_CRAWLERS.

Advantages over Basic Crawler:

Renders JavaScript-heavy single-page applications (SPAs).
Removes overlays, popups, and modals automatically.
Simulates human browsing behavior to avoid bot detection.
Built-in content quality filtering (pruning).
Configurable rate limiting per domain.

Firecrawl Crawler

Category: API-based (experimental)

How it works: Uses the Firecrawl API's batch scrape endpoint. Sends all URLs in a single batch request and receives Markdown content for each page.

Data flow:

Outbound: List of URLs and API key sent to Firecrawl servers.
Inbound: Markdown content for each URL.

Setup:

Environment Variable	Value
`FIRECRAWL_API_KEY`	Your Firecrawl API key

Providing this key automatically adds the Firecrawl crawler to the active list. No further configuration is needed.

Jina Reader Crawler

Category: API-based (experimental)

How it works: Sends POST requests to the Jina Reader API (<https://r.jina.ai/)> with each URL. The API fetches and parses the page (using browser-mode rendering by default) and returns structured content including title, description, and Markdown text.

Data flow:

Outbound: URL (JSON body), API key (via Authorization: Bearer header), rendering preferences (via X-Return-Format, X-Engine headers).
Inbound: JSON response with title, description, and Markdown content.

Setup:

Environment Variable	Value
`JINA_API_KEY`	Your Jina AI API key
`JINA_READER_API_ENDPOINT`	Reader endpoint (default: `<https://r.jina.ai/`)>

Providing the API key automatically adds the Jina crawler to the active list.

Tavily Crawler

Category: API-based (experimental)

How it works: Uses the Tavily Python client's extract API to retrieve page content in Markdown format. URLs are processed in batches of up to 20 (Tavily API limit). Supports basic and advanced extraction depth.

Data flow:

Outbound: List of URLs and API key sent to Tavily servers, extraction parameters (depth, format).
Inbound: Raw Markdown content for each URL, with separate tracking of failed URLs.

Setup:

Environment Variable	Value
`TAVILY_API_KEY`	Your Tavily API key

Providing this key automatically adds the Tavily crawler to the active list.

Crawler Comparison

Crawler	Type	JS Rendering	API Key	Auto-Enabled	Best For
Basic	Built-in	No	Not needed	Default	Simple HTML pages, fast crawling
Crawl4AI	Built-in	Yes	Not needed	Default	JavaScript-heavy sites, SPAs
Firecrawl	API	Yes	Required	On key	Reliable cloud-based scraping
Jina	API	Yes	Required	On key	High-quality Markdown extraction
Tavily	API	Yes	Required	On key	Batch URL extraction