Crawler Setup and Configuration

3 min read

Crawlers fetch web page content and convert it to readable text. They are used when a search engine returns only URLs and snippets (i.e., the engine "requires scraping"), and also for READ_URL steps in V2 mode.


How Crawlers Are Activated

The active crawler list is computed at startup from two sources:

  1. Built-in crawlers from ACTIVE_INHOUSE_CRAWLERS (default: ["basic", "crawl4ai"]).

  2. API-key-based crawlers that are automatically added when their API key is provided:

    • FIRECRAWL_API_KEY present --> Firecrawl crawler enabled.

    • JINA_API_KEY present --> Jina crawler enabled.

    • TAVILY_API_KEY present --> Tavily crawler enabled.

The space-level configuration UI shows only the activated crawlers as options.


Basic Crawler

Category: Built-in (no external dependency)

How it works: Sends HTTP GET requests to each URL using httpx with randomized User-Agent headers to avoid bot detection. Converts the HTML response to Markdown using the markdownify library. All URLs are fetched concurrently (up to max_concurrent_requests).

Data flow:

  • Outbound: HTTP GET requests to target websites with randomized User-Agent headers. Routed through the platform proxy if configured (see 03-proxy.md).

  • Inbound: HTML content, converted to Markdown locally.

Setup: No API keys or external services required. Included by default in ACTIVE_INHOUSE_CRAWLERS.

Limitations:

  • Cannot render JavaScript-heavy pages (only sees the initial HTML).

  • Blocks PDF and binary content types by default.

  • URL patterns can be blacklisted (default: *.pdf URLs are skipped).


Crawl4AI Crawler

Category: Built-in (no external API key, but uses a local Chromium browser)

How it works: Uses the crawl4ai library to launch a headless Chromium browser. Each page is loaded with JavaScript rendering, full page scrolling, overlay removal, and user simulation. Content is filtered using a pruning algorithm that removes low-quality blocks, then converted to Markdown.

Data flow:

  • Outbound: Browser HTTP requests to target websites (same as a real browser). Includes JavaScript execution, CSS loading, and image loading (in text mode).

  • Inbound: Fully rendered page content, filtered and converted to Markdown locally.

Setup: No API keys required. The Chromium browser is included in the Docker image. Included by default in ACTIVE_INHOUSE_CRAWLERS.

Advantages over Basic Crawler:

  • Renders JavaScript-heavy single-page applications (SPAs).

  • Removes overlays, popups, and modals automatically.

  • Simulates human browsing behavior to avoid bot detection.

  • Built-in content quality filtering (pruning).

  • Configurable rate limiting per domain.


Firecrawl Crawler

Category: API-based (experimental)

How it works: Uses the Firecrawl API's batch scrape endpoint. Sends all URLs in a single batch request and receives Markdown content for each page.

Data flow:

  • Outbound: List of URLs and API key sent to Firecrawl servers.

  • Inbound: Markdown content for each URL.

Setup:

Environment Variable

Value

FIRECRAWL_API_KEY

Your Firecrawl API key

Providing this key automatically adds the Firecrawl crawler to the active list. No further configuration is needed.


Jina Reader Crawler

Category: API-based (experimental)

How it works: Sends POST requests to the Jina Reader API (<https://r.jina.ai/)> with each URL. The API fetches and parses the page (using browser-mode rendering by default) and returns structured content including title, description, and Markdown text.

Data flow:

  • Outbound: URL (JSON body), API key (via Authorization: Bearer header), rendering preferences (via X-Return-Format, X-Engine headers).

  • Inbound: JSON response with title, description, and Markdown content.

Setup:

Environment Variable

Value

JINA_API_KEY

Your Jina AI API key

JINA_READER_API_ENDPOINT

Reader endpoint (default: <https://r.jina.ai/)>

Providing the API key automatically adds the Jina crawler to the active list.


Tavily Crawler

Category: API-based (experimental)

How it works: Uses the Tavily Python client's extract API to retrieve page content in Markdown format. URLs are processed in batches of up to 20 (Tavily API limit). Supports basic and advanced extraction depth.

Data flow:

  • Outbound: List of URLs and API key sent to Tavily servers, extraction parameters (depth, format).

  • Inbound: Raw Markdown content for each URL, with separate tracking of failed URLs.

Setup:

Environment Variable

Value

TAVILY_API_KEY

Your Tavily API key

Providing this key automatically adds the Tavily crawler to the active list.


Crawler Comparison

Crawler

Type

JS Rendering

API Key

Auto-Enabled

Best For

Basic

Built-in

No

Not needed

Default

Simple HTML pages, fast crawling

Crawl4AI

Built-in

Yes

Not needed

Default

JavaScript-heavy sites, SPAs

Firecrawl

API

Yes

Required

On key

Reliable cloud-based scraping

Jina

API

Yes

Required

On key

High-quality Markdown extraction

Tavily

API

Yes

Required

On key

Batch URL extraction

Last updated