Crawler Setup and Configuration
3 min read
Crawlers fetch web page content and convert it to readable text. They are used when a search engine returns only URLs and snippets (i.e., the engine "requires scraping"), and also for READ_URL steps in V2 mode.
How Crawlers Are Activated
The active crawler list is computed at startup from two sources:
Built-in crawlers from
ACTIVE_INHOUSE_CRAWLERS(default:["basic", "crawl4ai"]).API-key-based crawlers that are automatically added when their API key is provided:
FIRECRAWL_API_KEYpresent --> Firecrawl crawler enabled.JINA_API_KEYpresent --> Jina crawler enabled.TAVILY_API_KEYpresent --> Tavily crawler enabled.
The space-level configuration UI shows only the activated crawlers as options.
Basic Crawler
Category: Built-in (no external dependency)
How it works: Sends HTTP GET requests to each URL using httpx with randomized User-Agent headers to avoid bot detection. Converts the HTML response to Markdown using the markdownify library. All URLs are fetched concurrently (up to max_concurrent_requests).
Data flow:
Outbound: HTTP GET requests to target websites with randomized User-Agent headers. Routed through the platform proxy if configured (see 03-proxy.md).
Inbound: HTML content, converted to Markdown locally.
Setup: No API keys or external services required. Included by default in ACTIVE_INHOUSE_CRAWLERS.
Limitations:
Cannot render JavaScript-heavy pages (only sees the initial HTML).
Blocks PDF and binary content types by default.
URL patterns can be blacklisted (default:
*.pdfURLs are skipped).
Crawl4AI Crawler
Category: Built-in (no external API key, but uses a local Chromium browser)
How it works: Uses the crawl4ai library to launch a headless Chromium browser. Each page is loaded with JavaScript rendering, full page scrolling, overlay removal, and user simulation. Content is filtered using a pruning algorithm that removes low-quality blocks, then converted to Markdown.
Data flow:
Outbound: Browser HTTP requests to target websites (same as a real browser). Includes JavaScript execution, CSS loading, and image loading (in text mode).
Inbound: Fully rendered page content, filtered and converted to Markdown locally.
Setup: No API keys required. The Chromium browser is included in the Docker image. Included by default in ACTIVE_INHOUSE_CRAWLERS.
Advantages over Basic Crawler:
Renders JavaScript-heavy single-page applications (SPAs).
Removes overlays, popups, and modals automatically.
Simulates human browsing behavior to avoid bot detection.
Built-in content quality filtering (pruning).
Configurable rate limiting per domain.
Firecrawl Crawler
Category: API-based (experimental)
How it works: Uses the Firecrawl API's batch scrape endpoint. Sends all URLs in a single batch request and receives Markdown content for each page.
Data flow:
Outbound: List of URLs and API key sent to Firecrawl servers.
Inbound: Markdown content for each URL.
Setup:
Environment Variable | Value |
|---|---|
| Your Firecrawl API key |
Providing this key automatically adds the Firecrawl crawler to the active list. No further configuration is needed.
Jina Reader Crawler
Category: API-based (experimental)
How it works: Sends POST requests to the Jina Reader API (<https://r.jina.ai/)> with each URL. The API fetches and parses the page (using browser-mode rendering by default) and returns structured content including title, description, and Markdown text.
Data flow:
Outbound: URL (JSON body), API key (via
Authorization: Bearerheader), rendering preferences (viaX-Return-Format,X-Engineheaders).Inbound: JSON response with title, description, and Markdown content.
Setup:
Environment Variable | Value |
|---|---|
| Your Jina AI API key |
| Reader endpoint (default: |
Providing the API key automatically adds the Jina crawler to the active list.
Tavily Crawler
Category: API-based (experimental)
How it works: Uses the Tavily Python client's extract API to retrieve page content in Markdown format. URLs are processed in batches of up to 20 (Tavily API limit). Supports basic and advanced extraction depth.
Data flow:
Outbound: List of URLs and API key sent to Tavily servers, extraction parameters (depth, format).
Inbound: Raw Markdown content for each URL, with separate tracking of failed URLs.
Setup:
Environment Variable | Value |
|---|---|
| Your Tavily API key |
Providing this key automatically adds the Tavily crawler to the active list.
Crawler Comparison
Crawler | Type | JS Rendering | API Key | Auto-Enabled | Best For |
|---|---|---|---|---|---|
Basic | Built-in | No | Not needed | Default | Simple HTML pages, fast crawling |
Crawl4AI | Built-in | Yes | Not needed | Default | JavaScript-heavy sites, SPAs |
Firecrawl | API | Yes | Required | On key | Reliable cloud-based scraping |
Jina | API | Yes | Required | On key | High-quality Markdown extraction |
Tavily | API | Yes | Required | On key | Batch URL extraction |