Content Processing Configuration
6 min read
This page documents every setting under Content Processing in the Spaces configuration UI. Each setting is shown by its UI label, with the underlying configuration field name in italics.
After a search retrieves URLs and a Web Page Reader fetches their content, the result goes through the content processing pipeline before being handed to the AI assistant. The pipeline is in this order:

Each stage is independently configurable; cleaning runs first, then summarization (if enabled), then truncation (if enabled), then chunking. Chunking always runs as the final step.
Content Chunk Size (chunk_size)
UI Label | Field | Type | Default |
|---|---|---|---|
Content Chunk Size |
| integer (characters) |
|
Maximum size of each piece when long pages are split into chunks. Smaller values create more granular pieces (better for relevancy ranking but more pieces); larger values keep more context per piece.
Content Cleaning (cleaning)
Automatic cleanup steps run on every page before any other processing.
Enable Character Sanitization (cleaning.enable_character_sanitize)
UI Label | Field | Type | Default |
|---|---|---|---|
Enable Character Sanitization |
| boolean |
|
When enabled, strips null bytes, control characters, and other non-text binary content from page content. Recommended on. Mainly defends against pages that include null bytes or unusual control characters that downstream tooling cannot handle.
Line Removal (cleaning.line_removal)
Removes irrelevant lines from page content using regex patterns — typically navigation menus, cookie banners, "skip to content" links, and footer boilerplate.
UI Label | Field | Type | Default |
|---|---|---|---|
Enable Line Removal |
| boolean |
|
Removal Patterns |
| list of regex strings | Built-in list (see below) |
Built-in patterns target:
"Skip to" / "Skip Navigation" / "Jump to" / "Accessibility help"
Standalone "Sign In" / "Log In" / "Register" / "My Account" lines
"Subscribe" / "Follow Us" / "Share This" / "Newsletter Sign Up"
"Cookie Policy" / "Privacy Policy" / "Terms of Service" / "Cookie Settings" / "Cookie Notice" / "Accept Cookies"
Accessibility labels in
[…accessibility…]form
Each pattern is a regular expression matched against full lines.
Enable Link and URL Cleanup (cleaning.enable_markdown_cleaning)
UI Label | Field | Type | Default |
|---|---|---|---|
Enable Link and URL Cleanup |
| boolean |
|
When enabled, applies a fixed set of Markdown transformations:
Strips the URL out of
[text](url)Markdown links, keeping only the link textRemoves standalone URLs from the text
Normalizes excess whitespace and blank lines
This makes the content more compact and avoids the AI quoting noisy URLs.
Additional Processing Strategies (processing_strategies)
These strategies run after cleaning. Both are off-by-default in different ways: AI summarization is enabled by toggling it, length limit is enabled by default but can be increased or disabled.
Content Length Limit (processing_strategies.truncate)
Caps the size of any single page so very long pages don't dominate the context budget.
UI Label | Field | Type | Default |
|---|---|---|---|
Enable Content Length Limit |
| boolean |
|
Maximum Content Length |
| integer (tokens) |
|
Pages exceeding Maximum Content Length are cut off at the limit. Tokens are roughly 4 characters each. Runs after AI Summarization, so summarization happens first; truncation only kicks in if the summarized output is still too long.
AI Content Summarization (processing_strategies.llm_processor)
Use an AI model to summarize and compress page content per query. Useful for long pages where most of the content is irrelevant to the user's question.
Enable AI Summarization (processing_strategies.llm_processor.enabled)
UI Label | Field | Type | Default |
|---|---|---|---|
Enable AI Summarization |
| boolean |
|
Summarization Language Model (processing_strategies.llm_processor.language_model)
The AI model used for summarization. Must support structured output.
Minimum Content Length for Summarization (processing_strategies.llm_processor.min_tokens)
UI Label | Field | Type | Default |
|---|---|---|---|
Minimum Content Length for Summarization |
| integer (tokens) |
|
Pages shorter than this threshold are kept as-is. Only longer pages are sent to the AI for summarization. Ignored when Privacy Filtering is enabled — privacy filtering always runs regardless of length.
Privacy Filtering (processing_strategies.llm_processor.privacy_filter)
Privacy filtering is an additional layer on top of AI Summarization that detects and redacts GDPR Article 9 sensitive personal data (race, ethnicity, political opinions, religious beliefs, trade-union membership, biometric or genetic data, health data, data on sex life or sexual orientation).
UI Label | Field | Type | Default |
|---|---|---|---|
Enable Privacy Filtering |
| boolean |
|
Sanitization Pipeline Mode |
| enum (see below) |
|
Sensitive Content Flag Message |
| textarea | Built-in compliance notice |
Privacy Filtering Rules |
| textarea | Built-in GDPR Art. 9 ruleset |
Sanitization Pipeline Mode — five modes are exposed in the UI selector:
UI Option | Value | Behaviour |
|---|---|---|
Always Sanitize — summarize and redact every page unconditionally |
| Every page goes through summarize-and-sanitize. Highest cost, strongest guarantee. |
Judge Only — judge first; if flagged, replace content and snippet with a compliance notice |
| A lightweight LLM call decides if the page contains sensitive data. If flagged, the page content and snippet are replaced with the Sensitive Content Flag Message. If not flagged, the page passes through unchanged. |
Judge and Sanitize — single LLM call that judges and returns sanitized content when flagged |
| One call does both: classify and (if flagged) return summarized-and-sanitized content. Cheaper than Judge then Sanitize. |
Judge then Sanitize — judge first; if flagged, run a full summarize-and-sanitize call |
| Two-step: a cheap judge call gates a more expensive summarize-and-sanitize call. Best when most pages are not sensitive. |
Keyword Redact — extract sensitive phrases and apply local regex replacement (no summarization) |
| LLM extracts verbatim sensitive phrases from the page; a local regex replaces those phrases with redaction markers. Page is not summarized. Useful when you want to keep page content as-is and only mask sensitive strings. |
Sensitive Content Flag Message — Replaces the content and snippet of pages flagged as sensitive when using Judge Only mode. Default reads: "THIS PAGE MAY CONTAIN SENSITIVE INFORMATION. ITS CONTENT HAS BEEN WITHHELD FOR COMPLIANCE REASONS. YOU CAN REFERENCE THE PAGE TO THE USER SO HE CAN EXPLORE THE CONTENT HIMSELF."
Privacy Filtering Rules — Rules given to the AI for what constitutes sensitive data. Default targets GDPR Article 9 categories.
Platform note: when the platform deploys with
LLM_PROCESS_CONFIGset, all summarization and privacy-filter fields are visually disabled in the Spaces UI and locked to the platform's values. Space admins see the configuration as read-only. See Activation Reference under Platform / Infrastructure.
Advanced Prompts (processing_strategies.llm_processor.prompts)
Jinja2 prompt templates sent to the AI model. Edit only if you need to customise AI instructions — the defaults are well-tested.
UI Label | Field | Used by |
|---|---|---|
System Instructions |
| Always Sanitize and the no-sanitize summarize path |
User Instructions |
| Always Sanitize and the no-sanitize summarize path |
Judge System Instructions |
| Judge Only and Judge then Sanitize |
Judge and Sanitize System Instructions |
| Judge and Sanitize |
Page Context User Prompt |
| Judge call user prompt and Keyword Redact mode |
Keyword Extraction System Instructions |
| Keyword Redact mode |
All templates use Jinja2 syntax with variables like {{ page }}, {{ query }}, {{ sanitize }}, {{ sanitize_rules }}, and {{ output_schema }}.
Putting it together

The combinations most platforms use:
Profile | Cleaning | Summarization | Privacy Filter | Length Limit |
|---|---|---|---|---|
Default (fast) | All on | Off | Off | On at 10k tokens |
Privacy-conscious (open browse) | All on | On | On / Always Sanitize | On at 10k tokens |
Privacy-conscious (curated) | All on | On | On / Judge then Sanitize | On at 10k tokens |
Verbatim with masks | All on | On | On / Keyword Redact | On at 10k tokens |