Content Processing Configuration

6 min read

This page documents every setting under Content Processing in the Spaces configuration UI. Each setting is shown by its UI label, with the underlying configuration field name in italics.

After a search retrieves URLs and a Web Page Reader fetches their content, the result goes through the content processing pipeline before being handed to the AI assistant. The pipeline is in this order:

Each stage is independently configurable; cleaning runs first, then summarization (if enabled), then truncation (if enabled), then chunking. Chunking always runs as the final step.

Content Chunk Size (`chunk_size`)

UI Label	Field	Type	Default
Content Chunk Size	`chunk_size`	integer (characters)	`1000`

Maximum size of each piece when long pages are split into chunks. Smaller values create more granular pieces (better for relevancy ranking but more pieces); larger values keep more context per piece.

Content Cleaning (`cleaning`)

Automatic cleanup steps run on every page before any other processing.

Enable Character Sanitization (`cleaning.enable_character_sanitize`)

UI Label	Field	Type	Default
Enable Character Sanitization	`cleaning.enable_character_sanitize`	boolean	`true`

When enabled, strips null bytes, control characters, and other non-text binary content from page content. Recommended on. Mainly defends against pages that include null bytes or unusual control characters that downstream tooling cannot handle.

Line Removal (`cleaning.line_removal`)

Removes irrelevant lines from page content using regex patterns — typically navigation menus, cookie banners, "skip to content" links, and footer boilerplate.

UI Label	Field	Type	Default
Enable Line Removal	`cleaning.line_removal.enabled`	boolean	`true`
Removal Patterns	`cleaning.line_removal.patterns`	list of regex strings	Built-in list (see below)

Built-in patterns target:

"Skip to" / "Skip Navigation" / "Jump to" / "Accessibility help"
Standalone "Sign In" / "Log In" / "Register" / "My Account" lines
"Subscribe" / "Follow Us" / "Share This" / "Newsletter Sign Up"
"Cookie Policy" / "Privacy Policy" / "Terms of Service" / "Cookie Settings" / "Cookie Notice" / "Accept Cookies"
Accessibility labels in […accessibility…] form

Each pattern is a regular expression matched against full lines.

Enable Link and URL Cleanup (`cleaning.enable_markdown_cleaning`)

UI Label	Field	Type	Default
Enable Link and URL Cleanup	`cleaning.enable_markdown_cleaning`	boolean	`true`

When enabled, applies a fixed set of Markdown transformations:

Strips the URL out of [text](url) Markdown links, keeping only the link text
Removes standalone URLs from the text
Normalizes excess whitespace and blank lines

This makes the content more compact and avoids the AI quoting noisy URLs.

Additional Processing Strategies (`processing_strategies`)

These strategies run after cleaning. Both are off-by-default in different ways: AI summarization is enabled by toggling it, length limit is enabled by default but can be increased or disabled.

Content Length Limit (`processing_strategies.truncate`)

Caps the size of any single page so very long pages don't dominate the context budget.

UI Label	Field	Type	Default
Enable Content Length Limit	`processing_strategies.truncate.enabled`	boolean	`true`
Maximum Content Length	`processing_strategies.truncate.max_tokens`	integer (tokens)	`10000`

Pages exceeding Maximum Content Length are cut off at the limit. Tokens are roughly 4 characters each. Runs after AI Summarization, so summarization happens first; truncation only kicks in if the summarized output is still too long.

AI Content Summarization (`processing_strategies.llm_processor`)

Use an AI model to summarize and compress page content per query. Useful for long pages where most of the content is irrelevant to the user's question.

Enable AI Summarization (`processing_strategies.llm_processor.enabled`)

UI Label	Field	Type	Default
Enable AI Summarization	`processing_strategies.llm_processor.enabled`	boolean	`false`

Summarization Language Model (`processing_strategies.llm_processor.language_model`)

The AI model used for summarization. Must support structured output.

Minimum Content Length for Summarization (`processing_strategies.llm_processor.min_tokens`)

UI Label	Field	Type	Default
Minimum Content Length for Summarization	`processing_strategies.llm_processor.min_tokens`	integer (tokens)	`5000`

Pages shorter than this threshold are kept as-is. Only longer pages are sent to the AI for summarization. Ignored when Privacy Filtering is enabled — privacy filtering always runs regardless of length.

Privacy Filtering (`processing_strategies.llm_processor.privacy_filter`)

Privacy filtering is an additional layer on top of AI Summarization that detects and redacts GDPR Article 9 sensitive personal data (race, ethnicity, political opinions, religious beliefs, trade-union membership, biometric or genetic data, health data, data on sex life or sexual orientation).

UI Label	Field	Type	Default
Enable Privacy Filtering	`privacy_filter.sanitize`	boolean	`false`
Sanitization Pipeline Mode	`privacy_filter.sanitize_mode`	enum (see below)	`Always Sanitize`
Sensitive Content Flag Message	`privacy_filter.flag_message`	textarea	Built-in compliance notice
Privacy Filtering Rules	`privacy_filter.sanitize_rules`	textarea	Built-in GDPR Art. 9 ruleset

Sanitization Pipeline Mode — five modes are exposed in the UI selector:

UI Option	Value	Behaviour
Always Sanitize — summarize and redact every page unconditionally	`always_sanitize`	Every page goes through summarize-and-sanitize. Highest cost, strongest guarantee.
Judge Only — judge first; if flagged, replace content and snippet with a compliance notice	`judge_only`	A lightweight LLM call decides if the page contains sensitive data. If flagged, the page content and snippet are replaced with the Sensitive Content Flag Message. If not flagged, the page passes through unchanged.
Judge and Sanitize — single LLM call that judges and returns sanitized content when flagged	`judge_and_sanitize`	One call does both: classify and (if flagged) return summarized-and-sanitized content. Cheaper than Judge then Sanitize.
Judge then Sanitize — judge first; if flagged, run a full summarize-and-sanitize call	`judge_then_sanitize`	Two-step: a cheap judge call gates a more expensive summarize-and-sanitize call. Best when most pages are not sensitive.
Keyword Redact — extract sensitive phrases and apply local regex replacement (no summarization)	`keyword_redact`	LLM extracts verbatim sensitive phrases from the page; a local regex replaces those phrases with redaction markers. Page is not summarized. Useful when you want to keep page content as-is and only mask sensitive strings.

Sensitive Content Flag Message — Replaces the content and snippet of pages flagged as sensitive when using Judge Only mode. Default reads: "THIS PAGE MAY CONTAIN SENSITIVE INFORMATION. ITS CONTENT HAS BEEN WITHHELD FOR COMPLIANCE REASONS. YOU CAN REFERENCE THE PAGE TO THE USER SO HE CAN EXPLORE THE CONTENT HIMSELF."
Privacy Filtering Rules — Rules given to the AI for what constitutes sensitive data. Default targets GDPR Article 9 categories.

Platform note: when the platform deploys with LLM_PROCESS_CONFIG set, all summarization and privacy-filter fields are visually disabled in the Spaces UI and locked to the platform's values. Space admins see the configuration as read-only. See Activation Reference under Platform / Infrastructure.

Advanced Prompts (`processing_strategies.llm_processor.prompts`)

Jinja2 prompt templates sent to the AI model. Edit only if you need to customise AI instructions — the defaults are well-tested.

UI Label	Field	Used by
System Instructions	`prompts.system_prompt`	Always Sanitize and the no-sanitize summarize path
User Instructions	`prompts.user_prompt`	Always Sanitize and the no-sanitize summarize path
Judge System Instructions	`prompts.judge_prompt`	Judge Only and Judge then Sanitize
Judge and Sanitize System Instructions	`prompts.judge_and_sanitize_prompt`	Judge and Sanitize
Page Context User Prompt	`prompts.page_context_prompt`	Judge call user prompt and Keyword Redact mode
Keyword Extraction System Instructions	`prompts.keyword_extract_prompt`	Keyword Redact mode

All templates use Jinja2 syntax with variables like {{ page }}, {{ query }}, {{ sanitize }}, {{ sanitize_rules }}, and {{ output_schema }}.

Putting it together

The combinations most platforms use:

Profile	Cleaning	Summarization	Privacy Filter	Length Limit
Default (fast)	All on	Off	Off	On at 10k tokens
Privacy-conscious (open browse)	All on	On	On / Always Sanitize	On at 10k tokens
Privacy-conscious (curated)	All on	On	On / Judge then Sanitize	On at 10k tokens
Verbatim with masks	All on	On	On / Keyword Redact	On at 10k tokens

Content Processing Configuration

Content Chunk Size (chunk_size)

Content Cleaning (cleaning)

Enable Character Sanitization (cleaning.enable_character_sanitize)

Line Removal (cleaning.line_removal)

Enable Link and URL Cleanup (cleaning.enable_markdown_cleaning)

Additional Processing Strategies (processing_strategies)

Content Length Limit (processing_strategies.truncate)

AI Content Summarization (processing_strategies.llm_processor)

Enable AI Summarization (processing_strategies.llm_processor.enabled)

Summarization Language Model (processing_strategies.llm_processor.language_model)

Minimum Content Length for Summarization (processing_strategies.llm_processor.min_tokens)

Privacy Filtering (processing_strategies.llm_processor.privacy_filter)

Advanced Prompts (processing_strategies.llm_processor.prompts)