Content Processing Configuration

6 min read

This page documents every setting under Content Processing in the Spaces configuration UI. Each setting is shown by its UI label, with the underlying configuration field name in italics.

After a search retrieves URLs and a Web Page Reader fetches their content, the result goes through the content processing pipeline before being handed to the AI assistant. The pipeline is in this order:

Diagram: Untitled Diagram-1777548857775

Each stage is independently configurable; cleaning runs first, then summarization (if enabled), then truncation (if enabled), then chunking. Chunking always runs as the final step.


Content Chunk Size (chunk_size)

UI Label

Field

Type

Default

Content Chunk Size

chunk_size

integer (characters)

1000

Maximum size of each piece when long pages are split into chunks. Smaller values create more granular pieces (better for relevancy ranking but more pieces); larger values keep more context per piece.


Content Cleaning (cleaning)

Automatic cleanup steps run on every page before any other processing.

Enable Character Sanitization (cleaning.enable_character_sanitize)

UI Label

Field

Type

Default

Enable Character Sanitization

cleaning.enable_character_sanitize

boolean

true

When enabled, strips null bytes, control characters, and other non-text binary content from page content. Recommended on. Mainly defends against pages that include null bytes or unusual control characters that downstream tooling cannot handle.

Line Removal (cleaning.line_removal)

Removes irrelevant lines from page content using regex patterns — typically navigation menus, cookie banners, "skip to content" links, and footer boilerplate.

UI Label

Field

Type

Default

Enable Line Removal

cleaning.line_removal.enabled

boolean

true

Removal Patterns

cleaning.line_removal.patterns

list of regex strings

Built-in list (see below)

Built-in patterns target:

  • "Skip to" / "Skip Navigation" / "Jump to" / "Accessibility help"

  • Standalone "Sign In" / "Log In" / "Register" / "My Account" lines

  • "Subscribe" / "Follow Us" / "Share This" / "Newsletter Sign Up"

  • "Cookie Policy" / "Privacy Policy" / "Terms of Service" / "Cookie Settings" / "Cookie Notice" / "Accept Cookies"

  • Accessibility labels in […accessibility…] form

Each pattern is a regular expression matched against full lines.

UI Label

Field

Type

Default

Enable Link and URL Cleanup

cleaning.enable_markdown_cleaning

boolean

true

When enabled, applies a fixed set of Markdown transformations:

  • Strips the URL out of [text](url) Markdown links, keeping only the link text

  • Removes standalone URLs from the text

  • Normalizes excess whitespace and blank lines

This makes the content more compact and avoids the AI quoting noisy URLs.


Additional Processing Strategies (processing_strategies)

These strategies run after cleaning. Both are off-by-default in different ways: AI summarization is enabled by toggling it, length limit is enabled by default but can be increased or disabled.

Content Length Limit (processing_strategies.truncate)

Caps the size of any single page so very long pages don't dominate the context budget.

UI Label

Field

Type

Default

Enable Content Length Limit

processing_strategies.truncate.enabled

boolean

true

Maximum Content Length

processing_strategies.truncate.max_tokens

integer (tokens)

10000

Pages exceeding Maximum Content Length are cut off at the limit. Tokens are roughly 4 characters each. Runs after AI Summarization, so summarization happens first; truncation only kicks in if the summarized output is still too long.

AI Content Summarization (processing_strategies.llm_processor)

Use an AI model to summarize and compress page content per query. Useful for long pages where most of the content is irrelevant to the user's question.

Enable AI Summarization (processing_strategies.llm_processor.enabled)

UI Label

Field

Type

Default

Enable AI Summarization

processing_strategies.llm_processor.enabled

boolean

false

Summarization Language Model (processing_strategies.llm_processor.language_model)

The AI model used for summarization. Must support structured output.

Minimum Content Length for Summarization (processing_strategies.llm_processor.min_tokens)

UI Label

Field

Type

Default

Minimum Content Length for Summarization

processing_strategies.llm_processor.min_tokens

integer (tokens)

5000

Pages shorter than this threshold are kept as-is. Only longer pages are sent to the AI for summarization. Ignored when Privacy Filtering is enabled — privacy filtering always runs regardless of length.

Privacy Filtering (processing_strategies.llm_processor.privacy_filter)

Privacy filtering is an additional layer on top of AI Summarization that detects and redacts GDPR Article 9 sensitive personal data (race, ethnicity, political opinions, religious beliefs, trade-union membership, biometric or genetic data, health data, data on sex life or sexual orientation).

UI Label

Field

Type

Default

Enable Privacy Filtering

privacy_filter.sanitize

boolean

false

Sanitization Pipeline Mode

privacy_filter.sanitize_mode

enum (see below)

Always Sanitize

Sensitive Content Flag Message

privacy_filter.flag_message

textarea

Built-in compliance notice

Privacy Filtering Rules

privacy_filter.sanitize_rules

textarea

Built-in GDPR Art. 9 ruleset

Sanitization Pipeline Mode — five modes are exposed in the UI selector:

UI Option

Value

Behaviour

Always Sanitize — summarize and redact every page unconditionally

always_sanitize

Every page goes through summarize-and-sanitize. Highest cost, strongest guarantee.

Judge Only — judge first; if flagged, replace content and snippet with a compliance notice

judge_only

A lightweight LLM call decides if the page contains sensitive data. If flagged, the page content and snippet are replaced with the Sensitive Content Flag Message. If not flagged, the page passes through unchanged.

Judge and Sanitize — single LLM call that judges and returns sanitized content when flagged

judge_and_sanitize

One call does both: classify and (if flagged) return summarized-and-sanitized content. Cheaper than Judge then Sanitize.

Judge then Sanitize — judge first; if flagged, run a full summarize-and-sanitize call

judge_then_sanitize

Two-step: a cheap judge call gates a more expensive summarize-and-sanitize call. Best when most pages are not sensitive.

Keyword Redact — extract sensitive phrases and apply local regex replacement (no summarization)

keyword_redact

LLM extracts verbatim sensitive phrases from the page; a local regex replaces those phrases with redaction markers. Page is not summarized. Useful when you want to keep page content as-is and only mask sensitive strings.

  • Sensitive Content Flag Message — Replaces the content and snippet of pages flagged as sensitive when using Judge Only mode. Default reads: "THIS PAGE MAY CONTAIN SENSITIVE INFORMATION. ITS CONTENT HAS BEEN WITHHELD FOR COMPLIANCE REASONS. YOU CAN REFERENCE THE PAGE TO THE USER SO HE CAN EXPLORE THE CONTENT HIMSELF."

  • Privacy Filtering Rules — Rules given to the AI for what constitutes sensitive data. Default targets GDPR Article 9 categories.

Platform note: when the platform deploys with LLM_PROCESS_CONFIG set, all summarization and privacy-filter fields are visually disabled in the Spaces UI and locked to the platform's values. Space admins see the configuration as read-only. See Activation Reference under Platform / Infrastructure.

Advanced Prompts (processing_strategies.llm_processor.prompts)

Jinja2 prompt templates sent to the AI model. Edit only if you need to customise AI instructions — the defaults are well-tested.

UI Label

Field

Used by

System Instructions

prompts.system_prompt

Always Sanitize and the no-sanitize summarize path

User Instructions

prompts.user_prompt

Always Sanitize and the no-sanitize summarize path

Judge System Instructions

prompts.judge_prompt

Judge Only and Judge then Sanitize

Judge and Sanitize System Instructions

prompts.judge_and_sanitize_prompt

Judge and Sanitize

Page Context User Prompt

prompts.page_context_prompt

Judge call user prompt and Keyword Redact mode

Keyword Extraction System Instructions

prompts.keyword_extract_prompt

Keyword Redact mode

All templates use Jinja2 syntax with variables like {{ page }}, {{ query }}, {{ sanitize }}, {{ sanitize_rules }}, and {{ output_schema }}.


Putting it together

Diagram: Untitled Diagram-1777548931221

The combinations most platforms use:

Profile

Cleaning

Summarization

Privacy Filter

Length Limit

Default (fast)

All on

Off

Off

On at 10k tokens

Privacy-conscious (open browse)

All on

On

On / Always Sanitize

On at 10k tokens

Privacy-conscious (curated)

All on

On

On / Judge then Sanitize

On at 10k tokens

Verbatim with masks

All on

On

On / Keyword Redact

On at 10k tokens

Last updated