Agentic Image Content Extraction Configuration

5 min read

Agentic Image Content Extraction enriches PDF ingestion by extracting text from images, charts, diagrams, and figures found inside PDF pages.

How It Fits Into Ingestion

When Image Content Extraction is enabled:

The PDF is processed with Microsoft Document Intelligence.
Figures are detected on each page.
The page or figure image is sent internally to agentic-ingestion.
A vision-capable language model extracts the meaningful content.
The extracted text is inserted back into the ingested page content.
The normal chunking and embedding flow continues.

Image Content Extraction is therefore an enrichment step inside PDF ingestion, not a separate ingestion mode.

Configuration Fields

Field	Required	Description
pdfReadMode	Yes	Must be DOC_INTELLIGENCE_DEFAULT for Image Content Extraction in PDF ingestion.
pdfConfig.imageContentExtraction.enabled	Yes	Enables or disables Image Content Extraction for PDF ingestion.
pdfConfig.imageContentExtraction.languageModel	Yes, when enabled	Vision-capable model used to extract image content.
pdfConfig.imageContentExtraction.settings	No	Advanced configuration for image rendering, fallback model, and extraction strategy.
pdfConfig.imageContentExtraction.settings.imageProcessingConfig.dpiValue	No	DPI used when rendering PDF pages as images. Default is 150.
pdfConfig.imageContentExtraction.settings.imageProcessingConfig.compressionQuality	No	Image compression quality. Default is 50.
pdfConfig.imageContentExtraction.settings.languageModelFallbackConfig	No	Optional per-config fallback model. If omitted, the service default is used.
`pdfConfig.imageContentExtraction.settings.strategy`	No	Extraction strategy. One of `ONE_STEP` (default, single vision call) or `TWO_STEP` (classify → extract per category). Service default is set by `IMAGE_CONTENT_EXTRACTION_STRATEGY` env var on `agentic-ingestion`.
`pdfConfig.imageContentExtraction.settings.strategyConfig`	No	Strategy-specific overrides, including the system and user prompts used for image extraction. Schema depends on `strategy` (see "Customizing Extractor Prompts" below).

Where Can This Be Configured?

Image Content Extraction is configured through the platform ingestionConfig.

Place	How it is configured	Scope
Admin / Knowledge Upload UI	The ingestion configuration form writes pdfConfig.imageContentExtraction into the folder ingestion config.	Folder / scope default
GraphQL setScopeProperties	Set properties.ingestionConfig on a scope.	Folder / scope default, optionally subfolders
GraphQL contentUpsert / contentUpsertByChat	Pass input.ingestionConfig when uploading or upserting content.	Single uploaded content item
SDK upload helpers	Pass the same object as ingestion_config / ingestionConfig, depending on the SDK language helper.	Single uploaded content item

Folder-level configuration is used as the default for content uploaded into that folder. A per-upload ingestionConfig can override it for one content item.

How `ingestionConfig` looks like ?

`pdfReadMode` and `pdfConfig` are sibling fields inside the same `ingestionConfig` object.

Example:

json

{
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "AZURE_GPT_51_2025_1113",
      "settings": {
        "imageProcessingConfig": {
          "dpiValue": 150,
          "compressionQuality": 50
        },
        "languageModelFallbackConfig": "AZURE_GPT_4o_2024_0513"
      }
    }
  }
}

Only add settings when there is a specific reason, such as image quality tuning or an engineering-approved fallback model override.

Recommended Defaults

For most tenants:

json

{
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "<vision-capable-model>"
    }
  }
}

Only add settings if there is a specific reason, such as image quality tuning or engineering-guided prompt changes.

Customizing the Extractor Prompts

The vision LLM that extracts image content runs with hardcoded default prompts that live in agentic-ingestion. Both the system and user prompts can be overridden per scope, per assistant, or per upload through strategyConfig — without redeploying the service.

The available prompt fields and their resolution order depend on settings.strategy.

`ONE_STEP` (default strategy)

Two configurable prompt fields, both at pdfConfig.imageContentExtraction.settings.strategyConfig:

Field	Type	Resolution order (first non-empty wins)
`systemPrompt`	string	request `strategyConfig.systemPrompt`→ 2. env var `IMAGE_CONTENT_EXTRACTION_ONE_STEP_SYSTEM_PROMPT` on `agentic-ingestion` → 3. hardcoded default `get_one_step_system_prompt()` in `prompts.py`
`userPrompt`	string	request `strategyConfig.userPrompt` → 2. env var `IMAGE_CONTENT_EXTRACTION_ONE_STEP_USER_PROMPT` on `agentic-ingestion` → 3. hardcoded default `get_one_step_user_prompt()` in `prompts.py`

Each prompt is resolved independently (you can override the system prompt without overriding the user prompt). Empty strings are treated as unset and fall through to the next level.

{language} placeholder. Only substituted when the prompt comes from an env var or from the hardcoded default. When you supply systemPrompt or userPrompt via strategyConfig, the string is sent to the LLM verbatim — bake the language into the text or omit the language instruction.

Structured output. The service binds the LLM response to a { "reasoning": "...", "image_content": "..." } JSON schema (Pydantic structured output). A custom system prompt should describe this output shape so the model complies; if it does not, the primary model call fails and the request falls back to the fallback model.

Auto-injected context. When a figure caption is available, the service prepends a caption hint to the system prompt; when a full-page image is sent as context, it prepends a page-context notice. These additions cannot be disabled per request.

Example:

json

{
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "AZURE_GPT_51_2025_1113",
      "settings": {
        "strategy": "ONE_STEP",
        "strategyConfig": {
          "systemPrompt": "You are a legal-document image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
          "userPrompt": "Extract every visible clause, signature, and stamp from this image. Respond in English."
        }
      }
    }
  }
}

`TWO_STEP` (classify, then extract)

When strategy: "TWO_STEP", each cropped figure is first classified into one of the keys of extractorCategoryToSystemPrompts (or one of the categories listed in noExtractionForCategories). A category-specific extractor prompt is then run.

All prompt fields live at pdfConfig.imageContentExtraction.settings.strategyConfig. There is no env-var override layer for TWO_STEP — the resolution chain is request → hardcoded default only.

Field	Type	Default
`classifierSystemPrompt`	string	`prompts.CLASSIFIER_SYSTEM`
`classifierUserPrompt`	string	`prompts.CLASSIFIER_USER`
`extractorCategoryToSystemPrompts`	object (string → string)	7 hardcoded entries: `chart_with_numerical_values`, `chart_without_numerical_values`, `table_structure`, `mixed_content`, `logo`, `text_or_numbers`, `default`
`extractorCategoryToUserPrompts`	object (string → string)	7 hardcoded entries (same keys as above)
`documentReferencePrompt`	string	`"Here is the whole document page as a reference:"`
`noExtractionForCategories`	array of string	`["illustrative_picture", "icon", "humans", "content_filter_exception"]`
`imagesInParallel`	integer	`5`

Note:

No {language} substitution. The TWO_STEP path does not perform any language substitution. Bake the language into your strings.

Where overrides can be set

The same strategyConfig shape is accepted at every layer; only the JSON path used to reach it differs.

Layer	Path to `strategyConfig`
GraphQL `setScopeProperties` (folder / scope default)	`properties.ingestionConfig.pdfConfig.imageContentExtraction.settings.strategyConfig`
GraphQL `contentUpsert` / `contentUpsertByChat` (per-upload)	`input.ingestionConfig.pdfConfig.imageContentExtraction.settings.strategyConfig`
SDK upload helpers (`ingestion_config` / `ingestionConfig`)	same path under the helper's ingestion-config argument
Assistant `settings.ingestionConfig` (chat uploads)	same path
Knowledge-Upload UI Configuration textarea	`strategyConfig` directly at the root of the JSON typed into the textarea (the textarea's content is parsed into `settings`, so the leading `pdfConfig.imageContentExtraction.settings.` is supplied by the surrounding form)
Direct `agentic-ingestion` HTTP API	`strategy` and `strategyConfig` at the top level of the POST body — the worker adapter spreads `settings` to the top level on the wire

Common Mistakes

Putting pdfConfig outside of ingestionConfig.
Putting imageContentExtraction directly at the root instead of under pdfConfig.
Enabling Image Content Extraction without selecting a vision-capable language model.
Selecting a model that does not support image input.
Changing dpiValue without considering image quality, latency, and token cost.
Overriding prompts via strategyConfig without keeping the structured output contract. The service binds responses to { "reasoning": "string", "image_content": "string" }; a custom system prompt should describe this shape so the model complies. If it does not, the primary call fails and the request falls back to the fallback model.
Mixing ONE_STEP and TWO_STEP keys in the same strategyConfig. Keys for the strategy you are not running are silently ignored.

Infra Dependencies

Image Content Extraction also depends on platform and deployment configuration:

The Image Content Extraction feature flag must be enabled.
agentic-ingestion must be deployed and reachable by node-ingestion-worker.
Vision-capable models must be configured and available.
Redis/job queue configuration for agentic-ingestion must be healthy.

These are infra/operator concerns and should be documented in the Infra Admin page.

Troubleshooting Checklist

If Image Content Extraction does not run:

Check that pdfReadMode is DOC_INTELLIGENCE_DEFAULT.
Check that pdfConfig.imageContentExtraction.enabled is true.
Check that languageModel is configured.
Check that the selected model supports vision.
Check that the feature flag is enabled.
Check that node-ingestion-worker can reach agentic-ingestion.
Check agentic-ingestion logs for image extraction errors.

Agentic Image Content Extraction Configuration

How It Fits Into Ingestion

Configuration Fields

Where Can This Be Configured?

How `ingestionConfig` looks like ?

Recommended Defaults

Customizing the Extractor Prompts

ONE_STEP (default strategy)

TWO_STEP (classify, then extract)

Where overrides can be set

Common Mistakes

Infra Dependencies

Troubleshooting Checklist

`ONE_STEP` (default strategy)

`TWO_STEP` (classify, then extract)