Agentic Image Content Extraction Configuration

5 min read

Agentic Image Content Extraction enriches PDF ingestion by extracting text from images, charts, diagrams, and figures found inside PDF pages.

How It Fits Into Ingestion

When Image Content Extraction is enabled:

  1. The PDF is processed with Microsoft Document Intelligence.

  2. Figures are detected on each page.

  3. The page or figure image is sent internally to agentic-ingestion.

  4. A vision-capable language model extracts the meaningful content.

  5. The extracted text is inserted back into the ingested page content.

  6. The normal chunking and embedding flow continues.

Image Content Extraction is therefore an enrichment step inside PDF ingestion, not a separate ingestion mode.

Configuration Fields

Field

Required

Description

pdfReadMode

Yes

Must be DOC_INTELLIGENCE_DEFAULT for Image Content Extraction in PDF ingestion.

pdfConfig.imageContentExtraction.enabled

Yes

Enables or disables Image Content Extraction for PDF ingestion.

pdfConfig.imageContentExtraction.languageModel

Yes, when enabled

Vision-capable model used to extract image content.

pdfConfig.imageContentExtraction.settings

No

Advanced configuration for image rendering, fallback model, and extraction strategy.

pdfConfig.imageContentExtraction.settings.imageProcessingConfig.dpiValue

No

DPI used when rendering PDF pages as images. Default is 150.

pdfConfig.imageContentExtraction.settings.imageProcessingConfig.compressionQuality

No

Image compression quality. Default is 50.

pdfConfig.imageContentExtraction.settings.languageModelFallbackConfig

No

Optional per-config fallback model. If omitted, the service default is used.

pdfConfig.imageContentExtraction.settings.strategy

No

Extraction strategy. One of ONE_STEP (default, single vision call) or TWO_STEP (classify → extract per category). Service default is set by IMAGE_CONTENT_EXTRACTION_STRATEGY env var on agentic-ingestion.

pdfConfig.imageContentExtraction.settings.strategyConfig

No

Strategy-specific overrides, including the system and user prompts used for image extraction. Schema depends on strategy (see "Customizing Extractor Prompts" below).

Where Can This Be Configured?

Image Content Extraction is configured through the platform ingestionConfig.

Place

How it is configured

Scope

Admin / Knowledge Upload UI

The ingestion configuration form writes pdfConfig.imageContentExtraction into the folder ingestion config.

Folder / scope default

GraphQL setScopeProperties

Set properties.ingestionConfig on a scope.

Folder / scope default, optionally subfolders

GraphQL contentUpsert / contentUpsertByChat

Pass input.ingestionConfig when uploading or upserting content.

Single uploaded content item

SDK upload helpers

Pass the same object as ingestion_config / ingestionConfig, depending on the SDK language helper.

Single uploaded content item

Folder-level configuration is used as the default for content uploaded into that folder. A per-upload ingestionConfig can override it for one content item.

How `ingestionConfig` looks like ?

`pdfReadMode` and `pdfConfig` are sibling fields inside the same `ingestionConfig` object.

Example:

json
{
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "AZURE_GPT_51_2025_1113",
      "settings": {
        "imageProcessingConfig": {
          "dpiValue": 150,
          "compressionQuality": 50
        },
        "languageModelFallbackConfig": "AZURE_GPT_4o_2024_0513"
      }
    }
  }
}

Only add settings when there is a specific reason, such as image quality tuning or an engineering-approved fallback model override.

For most tenants:

json
{
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "<vision-capable-model>"
    }
  }
}

Only add settings if there is a specific reason, such as image quality tuning or engineering-guided prompt changes.

Customizing the Extractor Prompts

The vision LLM that extracts image content runs with hardcoded default prompts that live in agentic-ingestion. Both the system and user prompts can be overridden per scope, per assistant, or per upload through strategyConfig — without redeploying the service.

The available prompt fields and their resolution order depend on settings.strategy.

ONE_STEP (default strategy)

Two configurable prompt fields, both at pdfConfig.imageContentExtraction.settings.strategyConfig:

Field

Type

Resolution order (first non-empty wins)

systemPrompt

string

  1. request strategyConfig.systemPrompt→ 2. env var IMAGE_CONTENT_EXTRACTION_ONE_STEP_SYSTEM_PROMPT on agentic-ingestion → 3. hardcoded default get_one_step_system_prompt() in prompts.py

userPrompt

string

  1. request strategyConfig.userPrompt → 2. env var IMAGE_CONTENT_EXTRACTION_ONE_STEP_USER_PROMPT on agentic-ingestion → 3. hardcoded default get_one_step_user_prompt() in prompts.py

Each prompt is resolved independently (you can override the system prompt without overriding the user prompt). Empty strings are treated as unset and fall through to the next level.

{language} placeholder. Only substituted when the prompt comes from an env var or from the hardcoded default. When you supply systemPrompt or userPrompt via strategyConfig, the string is sent to the LLM verbatim — bake the language into the text or omit the language instruction.

Structured output. The service binds the LLM response to a { "reasoning": "...", "image_content": "..." } JSON schema (Pydantic structured output). A custom system prompt should describe this output shape so the model complies; if it does not, the primary model call fails and the request falls back to the fallback model.

Auto-injected context. When a figure caption is available, the service prepends a caption hint to the system prompt; when a full-page image is sent as context, it prepends a page-context notice. These additions cannot be disabled per request.

Example:

json
{
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "AZURE_GPT_51_2025_1113",
      "settings": {
        "strategy": "ONE_STEP",
        "strategyConfig": {
          "systemPrompt": "You are a legal-document image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
          "userPrompt": "Extract every visible clause, signature, and stamp from this image. Respond in English."
        }
      }
    }
  }
}

TWO_STEP (classify, then extract)

When strategy: "TWO_STEP", each cropped figure is first classified into one of the keys of extractorCategoryToSystemPrompts (or one of the categories listed in noExtractionForCategories). A category-specific extractor prompt is then run.

All prompt fields live at pdfConfig.imageContentExtraction.settings.strategyConfig. There is no env-var override layer for TWO_STEP — the resolution chain is request → hardcoded default only.

Field

Type

Default

classifierSystemPrompt

string

prompts.CLASSIFIER_SYSTEM

classifierUserPrompt

string

prompts.CLASSIFIER_USER

extractorCategoryToSystemPrompts

object (string → string)

7 hardcoded entries: chart_with_numerical_values, chart_without_numerical_values, table_structure, mixed_content, logo, text_or_numbers, default

extractorCategoryToUserPrompts

object (string → string)

7 hardcoded entries (same keys as above)

documentReferencePrompt

string

"Here is the whole document page as a reference:"

noExtractionForCategories

array of string

["illustrative_picture", "icon", "humans", "content_filter_exception"]

imagesInParallel

integer

5

Note:

  • No {language} substitution. The TWO_STEP path does not perform any language substitution. Bake the language into your strings.

Where overrides can be set

The same strategyConfig shape is accepted at every layer; only the JSON path used to reach it differs.

Layer

Path to strategyConfig

GraphQL setScopeProperties (folder / scope default)

properties.ingestionConfig.pdfConfig.imageContentExtraction.settings.strategyConfig

GraphQL contentUpsert / contentUpsertByChat (per-upload)

input.ingestionConfig.pdfConfig.imageContentExtraction.settings.strategyConfig

SDK upload helpers (ingestion_config / ingestionConfig)

same path under the helper's ingestion-config argument

Assistant settings.ingestionConfig (chat uploads)

same path

Knowledge-Upload UI Configuration textarea

strategyConfig directly at the root of the JSON typed into the textarea (the textarea's content is parsed into settings, so the leading pdfConfig.imageContentExtraction.settings. is supplied by the surrounding form)

Direct agentic-ingestion HTTP API

strategy and strategyConfig at the top level of the POST body — the worker adapter spreads settings to the top level on the wire

Common Mistakes

  • Putting pdfConfig outside of ingestionConfig.

  • Putting imageContentExtraction directly at the root instead of under pdfConfig.

  • Enabling Image Content Extraction without selecting a vision-capable language model.

  • Selecting a model that does not support image input.

  • Changing dpiValue without considering image quality, latency, and token cost.

  • Overriding prompts via strategyConfig without keeping the structured output contract. The service binds responses to { "reasoning": "string", "image_content": "string" }; a custom system prompt should describe this shape so the model complies. If it does not, the primary call fails and the request falls back to the fallback model.

  • Mixing ONE_STEP and TWO_STEP keys in the same strategyConfig. Keys for the strategy you are not running are silently ignored.

Infra Dependencies

Image Content Extraction also depends on platform and deployment configuration:

  • The Image Content Extraction feature flag must be enabled.

  • agentic-ingestion must be deployed and reachable by node-ingestion-worker.

  • Vision-capable models must be configured and available.

  • Redis/job queue configuration for agentic-ingestion must be healthy.

These are infra/operator concerns and should be documented in the Infra Admin page.

Troubleshooting Checklist

If Image Content Extraction does not run:

  1. Check that pdfReadMode is DOC_INTELLIGENCE_DEFAULT.

  2. Check that pdfConfig.imageContentExtraction.enabled is true.

  3. Check that languageModel is configured.

  4. Check that the selected model supports vision.

  5. Check that the feature flag is enabled.

  6. Check that node-ingestion-worker can reach agentic-ingestion.

  7. Check agentic-ingestion logs for image extraction errors.

Last updated