Agentic Image Content Extraction

11 min read

This feature is currently in BETA. It may change before general availability, due to user and client feedback, but it targeted to be high quality and stable. Documentation may lag behind feature updates. Use in production environments at your own discretion. Please refer to our Upgrade and Release Process for more information.

Overview

Documents processed through Microsoft Document Intelligence (MDI) often contain embedded figures — charts, diagrams, infographics, and other visual elements — whose content is lost or poorly represented in the extracted text. Traditional OCR and layout analysis tools can detect that a figure exists, but cannot interpret what the figure actually shows.

Agentic Image Content Extraction solves this by using vision-capable AI models to understand and transcribe the content of individual images and figures found within PDF documents. This produces a richer, more complete representation of each document page — ensuring that information locked inside charts, graphs, and diagrams becomes searchable and available for RAG-based AI chat.

Who It's For

Knowledge base admins who configure document processing workflows and want higher-quality text extraction from image-heavy documents
Users who rely on AI chat to search and query documents containing charts, graphs, infographics, and diagrams
Organizations in industries (FSI, consulting, research) where critical data is often communicated through visual elements

Can this feature be enabled on non-Azure or self-hosted tenants?

Agentic Image Content Extraction requires Microsoft Document Intelligence (MDI) and access to vision LLMs via the platform API. MDI can be deployed on-prem.

Benefits

Captures Information Locked in Visuals

Charts and graphs: Bar charts, line graphs, pie charts — the AI model reads axes, labels, values, and trends
Diagrams and flowcharts: Extracts structured descriptions of process flows, architecture diagrams, and org charts
Infographics: Interprets complex visual layouts combining text, icons, and data
Tables rendered as images: When tables are embedded as figures rather than native table elements, the vision model can still extract the data

Seamless Integration

Works within the existing MDI pipeline — no need to switch to a different read mode
Extracted figure text is inserted at the correct position in the page markdown
Figure captions from MDI are preserved alongside the extracted content
Falls back gracefully to standard MDI output if figure extraction fails for any image

Example Use Cases

Financial Reports

Revenue charts: Extract quarterly revenue figures, growth percentages, and trend descriptions from bar/line charts
Portfolio allocations: Interpret pie charts showing asset allocation breakdowns
Performance dashboards: Capture KPI values and metrics from dashboard screenshots embedded in reports

Research and Analysis

Scientific figures: Extract data from experimental result charts, correlation plots, and distribution graphs
Market analysis: Interpret market share diagrams, competitive landscape visuals, and trend charts
Process diagrams: Transcribe workflow and process flow diagrams into searchable text

Regulatory and Compliance

Risk matrices: Extract risk ratings and categories from visual risk assessment matrices
Compliance dashboards: Capture compliance status from visual scorecards and dashboards
Organizational charts: Extract hierarchical information from org chart figures

Step-by-Step Guide

1. Enable Image Content Extraction

Image Content Extraction is configured as part of the PDF ingestion settings and can be enabled through two independent paths, depending on how documents are uploaded.

Path A — Knowledge Base uploads (folder / scope ingestion configuration)

This is the primary configuration path for documents uploaded to the Knowledge Base via the knowledge-upload app.

Open the Knowledge Base app and navigate to the target folder
Open the Ingestion Configuration for that folder (or scope)
Ensure the PDF Read Mode is set to DOC_INTELLIGENCE_DEFAULT
Enable the Image Content Extraction toggle
Select a vision-capable language model from the dropdown (e.g. AZURE_GPT_51_2025_1113)
Click Save — optionally apply to sub-scopes

(Optional) To override the system or user prompt for the vision LLM, paste a JSON object into the Configuration field of the Image Content Extraction section. For ONE_STEP (default):

json

{
  "strategy": "ONE_STEP",
  "strategyConfig": {
    "systemPrompt": "You are a financial-statements image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
    "userPrompt": "Extract every line of the balance sheet, income statement, or cash-flow statement visible in this image. Respond in English."
  },
}

The Image Content Extraction section in the UI is only visible when:

The feature flag FEATURE_FLAG_ENABLE_IMAGE_CONTENT_EXTRACTION_UN_17223 is enabled on the knowledge-upload service
The PDF Read Mode is set to DOC_INTELLIGENCE_DEFAULT

The list of available language models in the dropdown is controlled by the IMAGE_CONTENT_EXTRACTION_LANGUAGE_MODELS environment variable on the knowledge-upload service (comma-separated list of model identifiers).

All documents subsequently uploaded to that folder (or child folders, if applied to sub-scopes) will use the configured image content extraction settings.

Path B — Chat uploads (space / assistant configuration)

This path applies to documents uploaded directly in a chat conversation (e.g. drag-and-drop into the chat window).

Configuration in a Unique AI space

Navigate to Manage Spaces and click on your Space
On the Configuration tile, click on the Advanced Settings link in the bottom

Open the Optimization tile and click on Configure File Ingestion

Configuration in a Unique Custom space

Navigate to Manage Spaces and click on your Space
Open Advanced Settings for the assistant
Configure the ingestionConfig to include imageContentExtraction:

json

{
  ...,
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "pdfConfig": {
      "imageContentExtraction": {
        "enabled": true,
        "languageModel": "AZURE_GPT_51_2025_1113",
        "settings": {
          "strategy": "ONE_STEP",
          "strategyConfig": {
            "systemPrompt": "You are a financial-statements image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
            "userPrompt": "Extract every line of the balance sheet, income statement, or cash-flow statement visible in this image. Respond in English."
          },
        }
      }
    }
  },
  ...
}

When a user uploads a file in a chat using that assistant, the assistant's settings.ingestionConfig is passed along with the upload and applied during processing. See Configuration Options → Custom Prompts (Advanced) for the full schema.

Important difference: For chat uploads, the ingestion configuration comes from the assistant settings — it is not inherited from any Knowledge Base folder/scope configuration. Each path is independent.

How configuration merging works

When a document is uploaded, the platform merges ingestion configuration from multiple layers (later layers override earlier ones):

Layer	Chat uploads	Knowledge Base uploads
1. Platform defaults	Service-level defaults (`imageContentExtraction.enabled: false`)	Service-level defaults
2. Owner config	Empty (`{}` — no scope config for chat)	Folder/scope `ingestionConfig`
3. Existing content config	From previous version of the content, if any	From previous version of the content, if any
4. Request input	`assistant.settings.ingestionConfig`	Per-upload override, if provided

The effective ("applied") ingestion configuration is stored on each content item and used by the ingestion worker.

Important: The pdfReadMode must be set to DOC_INTELLIGENCE_DEFAULT for image content extraction to work. It does not apply when using CUSTOM_SINGLE_PAGE_API or other read modes.

2. Configure the Language Model

The languageModel field specifies which vision-capable AI model to use for extracting content from figures. This must be a model that supports image/vision inputs.

3. Optional: Advanced Settings

You can pass additional settings via the settings object:

json

{
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "pdfConfig": {
      "imageContentExtraction": {
        "enabled": true,
        "languageModel": "AZURE_GPT_51_2025_1113",
        "settings": {
          "strategy": "ONE_STEP",
          "strategyConfig": {
            "systemPrompt": "You are a financial-statements image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
            "userPrompt": "Extract every line of the balance sheet, income statement, or cash-flow statement visible in this image. Respond in English."
          },
          "languageModelConfig": {
            "supportsStructuredOutput": true
          }
        }
      }
    }
  }
}

The available strategyConfig fields and their resolution order are documented in Configuration Options → Custom Prompts (Advanced) below.

4. Upload Documents

Upload PDF documents through the standard Unique AI interface — either to the Knowledge Base (for scope-configured extraction) or directly in chat (for assistant-configured extraction). The system will automatically:

Process each page through MDI with figure detection enabled
Detect figures on each page using MDI layout analysis
Crop each figure from the rendered page image
Extract content from each figure using the configured vision model
Compose the final page by merging figure text back into the page markdown

5. Verify Results

Review the extracted content to verify that figure content has been captured. Look for:

Chart and graph data appearing as text within the document content
Diagram descriptions replacing what would have been blank or [Figure] placeholders
Correct positioning of figure text within the page flow

Configuration Options

Performance Settings

Setting	Description	Configuration key	Default	Recommended
Image DPI value	Resolution used when rendering PDF pages for figure cropping	`settings.imageProcessingConfig.dpiValue`	150	150–300 (higher = better quality but slower)
Image compression quality	JPEG compression quality for cropped figures	`settings.imageProcessingConfig.compressionQuality`	50	50–75

Language Model Configuration

Setting	Description	Configuration key	Default
`languageModel`	Vision model used for extraction	`pdfConfig.imageContentExtraction.languageModel`	`AZURE_GPT_51_2025_1113`
`languageModelConfig.supportsStructuredOutput`	Whether to use native structured output	`settings.languageModelConfig.supportsStructuredOutput`	`true`
Fallback model	Automatic fallback if primary model fails	`settings.languageModelFallbackConfig.name`	`AZURE_GPT_4o_2024_1120`

Custom Prompts (Advanced)

The vision LLM uses built-in default prompts to extract image content. You can override these prompts per scope, per upload, or per assistant — without redeploying the service — by adding a strategyConfig block to the Configuration JSON.

Override prompts only when:

the documents need a domain-specific tone (legal, medical, financial regulator language),
the documents are not in a Latin script and the default English prompt confuses the model,
or you need to extract a different shape of information (e.g. always emit a markdown table of values).

Default behavior (no override) is appropriate for most customers.

Where to set custom prompts

Where the document is uploaded	Where to put `strategyConfig`
Knowledge Base (folder / scope)	Knowledge Upload app → folder Ingestion Configuration → Image Content Extraction → Configuration JSON textarea
Chat (drag-and-drop)	Admin app → space → assistant Advanced Settings → `ingestionConfig.pdfConfig.imageContentExtraction.settings.strategyConfig`
Single upload via API or SDK	`input.ingestionConfig.pdfConfig.imageContentExtraction.settings.strategyConfig` on `contentUpsert` / SDK upload helper

The same JSON keys work in all three places.

`ONE_STEP` (default strategy)

Field	Description	Default
`systemPrompt`	Replaces the hardcoded `ONE_STEP` system prompt. Should describe the `{ "reasoning": "...", "image_content": "..." }` JSON shape so the model complies with structured output.	You are a financial document image transcriber. Your task is to produce a structured text transcription of the image content that preserves all information for text-based retrieval systems. OUTPUT FORMAT (JSON): { "reasoning": "string" // your reasoning about the content, "image_content": "string" // the extracted content }
`userPrompt`	Replaces the hardcoded `ONE_STEP` user prompt.	Extract the content of this image following the appropriate strategy based on its type: If the image is a chart or data visualization (bar, line, pie, area, scatter, combined chart, etc.): State the chart type and title List all legend entries, axis labels, and category names Transcribe every visible data series into a markdown table with clear column headers For values printed on the chart, extract exactly; for values read from axes, estimate and mark as approximate Describe the overall trend or key comparison the chart illustrates If the image is a logo, icon, decorative illustration, or photo of a person: Output only the entity name or a single-sentence description Do not elaborate further

Field

Description

Default

systemPrompt

Replaces the hardcoded ONE_STEP system prompt. Should describe the { "reasoning": "...", "image_content": "..." } JSON shape so the model complies with structured output.

You are a financial document image transcriber. Your task is to produce a structured text transcription of the image content that preserves all information for text-based retrieval systems.

OUTPUT FORMAT (JSON):

{

"reasoning": "string" // your reasoning about the content,

"image_content": "string" // the extracted content

}

userPrompt

Replaces the hardcoded ONE_STEP user prompt.

Extract the content of this image following the appropriate strategy based on its type:

If the image is a chart or data visualization (bar, line, pie, area, scatter, combined chart, etc.):

State the chart type and title

List all legend entries, axis labels, and category names

Transcribe every visible data series into a markdown table with clear column headers

For values printed on the chart, extract exactly; for values read from axes, estimate and mark as approximate

Describe the overall trend or key comparison the chart illustrates

If the image is a logo, icon, decorative illustration, or photo of a person:

Output only the entity name or a single-sentence description

Do not elaborate further

Each is resolved independently: request strategyConfig → service env var (IMAGE_CONTENT_EXTRACTION_ONE_STEP_SYSTEM_PROMPT / _USER_PROMPT) → platform default. Empty strings fall through.

{language} is only substituted when the prompt comes from an env var or platform default. Prompts you supply via strategyConfig are sent to the LLM verbatim — write the language directly into the text.

json

{
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "AZURE_GPT_51_2025_1113",
      "settings": {
        "strategy": "ONE_STEP",
        "strategyConfig": {
          "systemPrompt": "You are a regulatory-filings image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
          "userPrompt": "Extract every figure, label, and footnote in this image. Respond in English."
        }
      }
    }
  }
}

`TWO_STEP` (classify, then extract)

Each figure is first classified into a category, then a category-specific extractor prompt is used. Configurable fields, all under strategyConfig:

Field	Type	Default
`classifierSystemPrompt`	string	platform default
`classifierUserPrompt`	string	platform default
`extractorCategoryToSystemPrompts`	object (string → string)	7 entries keyed by `chart_with_numerical_values`, `chart_without_numerical_values`, `table_structure`, `mixed_content`, `logo`, `text_or_numbers`, `default`
`extractorCategoryToUserPrompts`	object (string → string)	same 7 keys
`documentReferencePrompt`	string	`"Here is the whole document page as a reference:"`
`noExtractionForCategories`	array of string	`["illustrative_picture", "icon", "humans", "content_filter_exception"]`
`imagesInParallel`	integer	`5`

TWO_STEP does not have an env-var override layer (request → platform default only) and does not substitute {language}.

Custom Configuration

The set of Performance settings and fallback model settings are configured through the optional settings JSON object inside pdfConfig.imageContentExtraction.

In the UI, this corresponds to the Configuration field shown below the Image Content Extraction language model. The same JSON can also be added directly in a space or assistant ingestionConfig.

Example:

json

{
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "AZURE_GPT_51_2025_1113",
      "settings": {
        "imageProcessingConfig": {
          "dpiValue": 150,
          "compressionQuality": 50
        },
        "languageModelFallbackConfig": {
          "name": "AZURE_GPT_4o_2024_1120"
        }
      }
    }
  }
}

How It Works Under the Hood

When Image Content Extraction is enabled, PDF ingestion continues to use the standard Microsoft Document Intelligence pipeline, with an additional enrichment step for visual content.

Figure detection: Microsoft Document Intelligence detects figures, charts, diagrams, and other visual elements in the PDF.
AI-based extraction: Detected figures are analyzed with the configured vision-capable language model. Depending on the environment configuration, figures may be processed individually or in page batches.
Content enrichment: The extracted figure text is inserted back into the processed page content so it can be searched and used in AI chat answers.

If extraction fails for an individual figure, the rest of the page can still be processed. If figure extraction cannot run for a page, the system falls back to the standard Microsoft Document Intelligence output for that page.

Limitations

Current Limitations

This feature is currently in BETA. Extraction quality may vary depending on figure complexity, image resolution, and AI model capabilities.

Handwritten content in figures may not be reliably extracted
Very small figures (< 50px in any dimension) may produce low-quality extractions
Overlapping or nested figures are processed independently; the composer does not currently merge related figures

Performance Considerations

Enabling image content extraction increases per-page processing time because each figure requires an additional AI model call
Processing time scales with the number of figures per page (figures are processed in parallel, up to 5 concurrently)

Scenario	Typical Added Latency (per page)	Notes
Page with 1-2 figures	+3–7 sec	Minimal overhead
Page with 5+ figures	+7–20 sec	Parallel extraction helps
Two-step with many decorative images	Varies	Classification overhead offset by skipped extractions

Resource Usage

Additional AI model API calls are incurred for each figure extracted
In the internal evaluation, page-batch mode used ~2,035 tokens per call on average, while per-figure mode used ~1,034 tokens per call.
Page-batch mode was faster, but used about +1,001 additional tokens per call, mostly from prompt/input tokens because the full page image is attached as context.
Token costs depend on the selected model and model pricing.
Network bandwidth increases due to base64-encoded figure images being sent to the Agentic Ingestion service
Redis is used for async job queue management in the Agentic Ingestion service

Troubleshooting

Common Issues

Figures not being extracted

Verify that imageContentExtraction.enabled is true in the ingestion config
Ensure pdfReadMode is set to DOC_INTELLIGENCE_DEFAULT (not CUSTOM_SINGLE_PAGE_API)
Check that a valid languageModel is specified
Verify that the Agentic Ingestion service is running and reachable from node-ingestion-worker
For Knowledge Base uploads: check the folder's ingestion configuration in the knowledge-upload app
For chat uploads: check the assistant's settings.ingestionConfig in the admin app

Image Content Extraction toggle not visible in the Knowledge Upload UI

Ensure FEATURE_FLAG_ENABLE_IMAGE_CONTENT_EXTRACTION_UN_17223 is set to true on the knowledge-upload service
Ensure the PDF Read Mode is set to DOC_INTELLIGENCE_DEFAULT — the toggle is hidden for other read modes

Poor extraction quality

Increase the image DPI value for higher-resolution figure crops
Try a more capable language model (e.g., AZURE_GPT_51_2025_1113)

Slow processing

Reduce image DPI if extraction quality is acceptable at lower resolutions
Check Agentic Ingestion service health and resource allocation
Monitor Redis queue length for job backlogs

Fallback to plain MDI

If the logs show "Falling back to MDI without figure extraction", this means the figure extraction pipeline encountered an error. Check:

Agentic Ingestion service logs for errors
Network connectivity between node-ingestion-worker and agentic-ingestion
AI model endpoint availability

Getting Help

Check service logs in both node-ingestion-worker and agentic-ingestion pods
Review the ingestion configuration for the affected space or folder
Contact the infrastructure team with specific error messages
Open an issue in the repository with detailed logs

Best Practices

When to Enable

Enable for document sets that are rich in charts, graphs, and diagrams where the visual content carries important information
Enable when users report that chart data is missing or figures show as blank in AI chat responses
Consider the two-step strategy for document collections with a mix of informational and decorative images
For Knowledge Base content: enable at the folder level to apply consistently to all uploads within that scope
For chat uploads: configure on the assistant so all users uploading via that assistant benefit automatically

When Not to Enable

Documents that are primarily text and tables — MDI handles these well without figure extraction
When processing speed is critical and the added latency per figure is not acceptable
For non-PDF file types — image content extraction only applies to PDF processing with MDI. Powerpoint and word documents can be converted to PDFs before ingestion.

Optimization Tips

Use 150 DPI as a starting point; increase to 200–300 only if extraction quality is insufficient
Monitor processing times and adjust the language model if latency is too high

Agentic Image Content Extraction

Overview

Who It's For

Benefits

Captures Information Locked in Visuals

Seamless Integration

Example Use Cases

Financial Reports

Research and Analysis

Regulatory and Compliance

Step-by-Step Guide

1. Enable Image Content Extraction

Path A — Knowledge Base uploads (folder / scope ingestion configuration)

Path B — Chat uploads (space / assistant configuration)

How configuration merging works

2. Configure the Language Model

3. Optional: Advanced Settings

4. Upload Documents

5. Verify Results

Configuration Options

Performance Settings

Language Model Configuration

Custom Prompts (Advanced)

Where to set custom prompts

ONE_STEP (default strategy)

TWO_STEP (classify, then extract)

Custom Configuration

How It Works Under the Hood

Limitations

Current Limitations

Performance Considerations

Resource Usage

Troubleshooting

Common Issues

Getting Help

Best Practices

When to Enable

When Not to Enable

Optimization Tips

`ONE_STEP` (default strategy)

`TWO_STEP` (classify, then extract)