Agentic Image Content Extraction

10 min read

This feature is currently in BETA. It may change before general availability, due to user and client feedback, but it targeted to be high quality and stable. Documentation may lag behind feature updates. Use in production environments at your own discretion. Please refer to our Upgrade and Release Process for more information.

Overview

Documents processed through Microsoft Document Intelligence (MDI) often contain embedded figures — charts, diagrams, infographics, and other visual elements — whose content is lost or poorly represented in the extracted text. Traditional OCR and layout analysis tools can detect that a figure exists, but cannot interpret what the figure actually shows.

Agentic Image Content Extraction solves this by using vision-capable AI models to understand and transcribe the content of individual images and figures found within PDF documents. This produces a richer, more complete representation of each document page — ensuring that information locked inside charts, graphs, and diagrams becomes searchable and available for RAG-based AI chat.

Who It's For

  • Knowledge base admins who configure document processing workflows and want higher-quality text extraction from image-heavy documents

  • Users who rely on AI chat to search and query documents containing charts, graphs, infographics, and diagrams

  • Organizations in industries (FSI, consulting, research) where critical data is often communicated through visual elements

Can this feature be enabled on non-Azure or self-hosted tenants?

Agentic Image Content Extraction requires Microsoft Document Intelligence (MDI) and access to vision LLMs via the platform API. MDI can be deployed on-prem.

Benefits

Captures Information Locked in Visuals

  • Charts and graphs: Bar charts, line graphs, pie charts — the AI model reads axes, labels, values, and trends

  • Diagrams and flowcharts: Extracts structured descriptions of process flows, architecture diagrams, and org charts

  • Infographics: Interprets complex visual layouts combining text, icons, and data

  • Tables rendered as images: When tables are embedded as figures rather than native table elements, the vision model can still extract the data

Seamless Integration

  • Works within the existing MDI pipeline — no need to switch to a different read mode

  • Extracted figure text is inserted at the correct position in the page markdown

  • Figure captions from MDI are preserved alongside the extracted content

  • Falls back gracefully to standard MDI output if figure extraction fails for any image

Example Use Cases

Financial Reports

  • Revenue charts: Extract quarterly revenue figures, growth percentages, and trend descriptions from bar/line charts

  • Portfolio allocations: Interpret pie charts showing asset allocation breakdowns

  • Performance dashboards: Capture KPI values and metrics from dashboard screenshots embedded in reports

Research and Analysis

  • Scientific figures: Extract data from experimental result charts, correlation plots, and distribution graphs

  • Market analysis: Interpret market share diagrams, competitive landscape visuals, and trend charts

  • Process diagrams: Transcribe workflow and process flow diagrams into searchable text

Regulatory and Compliance

  • Risk matrices: Extract risk ratings and categories from visual risk assessment matrices

  • Compliance dashboards: Capture compliance status from visual scorecards and dashboards

  • Organizational charts: Extract hierarchical information from org chart figures

Step-by-Step Guide

1. Enable Image Content Extraction

Image Content Extraction is configured as part of the PDF ingestion settings and can be enabled through two independent paths, depending on how documents are uploaded.

Path A — Knowledge Base uploads (folder / scope ingestion configuration)

image-20260522-155610.png

This is the primary configuration path for documents uploaded to the Knowledge Base via the knowledge-upload app.

  1. Open the Knowledge Base app and navigate to the target folder

  2. Open the Ingestion Configuration for that folder (or scope)

  3. Ensure the PDF Read Mode is set to DOC_INTELLIGENCE_DEFAULT

  4. Enable the Image Content Extraction toggle

  5. Select a vision-capable language model from the dropdown (e.g. AZURE_GPT_51_2025_1113)

  6. Click Save — optionally apply to sub-scopes

  7. (Optional) To override the system or user prompt for the vision LLM, paste a JSON object into the Configuration field of the Image Content Extraction section. For ONE_STEP (default):

    json
    {
      "strategy": "ONE_STEP",
      "strategyConfig": {
        "systemPrompt": "...",
        "userPrompt": "..."
      }
    }

The Image Content Extraction section in the UI is only visible when:

  • The feature flag FEATURE_FLAG_ENABLE_IMAGE_CONTENT_EXTRACTION_UN_17223 is enabled on the knowledge-upload service

  • The PDF Read Mode is set to DOC_INTELLIGENCE_DEFAULT

The list of available language models in the dropdown is controlled by the IMAGE_CONTENT_EXTRACTION_LANGUAGE_MODELS environment variable on the knowledge-upload service (comma-separated list of model identifiers).

All documents subsequently uploaded to that folder (or child folders, if applied to sub-scopes) will use the configured image content extraction settings.

Path B — Chat uploads (space / assistant configuration)

This path applies to documents uploaded directly in a chat conversation (e.g. drag-and-drop into the chat window).

  1. Open the Admin app and navigate to the target space

  2. Open Advanced Settings for the assistant

  3. Configure the ingestionConfig within the assistant settings to include imageContentExtraction:

json
{
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "pdfConfig": {
      "imageContentExtraction": {
        "enabled": true,
        "languageModel": "AZURE_GPT_51_2025_1113",
        "settings": {
          "strategy": "ONE_STEP",
          "strategyConfig": {
            "systemPrompt": "...",
            "userPrompt": "..."
          }
        }
      }
    }
  }
}

When a user uploads a file in a chat using that assistant, the assistant's settings.ingestionConfig is passed along with the upload and applied during processing. See Configuration Options → Custom Prompts (Advanced) for the full schema.

Important difference: For chat uploads, the ingestion configuration comes from the assistant settings — it is not inherited from any Knowledge Base folder/scope configuration. Each path is independent.

How configuration merging works

When a document is uploaded, the platform merges ingestion configuration from multiple layers (later layers override earlier ones):

Layer

Chat uploads

Knowledge Base uploads

1. Platform defaults

Service-level defaults (imageContentExtraction.enabled: false)

Service-level defaults

2. Owner config

Empty ({} — no scope config for chat)

Folder/scope ingestionConfig

3. Existing content config

From previous version of the content, if any

From previous version of the content, if any

4. Request input

assistant.settings.ingestionConfig

Per-upload override, if provided

The effective ("applied") ingestion configuration is stored on each content item and used by the ingestion worker.

Important: The pdfReadMode must be set to DOC_INTELLIGENCE_DEFAULT for image content extraction to work. It does not apply when using CUSTOM_SINGLE_PAGE_API or other read modes.

2. Configure the Language Model

The languageModel field specifies which vision-capable AI model to use for extracting content from figures. This must be a model that supports image/vision inputs.

3. Optional: Advanced Settings

You can pass additional settings via the settings object:

json
{
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "pdfConfig": {
      "imageContentExtraction": {
        "enabled": true,
        "languageModel": "AZURE_GPT_51_2025_1113",
        "settings": {
          "strategy": "ONE_STEP",
          "strategyConfig": {
            "systemPrompt": "You are a financial-statements image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
            "userPrompt": "Extract every line of the balance sheet, income statement, or cash-flow statement visible in this image. Respond in English."
          },
          "languageModelConfig": {
            "supportsStructuredOutput": true
          }
        }
      }
    }
  }
}

The available strategyConfig fields and their resolution order are documented in Configuration Options → Custom Prompts (Advanced) below.

4. Upload Documents

Upload PDF documents through the standard Unique AI interface — either to the Knowledge Base (for scope-configured extraction) or directly in chat (for assistant-configured extraction). The system will automatically:

  1. Process each page through MDI with figure detection enabled

  2. Detect figures on each page using MDI layout analysis

  3. Crop each figure from the rendered page image

  4. Extract content from each figure using the configured vision model

  5. Compose the final page by merging figure text back into the page markdown

5. Verify Results

Review the extracted content to verify that figure content has been captured. Look for:

  • Chart and graph data appearing as text within the document content

  • Diagram descriptions replacing what would have been blank or [Figure] placeholders

  • Correct positioning of figure text within the page flow

Configuration Options

Performance Settings

Setting

Description

Configuration key

Default

Recommended

Image DPI value

Resolution used when rendering PDF pages for figure cropping

settings.imageProcessingConfig.dpiValue

150

150–300 (higher = better quality but slower)

Image compression quality

JPEG compression quality for cropped figures

settings.imageProcessingConfig.compressionQuality

50

50–75

Language Model Configuration

Setting

Description

Configuration key

Default

languageModel

Vision model used for extraction

pdfConfig.imageContentExtraction.languageModel

AZURE_GPT_51_2025_1113

languageModelConfig.supportsStructuredOutput

Whether to use native structured output

settings.languageModelConfig.supportsStructuredOutput

true

Fallback model

Automatic fallback if primary model fails

settings.languageModelFallbackConfig.name

AZURE_GPT_4o_2024_1120

Custom Prompts (Advanced)

The vision LLM uses built-in default prompts to extract image content. You can override these prompts per scope, per upload, or per assistant — without redeploying the service — by adding a strategyConfig block to the Configuration JSON.

Override prompts only when:

  • the documents need a domain-specific tone (legal, medical, financial regulator language),

  • the documents are not in a Latin script and the default English prompt confuses the model,

  • or you need to extract a different shape of information (e.g. always emit a markdown table of values).

Default behavior (no override) is appropriate for most customers.

Where to set custom prompts

Where the document is uploaded

Where to put strategyConfig

Knowledge Base (folder / scope)

Knowledge Upload app → folder Ingestion ConfigurationImage Content ExtractionConfiguration JSON textarea

Chat (drag-and-drop)

Admin app → space → assistant Advanced SettingsingestionConfig.pdfConfig.imageContentExtraction.settings.strategyConfig

Single upload via API or SDK

input.ingestionConfig.pdfConfig.imageContentExtraction.settings.strategyConfig on contentUpsert / SDK upload helper

The same JSON keys work in all three places.

ONE_STEP (default strategy)

Field

Description

systemPrompt

Replaces the hardcoded ONE_STEP system prompt. Should describe the { "reasoning": "...", "image_content": "..." } JSON shape so the model complies with structured output.

userPrompt

Replaces the hardcoded ONE_STEP user prompt.

Each is resolved independently: request strategyConfig → service env var (IMAGE_CONTENT_EXTRACTION_ONE_STEP_SYSTEM_PROMPT / _USER_PROMPT) → platform default. Empty strings fall through.

{language} is only substituted when the prompt comes from an env var or platform default. Prompts you supply via strategyConfig are sent to the LLM verbatim — write the language directly into the text.

json
{
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "AZURE_GPT_51_2025_1113",
      "settings": {
        "strategy": "ONE_STEP",
        "strategyConfig": {
          "systemPrompt": "You are a regulatory-filings image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
          "userPrompt": "Extract every figure, label, and footnote in this image. Respond in English."
        }
      }
    }
  }
}

TWO_STEP (classify, then extract)

Each figure is first classified into a category, then a category-specific extractor prompt is used. Configurable fields, all under strategyConfig:

Field

Type

Default

classifierSystemPrompt

string

platform default

classifierUserPrompt

string

platform default

extractorCategoryToSystemPrompts

object (string → string)

7 entries keyed by chart_with_numerical_values, chart_without_numerical_values, table_structure, mixed_content, logo, text_or_numbers, default

extractorCategoryToUserPrompts

object (string → string)

same 7 keys

documentReferencePrompt

string

"Here is the whole document page as a reference:"

noExtractionForCategories

array of string

["illustrative_picture", "icon", "humans", "content_filter_exception"]

imagesInParallel

integer

5

TWO_STEP does not have an env-var override layer (request → platform default only) and does not substitute {language}.

Custom Configuration

The set of Performance settings and fallback model settings are configured through the optional settings JSON object inside pdfConfig.imageContentExtraction.

In the UI, this corresponds to the Configuration field shown below the Image Content Extraction language model. The same JSON can also be added directly in a space or assistant ingestionConfig.

image-20260522-160519.png

Example:

json
{
  "pdfConfig": {
    "imageContentExtraction": {
      "enabled": true,
      "languageModel": "AZURE_GPT_51_2025_1113",
      "settings": {
        "imageProcessingConfig": {
          "dpiValue": 150,
          "compressionQuality": 50
        },
        "languageModelFallbackConfig": {
          "name": "AZURE_GPT_4o_2024_1120"
        }
      }
    }
  }
}

How It Works Under the Hood

When Image Content Extraction is enabled, PDF ingestion continues to use the standard Microsoft Document Intelligence pipeline, with an additional enrichment step for visual content.

  1. Figure detection: Microsoft Document Intelligence detects figures, charts, diagrams, and other visual elements in the PDF.

  2. AI-based extraction: Detected figures are analyzed with the configured vision-capable language model. Depending on the environment configuration, figures may be processed individually or in page batches.

  3. Content enrichment: The extracted figure text is inserted back into the processed page content so it can be searched and used in AI chat answers.

If extraction fails for an individual figure, the rest of the page can still be processed. If figure extraction cannot run for a page, the system falls back to the standard Microsoft Document Intelligence output for that page.

Limitations

Current Limitations

This feature is currently in BETA. Extraction quality may vary depending on figure complexity, image resolution, and AI model capabilities.

  • Handwritten content in figures may not be reliably extracted

  • Very small figures (< 50px in any dimension) may produce low-quality extractions

  • Overlapping or nested figures are processed independently; the composer does not currently merge related figures

Performance Considerations

  • Enabling image content extraction increases per-page processing time because each figure requires an additional AI model call

  • Processing time scales with the number of figures per page (figures are processed in parallel, up to 5 concurrently)

Scenario

Typical Added Latency (per page)

Notes

Page with 1-2 figures

+3–7 sec

Minimal overhead

Page with 5+ figures

+7–20 sec

Parallel extraction helps

Two-step with many decorative images

Varies

Classification overhead offset by skipped extractions

Resource Usage

  • Additional AI model API calls are incurred for each figure extracted

  • In the internal evaluation, page-batch mode used ~2,035 tokens per call on average, while per-figure mode used ~1,034 tokens per call.

  • Page-batch mode was faster, but used about +1,001 additional tokens per call, mostly from prompt/input tokens because the full page image is attached as context.

  • Token costs depend on the selected model and model pricing.

  • Network bandwidth increases due to base64-encoded figure images being sent to the Agentic Ingestion service

  • Redis is used for async job queue management in the Agentic Ingestion service

Troubleshooting

Common Issues

Figures not being extracted

  • Verify that imageContentExtraction.enabled is true in the ingestion config

  • Ensure pdfReadMode is set to DOC_INTELLIGENCE_DEFAULT (not CUSTOM_SINGLE_PAGE_API)

  • Check that a valid languageModel is specified

  • Verify that the Agentic Ingestion service is running and reachable from node-ingestion-worker

  • For Knowledge Base uploads: check the folder's ingestion configuration in the knowledge-upload app

  • For chat uploads: check the assistant's settings.ingestionConfig in the admin app

Image Content Extraction toggle not visible in the Knowledge Upload UI

  • Ensure FEATURE_FLAG_ENABLE_IMAGE_CONTENT_EXTRACTION_UN_17223 is set to true on the knowledge-upload service

  • Ensure the PDF Read Mode is set to DOC_INTELLIGENCE_DEFAULT — the toggle is hidden for other read modes

Poor extraction quality

  • Increase the image DPI value for higher-resolution figure crops

  • Try a more capable language model (e.g., AZURE_GPT_51_2025_1113)

Slow processing

  • Reduce image DPI if extraction quality is acceptable at lower resolutions

  • Check Agentic Ingestion service health and resource allocation

  • Monitor Redis queue length for job backlogs

Fallback to plain MDI

If the logs show "Falling back to MDI without figure extraction", this means the figure extraction pipeline encountered an error. Check:

  • Agentic Ingestion service logs for errors

  • Network connectivity between node-ingestion-worker and agentic-ingestion

  • AI model endpoint availability

Getting Help

  1. Check service logs in both node-ingestion-worker and agentic-ingestion pods

  2. Review the ingestion configuration for the affected space or folder

  3. Contact the infrastructure team with specific error messages

  4. Open an issue in the repository with detailed logs

Best Practices

When to Enable

  • Enable for document sets that are rich in charts, graphs, and diagrams where the visual content carries important information

  • Enable when users report that chart data is missing or figures show as blank in AI chat responses

  • Consider the two-step strategy for document collections with a mix of informational and decorative images

  • For Knowledge Base content: enable at the folder level to apply consistently to all uploads within that scope

  • For chat uploads: configure on the assistant so all users uploading via that assistant benefit automatically

When Not to Enable

  • Documents that are primarily text and tables — MDI handles these well without figure extraction

  • When processing speed is critical and the added latency per figure is not acceptable

  • For non-PDF file types — image content extraction only applies to PDF processing with MDI. Powerpoint and word documents can be converted to PDFs before ingestion.

Optimization Tips

  • Use 150 DPI as a starting point; increase to 200–300 only if extraction quality is insufficient

  • Monitor processing times and adjust the language model if latency is too high

Last updated