Agentic Image Content Extraction
10 min read
Overview
Documents processed through Microsoft Document Intelligence (MDI) often contain embedded figures — charts, diagrams, infographics, and other visual elements — whose content is lost or poorly represented in the extracted text. Traditional OCR and layout analysis tools can detect that a figure exists, but cannot interpret what the figure actually shows.
Agentic Image Content Extraction solves this by using vision-capable AI models to understand and transcribe the content of individual images and figures found within PDF documents. This produces a richer, more complete representation of each document page — ensuring that information locked inside charts, graphs, and diagrams becomes searchable and available for RAG-based AI chat.
Who It's For
Knowledge base admins who configure document processing workflows and want higher-quality text extraction from image-heavy documents
Users who rely on AI chat to search and query documents containing charts, graphs, infographics, and diagrams
Organizations in industries (FSI, consulting, research) where critical data is often communicated through visual elements
Can this feature be enabled on non-Azure or self-hosted tenants?
Agentic Image Content Extraction requires Microsoft Document Intelligence (MDI) and access to vision LLMs via the platform API. MDI can be deployed on-prem.
Benefits
Captures Information Locked in Visuals
Charts and graphs: Bar charts, line graphs, pie charts — the AI model reads axes, labels, values, and trends
Diagrams and flowcharts: Extracts structured descriptions of process flows, architecture diagrams, and org charts
Infographics: Interprets complex visual layouts combining text, icons, and data
Tables rendered as images: When tables are embedded as figures rather than native table elements, the vision model can still extract the data
Seamless Integration
Works within the existing MDI pipeline — no need to switch to a different read mode
Extracted figure text is inserted at the correct position in the page markdown
Figure captions from MDI are preserved alongside the extracted content
Falls back gracefully to standard MDI output if figure extraction fails for any image
Example Use Cases
Financial Reports
Revenue charts: Extract quarterly revenue figures, growth percentages, and trend descriptions from bar/line charts
Portfolio allocations: Interpret pie charts showing asset allocation breakdowns
Performance dashboards: Capture KPI values and metrics from dashboard screenshots embedded in reports
Research and Analysis
Scientific figures: Extract data from experimental result charts, correlation plots, and distribution graphs
Market analysis: Interpret market share diagrams, competitive landscape visuals, and trend charts
Process diagrams: Transcribe workflow and process flow diagrams into searchable text
Regulatory and Compliance
Risk matrices: Extract risk ratings and categories from visual risk assessment matrices
Compliance dashboards: Capture compliance status from visual scorecards and dashboards
Organizational charts: Extract hierarchical information from org chart figures
Step-by-Step Guide
1. Enable Image Content Extraction
Image Content Extraction is configured as part of the PDF ingestion settings and can be enabled through two independent paths, depending on how documents are uploaded.
Path A — Knowledge Base uploads (folder / scope ingestion configuration)

This is the primary configuration path for documents uploaded to the Knowledge Base via the knowledge-upload app.
Open the Knowledge Base app and navigate to the target folder
Open the Ingestion Configuration for that folder (or scope)
Ensure the PDF Read Mode is set to
DOC_INTELLIGENCE_DEFAULTEnable the Image Content Extraction toggle
Select a vision-capable language model from the dropdown (e.g.
AZURE_GPT_51_2025_1113)Click Save — optionally apply to sub-scopes
(Optional) To override the system or user prompt for the vision LLM, paste a JSON object into the Configuration field of the Image Content Extraction section. For
ONE_STEP(default):json{ "strategy": "ONE_STEP", "strategyConfig": { "systemPrompt": "...", "userPrompt": "..." } }
The Image Content Extraction section in the UI is only visible when:
The feature flag
FEATURE_FLAG_ENABLE_IMAGE_CONTENT_EXTRACTION_UN_17223is enabled on theknowledge-uploadserviceThe PDF Read Mode is set to
DOC_INTELLIGENCE_DEFAULT
The list of available language models in the dropdown is controlled by the IMAGE_CONTENT_EXTRACTION_LANGUAGE_MODELS environment variable on the knowledge-upload service (comma-separated list of model identifiers).
All documents subsequently uploaded to that folder (or child folders, if applied to sub-scopes) will use the configured image content extraction settings.
Path B — Chat uploads (space / assistant configuration)
This path applies to documents uploaded directly in a chat conversation (e.g. drag-and-drop into the chat window).
Open the Admin app and navigate to the target space
Open Advanced Settings for the assistant
Configure the
ingestionConfigwithin the assistant settings to includeimageContentExtraction:
{
"ingestionConfig": {
"pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
"pdfConfig": {
"imageContentExtraction": {
"enabled": true,
"languageModel": "AZURE_GPT_51_2025_1113",
"settings": {
"strategy": "ONE_STEP",
"strategyConfig": {
"systemPrompt": "...",
"userPrompt": "..."
}
}
}
}
}
}When a user uploads a file in a chat using that assistant, the assistant's settings.ingestionConfig is passed along with the upload and applied during processing. See Configuration Options → Custom Prompts (Advanced) for the full schema.
Important difference: For chat uploads, the ingestion configuration comes from the assistant settings — it is not inherited from any Knowledge Base folder/scope configuration. Each path is independent.
How configuration merging works
When a document is uploaded, the platform merges ingestion configuration from multiple layers (later layers override earlier ones):
Layer | Chat uploads | Knowledge Base uploads |
|---|---|---|
1. Platform defaults | Service-level defaults ( | Service-level defaults |
2. Owner config | Empty ( | Folder/scope |
3. Existing content config | From previous version of the content, if any | From previous version of the content, if any |
4. Request input |
| Per-upload override, if provided |
The effective ("applied") ingestion configuration is stored on each content item and used by the ingestion worker.
Important: The pdfReadMode must be set to DOC_INTELLIGENCE_DEFAULT for image content extraction to work. It does not apply when using CUSTOM_SINGLE_PAGE_API or other read modes.
2. Configure the Language Model
The languageModel field specifies which vision-capable AI model to use for extracting content from figures. This must be a model that supports image/vision inputs.
3. Optional: Advanced Settings
You can pass additional settings via the settings object:
{
"ingestionConfig": {
"pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
"pdfConfig": {
"imageContentExtraction": {
"enabled": true,
"languageModel": "AZURE_GPT_51_2025_1113",
"settings": {
"strategy": "ONE_STEP",
"strategyConfig": {
"systemPrompt": "You are a financial-statements image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
"userPrompt": "Extract every line of the balance sheet, income statement, or cash-flow statement visible in this image. Respond in English."
},
"languageModelConfig": {
"supportsStructuredOutput": true
}
}
}
}
}
}
The available strategyConfig fields and their resolution order are documented in Configuration Options → Custom Prompts (Advanced) below.
4. Upload Documents
Upload PDF documents through the standard Unique AI interface — either to the Knowledge Base (for scope-configured extraction) or directly in chat (for assistant-configured extraction). The system will automatically:
Process each page through MDI with figure detection enabled
Detect figures on each page using MDI layout analysis
Crop each figure from the rendered page image
Extract content from each figure using the configured vision model
Compose the final page by merging figure text back into the page markdown
5. Verify Results
Review the extracted content to verify that figure content has been captured. Look for:
Chart and graph data appearing as text within the document content
Diagram descriptions replacing what would have been blank or
[Figure]placeholdersCorrect positioning of figure text within the page flow
Configuration Options
Performance Settings
Setting | Description | Configuration key | Default | Recommended |
|---|---|---|---|---|
Image DPI value | Resolution used when rendering PDF pages for figure cropping |
| 150 | 150–300 (higher = better quality but slower) |
Image compression quality | JPEG compression quality for cropped figures |
| 50 | 50–75 |
Language Model Configuration
Setting | Description | Configuration key | Default |
|---|---|---|---|
| Vision model used for extraction |
|
|
| Whether to use native structured output |
|
|
Fallback model | Automatic fallback if primary model fails |
|
|
Custom Prompts (Advanced)
The vision LLM uses built-in default prompts to extract image content. You can override these prompts per scope, per upload, or per assistant — without redeploying the service — by adding a strategyConfig block to the Configuration JSON.
Override prompts only when:
the documents need a domain-specific tone (legal, medical, financial regulator language),
the documents are not in a Latin script and the default English prompt confuses the model,
or you need to extract a different shape of information (e.g. always emit a markdown table of values).
Default behavior (no override) is appropriate for most customers.
Where to set custom prompts
Where the document is uploaded | Where to put |
|---|---|
Knowledge Base (folder / scope) | Knowledge Upload app → folder Ingestion Configuration → Image Content Extraction → Configuration JSON textarea |
Chat (drag-and-drop) | Admin app → space → assistant Advanced Settings → |
Single upload via API or SDK |
|
The same JSON keys work in all three places.
ONE_STEP (default strategy)
Field | Description |
|---|---|
| Replaces the hardcoded |
| Replaces the hardcoded |
Each is resolved independently: request strategyConfig → service env var (IMAGE_CONTENT_EXTRACTION_ONE_STEP_SYSTEM_PROMPT / _USER_PROMPT) → platform default. Empty strings fall through.
{language} is only substituted when the prompt comes from an env var or platform default. Prompts you supply via strategyConfig are sent to the LLM verbatim — write the language directly into the text.
{
"pdfConfig": {
"imageContentExtraction": {
"enabled": true,
"languageModel": "AZURE_GPT_51_2025_1113",
"settings": {
"strategy": "ONE_STEP",
"strategyConfig": {
"systemPrompt": "You are a regulatory-filings image transcriber. Respond in English.\n\nOUTPUT FORMAT (JSON):\n{ \"reasoning\": \"string\", \"image_content\": \"string\" }",
"userPrompt": "Extract every figure, label, and footnote in this image. Respond in English."
}
}
}
}
}TWO_STEP (classify, then extract)
Each figure is first classified into a category, then a category-specific extractor prompt is used. Configurable fields, all under strategyConfig:
Field | Type | Default |
|---|---|---|
| string | platform default |
| string | platform default |
| object (string → string) | 7 entries keyed by |
| object (string → string) | same 7 keys |
| string |
|
| array of string |
|
| integer |
|
TWO_STEP does not have an env-var override layer (request → platform default only) and does not substitute {language}.
Custom Configuration
The set of Performance settings and fallback model settings are configured through the optional settings JSON object inside pdfConfig.imageContentExtraction.
In the UI, this corresponds to the Configuration field shown below the Image Content Extraction language model. The same JSON can also be added directly in a space or assistant ingestionConfig.

Example:
{
"pdfConfig": {
"imageContentExtraction": {
"enabled": true,
"languageModel": "AZURE_GPT_51_2025_1113",
"settings": {
"imageProcessingConfig": {
"dpiValue": 150,
"compressionQuality": 50
},
"languageModelFallbackConfig": {
"name": "AZURE_GPT_4o_2024_1120"
}
}
}
}
}How It Works Under the Hood
When Image Content Extraction is enabled, PDF ingestion continues to use the standard Microsoft Document Intelligence pipeline, with an additional enrichment step for visual content.
Figure detection: Microsoft Document Intelligence detects figures, charts, diagrams, and other visual elements in the PDF.
AI-based extraction: Detected figures are analyzed with the configured vision-capable language model. Depending on the environment configuration, figures may be processed individually or in page batches.
Content enrichment: The extracted figure text is inserted back into the processed page content so it can be searched and used in AI chat answers.
If extraction fails for an individual figure, the rest of the page can still be processed. If figure extraction cannot run for a page, the system falls back to the standard Microsoft Document Intelligence output for that page.
Limitations
Current Limitations
This feature is currently in BETA. Extraction quality may vary depending on figure complexity, image resolution, and AI model capabilities.
Handwritten content in figures may not be reliably extracted
Very small figures (< 50px in any dimension) may produce low-quality extractions
Overlapping or nested figures are processed independently; the composer does not currently merge related figures
Performance Considerations
Enabling image content extraction increases per-page processing time because each figure requires an additional AI model call
Processing time scales with the number of figures per page (figures are processed in parallel, up to 5 concurrently)
Scenario | Typical Added Latency (per page) | Notes |
|---|---|---|
Page with 1-2 figures | +3–7 sec | Minimal overhead |
Page with 5+ figures | +7–20 sec | Parallel extraction helps |
Two-step with many decorative images | Varies | Classification overhead offset by skipped extractions |
Resource Usage
Additional AI model API calls are incurred for each figure extracted
In the internal evaluation, page-batch mode used ~2,035 tokens per call on average, while per-figure mode used ~1,034 tokens per call.
Page-batch mode was faster, but used about +1,001 additional tokens per call, mostly from prompt/input tokens because the full page image is attached as context.
Token costs depend on the selected model and model pricing.
Network bandwidth increases due to base64-encoded figure images being sent to the Agentic Ingestion service
Redis is used for async job queue management in the Agentic Ingestion service
Troubleshooting
Common Issues
Figures not being extracted
Verify that
imageContentExtraction.enabledistruein the ingestion configEnsure
pdfReadModeis set toDOC_INTELLIGENCE_DEFAULT(notCUSTOM_SINGLE_PAGE_API)Check that a valid
languageModelis specifiedVerify that the Agentic Ingestion service is running and reachable from node-ingestion-worker
For Knowledge Base uploads: check the folder's ingestion configuration in the knowledge-upload app
For chat uploads: check the assistant's
settings.ingestionConfigin the admin app
Image Content Extraction toggle not visible in the Knowledge Upload UI
Ensure
FEATURE_FLAG_ENABLE_IMAGE_CONTENT_EXTRACTION_UN_17223is set totrueon theknowledge-uploadserviceEnsure the PDF Read Mode is set to
DOC_INTELLIGENCE_DEFAULT— the toggle is hidden for other read modes
Poor extraction quality
Increase the image DPI value for higher-resolution figure crops
Try a more capable language model (e.g.,
AZURE_GPT_51_2025_1113)
Slow processing
Reduce image DPI if extraction quality is acceptable at lower resolutions
Check Agentic Ingestion service health and resource allocation
Monitor Redis queue length for job backlogs
Fallback to plain MDI
If the logs show "Falling back to MDI without figure extraction", this means the figure extraction pipeline encountered an error. Check:
Agentic Ingestion service logs for errors
Network connectivity between node-ingestion-worker and agentic-ingestion
AI model endpoint availability
Getting Help
Check service logs in both node-ingestion-worker and agentic-ingestion pods
Review the ingestion configuration for the affected space or folder
Contact the infrastructure team with specific error messages
Open an issue in the repository with detailed logs
Best Practices
When to Enable
Enable for document sets that are rich in charts, graphs, and diagrams where the visual content carries important information
Enable when users report that chart data is missing or figures show as blank in AI chat responses
Consider the two-step strategy for document collections with a mix of informational and decorative images
For Knowledge Base content: enable at the folder level to apply consistently to all uploads within that scope
For chat uploads: configure on the assistant so all users uploading via that assistant benefit automatically
When Not to Enable
Documents that are primarily text and tables — MDI handles these well without figure extraction
When processing speed is critical and the added latency per figure is not acceptable
For non-PDF file types — image content extraction only applies to PDF processing with MDI. Powerpoint and word documents can be converted to PDFs before ingestion.
Optimization Tips
Use 150 DPI as a starting point; increase to 200–300 only if extraction quality is insufficient
Monitor processing times and adjust the language model if latency is too high