Agentic PDF Document Extraction for Infra Admins (Deprecated)
4 min read
The Agentic PDF Document Extraction service is Deprecated. We recommend using default Doc Intelligence while enabling Image Content Extraction service. See Agentic Image Content Extraction for Infra Admins
Target Audience
Infrastructure Admins who will deploy and run the Agentic Ingestion service
Description
Agentic PDF Document Extraction is an advanced document processing service that leverages AI-powered extraction techniques to convert PDF documents into structured, searchable content. The service uses multiple extraction methods including MDI (Microsoft Document Intelligence), Vision-based extraction, and hybrid approaches to provide superior text extraction accuracy compared to traditional OCR methods.
Our platform previously relied on basic OCR and manual document processing workflows. While functional, this approach had several limitations for enterprise requirements, particularly in handling complex financial documents, tables, and multi-format content.
Trigger
Activated when pdfReadMode = CUSTOM_SINGLE_PAGE_API with the Agentic Ingestion API identifier configured in CUSTOM_API_DEFINITIONS.
Processing Flow
View diagram “pdf-extraction-flow” in Confluence
Step-by-step:
node-ingestion-worker splits the PDF into individual pages
For each page, the Custom API Definition Parser sends the full page as base64-encoded PDF to
POST /agentic-ingestion/extractionsagentic-ingestion enqueues a job in Redis (
taskiq:pdf-content-extraction) and returns ajob_idThe worker picks up the job and selects the extraction method based on the
extractionMethodparameter:MDI — Structured extraction via Azure Document Intelligence only
VISION — Image-based extraction via Azure OpenAI vision model only
MDI_VISION — Hybrid approach: MDI for structured content + Vision for image/chart content
The extraction runs against external services (Azure MDI and/or Azure OpenAI)
Optionally, the Page Content Optimizer post-processes the output (evaluator/generator loop to correct layout errors and improve readability)
The result (extracted markdown) is stored in Redis
node-ingestion-worker polls
GET /agentic-ingestion/extractions/{job_id}until the job completes, then receives the extracted markdown
Code Path (node-ingestion-worker)
PDF Ingestor Service
└─ pdfReadMode === CUSTOM_SINGLE_PAGE_API
└─ CustomApiDefinitionParser.runCustomApi()
├─ POST {definition.url}/extractions (full page PDF base64)
└─ Poll GET {definition.url}/extractions/{job_id}Code Path (agentic-ingestion)
POST /agentic-ingestion/extractions
└─ Enqueue in Redis (taskiq:pdf-content-extraction)
└─ Worker: process_extraction_job()
├─ MDI path → Azure Document Intelligence
├─ VISION path → render page → Vision LLM
├─ MDI_VISION path → both in parallel
└─ Optional: PageContentOptimizer (evaluator/generator loops)Environment Variables and Secrets
Name | Description | Example Value | Default | Required |
| Base URL for the Unique AI API |
|
| Yes |
|
| No, must be set to | ||
| MDI service location identifier |
|
| Yes |
| JSON array of MDI endpoint definitions |
|
| Yes |
| MDI service version |
|
| No |
| Redis host for job queue |
|
| Yes |
| Redis port |
|
| No |
| Redis password |
|
| No |
| Redis database number |
|
| No |
| Enable TLS for Redis |
|
| No |
| Maximum concurrent job workers |
|
| No |
| Number of Gunicorn worker processes |
|
| No |
| Number of threads per worker |
|
| No |
| Save extraction results locally for debugging |
|
|
Connectivity Remarks
The service requires network access to Redis for job queue management
AI model endpoints must be accessible from the Kubernetes cluster
MDI service endpoints must be reachable for document processing
The service exposes HTTP endpoints on port 80 for health checks and API access
Operating & Troubleshooting
Authentication Methods
The service uses:
API Keys: For Unique AI API access
Service Principal: For Azure OpenAI access
Redis Authentication: Username/password or TLS certificates
Troubleshooting
How can I verify that the service is connected to Redis?
You should find "Successfully connected to Redis" in the logs when the service starts. Otherwise, connection errors will be displayed.
How can service outages and processing failures be monitored?
Monitor pod health with Kubernetes probes
Check Redis queue length and job status
Review application logs for error patterns
Set up alerts for failed job processing
How can we resolve failures during document processing?
Check MDI service availability
Verify Base URL endpoint access
Review job logs for specific error messages
Restart failed jobs using the API
When should the memory / CPU be increased?
When 80% of the allocated resources have been reached or when job processing times exceed acceptable thresholds.
How long does document processing take?
Processing time depends on:
Document complexity and size
Extraction method used (MDI, Vision, MDI_VISION)
Typically 15 seconds to 5 minutes per document
Will there be any downtime for updating / upgrading versions?
No downtime should be expected as Kubernetes will perform rolling updates one replica at a time.
Monitoring
Key Metrics to Monitor:
Job queue length
Processing time per document
Success/failure rates
Resource utilization (CPU, Memory)
MDI model API response times
Redis connection health
Log Analysis:
# View recent logs
kubectl logs -n chat deployment/agentic-ingestion --tail=100
# Filter for errors
kubectl logs -n chat deployment/agentic-ingestion | grep ERROR
# Monitor job processing
kubectl logs -n chat deployment/agentic-ingestion | grep "Processing job"Architecture Overview
System Context
The diagram illustrates the system context for Agentic Ingestion, highlighting the high-level interactions between document uploads and the processing pipeline.

Agentic Document Ingestion Overview
Document Processing Flow:
Document Upload: User uploads PDF via Unique AI interface
Job Creation: Agentic Ingestion creates processing job in Redis queue
Document Processing: Service extracts text using AI-powered methods
Storage: Processed content is stored in the knowledge base
Processing Methods:
MDI: Microsoft Document Intelligence for structured document extraction
Vision: AI vision models for image-based content extraction
MDI_VISION: Hybrid approach combining both methods