Agentic PDF Document Extraction for Infra Admins (Deprecated)

4 min read

The Agentic PDF Document Extraction service is Deprecated. We recommend using default Doc Intelligence while enabling Image Content Extraction service. See Agentic Image Content Extraction for Infra Admins

Target Audience

Infrastructure Admins who will deploy and run the Agentic Ingestion service

Description

Agentic PDF Document Extraction is an advanced document processing service that leverages AI-powered extraction techniques to convert PDF documents into structured, searchable content. The service uses multiple extraction methods including MDI (Microsoft Document Intelligence), Vision-based extraction, and hybrid approaches to provide superior text extraction accuracy compared to traditional OCR methods.

Our platform previously relied on basic OCR and manual document processing workflows. While functional, this approach had several limitations for enterprise requirements, particularly in handling complex financial documents, tables, and multi-format content.

Trigger

Activated when pdfReadMode = CUSTOM_SINGLE_PAGE_API with the Agentic Ingestion API identifier configured in CUSTOM_API_DEFINITIONS.

Processing Flow

View diagram “pdf-extraction-flow” in Confluence

Step-by-step:

node-ingestion-worker splits the PDF into individual pages
For each page, the Custom API Definition Parser sends the full page as base64-encoded PDF to POST /agentic-ingestion/extractions
agentic-ingestion enqueues a job in Redis (taskiq:pdf-content-extraction) and returns a job_id
The worker picks up the job and selects the extraction method based on the extractionMethod parameter:
- MDI — Structured extraction via Azure Document Intelligence only
- VISION — Image-based extraction via Azure OpenAI vision model only
- MDI_VISION — Hybrid approach: MDI for structured content + Vision for image/chart content
The extraction runs against external services (Azure MDI and/or Azure OpenAI)
Optionally, the Page Content Optimizer post-processes the output (evaluator/generator loop to correct layout errors and improve readability)
The result (extracted markdown) is stored in Redis
node-ingestion-worker polls GET /agentic-ingestion/extractions/{job_id} until the job completes, then receives the extracted markdown

Code Path (node-ingestion-worker)

PDF Ingestor Service
  └─ pdfReadMode === CUSTOM_SINGLE_PAGE_API
      └─ CustomApiDefinitionParser.runCustomApi()
          ├─ POST {definition.url}/extractions  (full page PDF base64)
          └─ Poll GET {definition.url}/extractions/{job_id}

Code Path (agentic-ingestion)

POST /agentic-ingestion/extractions
  └─ Enqueue in Redis (taskiq:pdf-content-extraction)
      └─ Worker: process_extraction_job()
          ├─ MDI path → Azure Document Intelligence
          ├─ VISION path → render page → Vision LLM
          ├─ MDI_VISION path → both in parallel
          └─ Optional: PageContentOptimizer (evaluator/generator loops)

Environment Variables and Secrets

Name	Description	Example Value	Default	Required
`API_BASE`	Base URL for the Unique AI API	`<http://node-chat.finance-gpt.svc.cluster.local:8092/public`>	`""`	Yes
`FEATURE_FLAG_ENABLE_PDF_CONTENT_EXTRACTION`			`false`	No, must be set to `true` to enable the feature
`MDI_LOCATION`	MDI service location identifier	`"switzerland-north"`	`""`	Yes
`MDI_ENDPOINT_DEFINITIONS`	JSON array of MDI endpoint definitions	`[{"location": "switzerland-north", "endpoint": "<https://mdi.example.com>"}]`	`[]`	Yes
`MDI_VERSION`	MDI service version	`"2024-11-30"`	`"2024-11-30"`	No
`REDIS_HOST`	Redis host for job queue	`"redis-service.chat.svc"`	`""`	Yes
`REDIS_PORT`	Redis port	`6379`	`10101`	No
`REDIS_PASSWORD`	Redis password	`"password"`	`null`	No
`REDIS_DB`	Redis database number	`4`	`4`	No
`REDIS_USE_TLS`	Enable TLS for Redis	`"true"`	`"false"`	No
`MAX_WORKERS`	Maximum concurrent job workers	`4`	`4`	No
`GUNICORN_WORKERS`	Number of Gunicorn worker processes	`2`	`2`	No
`GUNICORN_THREADS`	Number of threads per worker	`4`	`4`	No
`SAVE_RESULTS_LOCALLY`	Save extraction results locally for debugging	`"false"`	`"false"`

Connectivity Remarks

The service requires network access to Redis for job queue management
AI model endpoints must be accessible from the Kubernetes cluster
MDI service endpoints must be reachable for document processing
The service exposes HTTP endpoints on port 80 for health checks and API access

Operating & Troubleshooting

Authentication Methods

The service uses:

API Keys: For Unique AI API access
Service Principal: For Azure OpenAI access
Redis Authentication: Username/password or TLS certificates

Troubleshooting

How can I verify that the service is connected to Redis?
You should find "Successfully connected to Redis" in the logs when the service starts. Otherwise, connection errors will be displayed.

How can service outages and processing failures be monitored?

Monitor pod health with Kubernetes probes
Check Redis queue length and job status
Review application logs for error patterns
Set up alerts for failed job processing

How can we resolve failures during document processing?

Check MDI service availability
Verify Base URL endpoint access
Review job logs for specific error messages
Restart failed jobs using the API

When should the memory / CPU be increased?
When 80% of the allocated resources have been reached or when job processing times exceed acceptable thresholds.

How long does document processing take?
Processing time depends on:

Document complexity and size
Extraction method used (MDI, Vision, MDI_VISION)
Typically 15 seconds to 5 minutes per document

Will there be any downtime for updating / upgrading versions?
No downtime should be expected as Kubernetes will perform rolling updates one replica at a time.

Monitoring

Key Metrics to Monitor:

Job queue length
Processing time per document
Success/failure rates
Resource utilization (CPU, Memory)
MDI model API response times
Redis connection health

Log Analysis:

bash

# View recent logs
kubectl logs -n chat deployment/agentic-ingestion --tail=100

# Filter for errors
kubectl logs -n chat deployment/agentic-ingestion | grep ERROR

# Monitor job processing
kubectl logs -n chat deployment/agentic-ingestion | grep "Processing job"

Architecture Overview

System Context

The diagram illustrates the system context for Agentic Ingestion, highlighting the high-level interactions between document uploads and the processing pipeline.

Document Processing Flow:

Document Upload: User uploads PDF via Unique AI interface
Job Creation: Agentic Ingestion creates processing job in Redis queue
Document Processing: Service extracts text using AI-powered methods
Storage: Processed content is stored in the knowledge base

Processing Methods:

MDI: Microsoft Document Intelligence for structured document extraction
Vision: AI vision models for image-based content extraction
MDI_VISION: Hybrid approach combining both methods