Agentic PDF Document Extraction for Infra Admins (Deprecated)

4 min read

note

The Agentic PDF Document Extraction service is Deprecated. We recommend using default Doc Intelligence while enabling Image Content Extraction service. See Agentic Image Content Extraction for Infra Admins

Target Audience

Infrastructure Admins who will deploy and run the Agentic Ingestion service

Description

Agentic PDF Document Extraction is an advanced document processing service that leverages AI-powered extraction techniques to convert PDF documents into structured, searchable content. The service uses multiple extraction methods including MDI (Microsoft Document Intelligence), Vision-based extraction, and hybrid approaches to provide superior text extraction accuracy compared to traditional OCR methods.

Our platform previously relied on basic OCR and manual document processing workflows. While functional, this approach had several limitations for enterprise requirements, particularly in handling complex financial documents, tables, and multi-format content.

Trigger

Activated when pdfReadMode = CUSTOM_SINGLE_PAGE_API with the Agentic Ingestion API identifier configured in CUSTOM_API_DEFINITIONS.

Processing Flow

View diagram “pdf-extraction-flow” in Confluence

Step-by-step:

  1. node-ingestion-worker splits the PDF into individual pages

  2. For each page, the Custom API Definition Parser sends the full page as base64-encoded PDF to POST /agentic-ingestion/extractions

  3. agentic-ingestion enqueues a job in Redis (taskiq:pdf-content-extraction) and returns a job_id

  4. The worker picks up the job and selects the extraction method based on the extractionMethod parameter:

    • MDI — Structured extraction via Azure Document Intelligence only

    • VISION — Image-based extraction via Azure OpenAI vision model only

    • MDI_VISION — Hybrid approach: MDI for structured content + Vision for image/chart content

  5. The extraction runs against external services (Azure MDI and/or Azure OpenAI)

  6. Optionally, the Page Content Optimizer post-processes the output (evaluator/generator loop to correct layout errors and improve readability)

  7. The result (extracted markdown) is stored in Redis

  8. node-ingestion-worker polls GET /agentic-ingestion/extractions/{job_id} until the job completes, then receives the extracted markdown

Code Path (node-ingestion-worker)

PDF Ingestor Service
  └─ pdfReadMode === CUSTOM_SINGLE_PAGE_API
      └─ CustomApiDefinitionParser.runCustomApi()
          ├─ POST {definition.url}/extractions  (full page PDF base64)
          └─ Poll GET {definition.url}/extractions/{job_id}

Code Path (agentic-ingestion)

POST /agentic-ingestion/extractions
  └─ Enqueue in Redis (taskiq:pdf-content-extraction)
      └─ Worker: process_extraction_job()
          ├─ MDI path → Azure Document Intelligence
          ├─ VISION path → render page → Vision LLM
          ├─ MDI_VISION path → both in parallel
          └─ Optional: PageContentOptimizer (evaluator/generator loops)

Environment Variables and Secrets

Name

Description

Example Value

Default

Required

API_BASE

Base URL for the Unique AI API

<http://node-chat.finance-gpt.svc.cluster.local:8092/public>

""

Yes

FEATURE_FLAG_ENABLE_PDF_CONTENT_EXTRACTION

false

No, must be set to true to enable the feature

MDI_LOCATION

MDI service location identifier

"switzerland-north"

""

Yes

MDI_ENDPOINT_DEFINITIONS

JSON array of MDI endpoint definitions

[{"location": "switzerland-north", "endpoint": "<https://mdi.example.com>"}]

[]

Yes

MDI_VERSION

MDI service version

"2024-11-30"

"2024-11-30"

No

REDIS_HOST

Redis host for job queue

"redis-service.chat.svc"

""

Yes

REDIS_PORT

Redis port

6379

10101

No

REDIS_PASSWORD

Redis password

"password"

null

No

REDIS_DB

Redis database number

4

4

No

REDIS_USE_TLS

Enable TLS for Redis

"true"

"false"

No

MAX_WORKERS

Maximum concurrent job workers

4

4

No

GUNICORN_WORKERS

Number of Gunicorn worker processes

2

2

No

GUNICORN_THREADS

Number of threads per worker

4

4

No

SAVE_RESULTS_LOCALLY

Save extraction results locally for debugging

"false"

"false"

Connectivity Remarks

  • The service requires network access to Redis for job queue management

  • AI model endpoints must be accessible from the Kubernetes cluster

  • MDI service endpoints must be reachable for document processing

  • The service exposes HTTP endpoints on port 80 for health checks and API access


Operating & Troubleshooting

Authentication Methods

The service uses:

  • API Keys: For Unique AI API access

  • Service Principal: For Azure OpenAI access

  • Redis Authentication: Username/password or TLS certificates

Troubleshooting

How can I verify that the service is connected to Redis?
You should find "Successfully connected to Redis" in the logs when the service starts. Otherwise, connection errors will be displayed.

How can service outages and processing failures be monitored?

  • Monitor pod health with Kubernetes probes

  • Check Redis queue length and job status

  • Review application logs for error patterns

  • Set up alerts for failed job processing

How can we resolve failures during document processing?

  1. Check MDI service availability

  2. Verify Base URL endpoint access

  3. Review job logs for specific error messages

  4. Restart failed jobs using the API

When should the memory / CPU be increased?
When 80% of the allocated resources have been reached or when job processing times exceed acceptable thresholds.

How long does document processing take?
Processing time depends on:

  • Document complexity and size

  • Extraction method used (MDI, Vision, MDI_VISION)

  • Typically 15 seconds to 5 minutes per document

Will there be any downtime for updating / upgrading versions?
No downtime should be expected as Kubernetes will perform rolling updates one replica at a time.

Monitoring

Key Metrics to Monitor:

  • Job queue length

  • Processing time per document

  • Success/failure rates

  • Resource utilization (CPU, Memory)

  • MDI model API response times

  • Redis connection health

Log Analysis:

bash
# View recent logs
kubectl logs -n chat deployment/agentic-ingestion --tail=100

# Filter for errors
kubectl logs -n chat deployment/agentic-ingestion | grep ERROR

# Monitor job processing
kubectl logs -n chat deployment/agentic-ingestion | grep "Processing job"

Architecture Overview

System Context

The diagram illustrates the system context for Agentic Ingestion, highlighting the high-level interactions between document uploads and the processing pipeline.

image-20250228-155417.png

Agentic Document Ingestion Overview

Document Processing Flow:

  1. Document Upload: User uploads PDF via Unique AI interface

  2. Job Creation: Agentic Ingestion creates processing job in Redis queue

  3. Document Processing: Service extracts text using AI-powered methods

  4. Storage: Processed content is stored in the knowledge base

Processing Methods:

  • MDI: Microsoft Document Intelligence for structured document extraction

  • Vision: AI vision models for image-based content extraction

  • MDI_VISION: Hybrid approach combining both methods

Last updated