Agentic Ingestion Infrastructure

9 min read

Description

Agentic Ingestion is a suite of AI-powered document processing services that enable intelligent content extraction and enrichment for enterprise knowledge management. The platform provides three complementary services deployed as a unified bundle:

Agentic Metadata Extraction: Automatically extracts structured metadata from document content using LLMs with configurable schemas
Agentic Image Content Extraction: Extracts content from individual figures, charts, and diagrams detected within PDF pages using vision-capable LLMs.
Agentic PDF Document Extraction (deprecated as it is being replaced by Agentic Image Content Extraction: Converts PDF documents into structured, searchable text using vision-capable LLMs and Microsoft Document Intelligence (MDI).

All three services share common infrastructure (Kubernetes, Redis, container image) while maintaining independent API endpoints, job queues, and configurations. They leverage advanced AI techniques to provide superior accuracy compared to traditional OCR and manual processing workflows.

Architecture Overview

The Agentic Ingestion ecosystem consists of two primary components that work together:

node-ingestion-worker (Node.js/NestJS) — The orchestrator that manages the overall document ingestion pipeline, including PDF page splitting, MDI analysis, and coordination of AI extraction
agentic-ingestion (Python/Quart) — The AI extraction service that provides specialized endpoints for PDF extraction, image content extraction, and metadata extraction

High-Level Architecture

The high-level architecture shows the interplay between the two services and external dependencies:

node-ingestion-worker contains:

PDF Ingestor Service — Entry point for document processing; routes pages to the correct processing path
MS Document Intelligence Client — Calls Azure MDI for page layout analysis and figure detection
MDI Page Composer — Renders PDF pages, crops detected figures, and merges extracted figure text back into page markdown
Custom API Definition Parser — Sends full pages to external APIs (used for Agentic PDF Document Extraction)
Agentic Ingestion Image Extraction Adapter — HTTP client that creates async jobs and polls for results on the /image-content-extraction endpoint

agentic-ingestion contains:

/agentic-ingestion — PDF Content Extraction blueprint (job queue: taskiq:pdf-content-extraction)
/image-content-extraction — Image Content Extraction blueprint (job queue: taskiq:image-content-extraction)
/metadata-extraction — Metadata Extraction blueprint (webhook-driven)
/probe — Health check endpoint (Redis connectivity)

External Services:

Azure Document Intelligence (MDI) — Called by node-ingestion-worker (figure detection) and agentic-ingestion (PDF extraction methods)
Azure OpenAI (via node-chat / API_BASE) — Vision LLM completions for both PDF and image content extraction

Service Endpoints Summary

Endpoint Prefix	Service	Purpose	Job Queue
`/agentic-ingestion`	PDF Content Extraction	Full-page PDF extraction using MDI, Vision, or hybrid	`taskiq:pdf-content-extraction`
`/image-content-extraction`	Image Content Extraction	Per-figure image content extraction using vision models	`taskiq:image-content-extraction`
`/metadata-extraction`	Metadata Extraction	LLM-based structured metadata extraction	(webhook-driven)
`/probe`	Health Check	Redis connectivity health probe	N/A

Agentic Ingestion Capabilities

Capability	Status	What it does	Detailed infrastructure documentation
Agentic Metadata Extraction	BETA	Extracts structured metadata from ingested document content using a configurable schema and language model.	Agentic Metadata Extraction for Infra Admins
Agentic Image Content Extraction	BETA	Extracts searchable text from figures, charts, diagrams, and other visual content detected inside PDF pages processed through the standard Microsoft Document Intelligence pipeline.	Agentic Image Content Extraction for Infra Admins
Agentic PDF Document Extraction	Deprecated as it is being replaced by Agentic Image Content Extraction	Legacy Custom API based PDF extraction flow using `CUSTOM_SINGLE_PAGE_API`. New configurations should use Image Content Extraction with the default Document Intelligence PDF ingestion flow instead.	Agentic PDF Document Extraction for Infra Admins (Deprecated)

Planning

Agentic Ingestion Deployment Options: Self-hosted vs Managed Services

When using our Agentic Ingestion-powered feature, you'll need to decide how to deploy the underlying infrastructure. This choice affects operational overhead, costs, and performance.

Use Your Existing Infrastructure

If you already have Kubernetes running with Redis and AI model endpoints, you can deploy the Agentic Ingestion service directly to your existing cluster. This is often the most cost-effective option since you're leveraging infrastructure you're already maintaining.

Managed Services

Managed services like Azure Container Instances, Google Cloud Run, and AWS Fargate offer the fastest deployment path. They provide automatic maintenance, built-in scaling, comprehensive monitoring, and high availability. However, they come with higher ongoing costs and less configuration control.

Self-hosted Kubernetes

Self-hosting provides complete cost control and customization capabilities while maintaining data sovereignty. The trade-off is significant operational overhead requiring Kubernetes expertise, plus responsibility for all maintenance, updates, and monitoring.

Our Recommendation

Use your existing Kubernetes infrastructure if you already have a running cluster with available capacity. Start with a managed service if you lack Kubernetes experience. Consider self-hosting for dedicated infrastructure teams, or strict compliance requirements.

Budget

Depending on the volume of documents processed and the number of concurrent users, the dedicated resources vary.

The cost incurred will depend on the pricing model of your cloud provider and AI model usage.

Examples

These are rough estimations. The actual costs depend on the usage patterns, deployment regions, and pricing variations. Use the following at your own discretion!

Example 1: Small Scale

100 users, 2 replicas, 1000 documents/month
2 CPUs * 2 replicas = 4 CPUs
2 Gi Memory * 2 replicas = 4 Gi Memory
MDI costs: ~$200/month

For Azure pricing in Switzerland North this can be:

2 D4s v5 nodes (2 * $185 = $370)
MDI costs: $200
Total: ~$570 per month

Example 2: Enterprise Scale

5000 users, 4 replicas, 50000 documents/month
2 CPUs * 4 replicas = 8 CPUs
2 Gi Memory * 4 replicas = 8 Gi Memory
MDI costs: ~$2000/month

For Azure pricing in Switzerland North this can be:

4 D4s v5 nodes (4 * $185 = $740)
MDI costs: $2000
Total: ~$2740 per month

Note: AI service costs vary significantly based on which services are enabled:

PDF extraction costs depend on Document Intelligence API usage
Image content extraction costs depend on vision model API calls per figure
Metadata extraction costs depend on LLM token usage

Provisioning

Prerequisites

Infrastructure:

Kubernetes Cluster: Version 1.24 or higher
Redis: Version 6.0 or higher (for job queue management)
Container Registry: Access to push/pull container images

Service Dependencies:

Unique AI API (node-chat)
AI service endpoints (Azure OpenAI, Document Intelligence, etc.)

Deployment

We recommend using Helm charts for deployment to Kubernetes.

Helm Chart: unique/backend-service

Chart Version: >= 9.0.1

All three services (Metadata Extraction, PDF Content Extraction, and Image Content Extraction) are deployed together as a single bundle using the same Helm chart. Services can be enabled/disabled via feature flags.

For complete deployment instructions, environment variables, and configuration:

Metadata Extraction Service Documentation
PDF Content Extraction Service Documentation
Image Content Extraction Service Documentation

For resource allocation recommendations, see Sizing below.

Connectivity Requirements

Source	Destination	Protocol	Purpose
node-ingestion-worker	agentic-ingestion	HTTP (port 8081)	Job creation and polling for both PDF and image extraction
agentic-ingestion	Redis	TCP (port 6379/TLS)	Job queue and result storage
agentic-ingestion	node-chat (API_BASE)	HTTPS	LLM completions via platform API gateway
agentic-ingestion	Azure MDI	HTTPS	Document Intelligence analysis (PDF extraction only)
node-ingestion-worker	Azure MDI	HTTPS	Direct MDI analysis with figure extraction

Sizing

Compute / Memory

We recommend a deployment with 2-4 replicas for High Availability.

In terms of recommended resources, the recommended allocations per replica are as follows:

Concurrent Users	Replica CPU / Memory	Recommended Replicas
10-50	1000-2000m / 2-4Gi	2
50-200	2000-3000m / 4-6Gi	2-3
200-1000	3000-4000m / 6-8Gi	3-4
1000+	4000-5000m / 8-10Gi	4+

Note on Image Content Extraction: When image content extraction is heavily used (many figures per document), consider increasing replicas or worker counts, as each figure requires an individual AI model call processed through the Redis job queue.

Storage

The service requires minimal persistent storage for:

Temporary file processing (ephemeral)
Log storage (if enabled)
Debug output (if SAVE_RESULTS_LOCALLY=true)

Storage should be SSD-based and support volume expansion.

Redis Requirements

Memory: 1-2GB for job queue management
CPU: 500-1000m for queue processing
Persistence: Enabled for job durability

Redis hosts separate queues per service:

Queue Name	Service	Result Prefix
`taskiq:pdf-content-extraction`	PDF Document Extraction	`taskiq:pdf-res`
`taskiq:image-content-extraction`	Image Content Extraction	`taskiq:image-res`

Initial Setup

Prerequisites

Agentic Ingestion service must be up and running
node-ingestion-worker must be configured with AGENTIC_INGESTION_BASE_URL pointing to the agentic-ingestion service

Verification Steps

Check Service Health

bash

kubectl get pods -n chat -l app.kubernetes.io/name=agentic-ingestion
kubectl logs -n chat deployment/agentic-ingestion | grep "Successfully connected"

Check probe endpoint

bash

curl -X POST http://agentic-ingestion-service/probe

Verify Image Content Extraction endpoint

bash

curl -X POST http://agentic-ingestion-service/image-content-extraction/images/extractions \
  -H "Content-Type: application/json" \
  -H "x-user-id: test" \
  -d '{"companyId": "test", "data": "<base64-image>"}'

Verify node-ingestion-worker connectivity
Check that node-ingestion-worker can reach the agentic-ingestion service:
bash
```
kubectl logs -n chat deployment/node-ingestion-worker | grep "agentic-ingestion"
```

Performance Configuration

agentic-ingestion Service

Variable	Default	Description
`MAX_WORKERS`	4	Maximum number of concurrent job workers (shared across all modules)
`GUNICORN_WORKERS`	2	Number of Gunicorn worker processes
`GUNICORN_THREADS`	4	Number of threads per worker
`REDIS_JOB_TTL_SECONDS`	3600	Job timeout in seconds
`IMAGE_CONTENT_EXTRACTION_MAX_WORKERS`	4	Max async tasks for image extraction in-process worker
`IMAGE_CONTENT_EXTRACTION_CHAT_COMPLETION_TIMEOUT`	240000	Timeout in ms for LLM chat completion calls (image extraction)
`IMAGE_CONTENT_EXTRACTION_STRATEGY`	ONE_STEP	Default extraction strategy (ONE_STEP or TWO_STEP)
`IMAGE_CONTENT_EXTRACTION_REDIS_JOB_TTL_SECONDS`	3600	TTL for image extraction job results in Redis
`FEATURE_FLAG_ENABLE_IMAGE_CONTENT_EXTRACTION_UN_17223`	false	Enable/disable the image content extraction module
`FEATURE_FLAG_ENABLE_AGENTIC_INGESTION_PAGE_BATCH_FIGURE_EXTRACTION_UN_19457`	false	Enables page-batch figure extraction in `node-ingestion-worker`.
`FEATURE_FLAG_ENABLE_ATOMIC_FIGURE_CHUNKING_UN_19136`	false	Keeps extracted `<figure>` blocks atomic during markdown chunking in `node-ingestion-worker`

node-ingestion-worker Configuration

Variable	Default	Description
`AGENTIC_INGESTION_BASE_URL`	`""`	Base URL of the agentic-ingestion service (e.g., `http://agentic-ingestion.chat.svc:8081`)
`AGENTIC_INGESTION_IMAGE_POLLING_DURATION_MS`	3000	Polling interval in ms between status checks for image extraction jobs
`AGENTIC_INGESTION_IMAGE_TIMEOUT_MS`	300000	Maximum time in ms to wait for an image extraction job to complete
`AGENTIC_INGESTION_IMAGE_MAX_RETRIES`	3	Maximum retries per HTTP request (create job + each poll)
`AGENTIC_INGESTION_IMAGE_MIN_TIMEOUT_MS`	1000	Initial backoff delay for retries
`AGENTIC_INGESTION_IMAGE_BACKOFF_FACTOR`	2	Exponential backoff multiplier for retries

Service-Specific Documentation

Each service has dedicated documentation covering:

Detailed configuration and environment variables
API endpoints and usage
Architecture and processing flow
Troubleshooting and monitoring

Metadata Extraction Service

LLM-based metadata extraction with configurable schemas
Webhook-driven processing
Documentation: Agentic Metadata Extraction for Infra Admins

PDF Content Extraction Service

Multiple extraction methods (MDI, Vision, Hybrid)
Synchronous and asynchronous processing
Documentation: Agentic PDF Document Extraction for Infra Admins

Image Content Extraction Service

Per-figure vision-based content extraction
One-step and two-step extraction strategies with automatic fallback
Integrated with node-ingestion-worker via dedicated adapter
Documentation: Agentic Image Content Extraction for Infra Admins

Operating & Troubleshooting

Authentication Methods

The service uses:

API Keys: For Unique AI API access (via API_BASE)
Service Principal: For Azure OpenAI access
Redis Authentication: Username/password or TLS certificates

Troubleshooting

How can I verify that the service is connected to Redis?
You should find "Successfully connected to Redis" in the logs when the service starts. Otherwise, connection errors will be displayed.

How can I verify that image content extraction is working?
Check the agentic-ingestion logs for image-content-extraction related messages. Successful extractions will show job creation and completion. In node-ingestion-worker, look for logs from the AgenticIngestionImageExtractionAdapter.

How can service outages and processing failures be monitored?

Monitor pod health with Kubernetes probes
Check Redis queue length for both taskiq:pdf-content-extraction and taskiq:image-content-extraction
Review application logs for error patterns
Set up alerts for failed job processing

How can we resolve failures during document processing?

Check MDI service availability
Verify API_BASE endpoint access
Review job logs for specific error messages
Check Redis connectivity and queue health
Restart failed jobs using the API

Image extraction jobs timing out

If image extraction jobs are timing out:

Increase AGENTIC_INGESTION_IMAGE_TIMEOUT_MS on node-ingestion-worker
Check IMAGE_CONTENT_EXTRACTION_CHAT_COMPLETION_TIMEOUT on agentic-ingestion
Verify LLM endpoint response times (vision calls can be slower than text-only)
Check if Redis is experiencing high load

When should the memory / CPU be increased?
When 80% of the allocated resources have been reached or when job processing times exceed acceptable thresholds. Image content extraction may require more resources when processing documents with many figures concurrently.

How long does document processing take?
Processing time depends on:

Document complexity and size
Extraction method and path used
Number of figures per page (for image content extraction)
Typically 15 seconds to 5 minutes per document
Image extraction adds 2-15 seconds per page depending on figure count

Will there be any downtime for updating / upgrading versions?
No downtime should be expected as Kubernetes will perform rolling updates one replica at a time.

Monitoring

Key Metrics to Monitor:

Job queue length (per queue: PDF extraction and image extraction)
Processing time per document and per figure
Success/failure rates per service
Resource utilization (CPU, Memory)
AI model API response times (MDI and LLM)
Redis connection health and memory usage

Log Analysis:

bash

# View recent agentic-ingestion logs
kubectl logs -n chat deployment/agentic-ingestion --tail=100

# Filter for image extraction errors
kubectl logs -n chat deployment/agentic-ingestion | grep -i "image-content-extraction" | grep ERROR

# View node-ingestion-worker adapter logs
kubectl logs -n chat deployment/node-ingestion-worker | grep -i "agentic.*image"

# Monitor job processing across services
kubectl logs -n chat deployment/agentic-ingestion | grep "Processing job"