Agentic Ingestion Infrastructure
9 min read
Description
Agentic Ingestion is a suite of AI-powered document processing services that enable intelligent content extraction and enrichment for enterprise knowledge management. The platform provides three complementary services deployed as a unified bundle:
Agentic Metadata Extraction: Automatically extracts structured metadata from document content using LLMs with configurable schemas
Agentic Image Content Extraction: Extracts content from individual figures, charts, and diagrams detected within PDF pages using vision-capable LLMs.
Agentic PDF Document Extraction (deprecated as it is being replaced by Agentic Image Content Extraction: Converts PDF documents into structured, searchable text using vision-capable LLMs and Microsoft Document Intelligence (MDI).
All three services share common infrastructure (Kubernetes, Redis, container image) while maintaining independent API endpoints, job queues, and configurations. They leverage advanced AI techniques to provide superior accuracy compared to traditional OCR and manual processing workflows.
Architecture Overview
The Agentic Ingestion ecosystem consists of two primary components that work together:
node-ingestion-worker (Node.js/NestJS) — The orchestrator that manages the overall document ingestion pipeline, including PDF page splitting, MDI analysis, and coordination of AI extraction
agentic-ingestion (Python/Quart) — The AI extraction service that provides specialized endpoints for PDF extraction, image content extraction, and metadata extraction
High-Level Architecture

The high-level architecture shows the interplay between the two services and external dependencies:
node-ingestion-worker contains:
PDF Ingestor Service — Entry point for document processing; routes pages to the correct processing path
MS Document Intelligence Client — Calls Azure MDI for page layout analysis and figure detection
MDI Page Composer — Renders PDF pages, crops detected figures, and merges extracted figure text back into page markdown
Custom API Definition Parser — Sends full pages to external APIs (used for Agentic PDF Document Extraction)
Agentic Ingestion Image Extraction Adapter — HTTP client that creates async jobs and polls for results on the
/image-content-extractionendpoint
agentic-ingestion contains:
/agentic-ingestion— PDF Content Extraction blueprint (job queue:taskiq:pdf-content-extraction)/image-content-extraction— Image Content Extraction blueprint (job queue:taskiq:image-content-extraction)/metadata-extraction— Metadata Extraction blueprint (webhook-driven)/probe— Health check endpoint (Redis connectivity)
External Services:
Azure Document Intelligence (MDI) — Called by node-ingestion-worker (figure detection) and agentic-ingestion (PDF extraction methods)
Azure OpenAI (via node-chat / API_BASE) — Vision LLM completions for both PDF and image content extraction
Service Endpoints Summary
Endpoint Prefix | Service | Purpose | Job Queue |
|---|---|---|---|
| PDF Content Extraction | Full-page PDF extraction using MDI, Vision, or hybrid |
|
| Image Content Extraction | Per-figure image content extraction using vision models |
|
| Metadata Extraction | LLM-based structured metadata extraction | (webhook-driven) |
| Health Check | Redis connectivity health probe | N/A |
Agentic Ingestion Capabilities
Capability | Status | What it does | Detailed infrastructure documentation |
|---|---|---|---|
Agentic Metadata Extraction | BETA | Extracts structured metadata from ingested document content using a configurable schema and language model. | |
Agentic Image Content Extraction | BETA | Extracts searchable text from figures, charts, diagrams, and other visual content detected inside PDF pages processed through the standard Microsoft Document Intelligence pipeline. | |
Agentic PDF Document Extraction | Deprecated as it is being replaced by Agentic Image Content Extraction | Legacy Custom API based PDF extraction flow using | Agentic PDF Document Extraction for Infra Admins (Deprecated) |
Planning
Agentic Ingestion Deployment Options: Self-hosted vs Managed Services
When using our Agentic Ingestion-powered feature, you'll need to decide how to deploy the underlying infrastructure. This choice affects operational overhead, costs, and performance.
Use Your Existing Infrastructure
If you already have Kubernetes running with Redis and AI model endpoints, you can deploy the Agentic Ingestion service directly to your existing cluster. This is often the most cost-effective option since you're leveraging infrastructure you're already maintaining.
Managed Services
Managed services like Azure Container Instances, Google Cloud Run, and AWS Fargate offer the fastest deployment path. They provide automatic maintenance, built-in scaling, comprehensive monitoring, and high availability. However, they come with higher ongoing costs and less configuration control.
Self-hosted Kubernetes
Self-hosting provides complete cost control and customization capabilities while maintaining data sovereignty. The trade-off is significant operational overhead requiring Kubernetes expertise, plus responsibility for all maintenance, updates, and monitoring.
Our Recommendation
Use your existing Kubernetes infrastructure if you already have a running cluster with available capacity. Start with a managed service if you lack Kubernetes experience. Consider self-hosting for dedicated infrastructure teams, or strict compliance requirements.
Budget
Depending on the volume of documents processed and the number of concurrent users, the dedicated resources vary.
The cost incurred will depend on the pricing model of your cloud provider and AI model usage.
Examples
These are rough estimations. The actual costs depend on the usage patterns, deployment regions, and pricing variations. Use the following at your own discretion!
Example 1: Small Scale
100 users, 2 replicas, 1000 documents/month
2 CPUs * 2 replicas = 4 CPUs
2 Gi Memory * 2 replicas = 4 Gi Memory
MDI costs: ~$200/month
For Azure pricing in Switzerland North this can be:
2 D4s v5 nodes (2 * $185 = $370)
MDI costs: $200
Total: ~$570 per month
Example 2: Enterprise Scale
5000 users, 4 replicas, 50000 documents/month
2 CPUs * 4 replicas = 8 CPUs
2 Gi Memory * 4 replicas = 8 Gi Memory
MDI costs: ~$2000/month
For Azure pricing in Switzerland North this can be:
4 D4s v5 nodes (4 * $185 = $740)
MDI costs: $2000
Total: ~$2740 per month
Note: AI service costs vary significantly based on which services are enabled:
PDF extraction costs depend on Document Intelligence API usage
Image content extraction costs depend on vision model API calls per figure
Metadata extraction costs depend on LLM token usage
Provisioning
Prerequisites
Infrastructure:
Kubernetes Cluster: Version 1.24 or higher
Redis: Version 6.0 or higher (for job queue management)
Container Registry: Access to push/pull container images
Service Dependencies:
Unique AI API (node-chat)
AI service endpoints (Azure OpenAI, Document Intelligence, etc.)
Deployment
We recommend using Helm charts for deployment to Kubernetes.
Helm Chart: unique/backend-service
Chart Version: >= 9.0.1
All three services (Metadata Extraction, PDF Content Extraction, and Image Content Extraction) are deployed together as a single bundle using the same Helm chart. Services can be enabled/disabled via feature flags.
For complete deployment instructions, environment variables, and configuration:
Metadata Extraction Service Documentation
PDF Content Extraction Service Documentation
Image Content Extraction Service Documentation
For resource allocation recommendations, see Sizing below.
Connectivity Requirements
Source | Destination | Protocol | Purpose |
|---|---|---|---|
node-ingestion-worker | agentic-ingestion | HTTP (port 8081) | Job creation and polling for both PDF and image extraction |
agentic-ingestion | Redis | TCP (port 6379/TLS) | Job queue and result storage |
agentic-ingestion | node-chat (API_BASE) | HTTPS | LLM completions via platform API gateway |
agentic-ingestion | Azure MDI | HTTPS | Document Intelligence analysis (PDF extraction only) |
node-ingestion-worker | Azure MDI | HTTPS | Direct MDI analysis with figure extraction |
Sizing
Compute / Memory
We recommend a deployment with 2-4 replicas for High Availability.
In terms of recommended resources, the recommended allocations per replica are as follows:
Concurrent Users | Replica CPU / Memory | Recommended Replicas |
|---|---|---|
10-50 | 1000-2000m / 2-4Gi | 2 |
50-200 | 2000-3000m / 4-6Gi | 2-3 |
200-1000 | 3000-4000m / 6-8Gi | 3-4 |
1000+ | 4000-5000m / 8-10Gi | 4+ |
Note on Image Content Extraction: When image content extraction is heavily used (many figures per document), consider increasing replicas or worker counts, as each figure requires an individual AI model call processed through the Redis job queue.
Storage
The service requires minimal persistent storage for:
Temporary file processing (ephemeral)
Log storage (if enabled)
Debug output (if
SAVE_RESULTS_LOCALLY=true)
Storage should be SSD-based and support volume expansion.
Redis Requirements
Memory: 1-2GB for job queue management
CPU: 500-1000m for queue processing
Persistence: Enabled for job durability
Redis hosts separate queues per service:
Queue Name | Service | Result Prefix |
|---|---|---|
| PDF Document Extraction |
|
| Image Content Extraction |
|
Initial Setup
Prerequisites
Agentic Ingestion service must be up and running
node-ingestion-worker must be configured with
AGENTIC_INGESTION_BASE_URLpointing to the agentic-ingestion service
Verification Steps
Check Service Health
bashkubectl get pods -n chat -l app.kubernetes.io/name=agentic-ingestion kubectl logs -n chat deployment/agentic-ingestion | grep "Successfully connected"Check probe endpoint
bashcurl -X POST http://agentic-ingestion-service/probeVerify Image Content Extraction endpoint
bashcurl -X POST http://agentic-ingestion-service/image-content-extraction/images/extractions \ -H "Content-Type: application/json" \ -H "x-user-id: test" \ -d '{"companyId": "test", "data": "<base64-image>"}'Verify node-ingestion-worker connectivity
Check that node-ingestion-worker can reach the agentic-ingestion service:
bashkubectl logs -n chat deployment/node-ingestion-worker | grep "agentic-ingestion"
Performance Configuration
agentic-ingestion Service
Variable | Default | Description |
|---|---|---|
| 4 | Maximum number of concurrent job workers (shared across all modules) |
| 2 | Number of Gunicorn worker processes |
| 4 | Number of threads per worker |
| 3600 | Job timeout in seconds |
| 4 | Max async tasks for image extraction in-process worker |
| 240000 | Timeout in ms for LLM chat completion calls (image extraction) |
| ONE_STEP | Default extraction strategy (ONE_STEP or TWO_STEP) |
| 3600 | TTL for image extraction job results in Redis |
| false | Enable/disable the image content extraction module |
| false | Enables page-batch figure extraction in |
| false | Keeps extracted |
node-ingestion-worker Configuration
Variable | Default | Description |
|---|---|---|
|
| Base URL of the agentic-ingestion service (e.g., |
| 3000 | Polling interval in ms between status checks for image extraction jobs |
| 300000 | Maximum time in ms to wait for an image extraction job to complete |
| 3 | Maximum retries per HTTP request (create job + each poll) |
| 1000 | Initial backoff delay for retries |
| 2 | Exponential backoff multiplier for retries |
Service-Specific Documentation
Each service has dedicated documentation covering:
Detailed configuration and environment variables
API endpoints and usage
Architecture and processing flow
Troubleshooting and monitoring
Metadata Extraction Service
LLM-based metadata extraction with configurable schemas
Webhook-driven processing
Documentation: Agentic Metadata Extraction for Infra Admins
PDF Content Extraction Service
Multiple extraction methods (MDI, Vision, Hybrid)
Synchronous and asynchronous processing
Documentation: Agentic PDF Document Extraction for Infra Admins
Image Content Extraction Service
Per-figure vision-based content extraction
One-step and two-step extraction strategies with automatic fallback
Integrated with node-ingestion-worker via dedicated adapter
Documentation: Agentic Image Content Extraction for Infra Admins
Operating & Troubleshooting
Authentication Methods
The service uses:
API Keys: For Unique AI API access (via
API_BASE)Service Principal: For Azure OpenAI access
Redis Authentication: Username/password or TLS certificates
Troubleshooting
How can I verify that the service is connected to Redis?
You should find "Successfully connected to Redis" in the logs when the service starts. Otherwise, connection errors will be displayed.
How can I verify that image content extraction is working?
Check the agentic-ingestion logs for image-content-extraction related messages. Successful extractions will show job creation and completion. In node-ingestion-worker, look for logs from the AgenticIngestionImageExtractionAdapter.
How can service outages and processing failures be monitored?
Monitor pod health with Kubernetes probes
Check Redis queue length for both
taskiq:pdf-content-extractionandtaskiq:image-content-extractionReview application logs for error patterns
Set up alerts for failed job processing
How can we resolve failures during document processing?
Check MDI service availability
Verify API_BASE endpoint access
Review job logs for specific error messages
Check Redis connectivity and queue health
Restart failed jobs using the API
Image extraction jobs timing out
If image extraction jobs are timing out:
Increase
AGENTIC_INGESTION_IMAGE_TIMEOUT_MSon node-ingestion-workerCheck
IMAGE_CONTENT_EXTRACTION_CHAT_COMPLETION_TIMEOUTon agentic-ingestionVerify LLM endpoint response times (vision calls can be slower than text-only)
Check if Redis is experiencing high load
When should the memory / CPU be increased?
When 80% of the allocated resources have been reached or when job processing times exceed acceptable thresholds. Image content extraction may require more resources when processing documents with many figures concurrently.
How long does document processing take?
Processing time depends on:
Document complexity and size
Extraction method and path used
Number of figures per page (for image content extraction)
Typically 15 seconds to 5 minutes per document
Image extraction adds 2-15 seconds per page depending on figure count
Will there be any downtime for updating / upgrading versions?
No downtime should be expected as Kubernetes will perform rolling updates one replica at a time.
Monitoring
Key Metrics to Monitor:
Job queue length (per queue: PDF extraction and image extraction)
Processing time per document and per figure
Success/failure rates per service
Resource utilization (CPU, Memory)
AI model API response times (MDI and LLM)
Redis connection health and memory usage
Log Analysis:
# View recent agentic-ingestion logs
kubectl logs -n chat deployment/agentic-ingestion --tail=100
# Filter for image extraction errors
kubectl logs -n chat deployment/agentic-ingestion | grep -i "image-content-extraction" | grep ERROR
# View node-ingestion-worker adapter logs
kubectl logs -n chat deployment/node-ingestion-worker | grep -i "agentic.*image"
# Monitor job processing across services
kubectl logs -n chat deployment/agentic-ingestion | grep "Processing job"