Agentic Ingestion Infrastructure

9 min read

Description

Agentic Ingestion is a suite of AI-powered document processing services that enable intelligent content extraction and enrichment for enterprise knowledge management. The platform provides three complementary services deployed as a unified bundle:

  • Agentic Metadata Extraction: Automatically extracts structured metadata from document content using LLMs with configurable schemas

  • Agentic Image Content Extraction: Extracts content from individual figures, charts, and diagrams detected within PDF pages using vision-capable LLMs.

  • Agentic PDF Document Extraction (deprecated as it is being replaced by Agentic Image Content Extraction: Converts PDF documents into structured, searchable text using vision-capable LLMs and Microsoft Document Intelligence (MDI).

All three services share common infrastructure (Kubernetes, Redis, container image) while maintaining independent API endpoints, job queues, and configurations. They leverage advanced AI techniques to provide superior accuracy compared to traditional OCR and manual processing workflows.


Architecture Overview

The Agentic Ingestion ecosystem consists of two primary components that work together:

  • node-ingestion-worker (Node.js/NestJS) — The orchestrator that manages the overall document ingestion pipeline, including PDF page splitting, MDI analysis, and coordination of AI extraction

  • agentic-ingestion (Python/Quart) — The AI extraction service that provides specialized endpoints for PDF extraction, image content extraction, and metadata extraction

High-Level Architecture

Diagram: pdf-extraction-flow

The high-level architecture shows the interplay between the two services and external dependencies:

node-ingestion-worker contains:

  • PDF Ingestor Service — Entry point for document processing; routes pages to the correct processing path

  • MS Document Intelligence Client — Calls Azure MDI for page layout analysis and figure detection

  • MDI Page Composer — Renders PDF pages, crops detected figures, and merges extracted figure text back into page markdown

  • Custom API Definition Parser — Sends full pages to external APIs (used for Agentic PDF Document Extraction)

  • Agentic Ingestion Image Extraction Adapter — HTTP client that creates async jobs and polls for results on the /image-content-extraction endpoint

agentic-ingestion contains:

  • /agentic-ingestion — PDF Content Extraction blueprint (job queue: taskiq:pdf-content-extraction)

  • /image-content-extraction — Image Content Extraction blueprint (job queue: taskiq:image-content-extraction)

  • /metadata-extraction — Metadata Extraction blueprint (webhook-driven)

  • /probe — Health check endpoint (Redis connectivity)

External Services:

  • Azure Document Intelligence (MDI) — Called by node-ingestion-worker (figure detection) and agentic-ingestion (PDF extraction methods)

  • Azure OpenAI (via node-chat / API_BASE) — Vision LLM completions for both PDF and image content extraction

Service Endpoints Summary

Endpoint Prefix

Service

Purpose

Job Queue

/agentic-ingestion

PDF Content Extraction

Full-page PDF extraction using MDI, Vision, or hybrid

taskiq:pdf-content-extraction

/image-content-extraction

Image Content Extraction

Per-figure image content extraction using vision models

taskiq:image-content-extraction

/metadata-extraction

Metadata Extraction

LLM-based structured metadata extraction

(webhook-driven)

/probe

Health Check

Redis connectivity health probe

N/A


Agentic Ingestion Capabilities

Capability

Status

What it does

Detailed infrastructure documentation

Agentic Metadata Extraction

BETA

Extracts structured metadata from ingested document content using a configurable schema and language model.

Agentic Metadata Extraction for Infra Admins

Agentic Image Content Extraction

BETA

Extracts searchable text from figures, charts, diagrams, and other visual content detected inside PDF pages processed through the standard Microsoft Document Intelligence pipeline.

Agentic Image Content Extraction for Infra Admins

Agentic PDF Document Extraction

Deprecated as it is being replaced by Agentic Image Content Extraction

Legacy Custom API based PDF extraction flow using CUSTOM_SINGLE_PAGE_API. New configurations should use Image Content Extraction with the default Document Intelligence PDF ingestion flow instead.

Agentic PDF Document Extraction for Infra Admins (Deprecated)


Planning

Agentic Ingestion Deployment Options: Self-hosted vs Managed Services

When using our Agentic Ingestion-powered feature, you'll need to decide how to deploy the underlying infrastructure. This choice affects operational overhead, costs, and performance.

Use Your Existing Infrastructure

If you already have Kubernetes running with Redis and AI model endpoints, you can deploy the Agentic Ingestion service directly to your existing cluster. This is often the most cost-effective option since you're leveraging infrastructure you're already maintaining.

Managed Services

Managed services like Azure Container Instances, Google Cloud Run, and AWS Fargate offer the fastest deployment path. They provide automatic maintenance, built-in scaling, comprehensive monitoring, and high availability. However, they come with higher ongoing costs and less configuration control.

Self-hosted Kubernetes

Self-hosting provides complete cost control and customization capabilities while maintaining data sovereignty. The trade-off is significant operational overhead requiring Kubernetes expertise, plus responsibility for all maintenance, updates, and monitoring.

Our Recommendation

Use your existing Kubernetes infrastructure if you already have a running cluster with available capacity. Start with a managed service if you lack Kubernetes experience. Consider self-hosting for dedicated infrastructure teams, or strict compliance requirements.

Budget

Depending on the volume of documents processed and the number of concurrent users, the dedicated resources vary.

The cost incurred will depend on the pricing model of your cloud provider and AI model usage.

Examples

These are rough estimations. The actual costs depend on the usage patterns, deployment regions, and pricing variations. Use the following at your own discretion!

Example 1: Small Scale

  • 100 users, 2 replicas, 1000 documents/month

  • 2 CPUs * 2 replicas = 4 CPUs

  • 2 Gi Memory * 2 replicas = 4 Gi Memory

  • MDI costs: ~$200/month

For Azure pricing in Switzerland North this can be:

  • 2 D4s v5 nodes (2 * $185 = $370)

  • MDI costs: $200

  • Total: ~$570 per month

Example 2: Enterprise Scale

  • 5000 users, 4 replicas, 50000 documents/month

  • 2 CPUs * 4 replicas = 8 CPUs

  • 2 Gi Memory * 4 replicas = 8 Gi Memory

  • MDI costs: ~$2000/month

For Azure pricing in Switzerland North this can be:

  • 4 D4s v5 nodes (4 * $185 = $740)

  • MDI costs: $2000

  • Total: ~$2740 per month

Note: AI service costs vary significantly based on which services are enabled:

  • PDF extraction costs depend on Document Intelligence API usage

  • Image content extraction costs depend on vision model API calls per figure

  • Metadata extraction costs depend on LLM token usage

Provisioning

Prerequisites

  • Infrastructure:

  1. Kubernetes Cluster: Version 1.24 or higher

  2. Redis: Version 6.0 or higher (for job queue management)

  3. Container Registry: Access to push/pull container images

  • Service Dependencies:

  1. Unique AI API (node-chat)

  2. AI service endpoints (Azure OpenAI, Document Intelligence, etc.)

Deployment

We recommend using Helm charts for deployment to Kubernetes.

Helm Chart: unique/backend-service

Chart Version: >= 9.0.1

All three services (Metadata Extraction, PDF Content Extraction, and Image Content Extraction) are deployed together as a single bundle using the same Helm chart. Services can be enabled/disabled via feature flags.

For complete deployment instructions, environment variables, and configuration:

  • Metadata Extraction Service Documentation

  • PDF Content Extraction Service Documentation

  • Image Content Extraction Service Documentation

For resource allocation recommendations, see Sizing below.

Connectivity Requirements

Source

Destination

Protocol

Purpose

node-ingestion-worker

agentic-ingestion

HTTP (port 8081)

Job creation and polling for both PDF and image extraction

agentic-ingestion

Redis

TCP (port 6379/TLS)

Job queue and result storage

agentic-ingestion

node-chat (API_BASE)

HTTPS

LLM completions via platform API gateway

agentic-ingestion

Azure MDI

HTTPS

Document Intelligence analysis (PDF extraction only)

node-ingestion-worker

Azure MDI

HTTPS

Direct MDI analysis with figure extraction

Sizing

Compute / Memory

We recommend a deployment with 2-4 replicas for High Availability.

In terms of recommended resources, the recommended allocations per replica are as follows:

Concurrent Users

Replica CPU / Memory

Recommended Replicas

10-50

1000-2000m / 2-4Gi

2

50-200

2000-3000m / 4-6Gi

2-3

200-1000

3000-4000m / 6-8Gi

3-4

1000+

4000-5000m / 8-10Gi

4+

Note on Image Content Extraction: When image content extraction is heavily used (many figures per document), consider increasing replicas or worker counts, as each figure requires an individual AI model call processed through the Redis job queue.

Storage

The service requires minimal persistent storage for:

  • Temporary file processing (ephemeral)

  • Log storage (if enabled)

  • Debug output (if SAVE_RESULTS_LOCALLY=true)

Storage should be SSD-based and support volume expansion.

Redis Requirements

  • Memory: 1-2GB for job queue management

  • CPU: 500-1000m for queue processing

  • Persistence: Enabled for job durability

Redis hosts separate queues per service:

Queue Name

Service

Result Prefix

taskiq:pdf-content-extraction

PDF Document Extraction

taskiq:pdf-res

taskiq:image-content-extraction

Image Content Extraction

taskiq:image-res


Initial Setup

Prerequisites

  1. Agentic Ingestion service must be up and running

  2. node-ingestion-worker must be configured with AGENTIC_INGESTION_BASE_URL pointing to the agentic-ingestion service

Verification Steps

  1. Check Service Health

    bash
    kubectl get pods -n chat -l app.kubernetes.io/name=agentic-ingestion
    kubectl logs -n chat deployment/agentic-ingestion | grep "Successfully connected"
  2. Check probe endpoint

    bash
    curl -X POST http://agentic-ingestion-service/probe
  3. Verify Image Content Extraction endpoint

    bash
    curl -X POST http://agentic-ingestion-service/image-content-extraction/images/extractions \
      -H "Content-Type: application/json" \
      -H "x-user-id: test" \
      -d '{"companyId": "test", "data": "<base64-image>"}'
  4. Verify node-ingestion-worker connectivity

    Check that node-ingestion-worker can reach the agentic-ingestion service:

    bash
    kubectl logs -n chat deployment/node-ingestion-worker | grep "agentic-ingestion"

Performance Configuration

agentic-ingestion Service

Variable

Default

Description

MAX_WORKERS

4

Maximum number of concurrent job workers (shared across all modules)

GUNICORN_WORKERS

2

Number of Gunicorn worker processes

GUNICORN_THREADS

4

Number of threads per worker

REDIS_JOB_TTL_SECONDS

3600

Job timeout in seconds

IMAGE_CONTENT_EXTRACTION_MAX_WORKERS

4

Max async tasks for image extraction in-process worker

IMAGE_CONTENT_EXTRACTION_CHAT_COMPLETION_TIMEOUT

240000

Timeout in ms for LLM chat completion calls (image extraction)

IMAGE_CONTENT_EXTRACTION_STRATEGY

ONE_STEP

Default extraction strategy (ONE_STEP or TWO_STEP)

IMAGE_CONTENT_EXTRACTION_REDIS_JOB_TTL_SECONDS

3600

TTL for image extraction job results in Redis

FEATURE_FLAG_ENABLE_IMAGE_CONTENT_EXTRACTION_UN_17223

false

Enable/disable the image content extraction module

FEATURE_FLAG_ENABLE_AGENTIC_INGESTION_PAGE_BATCH_FIGURE_EXTRACTION_UN_19457

false

Enables page-batch figure extraction in node-ingestion-worker.

FEATURE_FLAG_ENABLE_ATOMIC_FIGURE_CHUNKING_UN_19136

false

Keeps extracted <figure> blocks atomic during markdown chunking in node-ingestion-worker

node-ingestion-worker Configuration

Variable

Default

Description

AGENTIC_INGESTION_BASE_URL

""

Base URL of the agentic-ingestion service (e.g., http://agentic-ingestion.chat.svc:8081)

AGENTIC_INGESTION_IMAGE_POLLING_DURATION_MS

3000

Polling interval in ms between status checks for image extraction jobs

AGENTIC_INGESTION_IMAGE_TIMEOUT_MS

300000

Maximum time in ms to wait for an image extraction job to complete

AGENTIC_INGESTION_IMAGE_MAX_RETRIES

3

Maximum retries per HTTP request (create job + each poll)

AGENTIC_INGESTION_IMAGE_MIN_TIMEOUT_MS

1000

Initial backoff delay for retries

AGENTIC_INGESTION_IMAGE_BACKOFF_FACTOR

2

Exponential backoff multiplier for retries

Service-Specific Documentation

Each service has dedicated documentation covering:

  • Detailed configuration and environment variables

  • API endpoints and usage

  • Architecture and processing flow

  • Troubleshooting and monitoring

Metadata Extraction Service

PDF Content Extraction Service

Image Content Extraction Service

  • Per-figure vision-based content extraction

  • One-step and two-step extraction strategies with automatic fallback

  • Integrated with node-ingestion-worker via dedicated adapter

  • Documentation: Agentic Image Content Extraction for Infra Admins


Operating & Troubleshooting

Authentication Methods

The service uses:

  • API Keys: For Unique AI API access (via API_BASE)

  • Service Principal: For Azure OpenAI access

  • Redis Authentication: Username/password or TLS certificates

Troubleshooting

How can I verify that the service is connected to Redis?
You should find "Successfully connected to Redis" in the logs when the service starts. Otherwise, connection errors will be displayed.

How can I verify that image content extraction is working?
Check the agentic-ingestion logs for image-content-extraction related messages. Successful extractions will show job creation and completion. In node-ingestion-worker, look for logs from the AgenticIngestionImageExtractionAdapter.

How can service outages and processing failures be monitored?

  • Monitor pod health with Kubernetes probes

  • Check Redis queue length for both taskiq:pdf-content-extraction and taskiq:image-content-extraction

  • Review application logs for error patterns

  • Set up alerts for failed job processing

How can we resolve failures during document processing?

  1. Check MDI service availability

  2. Verify API_BASE endpoint access

  3. Review job logs for specific error messages

  4. Check Redis connectivity and queue health

  5. Restart failed jobs using the API

Image extraction jobs timing out

If image extraction jobs are timing out:

  • Increase AGENTIC_INGESTION_IMAGE_TIMEOUT_MS on node-ingestion-worker

  • Check IMAGE_CONTENT_EXTRACTION_CHAT_COMPLETION_TIMEOUT on agentic-ingestion

  • Verify LLM endpoint response times (vision calls can be slower than text-only)

  • Check if Redis is experiencing high load

When should the memory / CPU be increased?
When 80% of the allocated resources have been reached or when job processing times exceed acceptable thresholds. Image content extraction may require more resources when processing documents with many figures concurrently.

How long does document processing take?
Processing time depends on:

  • Document complexity and size

  • Extraction method and path used

  • Number of figures per page (for image content extraction)

  • Typically 15 seconds to 5 minutes per document

  • Image extraction adds 2-15 seconds per page depending on figure count

Will there be any downtime for updating / upgrading versions?
No downtime should be expected as Kubernetes will perform rolling updates one replica at a time.

Monitoring

Key Metrics to Monitor:

  • Job queue length (per queue: PDF extraction and image extraction)

  • Processing time per document and per figure

  • Success/failure rates per service

  • Resource utilization (CPU, Memory)

  • AI model API response times (MDI and LLM)

  • Redis connection health and memory usage

Log Analysis:

bash
# View recent agentic-ingestion logs
kubectl logs -n chat deployment/agentic-ingestion --tail=100

# Filter for image extraction errors
kubectl logs -n chat deployment/agentic-ingestion | grep -i "image-content-extraction" | grep ERROR

# View node-ingestion-worker adapter logs
kubectl logs -n chat deployment/node-ingestion-worker | grep -i "agentic.*image"

# Monitor job processing across services
kubectl logs -n chat deployment/agentic-ingestion | grep "Processing job"
Last updated