Agentic Metadata Extraction for Infra Admins
3 min read
Description
The Agentic Metadata Extraction Service automatically extracts structured metadata from document content using Large Language Models (LLMs). Unlike traditional metadata extraction relying on document properties, this service analyzes actual text content to intelligently extract business-relevant metadata fields.
Processing Flow
View diagram “metadata-extraction-flow” in Confluence
How it works?
User uploads document → Content is ingested and chunked
Platform sends webhook to /metadata-extraction/webhook
Service fetches document chunks and configuration
Chunks merged up to maxInputTokens limit
LLM extracts metadata based on configured schema
Extracted metadata merged with existing content metadata
Returns 200 OK when complete synchronous processing
The service processes already-ingested content chunks, not raw documents. Processing is synchronous - the webhook completes when extraction finishes.
Code Path (agentic-ingestion)
POST /metadata-extraction
└─ Receive content + metadata schema
└─ LLM completion with structured output
└─ Return extracted metadata fields per schemaProvisioning
Prerequisites
See Agentic Ingestion Infrastructure for Kubernetes and general infrastructure requirements.
Service-Specific Requirements
Feature flag: FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=true
Unique AI API access for content retrieval and updates
LLM endpoints (Azure OpenAI or compatible) supporting structured output
Network access to Unique AI API and LLM endpoints
Deployment
Environment Variables and Secrets
Change | Environment Variable Name | Application Default Value | Example | Required | Applications | Short Description |
Added |
|
|
| No |
| Enable metadata extraction |
Added |
|
|
| No |
| Token encoder |
Added |
|
|
| Yes |
| Comma-separated list of available LLM models for UI selection |
Added |
|
|
| No |
|
Sizing
Performance
Scenario | Processing time |
|---|---|
Simple (3-5 fields) | 2-5 seconds |
Complex (10+ field) | 5-15 seconds |
Resource recommendations
We recommend a deployment with 2-4 replicas for High Availability.
In terms of recommended resources, the recommended allocations per replica are as follows:
Concurrent Users | Docs/Month | Replica CPU / Memory | Recommended Replicas |
|---|---|---|---|
10-100 | 1000-5000 | 1000-2000m / 2-4Gi | 2 |
100-500 | 5000-20000 | 2000-3000m / 4-6Gi | 2-3 |
Cost Estimation
LLM costs (GPT-4o example):
~$0.05-0.15 per document
10,000 docs/month: ~$500-1500/month
Costs may vary bo model, MaxInputTokens and schema complexity.
Configuration
Webhook Setup
To trigger the Agentic Metadata Extraction app, we need to send a POST request to the /metadata-extraction/webhook endpoint.
The webhook is triggered automatically when content ingestion completes (it subscribes to the
unique.ingestion.content.finishedevent)The app fetches the content's
appliedIngestionConfigto check if metadata extraction is enabledIf metadata extraction is configured with a schema and enabled, it extracts structured metadata using the specified LLM
The extracted metadata is then updated back to the content record

Metadata Schema
Schemas are configured per-folder in the Unique AI platform UI (Knowledge Upload → Folder Settings → Configure file ingestion).

Field Properties
Key | Description | Value |
|---|---|---|
type | Data type | string, number, boolean |
description | Natural language description to guide LLM | Text |
required | Whether field must be extracted or not | true/false |
Example Metadata Schema
{
"summary": {
"type": "string",
"required": true,
"description": "A brief one-sentence summary of the document"
},
"documentType": {
"type": "string",
"required": true,
"description": "What type of document is this? (e.g., invoice, report, email)"
}
}LLM Configuration
Language Model Selection
Models added in the env variable (
METADATA_EXTRACTION_LANGUAGE_MODELS) will be available in this drop down.LLMs must be provided by Azure OpenAI and support Structured Output
The model selected from this dropdown will be used for Metadata extraction

Input Tokens
This is configurable in the range 1000-10000, The default value appearing in the box is coming from the env variable (
DEFAULT_MAX_TOKENS_USED_FOR_METADATA_EXTRACTION)This defines the maximum token limit for data sent to the LLM.

Operating & Troubleshooting
Authentication Methods
The service uses:
API Keys: For Unique AI API access
API Endpoints
Webhook
Endpoint: POST /metadata-extraction/webhook
Payload:
{
"event": "unique.content.ingestion.finished",
"companyId": "company-123",
"userId": "user-456",
"payload": {
"contentId": "content-789"
}
}Troubleshooting
The metadata extraction feature is not visible on the UI, Why?
Verify
FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=trueis set in Frontend (knowledge-upload app)
Metadata’s are not visible after Ingestion has finished, Why?
Verify
FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=trueis set in Backend (agentic-ingestion app)Check if module is loaded
kubectl logs deployment/agentic-ingestion | grep "Registered blueprint for metadata_extraction"Verify if the webhook is configured properly and triggered
Verify authentication is working
Check all required environment variables are set in the Python service:
API_BASE=https://your-api-url.com
TOKEN_URL=https://your-auth-url.com/oauth/token
CLIENT_ID=your-client-id
CLIENT_SECRET=your-client-secretIn logs, look for:
"Error fetching or parsing metadata config for content"→ API access issue"Metadata settings not available"→ Initialization failure
Monitoring
Key Metrics to Monitor:
Webhook processing lag/queue length
Processing time per document
Success/failure rates
Resource utilization (CPU, Memory)
LLM API response times
Content API health & authentication success rate
Token consumption & costs
Extraction quality (fields extracted, empty extraction rate)