Agentic Metadata Extraction for Infra Admins

3 min read

This feature is currently in `BETA`. It may change before general availability, due to user and client feedback, but it targeted to be high quality and stable. Documentation may lag behind feature updates. Use in production environments at your own discretion. Please refer to our Upgrade and Release Process for more information.

Description

The Agentic Metadata Extraction Service automatically extracts structured metadata from document content using Large Language Models (LLMs). Unlike traditional metadata extraction relying on document properties, this service analyzes actual text content to intelligently extract business-relevant metadata fields.

Processing Flow

View diagram “metadata-extraction-flow” in Confluence

How it works?

User uploads document → Content is ingested and chunked
Platform sends webhook to /metadata-extraction/webhook
Service fetches document chunks and configuration
Chunks merged up to maxInputTokens limit
LLM extracts metadata based on configured schema
Extracted metadata merged with existing content metadata
Returns 200 OK when complete synchronous processing

The service processes already-ingested content chunks, not raw documents. Processing is synchronous - the webhook completes when extraction finishes.

Code Path (agentic-ingestion)

POST /metadata-extraction
  └─ Receive content + metadata schema
      └─ LLM completion with structured output
          └─ Return extracted metadata fields per schema

Provisioning

Prerequisites

See Agentic Ingestion Infrastructure for Kubernetes and general infrastructure requirements.

Service-Specific Requirements

Feature flag: FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=true
Unique AI API access for content retrieval and updates
LLM endpoints (Azure OpenAI or compatible) supporting structured output
Network access to Unique AI API and LLM endpoints

Deployment

Environment Variables and Secrets

Change	Environment Variable Name	Application Default Value (if env variable unset)	Example	Required	Applications	Short Description
Added	`FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619`	`"false"`	`"true"`	No	`web-app-knowledge-upload` `backend-service-agentic-ingestion`	Enable metadata extraction
Added	`DEFAULT_ENCODER_NAME`	`"o200k_base"`	`"o200k_base"`	No	`backend-service-agentic-ingestion`	Token encoder
Added	`METADATA_EXTRACTION_LANGUAGE_MODELS`	`""`	`"AZURE_GPT_4o_2024_0806:GPT-4o (2024-08-06),AZURE_GPT_4o_2024_1120:GPT-4o (2024-11-20)"`	Yes	`web-app-knowledge-upload`	Comma-separated list of available LLM models for UI selection
Added	`DEFAULT_MAX_TOKENS_USED_FOR_METADATA_EXTRACTION`	`10000`	`10000`	No	`web-app-knowledge-upload`

Sizing

Performance

Scenario	Processing time
Simple (3-5 fields)	2-5 seconds
Complex (10+ field)	5-15 seconds

Resource recommendations

We recommend a deployment with 2-4 replicas for High Availability.

In terms of recommended resources, the recommended allocations per replica are as follows:

Concurrent Users	Docs/Month	Replica CPU / Memory	Recommended Replicas
10-100	1000-5000	1000-2000m / 2-4Gi	2
100-500	5000-20000	2000-3000m / 4-6Gi	2-3

Cost Estimation

LLM costs (GPT-4o example):

~$0.05-0.15 per document
10,000 docs/month: ~$500-1500/month

Costs may vary bo model, MaxInputTokens and schema complexity.

Configuration

Webhook Setup

To trigger the Agentic Metadata Extraction app, we need to send a POST request to the /metadata-extraction/webhook endpoint.

The webhook is triggered automatically when content ingestion completes (it subscribes to the unique.ingestion.content.finished event)
The app fetches the content's appliedIngestionConfig to check if metadata extraction is enabled
If metadata extraction is configured with a schema and enabled, it extracts structured metadata using the specified LLM
The extracted metadata is then updated back to the content record

Metadata Schema

Schemas are configured per-folder in the Unique AI platform UI (Knowledge Upload → Folder Settings → Configure file ingestion).

Field Properties

Key	Description	Value
type	Data type	string, number, boolean
description	Natural language description to guide LLM	Text
required	Whether field must be extracted or not	true/false

Example Metadata Schema

json

{
  "summary": {
    "type": "string",
    "required": true,
    "description": "A brief one-sentence summary of the document"
  },
  "documentType": {
    "type": "string",
    "required": true,
    "description": "What type of document is this? (e.g., invoice, report, email)"
  }
}

LLM Configuration

Language Model Selection
- Models added in the env variable (METADATA_EXTRACTION_LANGUAGE_MODELS) will be available in this drop down.
- LLMs must be provided by Azure OpenAI and support Structured Output
- The model selected from this dropdown will be used for Metadata extraction

Input Tokens
- This is configurable in the range 1000-10000, The default value appearing in the box is coming from the env variable (DEFAULT_MAX_TOKENS_USED_FOR_METADATA_EXTRACTION)
- This defines the maximum token limit for data sent to the LLM.

Operating & Troubleshooting

Authentication Methods

The service uses:

API Keys: For Unique AI API access

API Endpoints

Webhook

Endpoint: POST /metadata-extraction/webhook

Payload:

json

{
  "event": "unique.content.ingestion.finished",
  "companyId": "company-123",
  "userId": "user-456",
  "payload": {
    "contentId": "content-789"
  }
}

Troubleshooting

The metadata extraction feature is not visible on the UI, Why?

Verify FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=true is set in Frontend (knowledge-upload app)

Metadata’s are not visible after Ingestion has finished, Why?

Verify FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=true is set in Backend (agentic-ingestion app)

Check if module is loaded

kubectl logs deployment/agentic-ingestion | grep "Registered blueprint for metadata_extraction"

Verify if the webhook is configured properly and triggered
Verify authentication is working

Check all required environment variables are set in the Python service:

API_BASE=https://your-api-url.com
TOKEN_URL=https://your-auth-url.com/oauth/token
CLIENT_ID=your-client-id
CLIENT_SECRET=your-client-secret

In logs, look for:

"Error fetching or parsing metadata config for content" → API access issue
"Metadata settings not available" → Initialization failure

Monitoring

Key Metrics to Monitor:

Webhook processing lag/queue length
Processing time per document
Success/failure rates
Resource utilization (CPU, Memory)
LLM API response times
Content API health & authentication success rate
Token consumption & costs
Extraction quality (fields extracted, empty extraction rate)