Agentic Metadata Extraction for Infra Admins

3 min read

This feature is currently in BETA. It may change before general availability, due to user and client feedback, but it targeted to be high quality and stable. Documentation may lag behind feature updates. Use in production environments at your own discretion. Please refer to our Upgrade and Release Process for more information.



Description

The Agentic Metadata Extraction Service automatically extracts structured metadata from document content using Large Language Models (LLMs). Unlike traditional metadata extraction relying on document properties, this service analyzes actual text content to intelligently extract business-relevant metadata fields.

Processing Flow

View diagram “metadata-extraction-flow” in Confluence

How it works?

  1. User uploads document → Content is ingested and chunked

  2. Platform sends webhook to /metadata-extraction/webhook

  3. Service fetches document chunks and configuration

  4. Chunks merged up to maxInputTokens limit

  5. LLM extracts metadata based on configured schema

  6. Extracted metadata merged with existing content metadata

  7. Returns 200 OK when complete synchronous processing

The service processes already-ingested content chunks, not raw documents. Processing is synchronous - the webhook completes when extraction finishes.

Code Path (agentic-ingestion)

POST /metadata-extraction
  └─ Receive content + metadata schema
      └─ LLM completion with structured output
          └─ Return extracted metadata fields per schema

Provisioning

Prerequisites

See Agentic Ingestion Infrastructure for Kubernetes and general infrastructure requirements.

Service-Specific Requirements

  • Feature flag: FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=true

  • Unique AI API access for content retrieval and updates

  • LLM endpoints (Azure OpenAI or compatible) supporting structured output

  • Network access to Unique AI API and LLM endpoints

Deployment

Environment Variables and Secrets

Change

Environment Variable Name

Application Default Value
(if env variable unset)

Example

Required

Applications

Short Description

Added

FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619

"false"

"true"

No

  • web-app-knowledge-upload

  • backend-service-agentic-ingestion

Enable metadata extraction

Added

DEFAULT_ENCODER_NAME

"o200k_base"

"o200k_base"

No

  • backend-service-agentic-ingestion

Token encoder

Added

METADATA_EXTRACTION_LANGUAGE_MODELS

""

"AZURE_GPT_4o_2024_0806:GPT-4o (2024-08-06),AZURE_GPT_4o_2024_1120:GPT-4o (2024-11-20)"

Yes

  • web-app-knowledge-upload

Comma-separated list of available LLM models for UI selection

Added

DEFAULT_MAX_TOKENS_USED_FOR_METADATA_EXTRACTION

10000

10000

No

  • web-app-knowledge-upload

Sizing

Performance

Scenario

Processing time

Simple (3-5 fields)

2-5 seconds

Complex (10+ field)

5-15 seconds

Resource recommendations

We recommend a deployment with 2-4 replicas for High Availability.

In terms of recommended resources, the recommended allocations per replica are as follows:

Concurrent Users

Docs/Month

Replica CPU / Memory

Recommended Replicas

10-100

1000-5000

1000-2000m / 2-4Gi

2

100-500

5000-20000

2000-3000m / 4-6Gi

2-3

Cost Estimation

LLM costs (GPT-4o example):

  • ~$0.05-0.15 per document

  • 10,000 docs/month: ~$500-1500/month

Costs may vary bo model, MaxInputTokens and schema complexity.


Configuration

Webhook Setup

To trigger the Agentic Metadata Extraction app, we need to send a POST request to the /metadata-extraction/webhook endpoint.

  • The webhook is triggered automatically when content ingestion completes (it subscribes to the unique.ingestion.content.finished event)

  • The app fetches the content's appliedIngestionConfig to check if metadata extraction is enabled

  • If metadata extraction is configured with a schema and enabled, it extracts structured metadata using the specified LLM

  • The extracted metadata is then updated back to the content record

image-redacted_dot_app-20260205-093636.png

Metadata Schema

Schemas are configured per-folder in the Unique AI platform UI (Knowledge Upload → Folder Settings → Configure file ingestion).

image-20260204-150152.png

Field Properties

Key

Description

Value

type

Data type

string, number, boolean

description

Natural language description to guide LLM

Text

required

Whether field must be extracted or not

true/false

Example Metadata Schema

json
{
  "summary": {
    "type": "string",
    "required": true,
    "description": "A brief one-sentence summary of the document"
  },
  "documentType": {
    "type": "string",
    "required": true,
    "description": "What type of document is this? (e.g., invoice, report, email)"
  }
}

LLM Configuration

  • Language Model Selection

    • Models added in the env variable (METADATA_EXTRACTION_LANGUAGE_MODELS) will be available in this drop down.

    • LLMs must be provided by Azure OpenAI and support Structured Output

    • The model selected from this dropdown will be used for Metadata extraction

image-20260120-120843.png
  • Input Tokens

    • This is configurable in the range 1000-10000, The default value appearing in the box is coming from the env variable (DEFAULT_MAX_TOKENS_USED_FOR_METADATA_EXTRACTION)

    • This defines the maximum token limit for data sent to the LLM.

image-20260120-121850.png

Operating & Troubleshooting

Authentication Methods

The service uses:

  • API Keys: For Unique AI API access

API Endpoints

Webhook

Endpoint: POST /metadata-extraction/webhook

Payload:

json
{
  "event": "unique.content.ingestion.finished",
  "companyId": "company-123",
  "userId": "user-456",
  "payload": {
    "contentId": "content-789"
  }
}

Troubleshooting

The metadata extraction feature is not visible on the UI, Why?

  • Verify FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=true is set in Frontend (knowledge-upload app)

Metadata’s are not visible after Ingestion has finished, Why?

  • Verify FEATURE_FLAG_ENABLE_AGENTIC_METADATA_EXTRACTION_UN_15619=true is set in Backend (agentic-ingestion app)

  • Check if module is loaded

    kubectl logs deployment/agentic-ingestion | grep "Registered blueprint for metadata_extraction"
  • Verify if the webhook is configured properly and triggered

  • Verify authentication is working

Check all required environment variables are set in the Python service:

py
API_BASE=https://your-api-url.com
TOKEN_URL=https://your-auth-url.com/oauth/token
CLIENT_ID=your-client-id
CLIENT_SECRET=your-client-secret

In logs, look for:

  • "Error fetching or parsing metadata config for content" → API access issue

  • "Metadata settings not available" → Initialization failure

Monitoring

Key Metrics to Monitor:

  1. Webhook processing lag/queue length

  2. Processing time per document

  3. Success/failure rates

  4. Resource utilization (CPU, Memory)

  5. LLM API response times

  6. Content API health & authentication success rate

  7. Token consumption & costs

  8. Extraction quality (fields extracted, empty extraction rate)

Last updated