Additional ingestion configuration options

6 min read

There are other ingestion configuration options available.


Ingestion configuration

Following here is the complete ingestionConfig (first values = default). Those values can be adjusted as described in the sub chapters and set on different levels for documents.

json
{
  ingestionConfig: {
    // Core Chunking Settings
    chunkMaxTokens: 600,
    chunkMaxTokensOnePager: 1000,
    chunkMinTokens: 3,
    documentMinTokens: 25,

    // Chunking Strategy
    chunkStrategy: 'RECURSIVE_CHUNKING' | 'UNIQUE_DEFAULT_CHUNKING' | 'CUSTOM_CHUNKING_API' | 'CONTEXTUAL_CHUNKING' | 'CONTEXTUAL_CHUNKING_LIGHT',
    chunkingConfiguration: {
      systemPrompt: 'string',
      model: 'string',
      tokens: number
    },

    // Document Processing Modes
    pdfReadMode: 'DOC_INTELLIGENCE_DISABLED' | 'DOC_INTELLIGENCE_DEFAULT' | 'DOC_INTELLIGENCE_ON_TABLE' | 'DOC_INTELLIGENCE_FALLBACK' | 'PDFTODOCX_ONLY' | 'CUSTOM_SINGLE_PAGE_API',
    wordReadMode: 'MAMMOTH_ONLY' | 'DOC_INTELLIGENCE_DEFAULT' | 'CUSTOM_SINGLE_PAGE_API' | 'INGEST_WORD_AS_PDF',
    pptReadMode: 'INGEST_WITH_DEFAULT_SERVICE' | 'INGEST_PPT_AS_PDF',
    excelReadMode: 'INGEST_WITH_DEFAULT_SERVICE' | 'INGEST_EXCEL_AS_PDF',
    jpgReadMode: 'NO_INGESTION' | 'DOC_INTELLIGENCE_DEFAULT',

    // Ingestion Mode
    uniqueIngestionMode: 'INGESTION' | 'SKIP_INGESTION' | 'SKIP_EXCEL_INGESTION' | 'EXTERNAL_INGESTION',

    // Custom API Options
    customApiOptions: [] | Array<{
      customisationType: 'CUSTOM_SINGLE_PAGE_API' | 'CUSTOM_CHUNKING_API',
      apiIdentifier: 'YOUR IDENTIFIER',
      apiPayload?: '{"xxx": "yyyy"}'
    }>,

    // Format-Specific Configuration
    pdfConfig: {
      usePageBasedChunking: false
    },
    pptConfig: {
      usePageBasedChunking: false
    },
    excelConfig: {
      rowsPerChunk: number,
      tableFormat: 'MARKDOWN' | 'OBJECT',
      headerRows: [1],
      headerColumns: [],
      maxEmptyTableRows: 1,
      maxEmptyTableCols: 2,
      tableChunkTokenLimit: 2000,
      maxRows: 5000,
      maxCols: 100
    },
    csvConfig: {
      maxRows: 5000,
      maxCols: 100
    },
    vttConfig: {
      languageModel: 'string'
    },

    // Metadata Configuration
    metadata: {},
    shouldApplyToSubScopes: false,
    hideInChat: false,

    // Metadata Extraction (AI-powered)
    metadataExtractionConfig: {
      enabled: false,
      metadataSchema: {},
      languageModel: 'string',
      maxInputTokens: number
    }
  }
}

The ingestion configuration can be set on different levels:

  • On the content object on file upload directly: Knowledge Base - Ingestion API

  • On instance level for all companies on a tenant. Contact Unique Customer Success.

  • On the Space (Assistant) level via Advanced Settings - applies to all documents uploaded in chat for that space

  • On the Scope/Folder level via API or Admin UI


Configuring Ingestion in Space (Assistant) Settings

You can configure ingestion settings at the Space level to control how documents uploaded to chat are processed. This is done through the Advanced Settings section of Space Management.

Via Admin UI

  1. Navigate to Admin > Spaces

  2. Select the space to configure

  3. Click Advanced Settings

  4. In the JSON configuration, add or modify the ingestionConfig object

  5. Save the configuration

Via the Assistant Configuration JSON

The ingestionConfig can be set within the assistant's settings JSON:

json
{
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "wordReadMode": "INGEST_WORD_AS_PDF",
    "chunkMaxTokens": 600,
    "chunkStrategy": "RECURSIVE_CHUNKING"
  }
}

Example: Enable MDI for Upload in Chat

To use Microsoft Document Intelligence processing when uploading documents to a specific space's chat:

json
{
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "wordReadMode": "INGEST_WORD_AS_PDF"
  }
}

Example: Custom Single Page API for Chat Uploads

To use a custom ingestion API (like Agentic Ingestion) for documents uploaded in chat:

json
{
  "ingestionConfig": {
    "pdfReadMode": "CUSTOM_SINGLE_PAGE_API",
    "customApiOptions": [{
      "customisationType": "CUSTOM_SINGLE_PAGE_API",
      "apiIdentifier": "Unique Text and Image Extraction API",
      "apiPayload": "{}"
    }]
  }
}

Setting the general Unique AI ingestion mode

This mode defines the overall behaviour of the ingestion. There are four possible options:

Value

Description

INGESTION

Default. Content is queued to be ingested by Unique.

SKIP_INGESTION

Directly sets status to FINISHED. No ingestion. Use for images/charts referenced in chat.

SKIP_EXCEL_INGESTION

Process all documents except Excel and CSV files (which are stored but not indexed).

EXTERNAL_INGESTION

Ingestion handled by SDK integration. Status set to QUEUED for SDK to pick up.


Chunk Strategy Options

Value

Description

RECURSIVE_CHUNKING

Default strategy. Recursively splits text at natural boundaries (paragraphs, sentences, words).

UNIQUE_DEFAULT_CHUNKING

Legacy strategy, now maps to RECURSIVE_CHUNKING.

CUSTOM_CHUNKING_API

Uses an external custom API for chunking. Requires customApiOptions configuration.

CONTEXTUAL_CHUNKING

Advanced strategy that generates per-chunk summaries using an LLM to improve retrieval.

CONTEXTUAL_CHUNKING_LIGHT

Lighter version that generates a single document summary prepended to all chunks.

Chunking Configuration (for Contextual Chunking)

Field

Type

Description

systemPrompt

string

Custom system prompt for the summarization LLM.

model

string

The LLM model to use for summarization.

tokens

number

Token limit for generated summaries.


PDF Read Mode Options

Value

Description

DOC_INTELLIGENCE_DISABLED

Default. Do not use Document Intelligence. Uses standard PDF parsing only.

DOC_INTELLIGENCE_DEFAULT

Always use Azure Document Intelligence for PDF processing. Best for complex layouts and tables.

DOC_INTELLIGENCE_ON_TABLE

Use Document Intelligence only when tables are detected.

DOC_INTELLIGENCE_FALLBACK

Use Document Intelligence as a fallback when standard parsing fails.

PDFTODOCX_ONLY

Convert PDF to DOCX format first, then process.

CUSTOM_SINGLE_PAGE_API

Use a custom external API for page-by-page processing. Requires customApiOptions.


Word Read Mode Options

Value

Description

MAMMOTH_ONLY

Default. Use the Mammoth library for Word to HTML conversion.

DOC_INTELLIGENCE_DEFAULT

Use Azure Document Intelligence for Word processing.

CUSTOM_SINGLE_PAGE_API

Use a custom external API for processing.

INGEST_WORD_AS_PDF

Convert Word to PDF first, then process using PDF pipeline.


PowerPoint Read Mode Options

Value

Description

INGEST_WITH_DEFAULT_SERVICE

Default. Use the default PowerPoint processing service.

INGEST_PPT_AS_PDF

Convert PowerPoint to PDF first, then process using PDF pipeline.


Excel Read Mode Options

Value

Description

INGEST_WITH_DEFAULT_SERVICE

Default. Use the default Excel processing service with table extraction.

INGEST_EXCEL_AS_PDF

Convert Excel to PDF first, then process using PDF pipeline.


Image/JPG Read Mode Options

Value

Description

NO_INGESTION

Default. Skip image ingestion entirely.

DOC_INTELLIGENCE_DEFAULT

Use Azure Document Intelligence OCR for text extraction from images.


Core Chunking Parameters

Field

Type

Default

Description

chunkMaxTokens

number

600

Maximum number of tokens per chunk. Azure OpenAI supports up to 2048, but 600 is recommended for optimal retrieval.

chunkMinTokens

number

3

Minimum tokens required for a chunk. Chunks below this are merged with adjacent chunks.

chunkMaxTokensOnePager

number

1000

Maximum tokens for "one-pager" documents that should not be split.

documentMinTokens

number

25

Minimum tokens required for a document to be ingested. Documents below this are skipped.


Format-Specific Configuration

PDF Configuration

Field

Type

Default

Description

pdfConfig.usePageBasedChunking

boolean

false

If true, creates separate chunks for each page rather than merging across pages.

PowerPoint Configuration

Field

Type

Default

Description

pptConfig.usePageBasedChunking

boolean

false

If true, creates separate chunks for each slide.

Excel Configuration

Field

Type

Default

Description

excelConfig.rowsPerChunk

number

-

Number of rows per chunk. If not set, uses token-based chunking.

excelConfig.tableFormat

enum

MARKDOWN

Output format for tables (OBJECT or MARKDOWN).

excelConfig.headerRows

number[]

[1]

Array of row indices (1-based) to treat as header rows.

excelConfig.headerColumns

number[]

[]

Array of column indices (1-based) to treat as header columns.

excelConfig.maxEmptyTableRows

number

1

Maximum consecutive empty rows allowed before table is split.

excelConfig.maxEmptyTableCols

number

2

Maximum consecutive empty columns allowed.

excelConfig.tableChunkTokenLimit

number

2000

Maximum tokens per table chunk.

excelConfig.maxRows

number

5000

Maximum rows allowed. Ingestion fails if exceeded.

excelConfig.maxCols

number

100

Maximum columns allowed. Ingestion fails if exceeded.

CSV Configuration

Field

Type

Default

Description

csvConfig.maxRows

number

5000

Maximum rows allowed in CSV files.

csvConfig.maxCols

number

100

Maximum columns allowed in CSV files.

VTT (Video Transcript) Configuration

Field

Type

Default

Description

vttConfig.languageModel

string

-

LLM model to use for transcript processing.


Custom API Options

Field

Type

Description

customApiOptions

array

Configuration for custom processing APIs.

CustomApiOptions Object:

Field

Type

Description

customisationType

enum

Type of customization: CUSTOM_SINGLE_PAGE_API or CUSTOM_CHUNKING_API

apiIdentifier

string

Identifier of the registered custom API endpoint.

apiPayload

string

Optional JSON payload to send to the custom API.


Metadata Configuration

Field

Type

Default

Description

metadata

object

-

Key-value pairs of custom metadata to attach to all ingested content.

shouldApplyToSubScopes

boolean

false

If true, applies this configuration to all child folders when setting on a scope.

hideInChat

boolean

false

If true, content is indexed but hidden from chat search results.


Metadata Extraction (AI-powered)

Field

Type

Default

Description

metadataExtractionConfig

object

-

Configuration for automatic metadata extraction using LLMs.

MetadataExtractionConfig Object:

Field

Type

Description

enabled

boolean

Whether to enable automatic metadata extraction.

metadataSchema

object

Schema defining what metadata fields to extract.

languageModel

string

LLM model to use for extraction.

maxInputTokens

number

Maximum input tokens to send to the LLM.

MetadataFieldSchema (for each field in metadataSchema):

Field

Type

Description

type

enum

Field type: string, number, boolean, or array

description

string

Description to help the LLM understand what to extract.

required

boolean

Whether this field must be extracted.

Example: AI Metadata Extraction Configuration

json
{
  "metadataExtractionConfig": {
    "enabled": true,
    "languageModel": "gpt-4o-mini",
    "maxInputTokens": 4000,
    "metadataSchema": {
      "document_date": {
        "type": "string",
        "description": "The date of the document in ISO format (YYYY-MM-DD)",
        "required": true
      },
      "author": {
        "type": "string",
        "description": "The author or authors of the document",
        "required": false
      },
      "topics": {
        "type": "array",
        "description": "Main topics covered in the document",
        "required": true
      }
    }
  }
}

Other Assistant Configuration Options

Beyond ingestionConfig, there are several other configuration options that can be set in the Space/Assistant Advanced Settings.

Complete Assistant Settings Structure

json
{
  // User Interface Type
  userInterface: 'CHAT' | 'MAGIC_TABLE' | 'TRANSLATION',

  // Model Selection Strategy
  modelChoosing: 'BY_FUNCTION_CALL',

  // PDF Highlighting in Chat
  showPdfHighlighting: true | false,

  // Auto-execute prompt on space entry
  autoExecutePrompt: null | 'string',

  // Ingestion Configuration (see above)
  ingestionConfig: { ... },

  // Speech-to-Text Configuration
  sttConfig: {
    grammarList: []
  },

  // Magic Table Configuration (for Due Diligence spaces)
  magicTableConfig: {
    answerLibrary: true | false,
    hideSheetStatus: true | false
  }
}

Speech-to-Text Configuration (sttConfig)

The sttConfig object configures the Speech-to-Text (voice input) functionality for a space.

Field

Type

Default

Description

sttConfig.grammarList

string[]

[]

List of phrases, words, or acronyms to help the speech recognition engine recognize company-specific terminology.

What is grammarList?

The grammarList is an array of strings that help the speech recognition service (Microsoft Azure Speech-to-Text) better recognize domain-specific vocabulary, company names, acronyms, and technical terms that may not be in the standard vocabulary.

Use Cases

  • Company names: ["UniqueAI", "Acme Corp", "TechCo"]

  • Industry acronyms: ["KPI", "ROI", "EBITDA", "P&L", "YoY"]

  • Product names: ["UniqueChat", "MagicTable", "AgenticTable"]

  • Technical terms: ["chunking", "embeddings", "vectorization"]

Example Configuration

json
{
  "sttConfig": {
    "grammarList": [
      "UniqueAI",
      "EBITDA",
      "YoY",
      "MoM",
      "P&L",
      "Due Diligence",
      "KYC",
      "AML"
    ]
  }
}

How it Works

When voice input is used in the chat, the grammar list phrases are sent to the Azure Speech-to-Text service as a PhraseListGrammar. This improves recognition accuracy for these specific terms, especially when they might otherwise be misinterpreted (e.g., "EBITDA" being recognized as "eat a" or similar).


Other Settings Reference

userInterface

Value

Description

CHAT

Standard chat interface (default)

MAGIC_TABLE

Magic Table / Agentic Table interface for structured data workflows

TRANSLATION

Translation-focused interface

showPdfHighlighting

Value

Description

true

Enable PDF highlighting in chat responses (default)

false

Disable PDF highlighting

autoExecutePrompt

Value

Description

null

No auto-execute prompt (default)

"string"

A prompt that automatically executes when user enters the space

magicTableConfig (for Due Diligence/Magic Table spaces)

Field

Type

Default

Description

answerLibrary

boolean

true

Enable/disable the answer library feature

hideSheetStatus

boolean

false

Hide/show sheet status indicators


Complete Example: Full Assistant Settings

json
{
  "userInterface": "CHAT",
  "modelChoosing": "BY_FUNCTION_CALL",
  "showPdfHighlighting": true,
  "autoExecutePrompt": null,
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "wordReadMode": "INGEST_WORD_AS_PDF",
    "chunkMaxTokens": 600,
    "chunkStrategy": "RECURSIVE_CHUNKING"
  },
  "sttConfig": {
    "grammarList": [
      "UniqueAI",
      "EBITDA",
      "Due Diligence",
      "KYC"
    ]
  }
}

Configuration Examples

Example 1: Basic Space Configuration

json
{
  "chunkMaxTokens": 600,
  "chunkMinTokens": 3,
  "chunkMaxTokensOnePager": 1000,
  "documentMinTokens": 25,
  "chunkStrategy": "RECURSIVE_CHUNKING",
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "wordReadMode": "MAMMOTH_ONLY",
  "pptReadMode": "INGEST_WITH_DEFAULT_SERVICE",
  "excelReadMode": "INGEST_WITH_DEFAULT_SERVICE",
  "jpgReadMode": "NO_INGESTION",
  "uniqueIngestionMode": "INGESTION"
}

Example 2: Excel-Heavy Content Configuration

json
{
  "chunkStrategy": "RECURSIVE_CHUNKING",
  "excelReadMode": "INGEST_WITH_DEFAULT_SERVICE",
  "excelConfig": {
    "rowsPerChunk": 50,
    "tableFormat": "MARKDOWN",
    "headerRows": [1],
    "headerColumns": [1],
    "maxEmptyTableRows": 2,
    "maxEmptyTableCols": 3,
    "tableChunkTokenLimit": 2500,
    "maxRows": 10000,
    "maxCols": 200
  },
  "shouldApplyToSubScopes": true
}

Example 3: Contextual Chunking Configuration

json
{
  "chunkMaxTokens": 500,
  "chunkMinTokens": 10,
  "chunkStrategy": "CONTEXTUAL_CHUNKING_LIGHT",
  "chunkingConfiguration": {
    "systemPrompt": "Summarize the key information from this document section.",
    "tokens": 150,
    "model": "gpt-4o-mini"
  },
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "uniqueIngestionMode": "INGESTION"
}
Last updated