Additional ingestion configuration options

6 min read

There are other ingestion configuration options available.

Ingestion configuration

Following here is the complete ingestionConfig (first values = default). Those values can be adjusted as described in the sub chapters and set on different levels for documents.

json

{
  ingestionConfig: {
    // Core Chunking Settings
    chunkMaxTokens: 600,
    chunkMaxTokensOnePager: 1000,
    chunkMinTokens: 3,
    documentMinTokens: 25,

    // Chunking Strategy
    chunkStrategy: 'RECURSIVE_CHUNKING' | 'UNIQUE_DEFAULT_CHUNKING' | 'CUSTOM_CHUNKING_API' | 'CONTEXTUAL_CHUNKING' | 'CONTEXTUAL_CHUNKING_LIGHT',
    chunkingConfiguration: {
      systemPrompt: 'string',
      model: 'string',
      tokens: number
    },

    // Document Processing Modes
    pdfReadMode: 'DOC_INTELLIGENCE_DISABLED' | 'DOC_INTELLIGENCE_DEFAULT' | 'DOC_INTELLIGENCE_ON_TABLE' | 'DOC_INTELLIGENCE_FALLBACK' | 'PDFTODOCX_ONLY' | 'CUSTOM_SINGLE_PAGE_API',
    wordReadMode: 'MAMMOTH_ONLY' | 'DOC_INTELLIGENCE_DEFAULT' | 'CUSTOM_SINGLE_PAGE_API' | 'INGEST_WORD_AS_PDF',
    pptReadMode: 'INGEST_WITH_DEFAULT_SERVICE' | 'INGEST_PPT_AS_PDF',
    excelReadMode: 'INGEST_WITH_DEFAULT_SERVICE' | 'INGEST_EXCEL_AS_PDF',
    jpgReadMode: 'NO_INGESTION' | 'DOC_INTELLIGENCE_DEFAULT',

    // Ingestion Mode
    uniqueIngestionMode: 'INGESTION' | 'SKIP_INGESTION' | 'SKIP_EXCEL_INGESTION' | 'EXTERNAL_INGESTION',

    // Custom API Options
    customApiOptions: [] | Array<{
      customisationType: 'CUSTOM_SINGLE_PAGE_API' | 'CUSTOM_CHUNKING_API',
      apiIdentifier: 'YOUR IDENTIFIER',
      apiPayload?: '{"xxx": "yyyy"}'
    }>,

    // Format-Specific Configuration
    pdfConfig: {
      usePageBasedChunking: false
    },
    pptConfig: {
      usePageBasedChunking: false
    },
    excelConfig: {
      rowsPerChunk: number,
      tableFormat: 'MARKDOWN' | 'OBJECT',
      headerRows: [1],
      headerColumns: [],
      maxEmptyTableRows: 1,
      maxEmptyTableCols: 2,
      tableChunkTokenLimit: 2000,
      maxRows: 5000,
      maxCols: 100
    },
    csvConfig: {
      maxRows: 5000,
      maxCols: 100
    },
    vttConfig: {
      languageModel: 'string'
    },

    // Metadata Configuration
    metadata: {},
    shouldApplyToSubScopes: false,
    hideInChat: false,

    // Metadata Extraction (AI-powered)
    metadataExtractionConfig: {
      enabled: false,
      metadataSchema: {},
      languageModel: 'string',
      maxInputTokens: number
    }
  }
}

The ingestion configuration can be set on different levels:

On the content object on file upload directly: Knowledge Base - Ingestion API
On instance level for all companies on a tenant. Contact Unique Customer Success.
On the Space (Assistant) level via Advanced Settings - applies to all documents uploaded in chat for that space
On the Scope/Folder level via API or Admin UI

Configuring Ingestion in Space (Assistant) Settings

You can configure ingestion settings at the Space level to control how documents uploaded to chat are processed. This is done through the Advanced Settings section of Space Management.

Via Admin UI

Navigate to Admin > Spaces
Select the space to configure
Click Advanced Settings
In the JSON configuration, add or modify the ingestionConfig object
Save the configuration

Via the Assistant Configuration JSON

The ingestionConfig can be set within the assistant's settings JSON:

json

{
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "wordReadMode": "INGEST_WORD_AS_PDF",
    "chunkMaxTokens": 600,
    "chunkStrategy": "RECURSIVE_CHUNKING"
  }
}

Example: Enable MDI for Upload in Chat

To use Microsoft Document Intelligence processing when uploading documents to a specific space's chat:

json

{
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "wordReadMode": "INGEST_WORD_AS_PDF"
  }
}

Example: Custom Single Page API for Chat Uploads

To use a custom ingestion API (like Agentic Ingestion) for documents uploaded in chat:

json

{
  "ingestionConfig": {
    "pdfReadMode": "CUSTOM_SINGLE_PAGE_API",
    "customApiOptions": [{
      "customisationType": "CUSTOM_SINGLE_PAGE_API",
      "apiIdentifier": "Unique Text and Image Extraction API",
      "apiPayload": "{}"
    }]
  }
}

Setting the general Unique AI ingestion mode

This mode defines the overall behaviour of the ingestion. There are four possible options:

Value	Description
`INGESTION`	Default. Content is queued to be ingested by Unique.
`SKIP_INGESTION`	Directly sets status to `FINISHED`. No ingestion. Use for images/charts referenced in chat.
`SKIP_EXCEL_INGESTION`	Process all documents except Excel and CSV files (which are stored but not indexed).
`EXTERNAL_INGESTION`	Ingestion handled by SDK integration. Status set to `QUEUED` for SDK to pick up.

Chunk Strategy Options

Value	Description
`RECURSIVE_CHUNKING`	Default strategy. Recursively splits text at natural boundaries (paragraphs, sentences, words).
`UNIQUE_DEFAULT_CHUNKING`	Legacy strategy, now maps to `RECURSIVE_CHUNKING`.
`CUSTOM_CHUNKING_API`	Uses an external custom API for chunking. Requires `customApiOptions` configuration.
`CONTEXTUAL_CHUNKING`	Advanced strategy that generates per-chunk summaries using an LLM to improve retrieval.
`CONTEXTUAL_CHUNKING_LIGHT`	Lighter version that generates a single document summary prepended to all chunks.

Chunking Configuration (for Contextual Chunking)

Field	Type	Description
`systemPrompt`	string	Custom system prompt for the summarization LLM.
`model`	string	The LLM model to use for summarization.
`tokens`	number	Token limit for generated summaries.

PDF Read Mode Options

Value	Description
`DOC_INTELLIGENCE_DISABLED`	Default. Do not use Document Intelligence. Uses standard PDF parsing only.
`DOC_INTELLIGENCE_DEFAULT`	Always use Azure Document Intelligence for PDF processing. Best for complex layouts and tables.
`DOC_INTELLIGENCE_ON_TABLE`	Use Document Intelligence only when tables are detected.
`DOC_INTELLIGENCE_FALLBACK`	Use Document Intelligence as a fallback when standard parsing fails.
`PDFTODOCX_ONLY`	Convert PDF to DOCX format first, then process.
`CUSTOM_SINGLE_PAGE_API`	Use a custom external API for page-by-page processing. Requires `customApiOptions`.

Word Read Mode Options

Value	Description
`MAMMOTH_ONLY`	Default. Use the Mammoth library for Word to HTML conversion.
`DOC_INTELLIGENCE_DEFAULT`	Use Azure Document Intelligence for Word processing.
`CUSTOM_SINGLE_PAGE_API`	Use a custom external API for processing.
`INGEST_WORD_AS_PDF`	Convert Word to PDF first, then process using PDF pipeline.

PowerPoint Read Mode Options

Value	Description
`INGEST_WITH_DEFAULT_SERVICE`	Default. Use the default PowerPoint processing service.
`INGEST_PPT_AS_PDF`	Convert PowerPoint to PDF first, then process using PDF pipeline.

Excel Read Mode Options

Value	Description
`INGEST_WITH_DEFAULT_SERVICE`	Default. Use the default Excel processing service with table extraction.
`INGEST_EXCEL_AS_PDF`	Convert Excel to PDF first, then process using PDF pipeline.

Image/JPG Read Mode Options

Value	Description
`NO_INGESTION`	Default. Skip image ingestion entirely.
`DOC_INTELLIGENCE_DEFAULT`	Use Azure Document Intelligence OCR for text extraction from images.

Core Chunking Parameters

Field	Type	Default	Description
`chunkMaxTokens`	number	600	Maximum number of tokens per chunk. Azure OpenAI supports up to 2048, but 600 is recommended for optimal retrieval.
`chunkMinTokens`	number	3	Minimum tokens required for a chunk. Chunks below this are merged with adjacent chunks.
`chunkMaxTokensOnePager`	number	1000	Maximum tokens for "one-pager" documents that should not be split.
`documentMinTokens`	number	25	Minimum tokens required for a document to be ingested. Documents below this are skipped.

Format-Specific Configuration

PDF Configuration

Field	Type	Default	Description
`pdfConfig.usePageBasedChunking`	boolean	false	If true, creates separate chunks for each page rather than merging across pages.

PowerPoint Configuration

Field	Type	Default	Description
`pptConfig.usePageBasedChunking`	boolean	false	If true, creates separate chunks for each slide.

Excel Configuration

Field	Type	Default	Description
`excelConfig.rowsPerChunk`	number	-	Number of rows per chunk. If not set, uses token-based chunking.
`excelConfig.tableFormat`	enum	`MARKDOWN`	Output format for tables (`OBJECT` or `MARKDOWN`).
`excelConfig.headerRows`	number[]	[1]	Array of row indices (1-based) to treat as header rows.
`excelConfig.headerColumns`	number[]	[]	Array of column indices (1-based) to treat as header columns.
`excelConfig.maxEmptyTableRows`	number	1	Maximum consecutive empty rows allowed before table is split.
`excelConfig.maxEmptyTableCols`	number	2	Maximum consecutive empty columns allowed.
`excelConfig.tableChunkTokenLimit`	number	2000	Maximum tokens per table chunk.
`excelConfig.maxRows`	number	5000	Maximum rows allowed. Ingestion fails if exceeded.
`excelConfig.maxCols`	number	100	Maximum columns allowed. Ingestion fails if exceeded.

CSV Configuration

Field	Type	Default	Description
`csvConfig.maxRows`	number	5000	Maximum rows allowed in CSV files.
`csvConfig.maxCols`	number	100	Maximum columns allowed in CSV files.

VTT (Video Transcript) Configuration

Field	Type	Default	Description
`vttConfig.languageModel`	string	-	LLM model to use for transcript processing.

Custom API Options

Field	Type	Description
`customApiOptions`	array	Configuration for custom processing APIs.

CustomApiOptions Object:

Field	Type	Description
`customisationType`	enum	Type of customization: `CUSTOM_SINGLE_PAGE_API` or `CUSTOM_CHUNKING_API`
`apiIdentifier`	string	Identifier of the registered custom API endpoint.
`apiPayload`	string	Optional JSON payload to send to the custom API.

Metadata Configuration

Field	Type	Default	Description
`metadata`	object	-	Key-value pairs of custom metadata to attach to all ingested content.
`shouldApplyToSubScopes`	boolean	false	If true, applies this configuration to all child folders when setting on a scope.
`hideInChat`	boolean	false	If true, content is indexed but hidden from chat search results.

Metadata Extraction (AI-powered)

Field	Type	Default	Description
`metadataExtractionConfig`	object	-	Configuration for automatic metadata extraction using LLMs.

MetadataExtractionConfig Object:

Field	Type	Description
`enabled`	boolean	Whether to enable automatic metadata extraction.
`metadataSchema`	object	Schema defining what metadata fields to extract.
`languageModel`	string	LLM model to use for extraction.
`maxInputTokens`	number	Maximum input tokens to send to the LLM.

MetadataFieldSchema (for each field in metadataSchema):

Field	Type	Description
`type`	enum	Field type: `string`, `number`, `boolean`, or `array`
`description`	string	Description to help the LLM understand what to extract.
`required`	boolean	Whether this field must be extracted.

Example: AI Metadata Extraction Configuration

json

{
  "metadataExtractionConfig": {
    "enabled": true,
    "languageModel": "gpt-4o-mini",
    "maxInputTokens": 4000,
    "metadataSchema": {
      "document_date": {
        "type": "string",
        "description": "The date of the document in ISO format (YYYY-MM-DD)",
        "required": true
      },
      "author": {
        "type": "string",
        "description": "The author or authors of the document",
        "required": false
      },
      "topics": {
        "type": "array",
        "description": "Main topics covered in the document",
        "required": true
      }
    }
  }
}

Other Assistant Configuration Options

Beyond ingestionConfig, there are several other configuration options that can be set in the Space/Assistant Advanced Settings.

Complete Assistant Settings Structure

json

{
  // User Interface Type
  userInterface: 'CHAT' | 'MAGIC_TABLE' | 'TRANSLATION',

  // Model Selection Strategy
  modelChoosing: 'BY_FUNCTION_CALL',

  // PDF Highlighting in Chat
  showPdfHighlighting: true | false,

  // Auto-execute prompt on space entry
  autoExecutePrompt: null | 'string',

  // Ingestion Configuration (see above)
  ingestionConfig: { ... },

  // Speech-to-Text Configuration
  sttConfig: {
    grammarList: []
  },

  // Magic Table Configuration (for Due Diligence spaces)
  magicTableConfig: {
    answerLibrary: true | false,
    hideSheetStatus: true | false
  }
}

Speech-to-Text Configuration (sttConfig)

The sttConfig object configures the Speech-to-Text (voice input) functionality for a space.

Field	Type	Default	Description
`sttConfig.grammarList`	string[]	[]	List of phrases, words, or acronyms to help the speech recognition engine recognize company-specific terminology.

What is grammarList?

The grammarList is an array of strings that help the speech recognition service (Microsoft Azure Speech-to-Text) better recognize domain-specific vocabulary, company names, acronyms, and technical terms that may not be in the standard vocabulary.

Use Cases

Company names: ["UniqueAI", "Acme Corp", "TechCo"]
Industry acronyms: ["KPI", "ROI", "EBITDA", "P&L", "YoY"]
Product names: ["UniqueChat", "MagicTable", "AgenticTable"]
Technical terms: ["chunking", "embeddings", "vectorization"]

Example Configuration

json

{
  "sttConfig": {
    "grammarList": [
      "UniqueAI",
      "EBITDA",
      "YoY",
      "MoM",
      "P&L",
      "Due Diligence",
      "KYC",
      "AML"
    ]
  }
}

How it Works

When voice input is used in the chat, the grammar list phrases are sent to the Azure Speech-to-Text service as a PhraseListGrammar. This improves recognition accuracy for these specific terms, especially when they might otherwise be misinterpreted (e.g., "EBITDA" being recognized as "eat a" or similar).

Other Settings Reference

userInterface

Value	Description
`CHAT`	Standard chat interface (default)
`MAGIC_TABLE`	Magic Table / Agentic Table interface for structured data workflows
`TRANSLATION`	Translation-focused interface

showPdfHighlighting

Value	Description
`true`	Enable PDF highlighting in chat responses (default)
`false`	Disable PDF highlighting

autoExecutePrompt

Value	Description
`null`	No auto-execute prompt (default)
`"string"`	A prompt that automatically executes when user enters the space

magicTableConfig (for Due Diligence/Magic Table spaces)

Field	Type	Default	Description
`answerLibrary`	boolean	true	Enable/disable the answer library feature
`hideSheetStatus`	boolean	false	Hide/show sheet status indicators

Complete Example: Full Assistant Settings

json

{
  "userInterface": "CHAT",
  "modelChoosing": "BY_FUNCTION_CALL",
  "showPdfHighlighting": true,
  "autoExecutePrompt": null,
  "ingestionConfig": {
    "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
    "wordReadMode": "INGEST_WORD_AS_PDF",
    "chunkMaxTokens": 600,
    "chunkStrategy": "RECURSIVE_CHUNKING"
  },
  "sttConfig": {
    "grammarList": [
      "UniqueAI",
      "EBITDA",
      "Due Diligence",
      "KYC"
    ]
  }
}

Configuration Examples

Example 1: Basic Space Configuration

json

{
  "chunkMaxTokens": 600,
  "chunkMinTokens": 3,
  "chunkMaxTokensOnePager": 1000,
  "documentMinTokens": 25,
  "chunkStrategy": "RECURSIVE_CHUNKING",
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "wordReadMode": "MAMMOTH_ONLY",
  "pptReadMode": "INGEST_WITH_DEFAULT_SERVICE",
  "excelReadMode": "INGEST_WITH_DEFAULT_SERVICE",
  "jpgReadMode": "NO_INGESTION",
  "uniqueIngestionMode": "INGESTION"
}

Example 2: Excel-Heavy Content Configuration

json

{
  "chunkStrategy": "RECURSIVE_CHUNKING",
  "excelReadMode": "INGEST_WITH_DEFAULT_SERVICE",
  "excelConfig": {
    "rowsPerChunk": 50,
    "tableFormat": "MARKDOWN",
    "headerRows": [1],
    "headerColumns": [1],
    "maxEmptyTableRows": 2,
    "maxEmptyTableCols": 3,
    "tableChunkTokenLimit": 2500,
    "maxRows": 10000,
    "maxCols": 200
  },
  "shouldApplyToSubScopes": true
}

Example 3: Contextual Chunking Configuration

json

{
  "chunkMaxTokens": 500,
  "chunkMinTokens": 10,
  "chunkStrategy": "CONTEXTUAL_CHUNKING_LIGHT",
  "chunkingConfiguration": {
    "systemPrompt": "Summarize the key information from this document section.",
    "tokens": 150,
    "model": "gpt-4o-mini"
  },
  "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
  "uniqueIngestionMode": "INGESTION"
}