3rd party APIs for customisation of ingestion

4 min read

Summary

This feature enables customers of Unique AI to use a custom API for doing specific stages during the Unique AI ingestion. Means Unique will based on the ingestion configuration of the content send at a specific stage a synchronous API call to let customers run custom logic for this stage.

General Setup

Unique Managed Tenants

In a first version the custom APIs need to be provided to Unique so Unique can add the configuration to its workload which is handling the ingestion. This API configuration contains out of an identifier, an URL and an API key (optional). E.g.

json

{
  "identifier": "Custom_Ingestor",
  "url": "https://customUrl.com/pdfPageIngestor",
  "apiKey": "myAPIKey"
}

This API configuration setup is needed for all types of custom API calls during the Unique ingestion process.

Customer Managed Tenants

Instead of providing the above mentioned configuration to Unique, the configuration must be getting composed as an array of such elements and loaded as environment variable to the backend-service-ingestion-worker application. The environment variable is called: CUSTOM_API_DEFINITIONS.

Example:

yaml

env:
  CUSTOM_API_DEFINITIONS: >-
    [
    {"identifier": "Custom_Ingestor","url": "https://customUrl.com/pdfPageIngestor","apiKey": "myAPIKey"},
    ...
    ]
  ...

Custom PDF Page Processing

Purpose

By setting up a custom PDF page processing API Unique will start the normal document ingestion process but as soon as the stage of processing the PDF page per page is reached Unique is not using its internal solution for parsing markdown text out of this PDF page. It will call an API to complete this stage.

Means for every PDF page Unique calls the API with the base64 data of this PDF page and expects a markdown text in return. After all pages has been processed via the API call Unique will continue with the standard process for chunking, storing, embedding, etc.

Ingestion Config

To use this custom PDF page processing the ingestion config of the content needs to be adjusted. This is a similar workflow as using the Microsoft Document Intelligence. This ingestion config can be set either on scope level or on content directly.

The configuration on scope can be done via the Ingestion Config Dialog in the Knowledge Base

Or via API:

curl --location --request POST 'https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \
--header 'Authorization: Bearer <yourToken>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "properties": {
        "ingestionConfig": {
            "pdfReadMode": "CUSTOM_SINGLE_PAGE_API",
            "customApiOptions": [{
                "customisationType": "CUSTOM_SINGLE_PAGE_API",
                "apiIdentifier": "Custom_Ingestor",
                "apiPayload": "{'stringified': 'JSON object or just a string'}"
            }]
        }
    },
    "applyToSubScopes": true
}'

Attention! Make sure you do not override some previous customised ingestionConfig. In case of doubt fetch/inspect first the current properties of the scope.

API requirements

The Unique AI platform provides two options to request a text extraction from a Custom API.

Simple API Text Extraction Request

Unique will send a POST request for each PDF page to the specified API configuration (URL and ApiKey). The body contains the following structure:

json

{
  "data": "<Base64EncodedPdfPage>",
  "ingestionConfiguration": {<ingestionConfig>},
  "companyId": "<companyId>",
  "chatId": "<chatId or null>",
  "pageNumber": <starting 1 -> numberOfPages>
}

Expected in return of the API is a JSON response in the following format. The extractedText should be the markdown string parsed/describing the sent PDF page.

json

{
  "extractedText": "Extracted text from this PDF page in markdown format. This is getting joined with all other pages and processed further."
}

Job-Queue based API Text Extraction Request

Unique will send first a POST request to URL + /extractions (using URL and ApiKey from the API configuration) for each PDF page to create a job which must return the job id in the following format

The Custom API must implement a POST endpoint with a base url that matches the URL in the API configuration and exposes a /extractions route.

json

{
  "job_id": "Job ID"
}

Unique will then poll the job status / result by sending a GET request to URL + /extractions/{job_id} (using URL and ApiKey from the API configuration). The response must comply to the following format:

The Custom API must implement a GET endpoint with a base url that matches the URL in the API configuration and exposes a /extractions/{job_id} route.

json

{
  "status": "PENDING | RUNNING | FINISHED | FAILED",
  "result": "Extracted text in case of FINISHED" (optional),
  "error": "Error text in case of FAILED" (optional)
}

The polling duration / interval and the timeout can be configured via an env variable in the node-ingestion-workerservice by adjusting the values of the following variables:

INGESTION_WORKER_CUSTOM_API_JOB_QUEUE_POLLING_DURATION_MS
INGESTION_WORKER_CUSTOM_API_JOB_QUEUE_TIMEOUT_MS

Custom Chunking

Purpose

The Unique ingestion process allows customers to do a custom chunking mechanism. Before running the stage of chunking the whole markdown text into pieces Unique checks the configuration of this content. When a custom chunking configuration is set Unique will call a custom API with the whole text of the document and expects an array of ordered chunks in return. Unique will then create embeddings out of those chunks and store them into the database.

Ingestion Config

To configure this custom chunking the ingestion config of the content needs to be adjusted. This is a similar workflow as using the Microsoft Document Intelligence. This ingestion config can be set either on scope level or on content directly.

The configuration on scope can be done via the Ingestion Config Dialog in the Knowledge Base

Or via API:

curl --location --request POST 'https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \
--header 'Authorization: Bearer <yourToken>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "properties": {
        "ingestionConfig": {
            "chunkStrategy": "CUSTOM_CHUNKING_API",
            "customApiOptions": [{
                "customsationType": "CUSTOM_CHUNKING_API",
                "apiIdentifier": "Custom_Ingestor",
                "apiPayload": "{'stringified': 'JSON object or just a string'}"
            }]
        }
    },
    "applyToSubScopes": true
}'

Attention! Make sure you do not override some previous customised ingestionConfig. In case of doubt fetch/inspect first the current properties of the scope.

API requirements

Unique will send a POST request once the whole text of the document is ready to be chunked to the API configuration (URL and ApiKey). The body contains the following structure:

json

{
  "text": "This is my plain text parsed from the document. It will be sent as whole text string.",
  "ingestionConfiguration": {<ingestionConfig>}
}

Expected in return of the API is a JSON response in the following format. The chunks should be a string array of text chunks based on the sent text.

json

{
  "chunks": ["This is my plain text parsed from the document.", "It will be sent as whole text string."]
}